Curio: A Swift package for analysis of social media data

Authors
Affiliations

James R. Wallace

1

Mingchung Xia

1

Adrian Davila

1

Abhinav Jain

1

Peter Li

1

Nicole Mathis

1

Jean Nordmann

1

Henry Tian

1

George Wang

1

Ali Raza Zaidi

1

Jason Zhao

1

Published

December 6, 2024

ImportantUnder Review

This paper is under review on the experimental track of the Journal of Visualization and Interaction. See the reviews and issues for this paper.

Summary

Social media constitutes a rich and influential source of information for qualitative researchers. However, while computational techniques like topic modeling are ubiquitous in the machine learning community they are less frequently used by qualitative researchers who often lack the programming skills required to put them to use in their own work. There is therefore an opportunity to explore how computational tools can be designed to align with the existing workflows and capabilities of qualitative researchers, using visual interfaces and commodity computer hardware.

Statement of need

Curio is a research-based, unsupervised topic modelling pipeline for social media, written in Swift. It draws from available libraries to support data collection, document encoding (e.g., CoreML, Model2vec (“Model2Vec: The Fastest State-of-the-Art Static Embeddings in the World” 2024), Apple’s Natural Language), dimensionality reduction (e.g., PCA, tSNE (Van der Maaten and Hinton 2008), UMAP (McInnes, Healy, and Melville 2020)), clustering (e.g., HDBSCAN (Campello, Moulavi, and Sander 2013), KMeans), and topic modeling. Our goals are to provide a modular and efficient set of tools that work across a variety of data sources. We leverage modern Swift concurrency and libraries like MLX to provide performant and safe implementations that work well on commodity Mac hardware. Curio will enable the development of new qualitative data analysis tools for edge devices like laptops, tablets, and smartphones.

Acknowledgements

We acknowledge contributions from Jaden Geller (Geller 2024), Jan Krukowski (Krukowski 2024), and the Minish Lab (“Model2Vec: The Fastest State-of-the-Art Static Embeddings in the World” 2024).

References

Campello, Ricardo JGB, Davoud Moulavi, and Jörg Sander. 2013. “Density-Based Clustering Based on Hierarchical Density Estimates.” In Pacific-Asia Conference on Knowledge Discovery and Data Mining, 160–72. Springer. https://doi.org/10.1007/978-3-642-37456-2_14.
Geller, Jaden. 2024. “Similarity Topology.” GitHub Repository. GitHub. https://github.com/JadenGeller/similarity-topology.
Krukowski, Jan. 2024. “SwiftFaiss.” GitHub Repository. GitHub. https://github.com/jkrukowski/SwiftFaiss.
McInnes, Leland, John Healy, and James Melville. 2020. “UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction.” https://doi.org/10.48550/arXiv.1802.03426.
“Model2Vec: The Fastest State-of-the-Art Static Embeddings in the World.” 2024. https://github.com/MinishLab/model2vec.
Van der Maaten, Laurens, and Geoffrey Hinton. 2008. “Visualizing Data Using t-SNE.” Journal of Machine Learning Research 9 (11).