Curio: A Swift package for analysis of social media data
This paper is under review on the experimental track of the Journal of Visualization and Interaction. See the reviews and issues for this paper.
Summary
Social media constitutes a rich and influential source of information for qualitative researchers. However, while computational techniques like topic modeling are ubiquitous in the machine learning community they are less frequently used by qualitative researchers who often lack the programming skills required to put them to use in their own work. There is therefore an opportunity to explore how computational tools can be designed to align with the existing workflows and capabilities of qualitative researchers, using visual interfaces and commodity computer hardware.
Statement of need
Curio is a research-based, unsupervised topic modelling pipeline for social media, written in Swift. It draws from available libraries to support data collection, document encoding (e.g., CoreML, Model2vec (“Model2Vec: The Fastest State-of-the-Art Static Embeddings in the World” 2024), Apple’s Natural Language), dimensionality reduction (e.g., PCA, tSNE (Van der Maaten and Hinton 2008), UMAP (McInnes, Healy, and Melville 2020)), clustering (e.g., HDBSCAN (Campello, Moulavi, and Sander 2013), KMeans), and topic modeling. Our goals are to provide a modular and efficient set of tools that work across a variety of data sources. We leverage modern Swift concurrency and libraries like MLX to provide performant and safe implementations that work well on commodity Mac hardware. Curio will enable the development of new qualitative data analysis tools for edge devices like laptops, tablets, and smartphones.
Acknowledgements
We acknowledge contributions from Jaden Geller (Geller 2024), Jan Krukowski (Krukowski 2024), and the Minish Lab (“Model2Vec: The Fastest State-of-the-Art Static Embeddings in the World” 2024).