Comprehensive Evaluation of LDA, NMF, and BERTopic's Performance on News Headline Topic Modeling
DOI:
https://doi.org/10.62411/jcta.11635Keywords:
Coherence Evaluation, Model Comparison, News Headlines, Non-Native English, Topic ModelingAbstract
Topic modeling is an integral text mining component, employing diverse algorithms to uncover hidden themes within texts. This study examines the comparative performance of prominent topic modeling techniques on news headlines, which is characterized by brevity and specific linguistic style. Given the corpus originates from a non-native English-speaking country, an additional layer of complexity is introduced to the task. Our research explores the feasibility of employing a committee approach for topic modeling, evaluating the efficacy and challenges of various methods in practical settings. We applied three techniques—Latent Dirichlet Allocation (LDA), Non-negative Matrix Factorization (NMF), and BERTopic—to create models with a fixed number of topics (n=40). These models were then tested on approximately 150,000 news headlines. To assess topic coherence, we utilized Word2Vec, human evaluators, and two large language models. Statistical tests confirmed the significance and impact of our findings. BERTopic demonstrated superior coherence compared to NMF, though slightly, but consistently outperformed NMF and LDA according to human and LLM evaluations. The notable disparity in LDA's performance relative to BERTopic and NMF underscores the importance of carefully selecting a topic modeling technique, as the choice can significantly influence the outcome of the analysis.References
D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet Allocation,” J. Mach. Learn. Res., vol. 3, pp. 993–1022, 2003.
T. L. Griffiths and M. Steyvers, “Finding scientific topics,” Proc. Natl. Acad. Sci., vol. 101, no. suppl_1, pp. 5228–5235, Apr. 2004, doi: 10.1073/pnas.0307752101.
D. M. Blei and J. D. Lafferty, “A correlated topic model of Science,” Ann. Appl. Stat., vol. 1, no. 1, Jun. 2007, doi: 10.1214/07-AOAS114.
M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth, “The Author-Topic Model for Authors and Documents,” in Proceedings of the Twentieth Conference on Uncertainty in Artificial Intelligence, Jul. 2004. [Online]. Available: http://arxiv.org/abs/1207.4169
A. Abdelrazek, Y. Eid, E. Gawish, W. Medhat, and A. Hassan, “Topic modeling algorithms and applications: A survey,” Inf. Syst., vol. 112, p. 102131, Feb. 2023, doi: 10.1016/j.is.2022.102131.
Y. Chen, Z. Peng, S.-H. Kim, and C. W. Choi, “What We Can Do and Cannot Do with Topic Modeling: A Systematic Review,” Commun. Methods Meas., vol. 17, no. 2, pp. 111–130, Apr. 2023, doi: 10.1080/19312458.2023.2167965.
R. Churchill and L. Singh, “The Evolution of Topic Modeling,” ACM Comput. Surv., vol. 54, no. 10s, pp. 1–35, Jan. 2022, doi: 10.1145/3507900.
R. Egger and J. Yu, “A Topic Modeling Comparison Between LDA, NMF, Top2Vec, and BERTopic to Demystify Twitter Posts,” Front. Sociol., vol. 7, May 2022, doi: 10.3389/fsoc.2022.886498.
P. Kherwa and P. Bansal, “Topic Modeling: A Comprehensive Review,” ICST Trans. Scalable Inf. Syst., p. 159623, Jul. 2018, doi: 10.4108/eai.13-7-2018.159623.
R. Albalawi, T. H. Yeap, and M. Benyoucef, “Using Topic Modeling Methods for Short-Text Data: A Comparative Analysis,” Front. Artif. Intell., vol. 3, Jul. 2020, doi: 10.3389/frai.2020.00042.
J. Qiang, Z. Qian, Y. Li, Y. Yuan, and X. Wu, “Short Text Topic Modeling Techniques, Applications, and Performance: A Survey,” IEEE Trans. Knowl. Data Eng., vol. 34, no. 3, pp. 1427–1445, Mar. 2022, doi: 10.1109/TKDE.2020.2992485.
J. Qiang, P. Chen, T. Wang, and X. Wu, “Topic Modeling over Short Texts by Incorporating Word Embeddings,” in Advances in Knowledge Discovery and Data Mining, 2017, pp. 363–374. doi: 10.1007/978-3-319-57529-2_29.
C. D. P. Laureate, W. Buntine, and H. Linger, “A systematic review of the use of topic models for short text social media analysis,” Artif. Intell. Rev., vol. 56, no. 12, pp. 14223–14255, Dec. 2023, doi: 10.1007/s10462-023-10471-x.
B. A. H. Murshed, J. Abawajy, S. Mallappa, M. A. N. Saif, S. M. Al-Ghuribi, and F. A. Ghanem, “Enhancing Big Social Media Data Quality for Use in Short-Text Topic Modeling,” IEEE Access, vol. 10, pp. 105328–105351, 2022, doi: 10.1109/ACCESS.2022.3211396.
H. Zhao, D. Phung, V. Huynh, Y. Jin, L. Du, and W. Buntine, “Topic Modelling Meets Deep Neural Networks: A Survey,” in Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, Aug. 2021, pp. 4713–4720. doi: 10.24963/ijcai.2021/638.
B. Ogunleye, T. Maswera, L. Hirsch, J. Gaudoin, and T. Brunsdon, “Comparison of Topic Modelling Approaches in the Banking Context,” Appl. Sci., vol. 13, no. 2, p. 797, Jan. 2023, doi: 10.3390/app13020797.
I. Vayansky and S. A. P. Kumar, “A review of topic modeling methods,” Inf. Syst., vol. 94, p. 101582, Dec. 2020, doi: 10.1016/j.is.2020.101582.
S. D. Rajan, T. Coombs, M. Jayabalan, and N. A. Ismail, “A Comparative Study of Methods for Topic Modelling in News Articles,” in Data Science and Emerging Technologies, 2024, pp. 269–277. doi: 10.1007/978-981-97-0293-0_20.
R. Thomson, E. Cranford, S. Somers, and C. Lebiere, “A Novel Approach to Intrusion Detection Using a Cognitively-Inspired Algorithm,” in Hawaii International Conference on System Sciences 2024, 2024. doi: 10.24251/HICSS.2023.116.
A. Amaro and F. Bacao, “Topic Modeling: A Consistent Framework for Comparative Studies,” Emerg. Sci. J., vol. 8, no. 1, pp. 125–139, Feb. 2024, doi: 10.28991/ESJ-2024-08-01-09.
T. Ramamoorthy, V. Kulothungan, and B. Mappillairaju, “Topic modeling and social network analysis approach to explore diabetes discourse on Twitter in India,” Front. Artif. Intell., vol. 7, Feb. 2024, doi: 10.3389/frai.2024.1329185.
Z. A. Güven, B. Diri, and T. Çakaloğlu, “Comparison of Topic Modeling Methods for Type Detection of Turkish News,” in 2019 4th International Conference on Computer Science and Engineering (UBMK), Sep. 2019, pp. 150–154. doi: 10.1109/UBMK.2019.8907050.
J. Blad and K. Svensson, “Exploring NMF and LDA Topic Models of Swedish News Articles,” Uppsala Universitet, 2020. [Online]. Available: https://uu.diva-portal.org/smash/get/diva2:1512130/FULLTEXT01.pdf
C. Jacobi, W. van Atteveldt, and K. Welbers, “Quantitative analysis of large amounts of journalistic texts using topic modelling,” Digit. Journal., vol. 4, no. 1, pp. 89–106, Jan. 2016, doi: 10.1080/21670811.2015.1093271.
R. Misra, “News Category Dataset,” arXiv. 2022. [Online]. Available: https://arxiv.org/abs/2209.11429
Q. Fu, Y. Zhuang, J. Gu, Y. Zhu, H. Qin, and X. Guo, “Search for K: Assessing Five Topic-Modeling Approaches to 120,000 Canadian Articles,” in 2019 IEEE International Conference on Big Data (Big Data), Dec. 2019, pp. 3640–3647. doi: 10.1109/BigData47090.2019.9006160.
M. D. Hoffman, D. M. Blei, and F. Bach, “Online learning for Latent Dirichlet Allocation,” in Proceedings of the 23rd International Conference on Neural Information Processing Systems, 2010, pp. 856–864.
M. Grootendorst, “BERTopic: Neural topic modeling with a class-based TF-IDF procedure,” arXiv. Mar. 11, 2022. [Online]. Available: http://arxiv.org/abs/2203.05794
A. Fang, C. Macdonald, I. Ounis, and P. Habel, “Using Word Embedding to Evaluate the Coherence of Topics from Twitter Data,” in Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, Jul. 2016, pp. 1057–1060. doi: 10.1145/2911451.2914729.
Y. Chen, H. Zhang, R. Liu, Z. Ye, and J. Lin, “Experimental explorations on short text topic mining between LDA and NMF based Schemes,” Knowledge-Based Syst., vol. 163, pp. 1–13, Jan. 2019, doi: 10.1016/j.knosys.2018.08.011.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 Olusola Babalola, Bolanle Ojokoh, Olutayo Boyinbode
This work is licensed under a Creative Commons Attribution 4.0 International License.