Keyphrase Extraction on Covid-19 Tweets Based on Doc2Vec and YAKE
DOI:
https://doi.org/10.33633/jais.v6i1.4454Abstract
Keyword and keyphrase extraction are one of the initial foundations for performing several text processing operations such as summarization and document clustering. YAKE is one of the techniques used for unsupervised and independent keyphrase extraction, it does not require a corpus for linguistic tools such as NER and POS-tag. However, the use of YAKE in microblogging documents such as Twitter often results in a keyphrase that is less representative because of the lack of words used for ranking. This paper offers a solution to this problem by looking for similar tweets in the keyphrase extraction process using Doc2Vec so that the number of words used in the YAKE ranking process can be greater. Covid-19 tweets related are used as dataset as the topic is currently widely discussed on social media to prove that the proposed approach could improve keyphrase extraction performanceReferences
Haddoud M, Mokhtari A, Lecroq T, Abdeddaïm S. Accurate Keyphrase Extraction from Scientific Papers by Mining Linguistic Information. CLBib@ ISSI. researchgate.net; 2015. pp. 12–17.
Trisna INP, Nurwidyantoro A. Single document keywords extraction in Bahasa Indonesia using phrase chunking. TELKOMNIKA. 2020;18: 1917.
Campos R, Mangaravite V, Pasquali A, Jorge AM, Nunes C, Jatowt A. YAKE! Collection-Independent Automatic Keyword Extractor. Advances in Information Retrieval. Springer International Publishing; 2018. pp. 806–810.
Škrlj B, Repar A, Pollak S. RaKUn: Rank-based Keyword Extraction via Unsupervised Learning and Meta Vertex Aggregation. Statistical Language and Speech Processing. Springer International Publishing; 2019. pp. 311–323.
Lamsal R. Design and analysis of a large-scale COVID-19 tweets dataset. Applied Intelligence. 2020. doi:10.1007/s10489-020-02029-z
Prastyo PH, Sumi AS, Dian AW, Permanasari AE. Tweets Responding to the Indonesian Government’s Handling of COVID-19: Sentiment Analysis Using SVM with Normalized Poly Kernel. Journal of Information Systems Engineering and Business Intelligence. 2020;6: 112–122.
Boon-Itt S, Skunkan Y. Public Perception of the COVID-19 Pandemic on Twitter: Sentiment Analysis and Topic Modeling Study. JMIR Public Health Surveill. 2020;6: e21978.
Sharma K, Seo S, Meng C, Rambhatla S, Dua A. Coronavirus on social media: Analyzing misinformation in Twitter conversations. arXiv preprint arXiv. 2020. Available: https://arxiv.org/abs/2003.12309
Alami Merrouni Z, Frikh B, Ouhbi B. Automatic keyphrase extraction: a survey and trends. J Intell Inf Syst. 2020;54: 391–424.
Siddiqi S, Sharan A. Keyword and keyphrase extraction techniques: a literature review. Int J Comput Appl Technol. 2015;109. Available: https://www.academia.edu/download/54323945/pxc3900607.pdf
Hanafiah N, Kevin A, Sutanto C, Fiona, Arifin Y, Hartanto J. Text Normalization Algorithm on Twitter in Complaint Category. Procedia Comput Sci. 2017;116: 20–26.
Arora M, Kansal V. Character level embedding with deep convolutional neural network for text normalization of unstructured data for Twitter sentiment analysis. Social Network Analysis and Mining. 2019;9: 12.
Campos R, Mangaravite V, Pasquali A, Jorge A, Nunes C, Jatowt A. YAKE! Keyword extraction from single documents using multiple local features. Inf Sci . 2020;509: 257–289.
Mandal A, Chaki R, Saha S, Ghosh K, Pal A, Ghosh S. Measuring Similarity among Legal Court Case Documents. Proceedings of the 10th Annual ACM India Compute Conference. New York, NY, USA: Association for Computing Machinery; 2017. pp. 1–9.
Barco Ranera LT, Solano GA, Oco N. Retrieval of Semantically Similar Philippine Supreme Court Case Decisions using Doc2Vec. 2019 International Symposium on Multimedia and Communication Technology (ISMAC). ieeexplore.ieee.org; 2019. pp. 1–6.
Li J, Huang G, Fan C, Sun Z, Zhu H. Key word extraction for short text via word2vec, doc2vec, and textrank. TURK J OF ELECTR ENG COMPUT SCI. 2019;27: 1794–1805.
Hermansyah DD. COVID-19 Indonesian Tweets. 2020. Available: https://www.kaggle.com/dionisiusdh/covid19-indonesian-twitter-sentiment
Downloads
Published
Issue
Section
License
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).