Keyphrase Extraction on Covid-19 Tweets Based on Doc2Vec and YAKE


  • Fahri Firdausillah Universitas Dian Nuswantoro
  • Erika Devi Udayanti Universitas Dian Nuswantoro



Keyword and keyphrase extraction are one of the initial foundations for performing several text processing operations such as summarization and document clustering. YAKE is one of the techniques used for unsupervised and independent keyphrase extraction, it does not require a corpus for linguistic tools such as NER and POS-tag. However, the use of YAKE in microblogging documents such as Twitter often results in a keyphrase that is less representative because of the lack of words used for ranking. This paper offers a solution to this problem by looking for similar tweets in the keyphrase extraction process using Doc2Vec so that the number of words used in the YAKE ranking process can be greater. Covid-19 tweets related are used as dataset as the topic is currently widely discussed on social media to prove that the proposed approach could improve keyphrase extraction performance


Haddoud M, Mokhtari A, Lecroq T, Abdeddaïm S. Accurate Keyphrase Extraction from Scientific Papers by Mining Linguistic Information. CLBib@ ISSI.; 2015. pp. 12–17.

Trisna INP, Nurwidyantoro A. Single document keywords extraction in Bahasa Indonesia using phrase chunking. TELKOMNIKA. 2020;18: 1917.

Campos R, Mangaravite V, Pasquali A, Jorge AM, Nunes C, Jatowt A. YAKE! Collection-Independent Automatic Keyword Extractor. Advances in Information Retrieval. Springer International Publishing; 2018. pp. 806–810.

Škrlj B, Repar A, Pollak S. RaKUn: Rank-based Keyword Extraction via Unsupervised Learning and Meta Vertex Aggregation. Statistical Language and Speech Processing. Springer International Publishing; 2019. pp. 311–323.

Lamsal R. Design and analysis of a large-scale COVID-19 tweets dataset. Applied Intelligence. 2020. doi:10.1007/s10489-020-02029-z

Prastyo PH, Sumi AS, Dian AW, Permanasari AE. Tweets Responding to the Indonesian Government’s Handling of COVID-19: Sentiment Analysis Using SVM with Normalized Poly Kernel. Journal of Information Systems Engineering and Business Intelligence. 2020;6: 112–122.

Boon-Itt S, Skunkan Y. Public Perception of the COVID-19 Pandemic on Twitter: Sentiment Analysis and Topic Modeling Study. JMIR Public Health Surveill. 2020;6: e21978.

Sharma K, Seo S, Meng C, Rambhatla S, Dua A. Coronavirus on social media: Analyzing misinformation in Twitter conversations. arXiv preprint arXiv. 2020. Available:

Alami Merrouni Z, Frikh B, Ouhbi B. Automatic keyphrase extraction: a survey and trends. J Intell Inf Syst. 2020;54: 391–424.

Siddiqi S, Sharan A. Keyword and keyphrase extraction techniques: a literature review. Int J Comput Appl Technol. 2015;109. Available:

Hanafiah N, Kevin A, Sutanto C, Fiona, Arifin Y, Hartanto J. Text Normalization Algorithm on Twitter in Complaint Category. Procedia Comput Sci. 2017;116: 20–26.

Arora M, Kansal V. Character level embedding with deep convolutional neural network for text normalization of unstructured data for Twitter sentiment analysis. Social Network Analysis and Mining. 2019;9: 12.

Campos R, Mangaravite V, Pasquali A, Jorge A, Nunes C, Jatowt A. YAKE! Keyword extraction from single documents using multiple local features. Inf Sci . 2020;509: 257–289.

Mandal A, Chaki R, Saha S, Ghosh K, Pal A, Ghosh S. Measuring Similarity among Legal Court Case Documents. Proceedings of the 10th Annual ACM India Compute Conference. New York, NY, USA: Association for Computing Machinery; 2017. pp. 1–9.

Barco Ranera LT, Solano GA, Oco N. Retrieval of Semantically Similar Philippine Supreme Court Case Decisions using Doc2Vec. 2019 International Symposium on Multimedia and Communication Technology (ISMAC).; 2019. pp. 1–6.

Li J, Huang G, Fan C, Sun Z, Zhu H. Key word extraction for short text via word2vec, doc2vec, and textrank. TURK J OF ELECTR ENG COMPUT SCI. 2019;27: 1794–1805.

Hermansyah DD. COVID-19 Indonesian Tweets. 2020. Available:


