Quantifying the Impact of Text Preprocessing on IndoBERT Fine-Tuning for Indonesian Informal Culinary Sentiment Analysis

Authors

  • Rahmat Budianoor Lambung Mangkurat University
  • Setyo Wahyu Saputro Lambung Mangkurat University
  • Friska Abadi Lambung Mangkurat University
  • Radityo Adi Nugroho Lambung Mangkurat University
  • Andi Farmadi Lambung Mangkurat University

DOI:

https://doi.org/10.62411/jcta.15980

Abstract

Indonesian culinary comments on social media platforms such as Instagram are characterized by informal spelling, regional language mixing, slang expressions, and emojis, posing substantial challenges for automated sentiment classification. While IndoBERT has demonstrated strong performance across Indonesian natural language processing tasks, the contribution of individual preprocessing components to fine-tuning performance on informal text remains underexplored, particularly in the culinary domain. This study addresses this gap by conducting a systematic preprocessing ablation study on IndoBERT-Base fine-tuning for Indonesian culinary sentiment classification, accompanied by a comparative evaluation against Naive Bayes with TF-IDF, SVM with TF-IDF, and BiLSTM as representative baselines. A dataset of 3,500 manually labeled Instagram culinary comments across three sentiment classes was used, with a stratified 80/10/10 split. Six preprocessing variants were evaluated under identical experimental conditions to isolate the contribution of each component. The results show that slang normalization is the most impactful single preprocessing step, yielding a macro F1-score gain of +0.0609 over the no-preprocessing baseline, while the full pipeline achieves an accuracy of 0.8800 and a macro F1-score of 0.8465. IndoBERT-Base with the full pipeline outperforms all baselines across all evaluation metrics. Per-class analysis reveals that the negative class achieves the lowest F1-score of 0.7600, with sarcastic expressions and Banjar regional vocabulary identified as primary sources of misclassification. These findings indicate that preprocessing decisions have a measurable and non-uniform effect on IndoBERT fine-tuning performance. In this study, slang normalization provides the most substantial individual contribution in bridging the vocabulary gap between informal user-generated text and the model’s pre-training distribution.

Author Biographies

Rahmat Budianoor, Lambung Mangkurat University

Department of Computer Science, Lambung Mangkurat University, Banjarbaru 70714, South Kalimantan, Indonesia

Setyo Wahyu Saputro, Lambung Mangkurat University

Department of Computer Science, Lambung Mangkurat University, Banjarbaru 70714, South Kalimantan, Indonesia

Friska Abadi, Lambung Mangkurat University

Department of Computer Science, Lambung Mangkurat University, Banjarbaru 70714, South Kalimantan, Indonesia

Radityo Adi Nugroho, Lambung Mangkurat University

Department of Computer Science, Lambung Mangkurat University, Banjarbaru 70714, South Kalimantan, Indonesia

Andi Farmadi, Lambung Mangkurat University

Department of Computer Science, Lambung Mangkurat University, Banjarbaru 70714, South Kalimantan, Indonesia

References

S. Abdullah, P. Van Cauwenberge, H. Vander Bauwhede, and P. O’Connor, “Review Ratings, Sentiment in Review Comments, and Restaurant Profitability: Firm-Level Evidence,” Cornell Hosp. Q., vol. 65, no. 3, pp. 378–392, Aug. 2024, doi: 10.1177/19389655231214758.

Y. Wang, J. Kim, and J. Kim, “The financial impact of online customer reviews in the restaurant industry: A moderating effect of brand equity,” Int. J. Hosp. Manag., vol. 95, p. 102895, May 2021, doi: 10.1016/j.ijhm.2021.102895.

A. R. Putra, E. Ernawati, J. Jahroni, T. S. Anjanarko, and E. Retnowati, “Creative Economy Development Efforts in Culinary Business,” J. Soc. Sci. Stud., vol. 2, no. 1, pp. 21–26, Jan. 2022, doi: 10.56348/jos3.v2i1.17.

H. Mulyono and A. R. Syamsuri, “Organizational Agility, Open Innovation, and Business Competitive Advantage: Evidence from Culinary SMEs in Indonesia,” Int. J. Soc. Sci. Bus., vol. 7, no. 2, pp. 268–275, Jun. 2023, doi: 10.23887/ijssb.v7i2.54083.

A. F. Aji et al., “One Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in Indonesia,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 7226–7249. doi: 10.18653/v1/2022.acl-long.500.

G. I. Winata et al., “NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages,” in Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 2023, pp. 815–834. doi: 10.18653/v1/2023.eacl-main.57.

M. Birjali, M. Kasri, and A. Beni-Hssane, “A comprehensive survey on sentiment analysis: Approaches, challenges and trends,” Knowledge-Based Syst., vol. 226, p. 107134, Aug. 2021, doi: 10.1016/j.knosys.2021.107134.

Y. Mao, Q. Liu, and Y. Zhang, “Sentiment analysis methods, applications, and challenges: A systematic literature review,” J. King Saud Univ. - Comput. Inf. Sci., vol. 36, no. 4, p. 102048, Apr. 2024, doi: 10.1016/j.jksuci.2024.102048.

Y. Yanfi, Y. Heryadi, L. Lukas, W. Suparta, and Y. Arifin, “Sentiment Analysis of User Review on Indonesian Food and Beverage Group using Machine Learning Techniques,” in 2022 IEEE Creative Communication and Innovative Technology (ICCIT), Nov. 2022, pp. 1–5. doi: 10.1109/ICCIT55355.2022.10118707.

E. C. Garrido-Merchan, R. Gozalo-Brizuela, and S. Gonzalez-Carvajal, “Comparing BERT Against Traditional Machine Learning Models in Text Classification,” J. Comput. Cogn. Eng., vol. 2, no. 4, pp. 352–356, Apr. 2023, doi: 10.47852/bonviewJCCE3202838.

R. Pramana, M. Jonathan, H. S. Yani, and R. Sutoyo, “A Comparison of BiLSTM, BERT, and Ensemble Method for Emotion Recognition on Indonesian Product Reviews,” Procedia Comput. Sci., vol. 245, pp. 399–408, 2024, doi: 10.1016/j.procs.2024.10.266.

C.-H. Lin and U. Nuha, “Sentiment analysis of Indonesian datasets based on a hybrid deep-learning strategy,” J. Big Data, vol. 10, no. 1, p. 88, May 2023, doi: 10.1186/s40537-023-00782-9.

B. Wilie et al., “IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding,” in Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, 2020, pp. 843–857. doi: 10.18653/v1/2020.aacl-main.85.

F. Koto, J. H. Lau, and T. Baldwin, “IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effective Domain-Specific Vocabulary Initialization,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, pp. 10660–10668. doi: 10.18653/v1/2021.emnlp-main.833.

H. Murfi, Syamsyuriani, T. Gowandi, G. Ardaneswari, and S. Nurrohmah, “BERT-based combination of convolutional and recurrent neural network for indonesian sentiment analysis,” Appl. Soft Comput., vol. 151, p. 111112, Jan. 2024, doi: 10.1016/j.asoc.2023.111112.

H. Imaduddin, F. Y. A’la, and Y. S. Nugroho, “Sentiment Analysis in Indonesian Healthcare Applications using IndoBERT Approach,” Int. J. Adv. Comput. Sci. Appl., vol. 14, no. 8, 2023, doi: 10.14569/IJACSA.2023.0140813.

E. Yulianti and N. K. Nissa, “ABSA of Indonesian customer reviews using IndoBERT: single- sentence and sentence-pair classification approaches,” Bull. Electr. Eng. Informatics, vol. 13, no. 5, pp. 3579–3589, Oct. 2024, doi: 10.11591/eei.v13i5.8032.

R. I. Perwira, V. A. Permadi, D. I. Purnamasari, and R. P. Agusdin, “Domain-Specific Fine-Tuning of IndoBERT for Aspect-Based Sentiment Analysis in Indonesian Travel User-Generated Content,” J. Inf. Syst. Eng. Bus. Intell., vol. 11, no. 1, pp. 30–40, Mar. 2025, doi: 10.20473/jisebi.11.1.30-40.

M. Siino, I. Tinnirello, and M. La Cascia, “Is text preprocessing still worth the time? A comparative survey on the influence of popular preprocessing methods on Transformers and traditional classifiers,” Inf. Syst., vol. 121, p. 102342, Mar. 2024, doi: 10.1016/j.is.2023.102342.

A. Bustamin, A. A. Prayogi, D. Siswanto, M. Rafrin, and A. Nurdin, “Text normalization for Indonesian slang words in sentiment analysis development,” ICIC Express Lett. Part B Appl., vol. 16, no. 2, pp. 121–129, Feb. 2025, doi: 10.24507/icicelb.16.02.121.

A. Khan, D. Majumdar, and B. Mondal, “Sentiment analysis of emoji fused reviews using machine learning and Bert,” Sci. Rep., vol. 15, no. 1, p. 7538, Mar. 2025, doi: 10.1038/s41598-025-92286-0.

M. Pota, M. Ventura, H. Fujita, and M. Esposito, “Multilingual evaluation of pre-processing for BERT-based sentiment analysis of tweets,” Expert Syst. Appl., vol. 181, p. 115119, Nov. 2021, doi: 10.1016/j.eswa.2021.115119.

D. Suhartono, W. Wongso, and A. Tri Handoyo, “IdSarcasm: Benchmarking and Evaluating Language Models for Indonesian Sarcasm Detection,” IEEE Access, vol. 12, pp. 87323–87332, 2024, doi: 10.1109/ACCESS.2024.3416955.

M. Wankhade, A. C. S. Rao, and C. Kulkarni, “A survey on sentiment analysis methods, applications, and challenges,” Artif. Intell. Rev., vol. 55, no. 7, pp. 5731–5780, Oct. 2022, doi: 10.1007/s10462-022-10144-1.

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in Proceedings of the 2019 Conference of the North, 2019, pp. 4171–4186. doi: 10.18653/v1/N19-1423.

H. Ahmadian, T. F. Abidin, H. Riza, and K. Muchtar, “Hybrid Models for Emotion Classification and Sentiment Analysis in Indonesian Language,” Appl. Comput. Intell. Soft Comput., vol. 2024, no. 1, Jan. 2024, doi: 10.1155/2024/2826773.

D. R. I. M. Setiadi, W. Warto, A. R. Muslikh, K. Nugroho, and A. N. Safriandono, “Aspect-Based Sentiment Analysis on E-commerce Reviews using BiGRU and Bi-Directional Attention Flow,” J. Comput. Theor. Appl., vol. 2, no. 4, pp. 470–480, Apr. 2025, doi: 10.62411/jcta.12376.

A. Bahmani, “Fusion of Statistical and Stylistic Text Features with SVM for Persian Sentiment Analysis,” J. Futur. Artif. Intell. Technol., vol. 2, no. 4, pp. 534–548, Dec. 2025, doi: 10.62411/faith.3048-3719-287.

N. F. Adhim and N. Cahyono, “Optimization of IndoBERT for Sentiment Analysis of FOMO on Social Media Through Fine-Tuning and Hybrid Labeling,” J. Appl. Informatics Comput., vol. 9, no. 6, pp. 3786–3797, Dec. 2025, doi: 10.30871/jaic.v9i6.11686.

A. Romadhony, S. Al Faraby, R. Rismala, U. N. Wisesty, and A. Arifianto, “Sentiment Analysis on a Large Indonesian Product Review Dataset,” J. Inf. Syst. Eng. Bus. Intell., vol. 10, no. 1, pp. 167–178, Feb. 2024, doi: 10.20473/jisebi.10.1.167-178.

S. Henning, W. Beluch, A. Fraser, and A. Friedrich, “A Survey of Methods for Addressing Class Imbalance in Deep-Learning Based Natural Language Processing,” in Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 2023, pp. 523–540. doi: 10.18653/v1/2023.eacl-main.38.

J. Opitz, “A Closer Look at Classification Evaluation Metrics and a Critical Reflection of Common Evaluation Practice,” Trans. Assoc. Comput. Linguist., vol. 12, pp. 820–836, Jun. 2024, doi: 10.1162/tacl_a_00675.

D. A. Kristiyanti, S. A. Sanjaya, V. C. Tjokro, and J. Suhali, “Dealing imbalance dataset problem in sentiment analysis of recession in Indonesia,” IAES Int. J. Artif. Intell., vol. 13, no. 2, p. 2060, Jun. 2024, doi: 10.11591/ijai.v13.i2.pp2060-2072.

Y. Y. Tan, C.-O. Chow, J. Kanesan, J. H. Chuah, and Y. Lim, “Sentiment Analysis and Sarcasm Detection using Deep Multi-Task Learning,” Wirel. Pers. Commun., vol. 129, no. 3, pp. 2213–2237, Apr. 2023, doi: 10.1007/s11277-023-10235-4.

N. Aliyah Salsabila, Y. Ardhito Winatmoko, A. Akbar Septiandri, and A. Jamal, “Colloquial Indonesian Lexicon,” in 2018 International Conference on Asian Language Processing (IALP), Nov. 2018, pp. 226–229. doi: 10.1109/IALP.2018.8629151.

Y. Puspita Sari, A. Husna, E. Anggraini, and S. Akbari, “Kamus: bahasa Banjar-Indonesia untuk pelajar,” Kementerian Pendidikan Dasar dan Menengah Republik Indonesia. https://repositori.kemendikdasmen.go.id/35353/

N. J. Prottasha et al., “Transfer Learning for Sentiment Analysis Using BERT Based Supervised Fine-Tuning,” Sensors, vol. 22, no. 11, p. 4157, May 2022, doi: 10.3390/s22114157.

N. D. A. Saputra, M. Muljono, A. Karim, and D. R. I. M. Setiadi, “End-to-End Fine-Tuning of DeBERTa-Base for Stance Detection,” J. Futur. Artif. Intell. Technol., vol. 2, no. 4, pp. 698–715, Feb. 2026, doi: 10.62411/faith.3048-3719-168.

J. Chen, Z. Yao, S. Zhao, and Y. Zhang, “Fusion Pre-trained Emoji Feature Enhancement for Sentiment Analysis,” ACM Trans. Asian Low-Resource Lang. Inf. Process., vol. 22, no. 4, pp. 1–14, Apr. 2023, doi: 10.1145/3578582.

Downloads

Published

2026-05-07

How to Cite

Budianoor, R., Saputro, S. W., Abadi, F., Nugroho, R. A., & Farmadi, A. (2026). Quantifying the Impact of Text Preprocessing on IndoBERT Fine-Tuning for Indonesian Informal Culinary Sentiment Analysis. Journal of Computing Theories and Applications, 3(4), 564–581. https://doi.org/10.62411/jcta.15980