Classification Email Spam using Naive Bayes Algorithm and Chi-Squared Feature Selection


  • Maylinna Rahayu Ningsih Universitas Negeri Semarang
  • Jumanto Unjung Universitas Negeri Semarang
  • Habib al Farih Universiti Tun Hussein Onn Malaysia
  • Much Aziz Muslim Universiti Tun Hussein Onn Malaysia



Spam email is a problem that disturbs and harms the recipient. Machine learning is widely used in overcoming email spam because of its ability to classify emails into spam or non-spam. In this research, the Naïve Bayes algorithm is initiated with the Chi-Squared selection feature to classify spam emails. So that the implementation is able to increase accuracy for better performance in classification. The feature selection method is used to direct the model's attention to features that are related to the target variable. In this study, the chi squared feature uses a value of K = 2500, with an accuracy of 98.83% which shows an increase in model performance compared to previous research. So that the Naïve Bayes model with the Chi-Squared selection feature is proven to provide better performance. 


R. M. A. Mohammad, “Applied Computing and Informatics A lifelong spam emails classification model,” Appl. Comput. Informatics, no. xxxx, hal. 1–10, 2020, doi: 10.1016/j.aci.2020.01.002.

T. A. Almeida dan J. Almeida, “Spam filtering : how the dimensionality reduction affects the accuracy of Naive Bayes classifiers,” hal. 183–200, 2011, doi: 10.1007/s13174-010-0014-7.

Y. Cohen, D. Hendler, dan A. Rubin, “US CR,” Knowledge-Based Syst., 2017, doi: 10.1016/j.knosys.2017.11.011.

T. Gangavarapu dan C. D. J. B. Chanduka, Applicability of machine learning in spam and phishing email filtering : review and approaches, vol. 53, no. 7. Springer Netherlands, 2020.

U. Murugavel dan R. Santhi, “Materials Today : Proceedings Detection of spam and threads identification in E-mail spam corpus using content based text analytics method,” Mater. Today Proc., no. xxxx, 2020, doi: 10.1016/j.matpr.2020.04.742.

B. Kagan dan B. Akay, “Spam filtering using a logistic regression model trained by an artificial bee colony algorithm,” Appl. Soft Comput. J., vol. 91, hal. 106229, 2020, doi: 10.1016/j.asoc.2020.106229.

N. Hidayat dan M. F. Al Hakim, “Halal Food Restaurant Classification Based on Restaurant Review in Indonesian Language Using Machine Learning,” vol. 8, no. 2, hal. 314–319, 2021, doi: 10.15294/sji.v8i2.25356.

H. Fang, J. Xiao, dan Y. Wang, “International Journal of Electrical Power and Energy Systems A machine learning-based detection framework against intermittent electricity theft attack,” Int. J. Electr. Power Energy Syst., vol. 150, no. March, hal. 109075, 2023, doi: 10.1016/j.ijepes.2023.109075.

S. Schulz, M. Becker, M. R. Groseclose, S. Schadt, dan C. Hopf, “Advanced MALDI mass spectrometry imaging in pharmaceutical research and drug development,” Curr. Opin. Biotechnol., vol. 55, hal. 51–59, 2019, doi: 10.1016/j.copbio.2018.08.003.

M. Schulz dan T. Schr, “Monitoring machine learning models : a categorization of challenges and methods,” vol. 5, no. July, hal. 105–116, 2022, doi: 10.1016/j.dsm.2022.07.004.

S. Rahman, “An efficient hybrid system for anomaly detection in social networks,” 2021.

J. Fattahi, “SpaML : a Bimodal Ensemble Learning Spam Detector based on NLP Techniques,” no. Ml, hal. 107–112, 2021.

W. Binsaeedan dan S. Alramlawi, “Knowledge-Based Systems CS-BPSO : Hybrid feature selection based on chi-square and binary PSO algorithm for Arabic email authorship analysis,” Knowledge-Based Syst., vol. 227, hal. 107224, 2021, doi: 10.1016/j.knosys.2021.107224.

G. Kou, P. Yang, Y. Peng, F. Xiao, Y. Chen, dan F. E. Alsaadi, “Evaluation of feature selection methods for text classification with small datasets using multiple criteria decision-making methods,” Appl. Soft Comput. J., vol. 86, hal. 105836, 2020, doi: 10.1016/j.asoc.2019.105836.

R. Cekik dan A. K. Uysal, “Expert Systems with Applications A novel filter feature selection method using rough set for short text data,” Expert Syst. Appl., vol. 160, hal. 113691, 2020, doi: 10.1016/j.eswa.2020.113691.

K. Thirumoorthy, “Optimal feature subset selection using hybrid binary Jaya optimization algorithm for text classification,” S?dhan?, vol. 45, no. 1, hal. 1–13, 2020, doi: 10.1007/s12046-020-01443-w.

U. I. Larasati, M. A. Muslim, dan R. Arifudin, “Improve the Accuracy of Support Vector Machine Using Chi Square Statistic and Term Frequency Inverse Document Frequency on Movie Review Sentiment Analysis,” vol. 6, no. 1, hal. 138–149, 2019.

L. Allen, C. Ahakonye, C. I. Nwakanma, J. Lee, dan D. Kim, “Internet of Things SCADA intrusion detection scheme exploiting the fusion of modified decision tree and Chi-square feature selection,” Internet of Things, vol. 21, no. September 2022, hal. 100676, 2023, doi: 10.1016/j.iot.2022.100676.

V. Gupta, A. Mehta, A. Goel, U. Dixit, dan A. C. Pandey, Learning. Springer Singapore.

F. Hossain, “Analysis of Optimized Machine Learning and Deep Learning Techniques for Spam Detection,” 2021.

H. A. M. Bert, “ScienceDirect Spam Spam Email Email Detection Detection Using Using Deep Deep Learning Learning Techniques Techniques,” Procedia Comput. Sci., vol. 184, no. 2019, hal. 853–858, 2021, doi: 10.1016/j.procs.2021.03.107.

E. M. Bahgat, S. Rady, W. Gad, dan I. F. Moawad, “Efficient email classification approach based on semantic methods,” Ain Shams Eng. J., vol. 9, no. 4, hal. 3259–3269, 2018, doi: 10.1016/j.asej.2018.06.001.

J. Velasco-mata, “Classification of Spam Emails through Hierarchical Clustering and Supervised Learning.”

R. Talaei, P. Yaser, dan R. Mohsen, “Spam detection through feature selection using artificial neural network and sine – cosine algorithm,” Math. Sci., vol. 14, no. 3, hal. 193–199, 2020, doi: 10.1007/s40096-020-00327-8.

K. Taghandiki, “Building an Effective Email Spam Classification Model with spaCy,” hal. 1–5.

S. Ernawati, “Implementation of The Naïve Bayes Algorithm with Feature Selection using Genetic Algorithm for Sentiment Review Analysis of Fashion Online Companies,” 2018 6th Int. Conf. Cyber IT Serv. Manag., no. Citsm, hal. 1–5, 2018, doi: 10.1109/CITSM.2018.8674286.

N. Parveen, P. Chakrabarti, B. T. Hung, dan A. Shaik, “Twitter sentiment analysis using hybrid gated attention recurrent network,” J. Big Data, 2023, doi: 10.1186/s40537-023-00726-3.

S. Suryawanshi, “Email Spam Detection : An Empirical Comparative Study of Different ML and Ensemble Classifiers,” hal. 69–74, 2019.

P. Widyaningrum, Y. Ruldeviyani, R. Dharayani, P. Widyaningrum, Y. Ruldeviyani, dan R. Dharayani, “ScienceDirect ScienceDirect Sentiment Analysis to Assess the Community ’ s Enthusiasm Sentiment Analysis to Assess the Community ’ s Enthusiasm Towards the Development Chatbot Using an Appraisal Theory Towards the Development Chatbot Using an Appraisal Theory,” Procedia Comput. Sci., vol. 161, hal. 723–730, 2019, doi: 10.1016/j.procs.2019.11.176.

M. Crawford, T. M. Khoshgoftaar, J. D. Prusa, A. N. Richter, dan H. Al Najada, “Survey of review spam detection using machine learning techniques,” J. Big Data, 2015, doi: 10.1186/s40537-015-0029-9.

N. Peker dan C. Kubat, “Application of Chi-square discretization algorithms to ensemble classification methods,” Expert Syst. Appl., vol. 185, no. June, hal. 115540, 2021, doi: 10.1016/j.eswa.2021.115540.

N. F. Rusland, N. Wahid, dan S. Kasim, “Analysis of Naïve Bayes Algorithm for Email Spam Filtering across Multiple Datasets Analysis of Na ¨ ?ve Bayes Algorithm for Email Spam Filtering across Multiple Datasets,” doi: 10.1088/1757-899X/226/1/012091.

I. P. Wardhani, Y. I. Chandra, dan F. Yusra, “Application of the Naïve Bayes Classifier Algorithm to Analyze Sentiment for the Covid-19 Vaccine on Twitter in Jakarta,” vol. 07, no. 01, hal. 1–18, 2023.

V. A. Fitri, R. Andreswari, M. A. Hasibuan, V. A. Fitri, R. Andreswari, dan M. A. Hasibuan, “ScienceDirect ScienceDirect ScienceDirect Sentiment Analysis of Social Media Twitter with Case of Anti- Sentiment Analysis of Social Media Twitter with Case of Anti- LGBT Campaign in Indonesia using Naïve Bayes , Decision Tree , LGBT Campaign in Indonesia using Naïve Bayes , Decision Tree , and Random Forest Algorithm and Random Forest Algorithm,” Procedia Comput. Sci., vol. 161, hal. 765–772, 2019, doi: 10.1016/j.procs.2019.11.181.

P. Aliandu, “Sentiment Analysis to determine Accommodation , Shopping and Culinary Location on Foursquare in Kupang City,” Procedia - Procedia Comput. Sci., vol. 72, hal. 300–305, 2015, doi: 10.1016/j.procs.2015.12.144.

V. Balakrishnan dan W. Kaur, “ScienceDirect ScienceDirect String-based Multinomial Naïve Bayes for Emotion Detection String-based Multinomial Naïve Bayes for Emotion Detection among Facebook Diabetes Community among Facebook Diabetes Community,” Procedia Comput. Sci., vol. 159, hal. 30–37, 2019, doi: 10.1016/j.procs.2019.09.157.

D. Van Herwerden, J. W. O. Brien, P. M. Choi, K. V Thomas, P. J. Schoenmakers, dan S. Samanipour, “Chemometrics and Intelligent Laboratory Systems Naive Bayes classi fi cation model for isotopologue detection in LC-HRMS data,” Chemom. Intell. Lab. Syst., vol. 223, no. January, hal. 104515, 2022, doi: 10.1016/j.chemolab.2022.104515.

M. Aziz, T. Lailatul, D. Ananda, dan A. Pertiwi, “Intelligent Systems with Applications New model combination meta-learner to improve accuracy prediction P2P lending with stacking ensemble learning *,” Intell. Syst. with Appl., vol. 18, no. February, hal. 200204, 2023, doi: 10.1016/j.iswa.2023.200204.

D. Chicco dan G. Jurman, “The advantages of the Matthews correlation coefficient ( MCC ) over F1 score and accuracy in binary classification evaluation,” hal. 1–13, 2020.

Y. Zhang, G. Wang, X. Wang, H. Fan, B. Shen, dan K. Sun, “Energy Geoscience TOC estimation from logging data using principal component analysis,” Energy Geosci., vol. 4, no. 4, hal. 100197, 2023, doi: 10.1016/j.engeos.2023.100197.

A. Luque, A. Carrasco, A. Martín, dan A. De, “The impact of class imbalance in classification performance metrics based on the binary confusion matrix,” Pattern Recognit., vol. 91, hal. 216–231, 2019, doi: 10.1016/j.patcog.2019.02.023.

D. Valero-carreras, J. Alcaraz, dan M. Landete, “Computers and Operations Research Comparing two SVM models through different metrics based on the confusion matrix,” Comput. Oper. Res., vol. 152, no. April 2022, hal. 106131, 2023, doi: 10.1016/j.cor.2022.106131.

L. P. Lim dan M. M. Singh, “Journal of Information Security and Applications Resolving the imbalance issue in short messaging service spam dataset using cost-sensitive techniques,” vol. 54, 2020, doi: 10.1016/j.jisa.2020.102558.

F. Jáñez-martino, R. Alaiz-rodríguez, V. González-castro, E. Fidalgo, dan E. Alegre, “Classifying spam emails using agglomerative hierarchical clustering and a topic-based approach,” Appl. Soft Comput., vol. 139, hal. 110226, 2023, doi: 10.1016/j.asoc.2023.110226.

M. Mafarja, M. A. Hassonah, dan H. Fujita, “PT US CR,” Inf. Fusion, 2018, doi: 10.1016/j.inffus.2018.08.002.

A. Ligthart, C. Catal, dan B. Tekinerdogan, “Analyzing the effectiveness of semi-supervised learning approaches for opinion spam classification,” Appl. Soft Comput. J., vol. 101, hal. 107023, 2021, doi: 10.1016/j.asoc.2020.107023.