Implementation of Feature Selection Chi-Square to Improve the Accuracy of the Classification Model Using the Random Forest Algorithm on Coronary Artery Disease

Authors

  • Ida Bagus Satya Mahendra
  • Tatik Widiharih
  • Fajar Agung Nugroho
  • Priyo Sidik Sasongko

DOI:

https://doi.org/10.33633/jais.v9i1.7858

Abstract

Coronary heart disease is a disease in which the occurrence of blockages in the blood vessels in the heart. Coronary heart disease is a fatal disease, it is better to get as much information about this disease as possible. Data Mining can classify whether a person has heart disease or not based on symptoms. Data mining builds a model that can predict whether a person has heart disease or not. How well a model performs classification can be determined from its accuracy value, but this accuracy value can still be improved. Increasing the accuracy value can be done by performing Feature Selection. The research object used in this research is a dataset about coronary heart disease obtained from the Kaggle website. The classification method used in this modeling is the Random Forest algorithm to classify whether a person has coronary heart disease or not. The Random Forest Algorithm is a classification algorithm consisting of Decision Trees for classifying. The Random Forest algorithm is used because it has been proven to produce good accuracy in several previous studies. The Feature Selection method used in this modeling is the Chi-Square hypothesis test to determine whether there is an effect of each independent variable on the dependent variable. This research compared the value of modeling accuracy without using Feature Selection with modeling using Feature Selection. The result of this study is that the model without Chi-Square Feature Selection produced an accuracy value of 96,05% and the model with Chi-Square Feature Selection produced an accuracy value of 97,33%.

References

World Health Organization, Cardiovascular Disease (CVDs), 2021. https://www.who.int/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds)

C. C. Aggarwal, Data Mining. Cham: Springer International Publishing, 2015. doi: 10.1007/978-3-319-14142-8.

Z. R. S. Elsi et al., Utilization of Data Mining Techniques in National Food Security during the Covid-19 Pandemic in Indonesia, Journal of Physics: Conference Series, 2020, doi: 10.1088/1742-6596/1594/1/012007.

M. A. Muslim, B. Prasetiyo, E. L. H. Mawarni, A. J. Herowati, Mirqotussa’adah, S. H. Rukmana, A. Nurzahputra, Data Mining Algoritma C4.5., ILKOM UNNES http://lib.unnes.ac.id/33080/

S. García, J. Luengo, and F. Herrera, “Intelligent Systems Reference Library 72 Data Preprocessing in Data Mining.” [Online]. Available: http://www.springer.com/series/8578

I. Sumaiya Thaseen and C. Aswani Kumar, Intrusion detection model using fusion of chi-square feature selection and multi class SVM, Journal of King Saud University - Computer and Information Sciences, 2017, doi: 10.1016/j.jksuci.2015.12.004.

Md. A. M. Hasan, M. Nasser, S. Ahmad, and K. I. Molla, Feature Selection for Intrusion Detection Using Random Forest, Journal of Information Security, doi: 10.4236/jis.2016.73009.

M. I. Prasetiyowati, N. U. Maulidevi, and K. Surendro, Feature selection to increase the random forest method performance on high dimensional data, International Journal of Advances in Intelligent Informatics, 2020, doi: 10.26555/ijain.v6i3.471.

L. Breiman, Random Forest, Machine Learning, 2001, https://doi.org/10.1023/A:1010933404324

Y. K. Singh, N. Sinha, and S. K. Singh, Heart disease prediction system using random forest, Communications in Computer and Information Science, doi: 10.1007/978-981-10-5427-3_63.

M. Pal and S. Parija, Prediction of Heart Diseases using Random Forest, Journal of Physics: Conference Series, 2021, doi: 10.1088/1742-6596/1817/1/012009.

R. Ani, A. Augustine, N. C. Akhil, and O. S. Deepa, Random forest ensemble classifier to predict the coronary heart disease using risk factors, Advances in Intelligent Systems and Computing, 2016, doi: 10.1007/978-81-322-2671-0_66.

R. Katarya and S. K. Meena, Machine Learning Techniques for Heart Disease Prediction: A Comparative Study and Analysis, Health Technol (Berl), 2021, doi: 10.1007/s12553-020-00505-7.

Hangaw Qadir, Coronary Artery Disease, Kaggle.com, 2022. https://www.kaggle.com/datasets/hangawqadir/erbil-heart-disease-dataset

Jason Brownlee, How to Choose a Feature Selection Method For Machine Learning, machinelearningmastery.com, 2019, https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/

Downloads

Published

2024-04-02