Dataset and Feature Analysis for Diabetes Mellitus Classification using Random Forest

Fachrul Mustofa; Achmad Nuruddin Safriandono; Ahmad Rofiqul Muslikh; De Rosal Ignatius Moses Setiadi

doi:10.33633/jcta.v1i1.9190

Authors

Fachrul Mustofa Dian Nuswantoro University
Achmad Nuruddin Safriandono Sultan Fatah University
Ahmad Rofiqul Muslikh University of Merdeka Malang
De Rosal Ignatius Moses Setiadi Dian Nuswantoro University http://orcid.org/0000-0001-6615-4457

DOI:

https://doi.org/10.33633/jcta.v1i1.9190

Keywords:

Classification Diabetes Types, Comprehensive analysis for diabetes types classification, Prediction for health technology, Random Forest, Feature Analysis, Abelvikas Dataset,

Abstract

Diabetes Mellitus is a hazardous disease, and according to the World Health Organization (WHO), diabetes will be one of the main causes of death by 2030. One of the most popular diabetes datasets is PIMA Indians, and this dataset has been widely tested on various machine learning (ML) methods, even deep learning (DL). But on average, ML methods are not able to produce good accuracy. The quality of the dataset and features is the most influential thing in this case, so deeper investment is needed to examine this dataset. This research will analyze and compare the PIMA Indians and Abelvikas datasets using the Random Forest (RF) method. The two datasets are imbalanced, in fact, the Abelvikas dataset is more imbalanced and has a larger number of classes so it is be more complex. The RF was chosen because it is one of the ML methods that has the best results on various diabetes datasets. Based on the test results, very contrasting results were obtained on the two datasets. Abelvikas had accuracy, precision, and recall, reaching 100%, and PIMA Indians only achieved 75% for accuracy, 87% for precision, and 80% for the best recall. Testing was done with 3, 5, 7, 10, and 15 tree number parameters. Apart from that, it was also tested with k-fold validation to get valid results. This determines that the features in the Abelvikas dataset are much better because more complete glucose features support them.

References

N. Pradhan, G. Rani, V. S. Dhaka, and R. C. Poonia, “Diabetes prediction using artificial neural network,” in Deep Learning Techniques for Biomedical and Health Informatics, Elsevier Inc., 2020, pp. 327–339. doi: 10.1016/B978-0-12-819061-6.00014-8.

W. R. Rowley, C. Bezold, Y. Arikan, E. Byrne, and S. Krohe, “Diabetes 2030: Insights from Yesterday, Today, and Future Trends,” Popul. Health Manag., vol. 20, no. 1, pp. 6–12, Feb. 2017, doi: 10.1089/pop.2015.0181.

H. Das, B. Naik, and H. S. Behera, “Classification of Diabetes Mellitus Disease (DMD): A Data Mining (DM) Approach,” in Advances in Intelligent Systems and Computing, vol. 710, Springer Verlag, 2018, pp. 539–549. doi: 10.1007/978-981-10-7871-2_52.

E. Pekel Özmen and T. Özcan, “Diagnosis of diabetes mellitus using artificial neural network and classification and regression tree optimized with genetic algorithm,” J. Forecast., vol. 39, no. 4, pp. 661–670, Jul. 2020, doi: 10.1002/for.2652.

American Diabetes Association, “Classification and Diagnosis of Diabetes,” Diabetes Care, vol. 38, no. Supplement_1, pp. S8–S16, Jan. 2015, doi: 10.2337/dc15-S005.

B. M. P. Waseso and N. A. Setiyanto, “Web Phishing Classification using Combined Machine Learning Methods,” J. Comput. Theor. Appl., vol. 1, no. 1, pp. 11–18, Aug. 2023, doi: 10.33633/jcta.v1i1.8898.

M. A. Araaf, K. Nugroho, and D. R. I. M. Setiadi, “Comprehensive Analysis and Classification of Skin Diseases based on Image Texture Features using K-Nearest Neighbors Algorithm,” J. Comput. Theor. Appl., vol. 1, no. 1, pp. 31–40, Sep. 2023, doi: 10.33633/jcta.v1i1.9185.

R. J. Lewis, “An Introduction to Classification and Regression Tree ( CART ) Analysis,” in Annual Meeting of the Society for Academic Emergency Medicine, 2000, no. 310, p. 14p. [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.95.4103&rep=rep1&type=pdf

B. Boehmke and B. Greenwell, “Random Forests,” in Hands-On Machine Learning with R, Chapman and Hall/CRC, 2019, pp. 203–219. doi: 10.1201/9780367816377-11.

H. Esmaily, M. Tayefi, H. Doosti, M. Ghayour-Mobarhan, H. Nezami, and A. Amirabadizadeh, “A comparison between decision tree and random forest in determining the risk factors associated with type 2 diabetes,” J. Res. Health Sci., vol. 18, no. 2, 2018.

O. Adigun, F. Okikiola, N. Yekini, and R. Babatunde, “Classification of Diabetes Types using Machine Learning,” Int. J. Adv. Comput. Sci. Appl., vol. 13, no. 9, pp. 152–161, 2022, doi: 10.14569/IJACSA.2022.0130918.

B. Boehmke and B. Greenwell, Hands-On Machine Learning with R. Chapman and Hall/CRC, 2019. doi: 10.1201/9780367816377.

M. Phongying and S. Hiriote, “Diabetes Classification Using Machine Learning Techniques,” Computation, vol. 11, no. 5, p. 96, May 2023, doi: 10.3390/computation11050096.

K. K. Chari, M. C. Babu, and S. Kodati, “Classification of Diabetes using Random Forest with Feature Selection Algorithm,” Int. J. Innov. Technol. Explor. Eng., vol. 9, no. 1, pp. 1295–1300, Nov. 2019, doi: 10.35940/ijitee.L3595.119119.

E. H. Rachmawanto, D. R. Ignatius Moses Setiadi, N. Rijati, A. Susanto, I. U. Wahyu Mulyono, and H. Rahmalan, “Attribute Selection Analysis for the Random Forest Classification in Unbalanced Diabetes Dataset,” in 2021 International Seminar on Application for Technology of Information and Communication (iSemantic), Sep. 2021, pp. 82–86. doi: 10.1109/iSemantic52711.2021.9573181.

D. R. Ignatius Moses Setiadi et al., “Effect of Feature Selection on The Accuracy of Music Genre Classification using SVM Classifier,” in 2020 International Seminar on Application for Technology of Information and Communication (iSemantic), Sep. 2020, pp. 7–11. doi: 10.1109/iSemantic50169.2020.9234222.

H. Thakkar, V. Shah, H. Yagnik, and M. Shah, “Comparative anatomization of data mining and fuzzy logic techniques used in diabetes prognosis,” Clin. eHealth, vol. 4, pp. 12–23, Jan. 2021, doi: 10.1016/j.ceh.2020.11.001.

V. Chang, J. Bailey, Q. A. Xu, and Z. Sun, “Pima Indians diabetes mellitus classification based on machine learning (ML) algorithms,” Neural Comput. Appl., vol. 35, no. 22, pp. 16157–16173, Aug. 2023, doi: 10.1007/s00521-022-07049-z.

Q. Wang, W. Cao, J. Guo, J. Ren, Y. Cheng, and D. N. Davis, “DMP_MI: An Effective Diabetes Mellitus Classification Algorithm on Imbalanced Data With Missing Values,” IEEE Access, vol. 7, pp. 102232–102238, 2019, doi: 10.1109/ACCESS.2019.2929866.

I. Tasin, T. U. Nabil, S. Islam, and R. Khan, “Diabetes prediction using machine learning and explainable AI techniques,” Healthc. Technol. Lett., vol. 10, no. 1–2, pp. 1–10, Feb. 2023, doi: 10.1049/htl2.12039.

O. Iparraguirre-Villanueva, K. Espinola-Linares, R. O. Flores Castañeda, and M. Cabanillas-Carbonell, “Application of Machine Learning Models for Early Detection and Accurate Classification of Type 2 Diabetes,” Diagnostics, vol. 13, no. 14, p. 2383, Jul. 2023, doi: 10.3390/diagnostics13142383.

H. He and Y. Ma, Imbalanced Learning: Foundations, Algorithms, and Applications. Wiley-IEEE Press, 2013. [Online]. Available: https://ieeexplore.ieee.org/book/6542371

N. Japkowicz and S. Stephen, “The class imbalance problem: A systematic study1,” Intell. Data Anal., vol. 6, no. 5, pp. 429–449, Nov. 2002, doi: 10.3233/IDA-2002-6504.

I. Kononenko, “Machine learning for medical diagnosis: history, state of the art and perspective,” Artif. Intell. Med., vol. 23, no. 1, pp. 89–109, Aug. 2001, doi: 10.1016/S0933-3657(01)00077-X.