Enhancing Software Defect Prediction through Hybrid Multi-Filter Feature Selection and Imbalance Handling

Authors

Muhammad Khalid Maulana Lambung Mangkurat University
Setyo Wahyu Saputro Lambung Mangkurat University
Mohammad Reza Faisal Lambung Mangkurat University
Radityo Adi Nugroho Lambung Mangkurat University
As’ary Ramadhan Lambung Mangkurat University

DOI:

https://doi.org/10.62411/jcta.15943

Keywords:

Backward Elimination, Class Imbalance, Feature Selection, Logistic Regression, Mutual Information, Naïve Bayes, Software Defect Prediction, SMOTE-Tomek

Abstract

Software Defect Prediction (SDP) aims to identify defective modules early in the software development lifecycle to improve software quality and reduce maintenance costs. However, SDP datasets commonly suffer from high dimensionality, feature redundancy, and class imbalance, which can degrade model performance and stability. This study proposes a hybrid feature selection framework to address these challenges and enhance prediction performance. The proposed approach integrates Combined Correlation and Mutual Information (CONMI), which combines the Pearson Correlation Coefficient (PCC) and Mutual Information (MI) to capture both linear and nonlinear feature relevance. The selected features are further refined through Top-K selection, correlation-based filtering to reduce multicollinearity, and Backward Elimination (BE) to obtain an optimal feature subset. To address class imbalance, SMOTE-Tomek is applied by combining over-sampling and data cleaning techniques. Experiments are conducted on twelve NASA MDP datasets using Logistic Regression (LR) and Naïve Bayes (NB) classifiers. The results show that the proposed framework consistently achieves the best performance, with Logistic Regression combined with SMOTE-Tomek obtaining the highest average AUC of 0.7923 ± 0.0714, while NB achieves 0.7554 ± 0.0580. Statistical analysis using a paired t-test indicates that the proposed method significantly outperforms MI+SMOTE-Tomek and BE+SMOTE-Tomek for Logistic Regression, whereas no significant differences are observed for NB. In addition to improving overall classification performance (AUC), the proposed approach also enhances minority class detection, as reflected in improved Recall and F1-score. Overall, the proposed hybrid framework provides an effective and reliable solution for software defect prediction, particularly for high-dimensional and imbalanced datasets.

Author Biographies

Muhammad Khalid Maulana, Lambung Mangkurat University

Department of Computer Science, Faculty of Mathematics and Natural Science, Lambung Mangkurat University, Banjarbaru 70714, Indonesia

Setyo Wahyu Saputro, Lambung Mangkurat University

Department of Computer Science, Faculty of Mathematics and Natural Science, Lambung Mangkurat University, Banjarbaru 70714, Indonesia

Mohammad Reza Faisal, Lambung Mangkurat University

Department of Computer Science, Faculty of Mathematics and Natural Science, Lambung Mangkurat University, Banjarbaru 70714, Indonesia

Radityo Adi Nugroho, Lambung Mangkurat University

Department of Computer Science, Faculty of Mathematics and Natural Science, Lambung Mangkurat University, Banjarbaru 70714, Indonesia

As’ary Ramadhan, Lambung Mangkurat University

Department of Computer Science, Faculty of Mathematics and Natural Science, Lambung Mangkurat University, Banjarbaru 70714, Indonesia

References

M. Jorayeva, A. Akbulut, C. Catal, and A. Mishra, “Machine Learning-Based Software Defect Prediction for Mobile Applications: A Systematic Literature Review,” Sensors, vol. 22, no. 7, p. 2551, Mar. 2022, doi: 10.3390/s22072551.

A. Rahim, Z. Hayat, M. Abbas, A. Rahim, and M. A. Rahim, “Software Defect Prediction with Naïve Bayes Classifier,” in 2021 International Bhurban Conference on Applied Sciences and Technologies (IBCAST), Jan. 2021, pp. 293–297. doi: 10.1109/IBCAST51254.2021.9393250.

J. B. Awotunde, S. Misra, A. E. Adeniyi, M. K. Abiodun, M. Kaushik, and M. O. Lawrence, “A Feature Selection-Based K-NN Model for Fast Software Defect Prediction,” in Lecture Notes in Computer Science, 2022, pp. 49–61. doi: 10.1007/978-3-031-10542-5_4.

M. Nevendra and P. Singh, “A Survey of Software Defect Prediction Based on Deep Learning,” Arch. Comput. Methods Eng., vol. 29, no. 7, pp. 5723–5748, Nov. 2022, doi: 10.1007/s11831-022-09787-8.

A. Khalid, G. Badshah, N. Ayub, M. Shiraz, and M. Ghouse, “Software Defect Prediction Analysis Using Machine Learning Techniques,” Sustainability, vol. 15, no. 6, p. 5517, Mar. 2023, doi: 10.3390/su15065517.

R. Shwartz-Ziv and A. Armon, “Tabular data: Deep learning is not all you need,” Inf. Fusion, vol. 81, pp. 84–90, May 2022, doi: 10.1016/j.inffus.2021.11.011.

R. Wibowo, M. A. Soeleman, and A. Affandy, “Hybrid Top-K Feature Selection to Improve High-Dimensional Data Classification Using Naïve Bayes Algorithm,” Sci. J. Informatics, vol. 10, no. 2, pp. 113–120, Apr. 2023, doi: 10.15294/sji.v10i2.42818.

Y. Hu et al., “Beyond Comparing Machine Learning and Logistic Regression in Clinical Prediction Modelling: Shifting from Model Debate to Data Quality,” J. Med. Internet Res., vol. 27, p. e77721, Nov. 2025, doi: 10.2196/77721.

M. N. Aisy, S. A. Wulandari, and D. R. I. M. Setiadi, “A Probabilistic Feature-Augmented GRU-Attention Model for Chronic Disease Prediction on Imbalanced Data,” J. Futur. Artif. Intell. Technol., vol. 2, no. 2, pp. 282–293, Jul. 2025, doi: 10.62411/faith.3048-3719-100.

N. A. A. Khleel and K. Nehéz, “A novel approach for software defect prediction using CNN and GRU based on SMOTE Tomek method,” J. Intell. Inf. Syst., vol. 60, no. 3, pp. 673–707, Jun. 2023, doi: 10.1007/s10844-023-00793-1.

D. R. I. M. Setiadi, K. Nugroho, A. R. Muslikh, S. W. Iriananda, and A. A. Ojugo, “Integrating SMOTE-Tomek and Fusion Learning with XGBoost Meta-Learner for Robust Diabetes Recognition,” J. Futur. Artif. Intell. Technol., vol. 1, no. 1, pp. 23–38, May 2024, doi: 10.62411/faith.2024-11.

E. F. ; Swana et al., “Tomek Link and SMOTE Approaches for Machine Fault Classification with an Imbalanced Dataset,” Sensors, vol. 22, no. 9, p. 3246, Apr. 2022, doi: 10.3390/S22093246.

B. Mumtaz, S. Kanwal, S. Alamri, and F. Khan, “Feature Selection Using Artificial Immune Network: An Approach for Software Defect Prediction,” Intell. Autom. Soft Comput., vol. 29, no. 3, pp. 669–684, 2021, doi: 10.32604/iasc.2021.018405.

J. Y.-L. Chan et al., “Mitigating the Multicollinearity Problem and Its Machine Learning Approach: A Review,” Mathematics, vol. 10, no. 8, p. 1283, Apr. 2022, doi: 10.3390/math10081283.

M. Cuartas, E. Ruiz, D. Ferreño, J. Setién, V. Arroyo, and F. Gutiérrez-Solana, “Machine learning algorithms for the prediction of non-metallic inclusions in steel wires for tire reinforcement,” J. Intell. Manuf., vol. 32, no. 6, pp. 1739–1751, Aug. 2021, doi: 10.1007/s10845-020-01623-9.

R. Deng, Y. Liu, L. Luo, D. Chen, and X. Li, “Unsupervised Feature Selection using Pseudo Label Approximation,” in 2021 13th International Conference on Machine Learning and Computing, Feb. 2021, pp. 498–502. doi: 10.1145/3457682.3457758.

M. N. Juybari, P. Baraldi, A. Palermo, A. E. Milani, A. Marzani, and E. Zio, “Wrapper Selection of Features for Fault Diagnostics of Truss Structures,” in Book of Extended Abstracts for the 32nd European Safety and Reliability Conference, 2022, pp. 1867–1874. doi: 10.3850/978-981-18-5183-4_S02-06-619-cd.

B. T. Pham et al., “Performance assessment of artificial neural network using chi-square and backward elimination feature selection methods for landslide susceptibility analysis,” Environ. Earth Sci., vol. 80, no. 20, p. 686, Oct. 2021, doi: 10.1007/s12665-021-09998-5.

J. R. Busenbark, H. (Elle) Yoon, D. L. Gamache, and M. C. Withers, “Omitted Variable Bias: Examining Management Research With the Impact Threshold of a Confounding Variable (ITCV),” J. Manage., vol. 48, no. 1, pp. 17–48, Jan. 2022, doi: 10.1177/01492063211006458.

P. Chen, F. Li, and C. Wu, “Research on Intrusion Detection Method Based on Pearson Correlation Coefficient Feature Selection Algorithm,” J. Phys. Conf. Ser., vol. 1757, no. 1, p. 012054, Jan. 2021, doi: 10.1088/1742-6596/1757/1/012054.

F. Macedo, R. Valadas, E. Carrasquinha, M. R. Oliveira, and A. Pacheco, “Feature selection using Decomposed Mutual Information Maximization,” Neurocomputing, vol. 513, pp. 215–232, Nov. 2022, doi: 10.1016/j.neucom.2022.09.101.

B. M. Kessels, R. H. B. Fey, and N. van de Wouw, “Mutual information-based feature selection for inverse mapping parameter updating of dynamical systems,” Multibody Syst. Dyn., vol. 64, no. 3, pp. 437–464, Jul. 2025, doi: 10.1007/s11044-024-10015-3.

A. O. Balogun et al., “Empirical Analysis of Rank Aggregation-Based Multi-Filter Feature Selection Methods in Software Defect Prediction,” Electronics, vol. 10, no. 2, p. 179, Jan. 2021, doi: 10.3390/electronics10020179.

R. Rahmayanti, R. Herteno, S. W. Saputro, T. H. Saragih, and F. Abadi, “Comparative Study of Filter, Wrapper, and Hybrid Feature Selection Using Tree-Based Classifiers for Software Defect Prediction,” Indones. J. Electron. Electromed. Eng. Med. Informatics, vol. 8, no. 1, pp. 1–16, Dec. 2025, doi: 10.35882/ijeeemi.v8i1.294.

Muhammad Noor, Radityo Adi Nugroho, Setyo Wahyu Saputro, Rudy Herteno, and Friska Abadi, “Optimization of Backward Elimination for Software Defect Prediction with Correlation Coefficient Filter Method,” J. Electron. Electromed. Eng. Med. Informatics, vol. 6, no. 4, pp. 397–404, Sep. 2024, doi: 10.35882/jeeemi.v6i4.466.

A. O. Balogun et al., “Software Defect Prediction Using Wrapper Feature Selection Based on Dynamic Re-Ranking Strategy,” Symmetry (Basel)., vol. 13, no. 11, p. 2166, Nov. 2021, doi: 10.3390/sym13112166.

M. N. M. Rahman, R. A. Nugroho, M. R. Faisal, F. Abadi, and R. Herteno, “Optimized multi correlation-based feature selection in software defect prediction,” TELKOMNIKA (Telecommunication Comput. Electron. Control., vol. 22, no. 3, p. 598, Jun. 2024, doi: 10.12928/telkomnika.v22i3.25793.

M. Y. A. Pratama, R. Herteno, M. R. Faisal, R. A. Nugroho, and F. Abadi, “Improving with Hybrid Feature Selection in Software Defect Prediction,” J. Online Inform., vol. 9, no. 1, pp. 52–60, Apr. 2024, doi: 10.15575/join.v9i1.1307.

M. I. Akazue, I. A. Debekeme, A. E. Edje, C. Asuai, and U. J. Osame, “UNMASKING FRAUDSTERS: Ensemble Features Selection to Enhance Random Forest Fraud Detection,” J. Comput. Theor. Appl., vol. 1, no. 2, pp. 201–211, Dec. 2023, doi: 10.33633/jcta.v1i2.9462.

C. Asuai et al., “Enhancing DDoS Detection via 3ConFA Feature Fusion and 1D Convolutional Neural Networks,” J. Futur. Artif. Intell. Technol., vol. 2, no. 1, pp. 145–162, Jun. 2025, doi: 10.62411/faith.3048-3719-105.

J. Zhang, D. Li, W. E. Wong, and S. Wang, “A Hybrid Sampling and Multi-Objective Optimization Approach for Enhanced Software Defect Prediction,” arXiv. Oct. 13, 2024. [Online]. Available: http://arxiv.org/abs/2410.10046

M. S. Masari, M. A. Danladi, I. L. Onyinye, and L. K. Tohomdet, “Android Malware Detection Using Machine Learning with SMOTE-Tomek Data Balancing,” J. Comput. Theor. Appl., vol. 3, no. 3, pp. 302–313, Jan. 2026, doi: 10.62411/jcta.15084.

T. Wahyuningsih, D. Manongga, I. Sembiring, and S. Wijono, “Comparison of Effectiveness of Logistic Regression, Naive Bayes, and Random Forest Algorithms in Predicting Student Arguments,” Procedia Comput. Sci., vol. 234, pp. 349–356, 2024, doi: 10.1016/j.procs.2024.03.014.

S. K, J. V. K, H. S, and K. V, “Defect Prediction Model for Software Projects using Naïve Bayesian Classifier,” Int. J. Eng. Trends Technol., vol. 71, no. 9, pp. 170–177, Sep. 2023, doi: 10.14445/22315381/IJETT-V71I9P216.

R. Oueslati and G. Manita, “Software Defect Prediction Using Integrated Logistic Regression and Fractional Chaotic Grey Wolf Optimizer,” in Proceedings of the 19th International Conference on Evaluation of Novel Approaches to Software Engineering, 2024, pp. 633–640. doi: 10.5220/0012704600003687.

H. Gong, Y. Li, J. Zhang, B. Zhang, and X. Wang, “A new filter feature selection algorithm for classification task by ensembling pearson correlation coefficient and mutual information,” Eng. Appl. Artif. Intell., vol. 131, p. 107865, May 2024, doi: 10.1016/j.engappai.2024.107865.

H. Zhou, X. Wang, and R. Zhu, “Feature selection based on mutual information with correlation coefficient,” Appl. Intell., vol. 52, no. 5, pp. 5457–5474, Mar. 2022, doi: 10.1007/s10489-021-02524-x.

J. Suntoro, F. W. Christanto, and H. Indriyawati, “Software Defect Prediction Using AWEIG+ADACOST Bayesian Algorithm for Handling High Dimensional Data and Class Imbalance Problem,” Int. J. Inf. Technol. Bus., vol. 5, no. 1, pp. 27–32, Nov. 2022, doi: 10.24246/ijiteb.512018.27-32.

A. K. Aryanti, R. Herteno, F. Indriani, R. A. Nugroho, and M. Muliadi, “Implementation of Copeland Method on Wrapper-Based Feature Selection Using Random Forest For Software Defect Prediction,” Indones. J. Electron. Electromed. Eng. Med. Informatics, vol. 7, no. 1, pp. 90–101, Feb. 2025, doi: 10.35882/2pgffc67.

Downloads

pdf

Published

2026-04-24

How to Cite

Maulana, M. K., Saputro, S. W., Faisal, M. R., Nugroho, R. A., & Ramadhan, A. (2026). Enhancing Software Defect Prediction through Hybrid Multi-Filter Feature Selection and Imbalance Handling. Journal of Computing Theories and Applications, 3(4), 518–534. https://doi.org/10.62411/jcta.15943

Download Citation

Issue

Vol. 3 No. 4 (2026): JCTA 3(4) 2026

Section

Articles

License

Copyright (c) 2026 Muhammad Khalid Maulana, Setyo Wahyu Saputro, Mohammad Reza Faisal, Radityo Adi Nugroho, As’ary Ramadhan

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.

JCTA is now indexed in Scopus (Q3) and accredited as SINTA 2

.:JOURNAL MENU:.

Author Guidelines

Editorial Boards

Publication Ethics

Peer Review Process

Open Access Policy and Journal License

Similarity and AI Tools Policy

Author Fees (new)

.:AUTHORS' DIVERSITY:.

Total: 26 countries from 4 continents

Africa (9 countries): Algeria; Burkina Faso; Burundi; Democratic Republic of the Congo; Kenya; Morocco; Nigeria; Senegal; Tunisia

Asia (16 countries): Bahrain; Bangladesh; China; India; Indonesia; Iraq; Malaysia; Myanmar; Nepal; Pakistan; South Korea; Sri Lanka; Turkey; Viet Nam

America (1 country): United States

Europe (2 countries): Russian Federation; United Kingdom

.:VISITORS:.

.:JOURNAL STATICTICS:.

Year	Acceptance Rate	Days to First Decision
2025	35%	2 days
2024	45%	3 days

Information