Integrating Hybrid Statistical and Unsupervised LSTM-Guided Feature Extraction for Breast Cancer Detection

Authors

  • De Rosal Ignatius Moses Setiadi Universitas Dian Nuswantoro https://orcid.org/0000-0001-6615-4457
  • Arnold Adimabua Ojugo Federal University of Petroleum Resources Effurun
  • Octara Pribadi STMIK TIME
  • Etika Kartikadarma Universitas Dian Nuswantoro
  • Bimo Haryo Setyoko Universitas Islam Negeri Salatiga
  • Suyud Widiono Universitas Teknologi Yogyakarta
  • Robet Robet STMIK TIME
  • Tabitha Chukwudi Aghaunor Robert Morris University
  • Eferhire Valentine Ugbotu University of Salford

DOI:

https://doi.org/10.62411/jcta.12698

Keywords:

Breast cancer detection, Ensemble feature selection, Feature fusion, Healthcare AI, Imbalance problem, Interpretable machine learning, Unsupervised LSTM

Abstract

Breast cancer is the most prevalent cancer among women worldwide, requiring early and accurate diagnosis to reduce mortality. This study proposes a hybrid classification pipeline that integrates Hybrid Statistical Feature Selection (HSFS) with unsupervised LSTM-guided feature extraction for breast cancer detection using the Wisconsin Diagnostic Breast Cancer (WDBC) dataset. Initially, 20 features were selected using HSFS based on Mutual Information, Chi-square, and Pearson Correlation. To address class imbalance, the training set was balanced using the Synthetic Minority Over-sampling Technique (SMOTE). Subsequently, an LSTM encoder extracted non-linear latent features from the selected features. A fusion strategy was applied by concatenating the statistical and latent features, followed by re-selection of the top 30 features. The final classification was performed using a Support Vector Machine (SVM) with RBF kernel and evaluated using 5-fold cross-validation and a held-out test set. Experimental results showed that the proposed method achieved an average training accuracy of 98.13%, F1-score of 98.13%, and AUC-ROC of 99.55%. On the held-out test set, the model reached an accuracy of 99.30%, precision of 100%, and F1-score of 99.05%, with an AUC-ROC of 0.9973. The proposed pipeline demonstrates improved generalization and interpretability compared to existing methods such as LightGBM-PSO, DHH-GRU, and ensemble deep networks. These results highlight the effectiveness of combining statistical selection and LSTM-based latent feature encoding in a balanced classification framework.

Author Biographies

De Rosal Ignatius Moses Setiadi, Universitas Dian Nuswantoro

Faculty of Computer Science, Universitas Dian Nuswantoro, Semarang 50131, Indonesia

Arnold Adimabua Ojugo, Federal University of Petroleum Resources Effurun

Department of Computer Science, Federal University of Petroleum Resources Effurun, Delta State 330102, Nigeria

Octara Pribadi, STMIK TIME

Department of Informatics Engineering, STMIK TIME, Medan 20212, Indonesia

Etika Kartikadarma , Universitas Dian Nuswantoro

Faculty of Computer Science, Universitas Dian Nuswantoro, Semarang 50131, Indonesia

Bimo Haryo Setyoko, Universitas Islam Negeri Salatiga

Pusat Teknologi Informasi dan Pangkalan Data, Universitas Islam Negeri Salatiga, Salatiga 50716, Indonesia

Suyud Widiono, Universitas Teknologi Yogyakarta

Department of Computer Engineering, Faculty of Science and Technology, Universitas Teknologi Yogyakarta, Yogyakarta 55285

Robet Robet, STMIK TIME

Department of Informatics Engineering, STMIK TIME, Medan 20212, Indonesia

Tabitha Chukwudi Aghaunor, Robert Morris University

Department of Computer Science, School of Data Intelligence and Technology, Robert Morris University, Pittsburgh, PA 15108, United States

Eferhire Valentine Ugbotu, University of Salford

Department of Data Science, College of Science and Engineering, University of Salford, Manchester M5 4WT, United Kingdom

References

D. S. Stamoulis and C. Papachristopoulou, “Artificial Intelligence in Radiology, Emergency, and Remote Healthcare: A Snapshot of Present and Future Applications,” J. Futur. Artif. Intell. Technol., vol. 1, no. 3, pp. 228–234, Oct. 2024, doi: 10.62411/faith.3048-3719-38.

O. Okolo, B. Y. Baha, and M. D. Philemon, “Using Causal Graph Model variable selection for BERT models Prediction of Patient Survival in a Clinical Text Discharge Dataset,” J. Futur. Artif. Intell. Technol., vol. 1, no. 4, pp. 455–473, Mar. 2025, doi: 10.62411/faith.3048-3719-61.

M. B. Teferi and L. A. Akinyemi, “Deep Learning-Based Cross-Cancer Morphological Analysis: Identifying Histopathological Patterns in Breast and Lung Cancer,” J. Futur. Artif. Intell. Technol., vol. 1, no. 3, pp. 235–248, Oct. 2024, doi: 10.62411/faith.3048-3719-36.

O. Jaiyeoba, O. Jaiyeoba, E. Ogbuju, and F. Oladipo, “AI-Based Detection Techniques for Skin Diseases: A Review of Recent Methods, Datasets, Metrics, and Challenges,” J. Futur. Artif. Intell. Technol., vol. 1, no. 3, pp. 318–336, Dec. 2024, doi: 10.62411/faith.3048-3719-46.

World Health Organization (WHO), “Breast cancer,” who.int, 2024. https://www.who.int/news-room/fact-sheets/detail/breast-cancer (accessed Apr. 30, 2025).

S. Fanijo, “AI4CRC: A Deep Learning Approach Towards Preventing Colorectal Cancer,” J. Futur. Artif. Intell. Technol., vol. 1, no. 2, pp. 143–159, Sep. 2024, doi: 10.62411/faith.2024-28.

K. Swanson, E. Wu, A. Zhang, A. A. Alizadeh, and J. Zou, “From patterns to patients: Advances in clinical machine learning for cancer diagnosis, prognosis, and treatment,” Cell, vol. 186, no. 8, pp. 1772–1791, Apr. 2023, doi: 10.1016/j.cell.2023.01.035.

A. Yaqoob, R. Musheer Aziz, and N. K. Verma, “Applications and Techniques of Machine Learning in Cancer Classification: A Systematic Review,” Human-Centric Intell. Syst., vol. 3, no. 4, pp. 588–615, Sep. 2023, doi: 10.1007/s44230-023-00041-3.

G. Sruthi, C. L. Ram, M. K. Sai, B. P. Singh, N. Majhotra, and N. Sharma, “Cancer Prediction using Machine Learning,” in 2022 2nd International Conference on Innovative Practices in Technology and Management (ICIPTM), Feb. 2022, pp. 217–221. doi: 10.1109/ICIPTM54933.2022.9754059.

Y. Kumar et al., “Automating cancer diagnosis using advanced deep learning techniques for multi-cancer image classification,” Sci. Rep., vol. 14, no. 1, p. 25006, Oct. 2024, doi: 10.1038/s41598-024-75876-2.

A. Khalid et al., “Breast Cancer Detection and Prevention Using Machine Learning,” Diagnostics, vol. 13, no. 19, p. 3113, Oct. 2023, doi: 10.3390/diagnostics13193113.

O. N. Oyelade, A. A. Obiniyi, S. B. Junaidu, and S. A. Adewuyi, “ST-ONCODIAG: A semantic rule-base approach to diagnosing breast cancer base on Wisconsin datasets,” Informatics Med. Unlocked, vol. 10, no. December 2017, pp. 117–125, 2018, doi: 10.1016/j.imu.2017.12.008.

M. H. Alshayeji, H. Ellethy, S. Abed, and R. Gupta, “Computer-aided detection of breast cancer on the Wisconsin dataset: An artificial neural networks approach,” Biomed. Signal Process. Control, vol. 71, no. PA, p. 103141, Jan. 2022, doi: 10.1016/j.bspc.2021.103141.

A. D. Raha et al., “Modeling and Predictive Analytics of Breast Cancer Using Ensemble Learning Techniques: An Explainable Artificial Intelligence Approach,” Comput. Mater. Contin., vol. 81, no. 3, pp. 4033–4048, 2024, doi: 10.32604/cmc.2024.057415.

J. Zhu et al., “An integrated approach of feature selection and machine learning for early detection of breast cancer,” Sci. Rep., vol. 15, no. 1, p. 13015, Apr. 2025, doi: 10.1038/s41598-025-97685-x.

R. Natarajan, S. Krishna, H. L. Gururaj, F. Flammini, B. S. Alfurhood, and C. M. N. Kumar, “A Novel Hybrid Dynamic Harris Hawks Optimized Gated Recurrent Unit Approach for Breast Cancer Prediction,” Int. J. Comput. Intell. Syst., vol. 18, no. 1, p. 7, Jan. 2025, doi: 10.1007/s44196-024-00712-4.

E. Sreehari and L. D. Dhinesh Babu, “A novel aggregated coefficient ranking based feature selection strategy for enhancing the diagnosis of breast cancer classification using machine learning,” Sci. Rep., vol. 15, no. 1, p. 4171, Feb. 2025, doi: 10.1038/s41598-025-87826-7.

M. S. Al Reshan et al., “Advanced breast cancer prediction using Deep Neural Networks integrated with ensemble models,” Chemom. Intell. Lab. Syst., vol. 262, no. January, p. 105399, Jul. 2025, doi: 10.1016/j.chemolab.2025.105399.

P. S. R. C. Murty et al., “Integrative hybrid deep learning for enhanced breast cancer diagnosis: leveraging the Wisconsin Breast Cancer Database and the CBIS-DDSM dataset,” Sci. Rep., vol. 14, no. 1, p. 26287, Nov. 2024, doi: 10.1038/s41598-024-74305-8.

A. Sagheer and M. Kotb, “Unsupervised Pre-training of a Deep LSTM-based Stacked Autoencoder for Multivariate Time Series Forecasting Problems,” Sci. Rep., vol. 9, no. 1, p. 19038, Dec. 2019, doi: 10.1038/s41598-019-55320-6.

L. Annamalai, V. Ramanathan, and C. S. Thakur, “Event-LSTM: An Unsupervised and Asynchronous Learning-Based Representation for Event-Based Data,” IEEE Robot. Autom. Lett., vol. 7, no. 2, pp. 4678–4685, Apr. 2022, doi: 10.1109/LRA.2022.3151426.

S. Aymaz, “Unlocking the power of optimized data balancing ratios: a new frontier in tackling imbalanced datasets,” J. Supercomput., vol. 81, no. 2, p. 443, Jan. 2025, doi: 10.1007/s11227-025-06919-2.

S. Wang, Y. Dai, J. Shen, and J. Xuan, “Research on expansion and classification of imbalanced data based on SMOTE algorithm,” Sci. Reports 2021 111, vol. 11, no. 1, pp. 1–11, 2021, doi: 10.1038/s41598-021-03430-5.

S. M. Imran and A. Geetha, “Evaluating the Effectiveness of Smote for Imbalanced Data Expansion and Its Impact on Classification Accuracy,” in 2024 First International Conference for Women in Computing (InCoWoCo), Nov. 2024, pp. 1–7. doi: 10.1109/InCoWoCo64194.2024.10863344.

F. O. Aghware et al., “Enhancing the Random Forest Model via Synthetic Minority Oversampling Technique for Credit-Card Fraud Detection,” J. Comput. Theor. Appl., vol. 1, no. 4, pp. 407–420, Mar. 2024, doi: 10.62411/jcta.10323.

D. R. I. M. Setiadi, K. Nugroho, A. R. Muslikh, S. W. Iriananda, and A. A. Ojugo, “Integrating SMOTE-Tomek and Fusion Learning with XGBoost Meta-Learner for Robust Diabetes Recognition,” J. Futur. Artif. Intell. Technol., vol. 1, no. 1, pp. 23–38, May 2024, doi: 10.62411/faith.2024-11.

C. C. Odiakaose et al., “Hypertension Detection via Tree-Based Stack Ensemble with SMOTE-Tomek Data Balance and XGBoost Meta-Learner,” J. Futur. Artif. Intell. Technol., vol. 1, no. 3, pp. 269–283, Dec. 2024, doi: 10.62411/faith.3048-3719-43.

F. S. Gomiasti, W. Warto, E. Kartikadarma, J. Gondohanindijo, and D. R. I. M. Setiadi, “Enhancing Lung Cancer Classification Effectiveness Through Hyperparameter-Tuned Support Vector Machine,” J. Comput. Theor. Appl., vol. 1, no. 4, pp. 396–406, Mar. 2024, doi: 10.62411/jcta.10106.

J. A. Ingio, A. S. Nsang, and A. Iorliam, “Optimizing Rice Production Forecasting Through Integrating Multiple Linear Regression with Recursive Feature Elimination,” J. Futur. Artif. Intell. Technol., vol. 1, no. 2, pp. 96–108, Aug. 2024, doi: 10.62411/faith.2024-17.

D. R. I. M. Setiadi, S. Widiono, A. N. Safriandono, and S. Budi, “Phishing Website Detection Using Bidirectional Gated Recurrent Unit Model and Feature Selection,” J. Futur. Artif. Intell. Technol., vol. 1, no. 2, pp. 75–83, Jul. 2024, doi: 10.62411/faith.2024-15.

M. I. Akazue, I. A. Debekeme, A. E. Edje, C. Asuai, and U. J. Osame, “UNMASKING FRAUDSTERS: Ensemble Features Selection to Enhance Random Forest Fraud Detection,” J. Comput. Theor. Appl., vol. 1, no. 2, pp. 201–211, Dec. 2023, doi: 10.33633/jcta.v1i2.9462.

K. Natarajan, D. Baskaran, and S. Kamalanathan, “An adaptive ensemble feature selection technique for model-agnostic diabetes prediction,” Sci. Rep., vol. 15, no. 1, p. 6907, Feb. 2025, doi: 10.1038/s41598-025-91282-8.

A. Hashemi, M. B. Dowlatshahi, and H. Nezamabadi-pour, “Ensemble of feature selection algorithms: a multi-criteria decision-making approach,” Int. J. Mach. Learn. Cybern., vol. 13, no. 1, pp. 49–69, Jan. 2022, doi: 10.1007/s13042-021-01347-z.

D. P. M. Abellana, R. R. Roxas, D. M. Lao, P. E. Mayol, and S. Lee, “Ensemble Feature Selection in Binary Machine Learning Classification: A Novel Application of the Evaluation Based on Distance from Average Solution (EDAS) Method,” Math. Probl. Eng., vol. 2022, pp. 1–13, Sep. 2022, doi: 10.1155/2022/4126536.

W. Xia, W. Zhu, B. Liao, M. Chen, L. Cai, and L. Huang, “Novel architecture for long short-term memory used in question classification,” Neurocomputing, vol. 299, pp. 20–31, Jul. 2018, doi: 10.1016/j.neucom.2018.03.020.

S. Adamu, A. Iorliam, and Ö. Asilkan, “Exploring Explainability in Multi-Category Electronic Markets: A Comparison of Machine Learning and Deep Learning Approaches,” J. Futur. Artif. Intell. Technol., vol. 1, no. 4, pp. 440–454, Mar. 2025, doi: 10.62411/faith.3048-3719-58.

S. M. Al-Selwi et al., “RNN-LSTM: From applications to modeling techniques and beyond—Systematic review,” J. King Saud Univ. - Comput. Inf. Sci., vol. 36, no. 5, p. 102068, Jun. 2024, doi: 10.1016/j.jksuci.2024.102068.

L. R. Zuama, D. R. I. M. Setiadi, A. Susanto, S. Santosa, H.-S. Gan, and A. A. Ojugo, “High-Performance Face Spoofing Detection using Feature Fusion of FaceNet and Tuned DenseNet201,” J. Futur. Artif. Intell. Technol., vol. 1, no. 4, pp. 385–400, Feb. 2025, doi: 10.62411/faith.3048-3719-62.

A. Pathirana et al., “A Reinforcement Learning-Based Approach for Promoting Mental Health Using Multimodal Emotion Recognition,” J. Futur. Artif. Intell. Technol., vol. 1, no. 2, pp. 124–142, Sep. 2024, doi: 10.62411/faith.2024-22.

N. R. Pratama, D. R. I. M. Setiadi, I. Harkespan, and A. A. Ojugo, “Feature Fusion with Albumentation for Enhancing Monkeypox Detection Using Deep Learning Models,” J. Comput. Theor. Appl., vol. 2, no. 3, pp. 427–440, Feb. 2025, doi: 10.62411/jcta.12255.

Z. Golrizkhatami and A. Acan, “ECG classification using three-level fusion of different feature descriptors,” Expert Syst. Appl., vol. 114, pp. 54–64, Dec. 2018, doi: 10.1016/j.eswa.2018.07.030.

Y. Qiao et al., “A multi-modal fusion model with enhanced feature representation for chronic kidney disease progression prediction,” Brief. Bioinform., vol. 26, no. 1, Nov. 2024, doi: 10.1093/bib/bbaf003.

Downloads

Published

2025-05-05

How to Cite

Setiadi, D. R. I. M., Ojugo, A. A., Pribadi, O., Kartikadarma , E., Setyoko, B. H., Widiono, S., Robet, R., Aghaunor, T. C., & Ugbotu, E. V. (2025). Integrating Hybrid Statistical and Unsupervised LSTM-Guided Feature Extraction for Breast Cancer Detection. Journal of Computing Theories and Applications, 2(4), 536–552. https://doi.org/10.62411/jcta.12698