Big Data-Driven Health Risk Stratification: A Health Index-Based Approach Using Feature Importance and PySpark
DOI:
https://doi.org/10.62411/jcta.12327Keywords:
Big Data Analytics, Feature Importance, Health Index, Heart Disease Risk, PySpark, Risk StratificationAbstract
Health risk stratification is crucial for preventive healthcare, yet existing models often rely on binary classification generalized disease prediction, neglecting personalized health indicators and graded risk levels. Many studies apply feature selection techniques like Relief and Univariate Selection without quantifying the weighted impact of features. To address these gaps, this study introduces a Big Data-driven Health Index (HI) framework using PySpark for scalable health risk stratification. The HI is computed as a weighted sum of health-related features using SHAP Analysis, XGBoost, Random Forest, and Correlation Analysis. PySpark enables efficient processing of large-scale health data, and individuals are classified into Low and High Risk. Optimal classification thresholds are determined using the Youden Index from the ROC curve to balance sensitivity and specificity. Personalized health recommendations are generated based on risk categories to guide preventive interventions. Performance evaluation reveals that Correlation Analysis achieves 100% precision and 98.90% recall, outperforming other methods. SHAP prioritizes recall but has low precision, while XGBoost and Random Forest improve precision but struggle with recall. By leveraging Big Data techniques with PySpark, this study enhances computational efficiency, scalability, and classification accuracy, addressing prior research limitations and providing a robust data-driven approach to personalized health monitoring.References
R. Naqvi, T. R. Soomro, H. M. Alzoubi, T. M. Ghazal, and M. T. Alshurideh, “The Nexus Between Big Data and Decision-Making: A Study of Big Data Techniques and Technologies,” in Proceedings of the International Conference on Artificial Intelligence and Computer Vision (AICV2021), Springer, Cham, 2021, pp. 838–853. doi: 10.1007/978-3-030-76346-6_73.
C. Nyamful and R. Agrawal, “Big Variety Data,” in Encyclopedia of Big Data, Cham: Springer International Publishing, 2022, pp. 110–113. doi: 10.1007/978-3-319-32010-6_23.
A. T. Atieh, “The Next Generation Cloud technologies: A Review On Distributed Cloud, Fog And Edge Computing and Their Opportunities and Challenges,” Res. Rev. Sci. Technol., vol. 1, no. 1, pp. 1–15, 2021, [Online]. Available: https://researchberg.com/index.php/rrst/article/view/18
S. Nazir et al., “A Comprehensive Analysis of Healthcare Big Data Management, Analytics and Scientific Programming,” IEEE Access, vol. 8, pp. 95714–95733, 2020, doi: 10.1109/ACCESS.2020.2995572.
S. Dash, S. K. Shakyawar, M. Sharma, and S. Kaushik, “Big data in healthcare: management, analysis and future prospects,” J. Big Data, vol. 6, no. 1, p. 54, Dec. 2019, doi: 10.1186/s40537-019-0217-0.
Z. M. Tun and M. Aye Khine, “Cardiac Diagnosis Classification Using Deep Learning Pipeline on Apache Spark,” in 2020 17th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), Jun. 2020, pp. 743–746. doi: 10.1109/ECTI-CON49241.2020.9158314.
K. Batko and A. Ślęzak, “The use of Big Data Analytics in healthcare,” J. Big Data, vol. 9, no. 1, p. 3, Dec. 2022, doi: 10.1186/s40537-021-00553-4.
P. Kangelani and T. Iyamu, “A Model for Evaluating Big Data Analytics Tools for Organisation Purposes,” in Responsible Design, Implementation, and Use of ICT (Information and Communication Technology), 2020.
D. Otoo-Arthur and T. L. van Zyl, “A Scalable Heterogeneous Big Data Framework for e-Learning Systems,” in 2020 International Conference on Artificial Intelligence, Big Data, Computing and Data Communication Systems (icABCD), Aug. 2020, pp. 1–15. doi: 10.1109/icABCD49160.2020.9183863.
R. Venkatraman and S. Venkatraman, “Big Data Infrastructure, Data Visualisation and Challenges,” in Proceedings of the 3rd International Conference on Big Data and Internet of Things, Aug. 2019, pp. 13–17. doi: 10.1145/3361758.3361768.
R. Rossi and K. Hirama, “Characterizing Big Data Management,” Issues Informing Sci. Inf. Technol., vol. 12, pp. 165–180, 2015, doi: 10.28945/2204.
S. Acharjee and R. Choudhury, “Big data searching using words,” arXiv. Sep. 10, 2024. [Online]. Available: http://arxiv.org/abs/2409.15346
J. Yang et al., “Brief introduction of medical database and data mining technology in big data era,” J. Evid. Based. Med., vol. 13, no. 1, pp. 57–69, Feb. 2020, doi: 10.1111/jebm.12373.
S. Venkatraman and R. Venkatraman, “Big data security challenges and strategies,” AIMS Math., vol. 4, no. 3, pp. 860–879, 2019, doi: 10.3934/math.2019.3.860.
S. Usman, R. Mehmood, I. Katib, and A. Albeshri, “Data Locality in High Performance Computing, Big Data, and Converged Systems: An Analysis of the Cutting Edge and a Future System Architecture,” Electronics, vol. 12, no. 1, p. 53, Dec. 2022, doi: 10.3390/electronics12010053.
S. Dasari and R. Kaluri, “Big Data Analytics, Processing Models, Taxonomy of Tools, V’s, and Challenges: State-of-Art Review and Future Implications,” Wirel. Commun. Mob. Comput., vol. 2023, pp. 1–14, May 2023, doi: 10.1155/2023/3976302.
A. Shanbhag, S. Madden, and X. Yu, “A Study of the Fundamental Performance Characteristics of GPUs and CPUs for Database Analytics,” in Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, Jun. 2020, pp. 1617–1632. doi: 10.1145/3318464.3380595.
E. Shaikh, I. Mohiuddin, Y. Alufaisan, and I. Nahvi, “Apache Spark: A Big Data Processing Engine,” in 2019 2nd IEEE Middle East and North Africa COMMunications Conference (MENACOMM), Nov. 2019, pp. 1–6. doi: 10.1109/MENACOMM46666.2019.8988541.
M. Saxena, S. Jha, S. Khan, J. Rodgers, P. Lindner, and E. Gabriel, “Comparison of MPI and Spark for Data Science Applications,” in 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), May 2020, pp. 682–690. doi: 10.1109/IPDPSW50202.2020.00123.
M. Alam Mallik, N. Fariza Zulkurnain, S. Siddiqui, and R. Sarkar, “The Parallel Fuzzy C-Median Clustering Algorithm Using Spark for the Big Data,” IEEE Access, vol. 12, pp. 151785–151804, 2024, doi: 10.1109/ACCESS.2024.3463712.
S. Tang, B. He, C. Yu, Y. Li, and K. Li, “A Survey on Spark Ecosystem: Big Data Processing Infrastructure, Machine Learning, and Applications,” IEEE Trans. Knowl. Data Eng., vol. 34, no. 1, pp. 1–1, 2020, doi: 10.1109/TKDE.2020.2975652.
M. Gecer, “Debugging Spark Applications A Study on Debugging Techniques of Spark Developers Master Thesis,” Universit ̈at Bern, 2020. [Online]. Available: https://scg.unibe.ch/archive/masters/Gece20a.pdf
A. Ed-Daoudy and K. Maalmi, “Real-time machine learning for early detection of heart disease using big data approach,” in 2019 International Conference on Wireless Technologies, Embedded and Intelligent Systems (WITS), Apr. 2019, pp. 1–5. doi: 10.1109/WITS.2019.8723839.
F. I. Alarsan and M. Younes, “Analysis and classification of heart diseases using heartbeat features and machine learning algorithms,” J. Big Data, vol. 6, no. 1, p. 81, Dec. 2019, doi: 10.1186/s40537-019-0244-x.
S. Ilbeigipour, A. Albadvi, and E. Akhondzadeh Noughabi, “Real-Time Heart Arrhythmia Detection Using Apache Spark Structured Streaming,” J. Healthc. Eng., vol. 2021, pp. 1–13, Apr. 2021, doi: 10.1155/2021/6624829.
S. Alotaibi, R. Mehmood, I. Katib, O. Rana, and A. Albeshri, “Sehaa: A Big Data Analytics Tool for Healthcare Symptoms and Diseases Detection Using Twitter, Apache Spark, and Machine Learning,” Appl. Sci., vol. 10, no. 4, p. 1398, Feb. 2020, doi: 10.3390/app10041398.
H. Ahmed, E. M. G. Younis, A. Hendawi, and A. A. Ali, “Heart disease identification from patients’ social posts, machine learning solution on Spark,” Futur. Gener. Comput. Syst., vol. 111, pp. 714–722, Oct. 2020, doi: 10.1016/j.future.2019.09.056.
A. Ed-daoudy, K. Maalmi, and A. El Ouaazizi, “A scalable and real-time system for disease prediction using big data processing,” Multimed. Tools Appl., vol. 82, no. 20, pp. 30405–30434, Aug. 2023, doi: 10.1007/s11042-023-14562-3.
P. Rajendra Kumar, P. Chakrabarti, T. Chakrabarti, B. Unhelkar, and M. Margala, “Heart disease prediction using spark architecture with fused feature set and hybrid Squeezenet-Linknet model,” Biomed. Signal Process. Control, vol. 100, p. 107070, Feb. 2025, doi: 10.1016/j.bspc.2024.107070.
Y. K. Gupta and S. Kumari, “Performance Evaluation of Distributed Machine Learning for Cardiovascular Disease Prediction in Spark,” in 2021 5th International Conference on Trends in Electronics and Informatics (ICOEI), Jun. 2021, pp. 1506–1512. doi: 10.1109/ICOEI51242.2021.9452955.
Arif Ahmad Shehloo and Ganesh Gopal Varshney, “Realizing the Potential of Big Data Analytics through Apache Spark MLlib,” Nanotechnol. Perceptions, pp. 1813–1830, Nov. 2024, doi: 10.62441/nano-ntp.vi.3022.
S. Eeti, “Real-Time Data Processing: An Analysis of PySpark’s Capabilities,” Int. J. Res. Anal. Rev., vol. 8, no. 3, 2021, [Online]. Available: www.ijrar.org
E. Dorison, F. Lesur, D. Meurice, and G. Roinel, “Health index, a tool for asset management,” in International Conference on Power Insulated Cables, 2007. [Online]. Available: https://www.jicable.org/2007/Actes/Session_B4/JIC07_B41.pdf
D. Kornbrot, “Point Biserial Correlation,” in Wiley StatsRef: Statistics Reference Online, Wiley, 2014. doi: 10.1002/9781118445112.stat06227.
J. D. Brown, “Point - biserial correlation coefficientsbiserial correlation coefficients,” Shiken: JLT Testing & Evlution SIG Newsletter. pp. 13–17, 2001. [Online]. Available: https://teval.jalt.org/test/PDF/Brown12.pdf
K. Pytlak, “Indicators of Heart Disease (2022 UPDATE).” 2022. [Online]. Available: https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease/data
I. Malakar and B. Nepal, “Conceptualizing Explorative Data Analysis in Applied Statistics,” Patan Gyansagar, vol. 6, no. 1, pp. 46–63, Jul. 2024, doi: 10.3126/pg.v6i1.67406.
M. Abt, T. Leuders, K. Loibl, and F. Reinhold, “Developing initial notions of variability when learning about box plots,” Math. Think. Learn., pp. 1–24, Oct. 2024, doi: 10.1080/10986065.2024.2421412.
R. L. Nuzzo, “The Box Plots Alternative for Visualizing Quantitative Data,” PM&R, vol. 8, no. 3, pp. 268–272, Mar. 2016, doi: 10.1016/j.pmrj.2016.02.001.
J. H. Kwak, H. Bin Lee, and K.-H. Lee, “Exploring how to Organize a Unit on Box Plots Through Analysis of Foreign Textbooks,” Korean Soc. Educ. Stud. Math. - Sch. Math., vol. 25, no. 2, pp. 249–276, Jun. 2023, doi: 10.57090/sm.2023.06.25.2.249.
K. Hu, “Become Competent within One Day in Generating Boxplots and Violin Plots for a Novice without Prior R Experience,” Methods Protoc., vol. 3, no. 4, p. 64, Sep. 2020, doi: 10.3390/mps3040064.
E. Soltanmohammadi and N. Hikmet, “Optimizing Healthcare Big Data Processing with Containerized PySpark and Parallel Computing: A Study on ETL Pipeline Efficiency,” J. Data Anal. Inf. Process., vol. 12, no. 04, pp. 544–565, 2024, doi: 10.4236/jdaip.2024.124029.
A. Senbato, “Designing Healthcare Data Analytics Framework Based on Big Data Approach: In Case of Stroke Disease Prediction,” Addis Ababa Science and Technology University, 2019.
K. Sharma et al., “Apache Spark for Analysis of Electronic Health Records: A Case Study of Diabetes Management,” Rev. d’Intelligence Artif., vol. 37, no. 6, pp. 1521–1526, Dec. 2023, doi: 10.18280/ria.370616.
I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. London, England: MIT Press, 2016.
M. T. Ribeiro, S. Singh, and C. Guestrin, “‘Why Should I Trust You?,’” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Aug. 2016, vol. 13-17-Augu, pp. 1135–1144. doi: 10.1145/2939672.2939778.
C. M. Bishop and N. M. Nasrabadi, “Pattern Recognition and Machine Learning,” J. Electron. Imaging, vol. 16, no. 4, p. 049901, Jan. 2007, doi: 10.1117/1.2819119.
T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition, 2nd editio. Springer, 2017.
A. K. S. Jardine, D. Lin, and D. Banjevic, “A review on machinery diagnostics and prognostics implementing condition-based maintenance,” Mech. Syst. Signal Process., vol. 20, no. 7, pp. 1483–1510, Oct. 2006, doi: 10.1016/j.ymssp.2005.09.012.
G. Niu, T. Han, B.-S. Yang, and A. C. C. Tan, “Multi-agent decision fusion for motor fault diagnosis,” Mech. Syst. Signal Process., vol. 21, no. 3, pp. 1285–1299, Apr. 2007, doi: 10.1016/j.ymssp.2006.03.003.
M. J. Goddard and I. Hinberg, “Receiver operator characteristic (ROC) curves and non‐normal data: An empirical study,” Stat. Med., vol. 9, no. 3, pp. 325–337, Mar. 1990, doi: 10.1002/sim.4780090315.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Oluwasegun Abiodun Abioye, Martins Ekata Irhebhude

This work is licensed under a Creative Commons Attribution 4.0 International License.