An Explainable Multimodal Framework for Chest X-Ray Alert Classification Using Radiology Reports and Images

Authors

  • Edy Winarno Universitas Muhammadiyah Semarang
  • Indah Manfaati Nur Universitas Muhammadiyah Semarang
  • Abdul Karim Hallym University
  • Saeful Amri Universitas Muhammadiyah Semarang
  • Ismi Elya Wirdati Universitas Muhammadiyah Semarang
  • Prajanto Wahyu Adi Universitas Diponegoro

DOI:

https://doi.org/10.62411/jcta.16023

Keywords:

Alert classification, Chest X-ray, Clinical NLP, Explainable artificial intelligence, Grad-CAM, Late fusion, Multimodal learning, Radiology reports

Abstract

Artificial intelligence has the potential to support radiology workflows by assisting in the identification of cases that may require additional clinical attention. However, alert-oriented medical AI systems should provide not only classification outputs but also interpretable evidence that can be reviewed and audited by clinicians. This study develops and evaluates an explainable multimodal framework for binary chest X-ray alert classification using paired radiology reports and chest X-ray images. The text branch employs TF-IDF n-gram features with a class-balanced Logistic Regression classifier, while the image branch fine-tunes a pretrained ResNet18 model. The two branches are integrated through probability-level late fusion using a validation-selected fusion weight. Explainability is implemented in a modality-specific manner: global coefficient analysis is used to identify influential textual cues, while Grad-CAM heatmaps are used to visualize salient image regions. Experiments were conducted on paired samples from the Open-i/IU X-Ray dataset using text-only, image-only, and fusion-based evaluation settings. Additional analyses include case-level complementarity analysis, bootstrap confidence intervals for ROC-AUC, shortcut-feature inspection, and qualitative Grad-CAM auditing. The results indicate that the text modality provides the dominant predictive signal under the current proxy-label setting. Late fusion produced a small descriptive improvement on the test set, increasing accuracy from 0.8533 to 0.8667, F1-score from 0.8817 to 0.8936, and ROC-AUC from 0.8936 to 0.9025 compared with the text-only baseline. However, the observed ROC-AUC improvement was not statistically conclusive based on bootstrap analysis. These findings suggest that the proposed framework is useful as a reproducible and auditable multimodal prototype, while also highlighting important limitations, including proxy-label ambiguity, potential label leakage from radiology reports, limited image-branch contribution, lack of external validation, and the need for stronger explanation and calibration assessment.

Author Biographies

Edy Winarno, Universitas Muhammadiyah Semarang

Department of Information Technology, Faculty of Engineering and Computer Science, Universitas Muhammadiyah Semarang, Semarang 50273, Indonesia

Indah Manfaati Nur, Universitas Muhammadiyah Semarang

Department of Statistics, Faculty of Mathematics and Natural Sciences, Universitas Muhammadiyah Semarang, Semarang 50273, Indonesia

Abdul Karim, Hallym University

College of Information Science / AI-X, Hallym University, Chuncheon 24252, South Korea

Saeful Amri, Universitas Muhammadiyah Semarang

Department of Data Science, Faculty of Science and Agricultural Technology, Universitas Muhammadiyah Semarang, Semarang 50273, Indonesia

Ismi Elya Wirdati, Universitas Muhammadiyah Semarang

Faculty of Public Health, Universitas Muhammadiyah Semarang, Semarang 50273, Indonesia

Prajanto Wahyu Adi, Universitas Diponegoro

Department of Computer Science / Informatics, Faculty of Science and Mathematics, Universitas Diponegoro, Semarang 50275, Indonesia

References

C. J. Kelly, A. Karthikesalingam, M. Suleyman, G. Corrado, and D. King, “Key challenges for delivering clinical impact with artificial intelligence,” BMC Med., vol. 17, no. 1, p. 195, Dec. 2019, doi: 10.1186/s12916-019-1426-2.

C. Rudin, “Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead,” Nat. Mach. Intell., vol. 1, no. 5, pp. 206–215, May 2019, doi: 10.1038/s42256-019-0048-x.

D. Demner-Fushman et al., “Preparing a collection of radiology examinations for distribution and retrieval,” J. Am. Med. Informatics Assoc., vol. 23, no. 2, pp. 304–310, Mar. 2016, doi: 10.1093/jamia/ocv080.

N. Dewaswala et al., “Natural language processing for identification of hypertrophic cardiomyopathy patients from cardiac magnetic resonance reports,” BMC Med. Inform. Decis. Mak., vol. 22, no. 1, p. 272, Oct. 2022, doi: 10.1186/s12911-022-02017-y.

B. Zhou, G. Yang, Z. Shi, and S. Ma, “Natural Language Processing for Smart Healthcare,” IEEE Rev. Biomed. Eng., vol. 17, pp. 4–18, 2024, doi: 10.1109/RBME.2022.3210270.

S. Sheikhalishahi, R. Miotto, J. T. Dudley, A. Lavelli, F. Rinaldi, and V. Osmani, “Natural Language Processing of Clinical Notes on Chronic Diseases: Systematic Review,” JMIR Med. Informatics, vol. 7, no. 2, p. e12239, Apr. 2019, doi: 10.2196/12239.

D. Jin, E. Sergeeva, W. Weng, G. Chauhan, and P. Szolovits, “Explainable deep learning in healthcare: A methodological survey from an attribution view,” WIREs Mech. Dis., vol. 14, no. 3, May 2022, doi: 10.1002/wsbm.1548.

G. Huang, Y. Li, S. Jameel, Y. Long, and G. Papanastasiou, “From explainable to interpretable deep learning for natural language processing in healthcare: How far from reality?,” Comput. Struct. Biotechnol. J., vol. 24, pp. 362–373, Dec. 2024, doi: 10.1016/j.csbj.2024.05.004.

A. S. Egbunu and A. M. Okedoye, “Harnessing Artificial Intelligence for Early Disease Detection: Opportunities and Challenges in Modern Healthcare,” J. Comput. Theor. Appl., vol. 3, no. 3, pp. 384–401, Feb. 2026, doi: 10.62411/jcta.15367.

A. E. W. Johnson et al., “MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports,” Sci. Data, vol. 6, no. 1, p. 317, Dec. 2019, doi: 10.1038/s41597-019-0322-0.

J. Irvin et al., “CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison,” Proc. AAAI Conf. Artif. Intell., vol. 33, no. 01, pp. 590–597, Jul. 2019, doi: 10.1609/aaai.v33i01.3301590.

X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, and R. M. Summers, “ChestX-Ray8: Hospital-Scale Chest X-Ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul. 2017, pp. 3462–3471. doi: 10.1109/CVPR.2017.369.

K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2016, vol. 2016-Decem, pp. 770–778. doi: 10.1109/CVPR.2016.90.

G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely Connected Convolutional Networks,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul. 2017, pp. 2261–2269. doi: 10.1109/CVPR.2017.243.

K. Pyar, “Segmentation Performance Analysis of Transfer Learning Models on X-Ray Pneumonia Images,” J. Futur. Artif. Intell. Technol., vol. 1, no. 1, pp. 64–74, Jun. 2024, doi: 10.62411/faith.2024-10.

G. Salton and C. Buckley, “Term-weighting approaches in automatic text retrieval,” Inf. Process. Manag., vol. 24, no. 5, pp. 513–523, Jan. 1988, doi: 10.1016/0306-4573(88)90021-0.

N. Fadul, M. F. Alaskar, K. B. Jillahi, and D. B. El-Khaled, “Generative AI in Healthcare: An Analytical Review of Models, Clinical Applications, and Decision-Support Implications,” J. Futur. Artif. Intell. Technol., vol. 2, no. 4, pp. 587–615, Dec. 2025, doi: 10.62411/faith.3048-3719-298.

J. Kittler, M. Hatef, R. P. W. Duin, and J. Matas, “On combining classifiers,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 3, pp. 226–239, Mar. 1998, doi: 10.1109/34.667881.

J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, “Multimodal deep learning,” in Proceedings of the 28th International Conference on International Conference on Machine Learning, 2011, pp. 689–696.

E. O. Ibam and J. B. Oluwagbemi, “Multimodal Deep Learning for Pneumonia Detection Using Wearable Sensors: Toward an Edge-Cloud Framework,” J. Comput. Theor. Appl., vol. 3, no. 3, pp. 314–333, Jan. 2026, doi: 10.62411/jcta.14944.

R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization,” in 2017 IEEE International Conference on Computer Vision (ICCV), Oct. 2017, vol. 128, no. 2, pp. 618–626. doi: 10.1109/ICCV.2017.74.

J. Adebayo, J. Gilmer, M. Muelly, I. Goodfellow, M. Hardt, and B. Kim, “Sanity checks for saliency maps,” in Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018, pp. 9525–9536. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2018/file/294a8ed24b1ad22ec2e7efea049b8737-Paper.pdf

S. Hooker, D. Erhan, P.-J. Kindermans, and B. Kim, “A benchmark for interpretability methods in deep neural networks,” in Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2019. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2019/file/fe4b8556000d0f0cae99daa5c5c5a410-Paper.pdf

D. S. Stamoulis and C. Papachristopoulou, “Artificial Intelligence in Radiology, Emergency, and Remote Healthcare: A Snapshot of Present and Future Applications,” J. Futur. Artif. Intell. Technol., vol. 1, no. 3, pp. 228–234, Oct. 2024, doi: 10.62411/faith.3048-3719-38.

M. T. Ribeiro, S. Singh, and C. Guestrin, “‘Why Should I Trust You?’ : Explaining the Predictions of Any Classifier,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Aug. 2016, vol. 13-17-Augu, pp. 1135–1144. doi: 10.1145/2939672.2939778.

S. Lundberg and S.-I. Lee, “A Unified Approach to Interpreting Model Predictions,” in NIPS’17: Proceedings of the 31st International Conference on Neural Information Processing Systems, Nov. 2017, pp. 4768–4777. [Online]. Available: http://arxiv.org/abs/1705.07874

K. Lekadir et al., “FUTURE-AI: international consensus guideline for trustworthy and deployable artificial intelligence in healthcare,” BMJ, vol. 388, p. e081554, Feb. 2025, doi: 10.1136/bmj-2024-081554.

L. R. Soenksen et al., “Integrated multimodal artificial intelligence framework for healthcare applications,” npj Digit. Med., vol. 5, no. 1, p. 149, Sep. 2022, doi: 10.1038/s41746-022-00689-4.

Downloads

Published

2026-05-23

How to Cite

Winarno, E., Nur, I. M., Karim, A., Amri, S., Wirdati, I. E., & Adi, P. W. (2026). An Explainable Multimodal Framework for Chest X-Ray Alert Classification Using Radiology Reports and Images. Journal of Computing Theories and Applications, 3(4), 647–666. https://doi.org/10.62411/jcta.16023