Evaluating Open-Source Machine Learning Project Quality Using SMOTE-Enhanced and Explainable ML/DL Models
DOI:
https://doi.org/10.62411/jcta.14793Keywords:
Explainable AI, GitHub Projects, Machine Learning, Open-Source Software, Quality Assessment, Software Engineering, SMOTE-Tomek, Deep Neural NetworksAbstract
The rapid growth of open-source software (OSS) in machine learning (ML) has intensified the need for reliable, automated methods to assess project quality, particularly as OSS increasingly underpins critical applications in science, industry, and public infrastructure. This study evaluates the effectiveness of a diverse set of machine learning and deep learning (ML/DL) algorithms for classifying GitHub OSS ML projects as engineered or non-engineered using a SMOTE-enhanced and explainable modeling pipeline. The dataset used in this research includes both numerical and categorical attributes representing documentation, testing, architecture, community engagement, popularity, and repository activity. After handling missing values, standardizing numerical features, encoding categorical variables, and addressing the inherent class imbalance using the Synthetic Minority Oversampling Technique (SMOTE), seven different classifiers—K-Nearest Neighbors (KNN), Decision Tree (DT), Random Forest (RF), XGBoost (XGB), Logistic Regression (LR), Support Vector Machine (SVM), and a Deep Neural Network (DNN)—were trained and evaluated. Results show that LR (84%) and DNN (85%) outperform all other models, indicating that both linear and moderately deep non-linear architectures can effectively capture key quality indicators in OSS ML projects. Additional explainability analysis using SHAP reveals consistent feature importance across models, with documentation quality, unit testing practices, architectural clarity, and repository dynamics emerging as the strongest predictors. These findings demonstrate that automated, explainable ML/DL-based quality assessment is both feasible and effective, offering a practical pathway for improving OSS sustainability, guiding contributor decisions, and enhancing trust in ML-based systems that depend on open-source components.References
R. Sharma, “The Transformative Power of AI as Future GPTs in Propelling Society Into a New Era of Advancement,” IEEE Eng. Manag. Rev., vol. 51, no. 4, pp. 215–224, Dec. 2023, doi: 10.1109/EMR.2023.3315191.
T. V. N. Rao, A. Gaddam, M. Kurni, and K. Saritha, “Reliance on Artificial Intelligence, Machine Learning and Deep Learning in the Era of Industry 4.0,” in Smart Healthcare System Design, Wiley, 2022, pp. 281–299. doi: 10.1002/9781119792253.ch12.
M. N. Chaudhry, S. S. U. Din, Z. U. R. Zia, M. K. Abid, and N. Aslam, “Achieving Scalable and Secure Systems: The Confluence of ML, AI, Iot, Block-chain, and Software Engineering,” J. Comput. Biomed. Informatics, 2024, [Online]. Available: https://jcbi.org/index.php/Main/article/view/359
V. Cosentino, J. L. Canovas Izquierdo, and J. Cabot, “A Systematic Mapping Study of Software Development With GitHub,” IEEE Access, vol. 5, pp. 7173–7192, 2017, doi: 10.1109/ACCESS.2017.2682323.
N. McDonald and S. Goggins, “Performance and participation in open source software on GitHub,” in CHI ’13 Extended Abstracts on Human Factors in Computing Systems, Apr. 2013, pp. 139–144. doi: 10.1145/2468356.2468382.
E. Kalliamvakou, D. Damian, K. Blincoe, L. Singer, and D. M. German, “Open Source-Style Collaborative Development Practices in Commercial Projects Using GitHub,” in 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, May 2015, pp. 574–585. doi: 10.1109/ICSE.2015.74.
R. Widyasari et al., “NICHE: A Curated Dataset of Engineered Machine Learning Projects in Python,” in 2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR), May 2023, pp. 62–66. doi: 10.1109/MSR59073.2023.00022.
S. Fahle, C. Prinz, and B. Kuhlenkötter, “Systematic review on machine learning (ML) methods for manufacturing processes – Identifying artificial intelligence (AI) methods for field application,” Procedia CIRP, vol. 93, pp. 413–418, 2020, doi: 10.1016/j.procir.2020.04.109.
B. Lin, Y. Huang, J. Zhang, J. Hu, X. Chen, and J. Li, “Cost-Driven Off-Loading for DNN-Based Applications Over Cloud, Edge, and End Devices,” IEEE Trans. Ind. Informatics, vol. 16, no. 8, pp. 5456–5466, Aug. 2020, doi: 10.1109/TII.2019.2961237.
O. Lock, M. Bain, and C. Pettit, “Towards the collaborative development of machine learning techniques in planning support systems – a Sydney example,” Environ. Plan. B Urban Anal. City Sci., vol. 48, no. 3, pp. 484–502, Mar. 2021, doi: 10.1177/2399808320939974.
F. de Arriba-Pérez, S. García-Méndez, J. Otero-Mosquera, F. J. González-Castaño, and F. Gil-Castiñeira, “Automatic Generation of Insights From Workers’ Actions in Industrial Workflows With Explainable Machine Learning: A Proposed Architecture With Validation,” IEEE Ind. Electron. Mag., vol. 18, no. 2, pp. 17–29, Jun. 2024, doi: 10.1109/MIE.2023.3284203.
S. Kourtzanidis, A. Chatzigeorgiou, and A. Ampatzoglou, “RepoSkillMiner: identifying software expertise from GitHub repositories using natural language processing,” in Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, Dec. 2020, pp. 1353–1357. doi: 10.1145/3324884.3415305.
F. Wen, C. Nagy, M. Lanza, and G. Bavota, “Quick remedy commits and their impact on mining software repositories,” Empir. Softw. Eng., vol. 27, no. 1, p. 14, Jan. 2022, doi: 10.1007/s10664-021-10051-z.
D. Spadini, M. Aniche, and A. Bacchelli, “PyDriller: Python framework for mining software repositories,” in Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Oct. 2018, pp. 908–911. doi: 10.1145/3236024.3264598.
D. Nagy, A. M. Yassin, and A. Bhattacherjee, “Organizational adoption of open source software,” Commun. ACM, vol. 53, no. 3, pp. 148–151, Mar. 2010, doi: 10.1145/1666420.1666457.
A. Bonaccorsi and C. Rossi, “Why Open Source software can succeed,” Res. Policy, vol. 32, no. 7, pp. 1243–1258, Jul. 2003, doi: 10.1016/S0048-7333(03)00051-9.
J. Feller, Perspectives on Free and Open Source Software. The MIT Press, 2005. doi: 10.7551/mitpress/5326.001.0001.
S. Butler et al., “An investigation of work practices used by companies making contributions to established OSS projects,” in Proceedings of the 40th International Conference on Software Engineering: Software Engineering in Practice, May 2018, pp. 201–210. doi: 10.1145/3183519.3183531.
W. Scacchi, “Socio-Technical Interaction Networks in Free/Open Source Software Development Processes,” in Software Process Modeling, New York: Springer-Verlag, 2005, pp. 1–27. doi: 10.1007/0-387-24262-7_1.
V. K. Gurbani, A. Garvert, and J. D. Herbsleb, “A case study of a corporate open source development model,” in Proceedings of the 28th international conference on Software engineering, May 2006, pp. 472–481. doi: 10.1145/1134285.1134352.
S. Butler et al., “On Company Contributions to Community Open Source Software Projects,” IEEE Trans. Softw. Eng., vol. 47, no. 7, pp. 1381–1401, Jul. 2021, doi: 10.1109/TSE.2019.2919305.
L. Dahlander and M. G. Magnusson, “Relationships between open source software companies and communities: Observations from Nordic firms,” Res. Policy, vol. 34, no. 4, pp. 481–493, May 2005, doi: 10.1016/j.respol.2005.02.003.
S. Arto, “Open Source in Finnish Software Companies,” ETLA Econ. Res., 2006, [Online]. Available: https://www.etla.fi/en/publications/dp1002-en/
K. Mouakhar and A. Tellier, “How do Open Source software companies respond to institutional pressures? A business model perspective,” J. Enterp. Inf. Manag., vol. 30, no. 4, pp. 534–554, Jul. 2017, doi: 10.1108/JEIM-05-2015-0041.
R. Sen, C. Subramaniam, and M. L. Nelson, “Determinants of the Choice of Open Source Software License,” J. Manag. Inf. Syst., vol. 25, no. 3, pp. 207–240, Dec. 2008, doi: 10.2753/MIS0742-1222250306.
T. August, W. Chen, and K. Zhu, “Competition Among Proprietary and Open-Source Software Firms: The Role of Licensing in Strategic Contribution,” Manage. Sci., vol. 67, no. 5, pp. 3041–3066, May 2021, doi: 10.1287/mnsc.2020.3674.
V. Markovtsev and W. Long, “Public git archive,” in Proceedings of the 15th International Conference on Mining Software Repositories, May 2018, pp. 34–37. doi: 10.1145/3196398.3196464.
M. Shahin, M. Ali Babar, and L. Zhu, “Continuous Integration, Delivery and Deployment: A Systematic Review on Approaches, Tools, Challenges and Practices,” IEEE Access, vol. 5, pp. 3909–3943, 2017, doi: 10.1109/ACCESS.2017.2685629.
C. Anderson, “Quality assurance practices in open-source projects: Nurturing excellence in collaborative development,” J. Sci. Technol., vol. 3, no. 4, pp. 23–36, 2022, [Online]. Available: https://thesciencebrigade.com/jst/article/view/61
S. Omri, P. Montag, and C. Sinz, “Static Analysis and Code Complexity Metrics as Early Indicators of Software Defects,” J. Softw. Eng. Appl., vol. 11, no. 04, pp. 153–166, 2018, doi: 10.4236/jsea.2018.114010.
K. Srinivasan and D. Fisher, “Machine learning approaches to estimating software development effort,” IEEE Trans. Softw. Eng., vol. 21, no. 2, pp. 126–137, 1995, doi: 10.1109/32.345828.
S. Goyal and P. K. Bhatia, “Comparison of Machine Learning Techniques for Software Quality Prediction,” Int. J. Knowl. Syst. Sci., vol. 11, no. 2, pp. 20–40, Apr. 2020, doi: 10.4018/IJKSS.2020040102.
J. Goyal and R. Ranjan Sinha, “Software Defect-Based Prediction Using Logistic Regression: Review and Challenges,” in Second International Conference on Sustainable Technologies for Computational Intelligence, Springer, 2022, pp. 233–248. doi: 10.1007/978-981-16-4641-6_20.
A. Alsaeedi and M. Z. Khan, “Software Defect Prediction Using Supervised Machine Learning and Ensemble Techniques: A Comparative Study,” J. Softw. Eng. Appl., vol. 12, no. 05, pp. 85–100, 2019, doi: 10.4236/jsea.2019.125007.
P. Devanbu et al., “Deep Learning & Software Engineering: State of Research and Future Directions,” ArXiv. Sep. 17, 2020. [Online]. Available: http://arxiv.org/abs/2009.08525
P. Singal, A. C. Kumari, and P. Sharma, “Estimation of Software Development Effort: A Differential Evolution Approach,” Procedia Comput. Sci., vol. 167, pp. 2643–2652, 2020, doi: 10.1016/j.procs.2020.03.343.
B. Mahesh, “Machine Learning Algorithms - A Review,” Int. J. Sci. Res., vol. 9, no. 1, pp. 381–386, 2020, doi: 10.21275/ART20203995.
C. López-Martín, “Predictive accuracy comparison between neural networks and statistical regression for development effort of software projects,” Appl. Soft Comput., vol. 27, pp. 434–449, Feb. 2015, doi: 10.1016/j.asoc.2014.10.033.
P. Zhang, B. Ren, H. Dong, and Q. Dai, “CAGFuzz: Coverage-Guided Adversarial Generative Fuzzing Testing for Image-Based Deep Learning Systems,” IEEE Trans. Softw. Eng., vol. 48, no. 11, pp. 4630–4646, Nov. 2022, doi: 10.1109/TSE.2021.3124006.
N. Pandey, D. K. Sanyal, A. Hudait, and A. Sen, “Automated classification of software issue reports using machine learning techniques: an empirical study,” Innov. Syst. Softw. Eng., vol. 13, no. 4, pp. 279–297, Dec. 2017, doi: 10.1007/s11334-017-0294-1.
M. D. Ali and A. A. Abusnaina, “Classifying bug reports to bugs and other requests: an approach using topic modelling and fuzzy set theory,” Int. J. Adv. Comput. Res., vol. 11, no. 56, p. 103, Sep. 2021, doi: 10.19101/IJACR.2021.1152031.
C. Seiffert, T. M. Khoshgoftaar, J. Van Hulse, and A. Folleco, “An empirical study of the classification performance of learners on imbalanced and noisy software quality data,” Inf. Sci. (Ny)., vol. 259, pp. 571–595, Feb. 2014, doi: 10.1016/j.ins.2010.12.016.
F. O. Aghware et al., “Enhancing the Random Forest Model via Synthetic Minority Oversampling Technique for Credit-Card Fraud Detection,” J. Comput. Theor. Appl., vol. 1, no. 4, pp. 407–420, Mar. 2024, doi: 10.62411/jcta.10323.
J. T. Iorzua, D. K. Kwaghtyo, T. P. Hule, A. T. Ibrahim, and A. D. Nongu, “AI-Driven Approach to Crop Recommendation: Tackling Class Imbalance and Feature Selection in Precision Agriculture,” J. Futur. Artif. Intell. Technol., vol. 2, no. 2, pp. 269–281, Jul. 2025, doi: 10.62411/faith.3048-3719-118.
D. R. I. M. Setiadi, K. Nugroho, A. R. Muslikh, S. W. Iriananda, and A. A. Ojugo, “Integrating SMOTE-Tomek and Fusion Learning with XGBoost Meta-Learner for Robust Diabetes Recognition,” J. Futur. Artif. Intell. Technol., vol. 1, no. 1, pp. 23–38, May 2024, doi: 10.62411/faith.2024-11.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Ali Hamza, Wahid Hussain, Hassan Iftikhar, Aziz Ahmad, Alamgir Md Shamim

This work is licensed under a Creative Commons Attribution 4.0 International License.













