PERTANIKA JOURNAL OF SCIENCE AND TECHNOLOGY

 

e-ISSN 2231-8526
ISSN 0128-7680

Home / Regular Issue / JST Vol. 32 (4) Jul. 2024 / JST-4343-2023

 

Optimizing Bagged Trees in an Ensemble Classifier for Improved Prediction of Diabetes Prevalence in Women

Jose Candia Jr., Airish Mae Adonis and Jesica Perlas

Pertanika Journal of Science & Technology, Volume 32, Issue 4, July 2024

DOI: https://doi.org/10.47836/pjst.32.4.16

Keywords: Bagged trees, diabetes prevalence, ensemble classifier, feature selection, model optimization, parameter tuning

Published on: 25 July 2024

This study aims to optimize the performance of the bagged tree in an ensemble classifier for predicting diabetes prevalence in women. The study used a dataset of 1,888 women with six features: age, BMI, glucose level, insulin level, blood pressure, and pregnancy status. The dataset was divided into training and testing sets with a 70:30 ratio. The bagged tree ensemble classifier was used for the analysis, and five-fold cross-validation was applied. The study found that using all features during training resulted in a 92.3% training accuracy and a 99.5% testing accuracy. However, applying optimization techniques such as feature selection, parameter tuning, and a maximum number of splits improved model performance. Feature selection optimized the accuracy performance by 0.2%, while parameter tuning improved the test accuracy by 0.2%. Moreover, decreasing the maximum number of splits from 1322 to 800 or 600 resulted in an optimized model with 0.1% higher validation accuracy. Finally, the optimized bagged tree models were evaluated using various performance metrics, including accuracy, precision, recall, and F1 score. The study found that Model 1, which used 800 maximum number of splits and 50 learners, outperformed Model 2 in terms of recall and F1 score, while Model 2, which used 600 maximum number of splits and 50 learners, had a higher precision score. The study concludes that optimization techniques can significantly improve the performance of the bagged tree in predicting diabetes prevalence in women.

  • Abayadeera, N., Jayawardena, R., & Byrne, N. M. (2019). Machine learning-based models for diabetes risk prediction in urban Filipinos. Journal of Diabetes Research, 2019, 1-8. https://doi.org/10.1155/2019/3709346

  • Biau, G. (2012). Analysis of a random forests model. Journal of Machine Learning Research, 13(1), 1063-1095.

  • Breiman, L., Friedman, J., Stone, C. J., & Olshen, R. A. (1984). Classification and regression trees. CRC press.

  • Chen, T., Li, Z. H., Yuan, C. X., & Wong, K. C. (2004, July 4-8). Improving bagging algorithms: Anti-overfitting by bagging adaptive boosting. [Paper presentation]. Proceedings of the Twenty-first International Conference on Machine Learning (ICML), Alberta, Canada.

  • Han, J., & Kamber, M. (2001). Data mining: Concepts and techniques. Morgan Kaufmann Publishers.

  • International Diabetes Federation. (2019). IDF diabetes atlas (9th ed.). International Diabetes Federation. https://www.diabetesatlas.org/upload/resources/material/20200302_133351_IDFATLAS9e-final-web.pdf

  • Jia, Q., Chen, F., Wang, Y., Huang, B., & Chen, Y. (2018). Application of bagged decision trees for predicting diabetes mellitus in urban Chinese residents. Journal of Healthcare Engineering, 2018, 1-10.

  • Mujumdar, A., & Vaidehi, V. (2019). Diabetes prediction using machine learning algorithms. Procedia Computer Science, 165, 292-299. https://doi.org/10.1016/j.procs.2020.01.047

  • Nguyen, T. T., Tran, T. H., & Nguyen, H. H. (2020, November 12-14). Feature selection techniques for diabetes prediction. [Paper presentation]. Proceedings of the International Conference on Advanced Data Mining and Applications, Foshan, China.

  • Nishat, M. M., Faisal, F., Mahbub, M. A., Mahbub, M. H., Islam, S., & Hoque, M. A. (2021). Performance assessment of different machine learning algorithms in predicting diabetes mellitus. Bioscience Biotechnology Research Communications, 14(1), 74-82. https://doi.org/10.21786/bbrc/14.1/10

  • Pang, B., Wang, C., Lu, Y., Cao, J., Zhang, Y., & Jing, L. (2017). Predicting the risk of diabetes mellitus using machine learning techniques. Frontiers in Genetics, 8, 1-8.

  • Philippine Statistics Authority. (2020). 2019 National nutrition survey final results. Philippine Statistics Authority. https://psa.gov.ph/nutrition-statistics/2019NNSTables

  • Quinlan, J. R. (1993). C4.5: Programs for machine learning. Morgan Kaufmann Publishers.

  • Tan, C., Ona, E. T., Yu, W., Garabiles, M. R., & Sy, R. G. (2019). Predictive modeling of type 2 diabetes risk among Filipinos using decision trees and logistic regression. Diabetes Research and Clinical Practice, 153, 177-185.

  • Wang, Y., Cao, Y., & Zhang, Y. (2018). An adaptive bagging algorithm for imbalanced data classification. Applied Soft Computing, 71, 1018-1030.

  • Zhang, Y., Liu, L., & Li, Q. (2018, July 16-20). A feature selection method based on PSO-SVM for diabetes prediction. [Paper presentation]. IEEE International Conference on Software Quality, Reliability and Security Companion, Lisbon, Portugal.

  • Zhao, Y., Feng, X., Li, L., Liu, Y., & Zhang, X. (2019). Prediction of diabetes using support vector machine algorithm based on medical examination data. BMC Medical Informatics and Decision Making, 19(2), 1-9.