PERTANIKA JOURNAL OF SCIENCE AND TECHNOLOGY

 

e-ISSN 2231-8526
ISSN 0128-7680

Home / Regular Issue / JST Vol. 32 (6) Oct. 2024 / JST-4947-2023

 

A Comprehensive Analysis of a Framework for Rebalancing Imbalanced Medical Data Using an Ensemble-based Classifier

Jafhate Edward, Marshima Mohd Rosli and Ali Seman

Pertanika Journal of Science & Technology, Volume 32, Issue 6, October 2024

DOI: https://doi.org/10.47836/pjst.32.6.12

Keywords: Ensemble classifier, ensemble learning, imbalance classification, machine learning algorithms, medical data, predictive modeling, rebalancing framework

Published on: 25 October 2024

In medical data, addressing imbalanced datasets is paramount for accurate predictive modeling. This paper delves into exploring a well-established rebalancing framework proposed in previous research. While acknowledged for its effectiveness, the adaptability of this framework across diverse medical datasets remains unexplored. We conduct a comprehensive investigation to bridge this gap by integrating an ensemble-based classifier into the existing framework. By leveraging seven imbalanced medical binary datasets, our study comprises three distinct experiments: utilizing standard baseline classifiers from the framework (original), incorporating the baseline with an ensemble-based classifier, and introducing our novel ensemble-based classifier with the self-paced ensemble (SPE) algorithm. Our novel ensemble, composed of decision tree (DT), radial support vector machine (R.SVM), and extreme gradient boosting (XGB) classifiers, serves as the foundation for the SPE. Our primary objective is to demonstrate the potential improvement of the existing framework’s overall performance through the integration of an ensemble. Experimental results reveal significant enhancements, with our proposed ensemble classifier outperforming the original by 4.96%, 5.89%, 5.68%, 7.85%, and 6.84% in terms of accuracy, precision, recall, F-score, and G-mean, respectively. This study contributes valuable insights into the adaptability and performance augmentation achievable through ensemble methods in addressing class imbalances within the medical domain.

  • Abedi, M., Hempel, L., Sadeghi, S., & Kirsten, T. (2022). GAN-based approaches for generating structured data in the medical domain. Applied Sciences, 12(14), Article 7075. https://doi.org/10.3390/app12147075

  • Abraham, A., & Elrahman, S. M. A. (2013). A review of class imbalance problem. Journal of Network and Innovative Computing, 1, 332–340.

  • Belarouci, S., & Chikh, M. A. (2017). Medical imbalanced data classification. Advances in Science, Technology and Engineering Systems, 2(3), 116–124. https://doi.org/10.25046/aj020316

  • Bi, W., & Ma, R. (2021). Unbalanced data set processing method for colorectal cancer prediction in TCM diagnosis. In 2020 IEEE International Conference on E-health Networking, Application & Services (HEALTHCOM) (pp. 1-6). IEEE Publishing. https://doi.org/10.1109/HEALTHCOM49281.2021.9615914

  • Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324

  • Cahyana, N., Khomsah, S., & Aribowo, A. S. (2019). Improving imbalanced dataset classification using oversampling and gradient boosting. In 2019 5th International Conference on Science in Information Technology (ICSITech) (pp. 217-222). IEEE Publishing. https://doi.org/10.1109/ICSITech46713.2019.8987499

  • Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16(2), 321–357. https://doi.org/10.1613/jair.953

  • Cheng, H., Garrick, D. J., & Fernando, R. L. (2017). Efficient strategies for leave-one-out cross validation for genomic best linear unbiased prediction. Journal of Animal Science and Biotechnology, 8(1), 1–5. https://doi.org/10.1186/s40104-017-0164-6

  • Cuingnet, R., Gerardin, E., Tessieras, J., Auzias, G., Lehéricy, S., Habert, M. O., Chupin, M., Benali, H., & Colliot, O. (2011). Automatic classification of patients with Alzheimer’s disease from structural MRI: A comparison of ten methods using the ADNI database. NeuroImage, 56(2), 766–781. https://doi.org/10.1016/j.neuroimage.2010.06.013

  • deAndrés-Galiana, E. J., Fernández-Martínez, J. L., & Sonis, S. T. (2016). Design of biomedical robots for phenotype prediction problems. Journal of Computational Biology, 23(8), 678–692. https://doi.org/10.1089/cmb.2016.0008

  • Edward, J., & Rosli, M. M. (2021). A systematic mapping study on ensemble-based classifier. In 2021 IEEE International Conference on Computing (ICOCO) (pp. 43-48). IEEE Publishing. https://doi.org/10.1109/ICOCO53166.2021.9673563

  • Elbattah, M., Loughnane, C., Guérin, J.-L., Carette, R., Cilia, F., & Dequen, G. (2021). Variational autoencoder for image-based augmentation of eye-tracking data. Journal of Imaging, 7(5), Article 83. https://doi.org/10.3390/jimaging7050083

  • Elkan, C. (2013). The foundations of cost-sensitive learning. In International Joint Conference on Artificial Intelligence (Vol. 17, No. 1, pp. 973-978). Lawrence Erlbaum Associates Ltd.

  • Harimoorthy, K., & Thangavelu, M. (2021). Multi-disease prediction model using improved SVM-radial bias technique in healthcare monitoring system. Journal of Ambient Intelligence and Humanized Computing, 12(3), 3715–3723. https://doi.org/10.1007/s12652-019-01652-0

  • Japkowicz, N. (2000, June 28 – July 1). The class imbalance problem: Significance and strategies. In Proceedings of the 2000 International Conference on Artificial Intelligence (pp. 111-117). Las Vegas, NV, USA.

  • Jiang, Z., Ji, R., & Chang, K.-C. (2020). A machine learning integrated portfolio rebalance framework with risk-aversion adjustment. Journal of Risk and Financial Management, 13(7), Article 155. https://doi.org/10.3390/jrfm13070155

  • Khalilia, M., Chakraborty, S., & Popescu, M. (2011). Predicting disease risks from highly imbalanced data using random forest. BMC Medical Informatics and Decision Making, 11(1), 1-13. https://doi.org/10.1186/1472-6947-11-51

  • Krishnan, U., & Sangar, P. (2021). A rebalancing framework for classification of imbalanced medical appointment no-show data. Journal of Data and Information Science, 6(1), 178–192. https://doi.org/doi:10.2478/jdis-2021-0011

  • Kuncheva, L. I. (2014). Combining pattern classifiers. John Wiley & Sons, Inc. https://doi.org/10.1002/9781118914564

  • Liang, C., Bian, Z., Lyu, W., Zeng, D., & Ma, J. (2018). A deep features-based radiomics model for breast lesion classification on FFDM. In 2018 IEEE Nuclear Science Symposium and Medical Imaging Conference Proceedings (NSS/MIC) (pp. 1-4). IEEE Publishing. https://doi.org/10.1109/NSSMIC.2018.8824722

  • Liu, Z., Cao, W., Gao, Z., Bian, J., Chen, H., Chang, Y., & Liu, T. Y. (2020). Self-paced ensemble for highly imbalanced massive data classification In 2020 IEEE 36th International Conference on Data Engineering (ICDE) (pp. 841-852). IEEE Publishing. https://doi.org/10.1109/ICDE48307.2020.00078

  • Ma, T., Wu, L., Zhu, S., & Zhu, H. (2022). Multiclassification prediction of clay sensitivity using extreme gradient boosting based on imbalanced dataset. Applied Sciences, 12(3), Article 1143. https://doi.org/10.3390/app12031143

  • Mandrekar, J. N. (2010). Receiver operating characteristic curve in diagnostic test assessment. Journal of Thoracic Oncology, 5(9), 1315–1316. https://doi.org/10.1097/JTO.0b013e3181ec173d

  • Mohammed, R. A., Wong, K. W., Shiratuddin, M. F., & Wang, X. (2020). Pwidb: A framework for learning to classify imbalanced data streams with incremental data re-balancing technique. Procedia Computer Science, 176, 818–827. https://doi.org/10.1016/j.procs.2020.09.077

  • Mohandes, M., Deriche, M., & Aliyu, S. O. (2018). Classifiers combination techniques: A comprehensive review. IEEE Access, 6, 19626–19639. https://doi.org/10.1109/ACCESS.2018.2813079

  • Pes, B. (2019). Handling class imbalance in high-dimensional biomedical datasets. In 2019 IEEE 28th International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE) (pp. 150-155). IEEE Publishing. https://doi.org/10.1109/WETICE.2019.00040

  • Phoungphol, P., Zhang, Y., & Zhao, Y. (2012). Robust multiclass classification for learning from imbalanced biomedical data. Tsinghua Science and Technology, 17(6), 619–628. https://doi.org/10.1109/TST.2012.6374363

  • Rahman, M. M., & Davis, D. N. (2013). Addressing the class imbalance problem in medical datasets. International Journal of Machine Learning and Computing, 3(2), Article 224. https://doi.org/10.7763/ijmlc.2013.v3.307

  • Saito, T., & Rehmsmeier, M. (2015). The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE, 10(3), 1–21. https://doi.org/10.1371/journal.pone.0118432

  • Sandhan, T., & Choi, J. Y. (2014). Handling imbalanced datasets by partially guided hybrid sampling for pattern recognition. In 2014 22nd International Conference on Pattern Recognition (pp. 1449-1453). IEEE Publishing. https://doi.org/10.1109/ICPR.2014.258

  • Shabaniyan, T., Parsaei, H., Aminsharifi, A., Movahedi, M. M., Jahromi, A. T., Pouyesh, S., & Parvin, H. (2019). An artificial intelligence-based clinical decision support system for large kidney stone treatment. Australasian Physical and Engineering Sciences in Medicine, 42(3), 771–779. https://doi.org/10.1007/s13246-019-00780-3

  • Song, L., Lin, J., Wang, Z. J., & Wang, H. (2020). An end-to-end multi-task deep learning framework for skin lesion analysis. IEEE Journal of Biomedical and Health Informatics, 24(10), 2912–2921. https://doi.org/10.1109/JBHI.2020.2973614

  • Tang, X., Cai, L., Meng, Y., Gu, C., Yang, J., & Yang, J. (2021). A novel hybrid feature selection and ensemble learning framework for unbalanced cancer data diagnosis with transcriptome and functional proteomic. IEEE Access, 9, 51659–51668. https://doi.org/10.1109/ACCESS.2021.3070428

  • Tantithamthavorn, C., Hassan, A. E., & Matsumoto, K. (2020). The impact of class rebalancing techniques on the performance and interpretation of defect prediction models. IEEE Transactions on Software Engineering, 46(11), 1200–1219. https://doi.org/10.1109/TSE.2018.2876537

  • Turlapati, V. P. K., & Prusty, M. R. (2020). Outlier-SMOTE: A refined oversampling technique for improved detection of COVID-19. Intelligence-Based Medicine, 3–4, Article 100023. https://doi.org/10.1016/j.ibmed.2020.100023

  • Utami, I. T., Sartono, B., & Sadik, K. (2014). Comparison of single and ensemble classifiers of support vector machine and classification tree. Journal of Mathematical Sciences and Applications, 2(2), 17–20. https://doi.org/10.12691/jmsa-2-2-1

  • Valentini, G., & Dietterich, T. G. (2004). Bias-variance analysis of support vector machines for the development of SVM-based ensemble methods. Journal of Machine Learning Research, 5, 725–775.

  • Yao, J. R., & Chen, J. R. (2019). A new hybrid support vector machine ensemble classification model for credit scoring. Journal of Information Technology Research, 12(1), 77–88. https://doi.org/10.4018/JITR.2019010106

  • Zhao, Y., Wong, Z. S. Y., & Tsui, K. L. (2018). A framework of rebalancing imbalanced healthcare data for rare events’ classification: A case of look-alike sound-alike mix-up incident detection. Journal of Healthcare Engineering, 2018(2010), Article 6275435. https://doi.org/10.1155/2018/6275435

  • Zhu, R., Guo, Y., & Xue, J.-H. (2020). Adjusting the imbalance ratio by the dimensionality of imbalanced data. Pattern Recognition Letters, 133, 217–223. https://doi.org/10.1016/j.patrec.2020.03.004

  • Zhu, W., Oh, B. S., Huang, W., Lin, Z., Pan, Y., & Zhou, J. (2015). Hybrid classifiers ensemble with an undersampling scheme for liver tumor segmentation. In 2015 10th International Conference on Information, Communications and Signal Processing (ICICS) (pp. 1-4). IEEE Publishing. https://doi.org/10.1109/ICICS.2015.7459850

ISSN 0128-7680

e-ISSN 2231-8526

Article ID

JST-4947-2023

Download Full Article PDF

Share this article

Related Articles