Home / Regular Issue / JST Vol. 29 (1) Jan. 2021 / JST-1911-2020

 

Predictive Performance of Logistic Regression for Imbalanced Data with Categorical Covariate

Hezlin Aryani Abd Rahman, Yap Bee Wah and Ong Seng Huat

Pertanika Journal of Science & Technology, Volume 29, Issue 1, January 2021

DOI: https://doi.org/10.47836/pjst.29.1.10

Keywords: Categorical covariate, imbalanced data, logistic regression, parameter estimates, predictive analytics, simulation

Published on: 22 January 2021

Logistic regression is often used for the classification of a binary categorical dependent variable using various types of covariates (continuous or categorical). Imbalanced data will lead to biased parameter estimates and classification performance of the logistic regression model. Imbalanced data occurs when the number of cases in one category of the binary dependent variable is very much smaller than the other category. This simulation study investigates the effect of imbalanced data measured by imbalanced ratio on the parameter estimate of the binary logistic regression with a categorical covariate. Datasets were simulated with controlled different percentages of imbalance ratio (IR), from 1% to 50%, and for various sample sizes. The simulated datasets were then modeled using binary logistic regression. The bias in the estimates was measured using MSE (Mean Square Error). The simulation results provided evidence that the effect of imbalance ratio on the parameter estimate of the covariate decreased as sample size increased. The bias of the estimates depended on sample size whereby for sample size 100, 500, 1000 ─ 2000 and 2500 ─ 3500, the estimates were biased for IR below 30%, 10%, 5% and 2% respectively. Results also showed that parameter estimates were all biased at IR 1% for all sample size. An application using a real dataset supported the simulation results.

  • Ahmad, S., Midi, H., & Ramli, N. M. (2011). Diagnostics for residual outliers using deviance component in binary logistic regression. World Applied Sciences Journal, 14(8), 1125-1130.

  • Anand, A., Pugalenthi, G., Fogel, G. B., & Suganthan, P. N. (2010). An approach for classification of highly imbalanced data using weighting and undersampling. Amino Acids, 39(5), 1385 1391. doi: https://doi.org/10.1007/s00726-010-0595-2

  • Antal, B., & Hajdu, A. (2014). An ensemble-based system for automatic screening of diabetic retinopathy. Knowledge-Based Systems, 60, 20 27. doi: https://doi.org/10.1016/j.knosys.2013.12.023

  • Blagus, R., & Lusa, L. (2010). Class prediction for high-dimensional class-imbalanced data. BMC Bioinformatics, 11(1), 1-17. doi: https://doi.org/10.1186/1471-2105-11-523

  • Burez, J., & Van den Poel, D. (2009). Handling class imbalance in customer churn prediction. Expert Systems with Applications, 36(3), 4626 4636. doi: https://doi.org/10.1016/j.eswa.2008.05.027

  • Chawla, N. V. (2003, August 21). C4. 5 and imbalanced data sets: Investigating the effect of sampling method, probabilistic estimate, and decision tree structure. In Proceedings of the International Conference on Machine Learning, Workshop Learning from Imbalanced Data Set II (Vol. 3, p. 66). Washington, DC.

  • Chawla, N. V., Japkowicz, N., & Kotcz, A. (2004). Editorial: Special issue on learning from imbalanced data sets. ACM SIGKDD Explorations Newsletter, 6(1), 1-6. doi: https://doi.org/10.1145/1007730.1007733

  • Cohen, G., Hilario, M., Sax, H., Hugonnet, S., & Geissbuhler, A. (2006). Learning from imbalanced data in surveillance of nosocomial infection. Artificial Intelligence in Medicine, 37(1), 7 18. doi: https://doi.org/10.1016/j.artmed.2005.03.002

  • Dong, Y., Guo, H., Zhi, W., & Fan, M. (2014, October 13-15). Class imbalance oriented logistic regression. In 2014 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (pp. 187 192). Shanghai, China. doi: https://doi.org/10.1109/CyberC.2014.42

  • Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., & Herrera, F. (2011). A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(4), 463-484. doi: 10.1109/TSMCC.2011.2161285

  • Goel, G., Maguire, L., Li, Y., & McLoone, S. (2013). Evaluation of sampling methods for learning from imbalanced data. Intelligent Computing Theories, 7995, 392 401. doi: https://doi.org/10.1007/978-3-642-39479-9_47

  • Hamid, H. A. (2016). Effects of different type of covariates and sample size on parameter estimation for multinomial logistic regression model. Jurnal Teknologi, 78(12 3), 155 161. doi: https://doi.org/10.11113/jt.v78.10036

  • Hamid, H. A., Yap, B. W., Xie, X. J., & Rahman, H. A. A. (2015). Assessing the effects of different types of covariates for binary logistic regression. In AIP Conference Proceedings 1643 (Vol. 425, pp. 425 430). New York, USA: American Institute of Physics. doi: https://doi.org/10.1063/1.4907476

  • Hamid, H. A., Yap, B. W., Xie, X. J., & Ong, S. H. (2018). Investigating the power of goodness-of-fit tests for multinomial logistic regression. Communications in Statistics: Simulation and Computation, 47(4), 1039 1055. doi: https://doi.org/10.1080/03610918.2017.1303727

  • He, H., & Garcia, E. E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263 1284. doi: https://doi.org/10.1109/TKDE.2008.239

  • Hosmer, D. W., & Lemeshow, S. (2004). Applied logistic regression, second edition. New York, NY: John Wiley & Sons, Inc. doi: https://doi.org/10.1002/0471722146

  • Lemnaru, C., Potolea, R., Lenmaru, C., & Potolea, R. (2012). Imbalanced classification problems: Systematic study, issues and best practices. Enterprise Information Systems: Lecture Notes in Business Information Processing, 102, 35 50. doi: https://doi.org/10.1007/978-3-642-29958-2

  • Longadge, R., Dongre, S. S., & Malik, L. (2013). Class imbalance problem in data mining: Review. International Journal of Computer Science and Network, 2(1), 83 87. doi: https://doi.org/10.1109/SIU.2013.6531574

  • Mena, L., & Gonzalez, J. A. (2006, May 11-13). Machine learning for imbalanced datasets: Application in medical diagnostic. In Proceedings of the Nineteenth International Florida Artificial Intelligence Research Society Conference (FLAIRS 2006) (pp. 574 579). Florida, USA.

  • Oztekin, A., Delen, D., & Kong, Z. J. (2009). Predicting the graft survival for heart-lung transplantation patients: An integrated data mining methodology. International Journal of Medical Informatics, 78(12), e84-e96. doi: https://doi.org/10.1016/j.ijmedinf.2009.04.007

  • Pourahmad, S., Ayatollahi, S. M. T., & Taheri, S. M. (2011). Fuzzy logistic regression: A new possibilistic model and its application in clinical vague status. Iranian Journal of Fuzzy Systems, 8(1), 1 17.

  • Prati, R. C., Batista, G. E. A. P. A., & Silva, D. F. (2014). Class imbalance revisited: A new experimental setup to assess the performance of treatment methods. Knowledge and Information Systems, 45(1), 247 270. doi: https://doi.org/10.1007/s10115-014-0794-3

  • Rahman, H. A. A., & Yap, B. W. (2016). Imbalance effects on classification using binary logistic regression. In International Conference on Soft Computing in Data Science (pp. 136 147). Singapore: Springer. doi: https://doi.org/https://doi.org/10.1007/978-981-10-2777-2_12

  • Rahman, H. A. A., Yap, B. W., Khairudin, Z., & Abdullah, N. N. (2012, September 10-12). Comparison of predictive models to predict survival of cardiac surgery patients. In 2012 International Conference on Statistics in Science, Business and Engineering (ICSSBE) (pp. 1 5). doi: https://doi.org/10.1109/ICSSBE.2012.6396534

  • Ramyachitra, D., & Manikandan, P. (2014). Imbalanced dataset classification and solutions: A review. International Journal of Computing and Business Research, 5(4), 1-29.

  • Rothstein, M. A. (2015). Ethical issues in big data health research: Currents in contemporary bioethics. The Journal of Law, Medicine and Ethics, 43(2), 425 429. doi: https://doi.org/10.1111/jlme.12258

  • Roumani, Y. F., May, J. H., Strum, D. P., & Vargas, L. G. (2013). Classifying highly imbalanced ICU data. Health Care Management Science, 16(2), 119 128. doi: https://doi.org/10.1007/s10729-012-9216-9

  • Sarmanova, A., & Albayrak, S. (2013, April 24-26). Alleviating class imbalance problem in data mining. In 2013 21st Signal Processing and Communications Applications Conference (SIU) (pp. 1-4). Haspolat, Turkey. doi: 10.1109/SIU.2013.6531574

  • Shariff, S. S. R., Rodzi, N. A. M., Rahman, K. A., Zahari, S. M., & Deni, S. M. (2016). Predicting the “graduate on time (GOT)” of PhD students using binary logistics regression model. In AIP Conference Proceedings (Vol. 1782, No. 1, p. 050015). New York, USA: AIP Publishing LLC. doi: https://doi.org/10.1063/1.4966105

  • Srinivasan, U., & Arunasalam, B. (2013). Leveraging big data analytics to reduce healthcare costs. IT Professional, 15(6), 21 28. doi: https://doi.org/10.1109/MITP.2013.55

  • Uyar, A., Bener, A., Ciracy, H. N., & Bahceci, M. (2010). Handling the imbalance problem of IVF implantation prediction. IAENG International Journal of Computer Science, 37(2), 164-170.

  • Van Hulse, J., Khoshgoftaar, T. M., & Napolitano, A. (2007). Experimental perspectives on learning from imbalanced data. In Proceedings of the 24th international conference on Machine learning (pp. 935-942). New York, USA: Association for Computing Machinery. doi: https://doi.org/10.1145/1273496.1273614

  • Wallace, B. C., & Dahabreh, I. J. (2012, December 10-13). Class probability estimates are unreliable for imbalanced data (and how to fix them). In 2012 IEEE 12th International Conference on Data Mining (pp. 695-704). Brussels, Belgium. doi: 10.1109/ICDM.2012.115

  • Weiss, G. M., & Provost, F. (2003). Learning when training data are costly: The effect of class distribution on tree induction. Journal of Artificial Intelligence Research, 19, 315 354. doi: https://doi.org/10.1613/jair.1199

  • Yap, B. W., Rani, K. A., Rahman, H. A. A., Fong, S., Khairudin, Z., & Abdullah, N. N. (2014). An application of oversampling, undersampling, bagging and boosting in handling imbalanced datasets. In Proceedings of the first international conference on advanced data and information engineering (DaEng-2013) (pp. 13-22). Singapore: Springer. doi: https://doi.org/10.1007/978-981-4585-18-7

ISSN 0128-7680

e-ISSN 2231-8526

Article ID

JST-1911-2020

Download Full Article PDF

Share this article

Recent Articles