An Application of Feature Engineering and Generalized Linear Model for Forecasting the Number of COVID-19 New Cases
Abstract
The purpose of this research is to construct a generalized linear model (GLM) for forecasting the number of new COVID-19 cases. The data used in this research is the open-source COVID-19 dataset from DEVAKUMAR updated on January 30, 2020, from www.kaggle.com. The dataset contains information of infected COVID-19 patients data collected from 187 countries and is composed of 1 responsive variable and 12 explanatory variables. Through feature engineering, it was found that there were 6 significant explanatory variables only. These variables provided 7 significant features, which were the number of new deaths, number of new cases in a week, number of recovered cases, number of newly recovered cases, number of confirmed cases, number of active cases, and the product of the number of new recovered cases with the number of active cases. The 7 features were used to create the GLM under the assumption that the data might be classified following one of these three statistical distributions, normal distribution, negative binomial distribution, and Poisson distribution. After that, the models were modified for improving their performance by using the stepwise selection technique. The study showed that the GLM by Poisson distribution provided the best performance. By using all 7 features, the model by Poisson distribution has RMSE = 365.0387 and MAE = 803.0267. However, the GLM by normal distribution provided a marginally lower performance, RMSE = 365.4591 and MAE = 803.0286, by using 4 features only. The 4 features used for modeling were the number of new deaths, number of new cases in a week, number of newly recovered cases, and number of active cases. The result of this implementation allows for a paradigm of applying feature engineering methods to simplify the creation of generalized linear models for forecasting.Keywords : Covid-19 ; generalized linear model; feature engineeringReferences
Amattayakul, S. (2020). The world after COVID-19 economic and social impact. Strategy and Planning Division Foreign Commerce Group, Office of the Permanent Secretary, Ministry of Commerce. Retrieved March 8, 2022, from https://mocplan.ops.moc.go.th/th/file/get/file/20210812e33224317258bf4f8d900c51caed3044143154.pdf (in Thai)
Benlagha, N. (2020). Modeling the Declared New Cases of COVID-19 Trend Using Advanced Statistical Approaches, Preprint Document. Retrieved March 6, 2022, from 10.6084/m9.figshare.12052638
Chirawichitchai, H. (2018). AutoFE: Efficient and Robust Automated Feature Engineering. Master of Engineering in Electrical Engineering and Computer Science at the Massachusetts Institute of Technology. Retrieved March 8, 2022, from https://dspace.mit.edu/bitstream/handle/1721.1/119919/1080934990-MIT.pdf
Department of Mental Health. (2022). What does endemic disease mean and why is it in the category of coronavirus? News from newspapers related to mental health, Retrieved April 7, 2022, from
https://www.dmh.go.th/newsdmh/view.as-p?id=31751. (in Thai)
Emerging Infectious Disease Work of Communicable Disease Academic Development Group. (2021). Coronavirus disease 2019 (COVID-19) situation, public health measures and barriers to disease prevention and control among travelers. Department of Disease Control. Retrieved March 8, 2022, from
https://ddc.moph.go.th/uploads/files/-2017420210820025238.pdf. (in Thai)
javatpoint. (2021). K-Nearest Neighbor (KNN) Algorithm for Machine Learning. Retrieved March 6, 2022, from https://www.javatpoint.com/k-nearest-neighbor-algorithm-for-machine-learning
Kasilingam, D., Prabhakaran, S.P.S., Rajendran, D.K. Rajagopal, V., Kumar, T.S., & Soundararaj, A. (2021). Exploring the growth of COVID-19 cases using exponential modelling across 42 countries and predicting signs of early containment using machine learning. Transboundary and Emerging Diseases. 86. Wiley Online Library. Retrieved January 29, 2022, from https://onlinelibrary.wiley.com/doi/epdf/10.1111/tbed.13764
Kelter, D., Ghiassi, K., Patel, S., Connors, C., Bonk, M. P., Gray, E., Zarbiv, S. A., Menon, A., & Juneja, P. (2021). Use of feature engineering to predict COVID-19 mortality. American Thoracic Society International Conference Abstracts. Retrieved March 29, 2022, from
https://www.atsjournals.org/doi/pdf/10.1164/ajrccm-conference.2021.203.1_MeetingAbstracts.A2630
Leelarutsamee, A. (n.d.). Interesting Facts about COVID-19 Infection from SARS-CoV-2. The Medical Council of Thailand. Retrieved March 26, 2022, from: https://tmc.or.th/covid19/download/pdf/tmc-covid19-19.pdf (in Thai)
Nawaratana N. (2019). Analysis of distributions for insurance claims data. Master Degree Thesis of Suranaree University of Technology, 37–38.
Office of the Royal Thai Embassy. (2018). Dictionary of Statistical Terms, Royal Thai Council edition. 2nd ed. (amended). Chulalongkorn University Press. (in Thai)
Patcharawongsakda, A. (2014). Introduction to Data Analysis with Data Mining Techniques. Bangkok: Asia Digital Press Company Limited. (in Thai)
Strategy and Organization Development Group. (2022). Government Action Plan for the Fiscal Year 2022. Urban Disease Prevention and Control Institute, Ministry of Public Health. Retrieved March 8, 2022, from https://ddc.moph.go.th/uploads/publish/1195920211116021227.pdf (in Thai)
Vytla1, V., Ramakuri, S.K., Peddi, A., Srinivas, K.K. & Ragav, N.N. (2021). Mathematical Models for Predicting Covid-19. Journal of Physics: Conference Series, 1797(2001)012009, doi:10.1088/1742-6596/1797/1/012009
Xie, J. & Farrell, P. (2020). Analysis of COVID-19 Confirmed Cases based on Poisson Loglinear Regression Model. Honours Project. School Of Mathematics And Statistics. Carleton University. Retrieved March 1, 2022, from https://carleton.ca/math/wp-content/uploads/XieJunPu-Analysis-of-COVID-19-Confirmed-Cases-based-on-Poisson-Loglinear-Regression-Model.pdf
Benlagha, N. (2020). Modeling the Declared New Cases of COVID-19 Trend Using Advanced Statistical Approaches, Preprint Document. Retrieved March 6, 2022, from 10.6084/m9.figshare.12052638
Chirawichitchai, H. (2018). AutoFE: Efficient and Robust Automated Feature Engineering. Master of Engineering in Electrical Engineering and Computer Science at the Massachusetts Institute of Technology. Retrieved March 8, 2022, from https://dspace.mit.edu/bitstream/handle/1721.1/119919/1080934990-MIT.pdf
Department of Mental Health. (2022). What does endemic disease mean and why is it in the category of coronavirus? News from newspapers related to mental health, Retrieved April 7, 2022, from
https://www.dmh.go.th/newsdmh/view.as-p?id=31751. (in Thai)
Emerging Infectious Disease Work of Communicable Disease Academic Development Group. (2021). Coronavirus disease 2019 (COVID-19) situation, public health measures and barriers to disease prevention and control among travelers. Department of Disease Control. Retrieved March 8, 2022, from
https://ddc.moph.go.th/uploads/files/-2017420210820025238.pdf. (in Thai)
javatpoint. (2021). K-Nearest Neighbor (KNN) Algorithm for Machine Learning. Retrieved March 6, 2022, from https://www.javatpoint.com/k-nearest-neighbor-algorithm-for-machine-learning
Kasilingam, D., Prabhakaran, S.P.S., Rajendran, D.K. Rajagopal, V., Kumar, T.S., & Soundararaj, A. (2021). Exploring the growth of COVID-19 cases using exponential modelling across 42 countries and predicting signs of early containment using machine learning. Transboundary and Emerging Diseases. 86. Wiley Online Library. Retrieved January 29, 2022, from https://onlinelibrary.wiley.com/doi/epdf/10.1111/tbed.13764
Kelter, D., Ghiassi, K., Patel, S., Connors, C., Bonk, M. P., Gray, E., Zarbiv, S. A., Menon, A., & Juneja, P. (2021). Use of feature engineering to predict COVID-19 mortality. American Thoracic Society International Conference Abstracts. Retrieved March 29, 2022, from
https://www.atsjournals.org/doi/pdf/10.1164/ajrccm-conference.2021.203.1_MeetingAbstracts.A2630
Leelarutsamee, A. (n.d.). Interesting Facts about COVID-19 Infection from SARS-CoV-2. The Medical Council of Thailand. Retrieved March 26, 2022, from: https://tmc.or.th/covid19/download/pdf/tmc-covid19-19.pdf (in Thai)
Nawaratana N. (2019). Analysis of distributions for insurance claims data. Master Degree Thesis of Suranaree University of Technology, 37–38.
Office of the Royal Thai Embassy. (2018). Dictionary of Statistical Terms, Royal Thai Council edition. 2nd ed. (amended). Chulalongkorn University Press. (in Thai)
Patcharawongsakda, A. (2014). Introduction to Data Analysis with Data Mining Techniques. Bangkok: Asia Digital Press Company Limited. (in Thai)
Strategy and Organization Development Group. (2022). Government Action Plan for the Fiscal Year 2022. Urban Disease Prevention and Control Institute, Ministry of Public Health. Retrieved March 8, 2022, from https://ddc.moph.go.th/uploads/publish/1195920211116021227.pdf (in Thai)
Vytla1, V., Ramakuri, S.K., Peddi, A., Srinivas, K.K. & Ragav, N.N. (2021). Mathematical Models for Predicting Covid-19. Journal of Physics: Conference Series, 1797(2001)012009, doi:10.1088/1742-6596/1797/1/012009
Xie, J. & Farrell, P. (2020). Analysis of COVID-19 Confirmed Cases based on Poisson Loglinear Regression Model. Honours Project. School Of Mathematics And Statistics. Carleton University. Retrieved March 1, 2022, from https://carleton.ca/math/wp-content/uploads/XieJunPu-Analysis-of-COVID-19-Confirmed-Cases-based-on-Poisson-Loglinear-Regression-Model.pdf
Downloads
Published
2023-01-30
Issue
Section
Research Article