Machine Learning Model for Predicting Number of COVID-19 Cases in Countries with Low Number of Tests

Abstract

Background:

The COVID-19 pandemic has presented a series of new challenges to governments and healthcare systems. Testing is one important method for monitoring and controlling the spread of COVID-19. Yet with a serious discrepancy in the resources available between rich and poor countries, not every country is able to employ widespread testing.

Methods and Objective:

Here, we have developed machine learning models for predicting the prevalence of COVID-19 cases in a country based on multilinear regression and neural network models. The models are trained on data from US states and tested against the reported infections in European countries. The model is based on four features: Number of tests, Population Percentage, Urban Population, and Gini index.

Results:

The population and the number of tests have the strongest correlation with the number of infections. The model was then tested on data from European countries for which the correlation coefficient between the actual and predicted cases R2 was found to be 0.88 in the multi-linear regression and 0.91 for the neural network model

Conclusion:

The model predicts that the actual prevalence of COVID-19 infection in countries where the number of tests is less than 10% of their populations is at least 26 times greater than the reported numbers.

Keywords: Machine learning, Model, COVID-19 cases, Healthcare systems, Testing, RNA viruses.

1. INTRODUCTION

The SARS-CoV-2 or COVID-19 outbreak was declared a global health emergency on 30th January, 2020, by the WHO. COVID-19 is a member of the coronavirus family enveloped positive sense single-stranded RNA viruses. It is thought that COVID-19 transitioned from animal to human hosts in the Huanan seafood market in Wuhan in the province of Hubei, China [1]. The virus spread rapidly initially within China and then worldwide. COVID-19 was declared a pandemic on 11th March, 2020, by the World Health Organization. As of April 25th, 2021, there have been almost 100 million confirmed cases worldwide. Yet PCR (polymerase chain reaction), which can detect the genetic material of the virus, is the most accurate technique for identifying the COVID-19 infection [2]. COVID-19 has exposed several inequalities. In the scrabble to obtain medical resources, poorer countries have been left behind. Governments of low- and middle-income countries have struggled to provide sufficient funds to obtain medical resources, such as COVID-19 tests [3]. Furthermore, more geo-politically powerful countries have been accused of hoarding supplies leaving poorer countries unable to access sufficient tests [4]. With a disparity in the number of COVID-19 tests available, we aim to provide a prediction model based on machine learning that mitigates the reliance on clinical tests.

Machine learning has been utilized in contact tracing as a diagnostic and prognostic tool in vaccine and treatment development as a method to forecast and predict COVID-19 cases and deaths [5-11]. It has the potential to reduce the strain on healthcare systems that have been heavily burdened by the COVID-19 pandemic. For example, machine learning has been used to predict a positive COVID-19 infection in a PCR test [12]. The prediction is based on 8 binary features, including age, sex, contact with individuals known to have had COVID-19, and the appearance of five clinical symptoms. In addition, Sun et al. developed a model to predict the severity of a COVID-19 infection [13]. Furthermore, the model has been utilized to predict the prevalence of COVID-19 patients between one and six days in advance in 10 Brazilian states [14].

In this work, we have built multilinear regression and neural network models to predict the number of COVID-19 cases as of 15/03/2021. The models have been trained on the US States data and tested against the number of infections in the European countries. Then, both the models have been used to predict COVID-19 infection cases in countries with a low number of tests. The model was based on four features: the number of tests, population, urban population, and the Gini index. The model suggests that the actual number of infections is at least 10 times higher than the reported number of infections.

Uncertainties from different sources are not considered in this study; first, the ML model parameter uncertainty, which requires different techniques to be placed, such as Bayesian Neural Networks (https://arxiv.org/abs/2107.03342). This uncertainty is not considered since the DNN used for this study does not deliver certainty estimates or suffers from over- or under-confidence. Second, uncertainty data sources, since the data used in the analysis performed provide neither uncertainty in the PCR tests nor the estimation of the number of populations. Additionally, the PCR test uncertainty used in the US data, used in training, was different from the corresponding uncertainty for the tests used in other countries, and thus was used for the inference. Moreover, different COVID-19 variants significantly change the uncertainty rates, which can be a topic of future studies.

2. MATERIALS AND METHODS

The data were obtained from several official sources, such as from the World Bank World Development Indicators [16-19], government websites and publications [20-22], Worldometer [23], and from Our World in Data [14]. The data were extracted, standardized, and compiled into a single file. Although several features were considered, only four were included in the model owing to a lack of availability of data and low correlation with COVID-19 cases recorded. The four features used were Population, Tests, Gini Index, and % Urban Population. As the model first needed to be trained on the US States and then tested on European countries, data for all factors included would need to be available for both. This considerably limited the number of features that could be incorporated into the models. Several other factors were also considered, for example, median age and percentage of the population that always wears a face mask. However, median age was excluded from the model as it correlated poorly with the number of infections. The mask-wearing variable was excluded as the proportion of the populations that always wore masks was measured differently between the training and test countries and likely with all other countries for which the models were used to make predictions.

Data used to train the model covered the period from the beginning of the pandemic to February 2021. Later time period data were not used owing to the vast differences among countries not only in terms of the starting date and accessibility of vaccines but also the rate of vaccination. These discrepancies would make predictions for other countries inaccurate. The data used to test the model covered the period up until March 15th, 2021. A later date was considered for the test data than for the training data as most European countries started vaccination after the US.

Although the intention was originally to train the data on Indian states as well as the US states to allow for different models for the developing and developed countries, India was excluded owing to the high prevalence of the new B. 1.617 variant, which has increased transmissibility [24]. Although replacing India with Russia as an additional training data set was considered, the lack of data available made this unfeasible.

Some pre-processing steps had to be taken to clean the data before it could be used for the machine learning algorithm. First, the relevant features and information were extracted from the .csv file, where the data were stored; after that, all commas were removed from individual data points to make sure python could parse them correctly. The data were then normalized via a min-max-scaler, which places all data points between 0 and 1. For each data point in a feature, the MinMaxScaler deducts the smallest value in the feature and then divides this answer by the range, which is the difference between the original maximum and original minimum. The MinMaxScaler retains the original shape of the distribution, thus preserving the information embedded into the initial data set. However, it is important to note that this also means that the MinMaxScaler does not reduce the importance of outliers. Finally, the pre-processing procedure was completed by removing data samples that had missing values for some of their features. This is to make sure that all data can be used for training the model, as missing values can cause errors and unwanted variations within the procedure.

Two different types of machine learning algorithms were used for the analysis of the data multi-linear regression and a multi-layer perceptron artificial neural network (ANN). The multiple linear regression model was built using the Scikit-learn library [16]. The neural network code operated Keras architecture from the Tensorflow library [25] was used to construct the model. The ANN utilizes 1 output layer, 1 input layer, and 3 dense hidden layers (Fig. 1).

All dense layers use the Rectified Linear Unit (ReLU) as an activation function, which is defined as follows:

The slope is always 0 for negative inputs and always 1 for positive inputs. ReLU was used as it is computationally less intensive and faster than most other activation functions, such as sigmoid and tanh.

The mean squared error (MSE) function is used to calculate the loss in the current iteration of the neural network. This function takes the absolute error of all points and calculates their mean. MAE is calculated via the following equation:

MSE was used because it is a commonly used metric and relatively robust to outliers suitable for the data used in this study.

The neural network contains a few hyperparameters that had to be set manually before the training. These hyperparameters are chosen by using a random grid search technique. The choice of the ReLU activation function, the number of hidden layers, and the number of nodes in each layer are examples of hyperparameters.

Fig. (1). Artificial neural network architecture.

3. RESULTS AND DISCUSSION

Since the start of the COVID-19 pandemic, the US has conducted over 400 million COVID-19 tests, making the country a rich and reliable source of information [15]. For this reason, the data from all US states were used to train our machine learning models. To evaluate the models, they were tested against the data from the European countries. Finally, the models were used to make predictions for the number of COVID-19 cases in countries that have conducted low numbers of tests. The following countries were used as an example of low-testing countries: Nepal, Vietnam, Mongolia, Kenya, Ghana, Zambia, Iran, Paraguay, and Ecuador.

3.1. Features Analysis

The features currently utilized in the models are as follows: Population, Tests, Gini index, and ‘% Urban population. To observe their collinearity, the number of cases was plotted against these features for the US states (Fig. 2).

The population and the number of tests conducted both show a strong correlation with the prevalence of COVID-19 cases with R2 values of 0.95 and 0.81, respectively (Fig. 3) (AB)), and p-values of zero. However, a much lower correlation was obtained for the Gini index and percentage urban population with R2 values of 0.12 and 0.16, and p-values of 0.01 and 0.003, respectively. The features that are currently utilized in the models were selected based on their strong correlation with the number of cases. Other features, such as Median age, % of people wearing a facemask outside, and Number of lockdown days, were not used as low correlation was found between these features and the number of cases, and because the data were incomplete for a number of these features. Adding these features to the models would have resulted in a higher error.

3.2. Multilinear Regression

A multiple linear regression model was built and trained on the US States data according to the following equation:

(1)

Where, Y denotes the number of cases; A, B, C and D are the regression coefficients obtained from least square fitting; x1x2x3 and x4 are the independent variables (population, number of tests, Gini index, and % of the urban population, respectively), and K is the y-intercept.

The model shows a very strong correlation between the predicted and actual prevalence of COVID-19 cases for both the US States data (the training dataset) and the European data (the test dataset) (Fig. 3).

For the US data, the calculated slope is 1.00 with an intercept of zero and R2 of 0.95. For the European data, the correlation coefficient R2 is 0.88, and the slope and the intercepts are 1.49 and 12k, respectively, which indicates that the predicted prevalence of infections for the EU is generally higher than the reported. This could result from the differences in the behavior and commitment of the people toward the governmental rules in the US and the EU.

To understand the contribution of each feature to the prediction model, we report the estimated regression coefficients for each of the four features. The calculated coefficients are 0.87, 0.13, -0.01, and -0.03, for the population, number of tests, Gini index, and % urban population, respectively.

Fig. (2). COVID-19 cases vs.. (A) Population; (B) Number of tests; (C) Gini index; (D) % Urban population; (E) Median age; (F) % of population that always wears a mask. Each point represents a state.
Fig. (3). To the left, the predictions vs.. observed cases for US data (Slope: 1.00; Intercept: 0; R2: 0.95). To the right, the predictions vs.. observed cases for European data (Slope: 1.49; Intercept: 12K; R2: 0.88).

The ‘population’ feature has a score close to one, and thus is the major contributor to the prediction model. The scores for the % Urban population and Gini are negative, which suggests that these features are not significant for the regression model.

Fig. (4). (A) the predictions vs.. observed cases for US data (Slope: 0.95; Intercept: 0.0; R2: 0.95); (B) the predictions vs.. observed cases for European data (Slope: 1.57; Intercept: 45K; R2: 0.81).

3.3. Neural Networks

The neural network model is mainly considered to account for possible non-linearities in the Gini index and the percentage of urban populations. A fully connected Deep Neural Network (DNN) is trained and tested with US and EU datasets, respectively. The input layer of the network consists of 128 nodes and is followed by four hidden layers with 128 nodes and an output layer with a single node. The number of nodes of the output layer corresponds to the number of classes. Each layer has a random weight and bias initialization based on the normal distribution initializer, which is necessary to set the first set of numbers of weights and biases and thus initiate the training procedure. The ReLU function has become the default activation function for many types of neural networks because such models are easy to train, and often achieve good performance.

The DNN model is trained with an objective function (loss function) that must be minimized. The Mean Squared Error (MSE) is used as a loss function, and Stochastic Gradient Descent (SGD) optimizer is employed to find the best values for the DNN parameters by minimizing the loss function iteratively over the dataset. The number of iterations (epochs) is chosen to be 100 epochs. The network is trained using data from US states and tested using data from European countries using the same set of features as in the case of multilinear regression, namely Population, Tests, Gini, and the Percentage of urban population. The testing results are illustrated in Fig. (4), which quantify the correlation between the predicted number of infections and the number of infections recorded. The slopes are 0.95 and 0.80, the R2 values are 0.95 and 0.91, and the mean absolute error is 0.03 and 0.06 for the US and EU datasets, respectively. These measurements suggest that the model fits the observed data by learning the relationships between the input variables.

3.4. Prediction of COVID-19 Cases

The reported infections and their corresponding predicted values (using linear regression and NN) are shown in Table 1. Furthermore, according to the training dataset, the US has performed 361 million tests, which is equal to approximately 110% of the US population. Thus, we reported the predicted number of cases for European and other countries with a low number of tests as these countries have had tests equal to 1.1 multiplied by their respective populations (columns 6-8 of Table 1). Although the number of tests for the EU countries is increased by 30%, the slopes of the linear regression and the NN models are increased only by 5% and 11%, respectively.

Table 1.
The predicted number of COVID-19 cases for test countries.
Country Actual No. of Tests Reported No. of Infections Predictions (Multilinear Regression) Predictions (Neural Network) Tests = 110% of Population Predictions (Multilinear Regression) Predictions (Neural Network)
Albania 506676 117474 210493 171218 3163148 239234 358221
Austria 6033827 495464 713692 498778 9946648 770166 762327
Belgium 10110146 808283 906281 527839 12787487 957884 650472
Bosnia and Herzegovina 702920 142160 263250 223023 3593094 295191 392963
Croatia 1431342 251174 317183 273416 4496184 352885 437758
Cyprus 2563270 39651 56012 149468 1334864 44094 503521
Czechia 9665502 1402420 851078 546321 11795219 895219 805890
Denmark 20418687 220459 526270 420297 6387490 406445 662516
Estonia 1038888 86086 65321 82105 1459877 69132 373473
Finland 3596402 67334 375604 152357 6101438 409848 521883
France 57231533 4071662 5455817 3026689 71912361 5758992 3671764
Germany 46319641 2578835 6752524 3338496 92369061 7397418 4629496
Greece 5856618 221147 796133 479660 11425943 871592 707136
Hungary 4104415 524196 734001 437445 10607413 816269 701647
Ireland 3720861 225741 362195 333602 5473973 387925 643442
Italy 44623304 3223142 4982964 2681373 66439272 5340019 3382223
Latvia 1670193 93959 98590 147075 2058635 103506 503588
Lithuania 2218746 205644 194128 127608 2965009 204527 352468
Luxembourg 2248588 57877 -6932 72753 696387 -23384 217924
Moldova 771763 204463 332354 106253 4430108 373454 358590
Netherlands 6970400 1157192 1244594 613274 18877627 1396911 1250356
Norway 4115415 80440 364594 176433 5996009 392747 609492
Poland 10668987 1917527 2962379 1413123 41599229 3346222 2136684
Portugal 8480932 814257 820324 557283 11193380 868501 764967
Romania 6774562 862681 1520347 768489 21062036 1700082 1141785
Serbia 3149048 516277 680907 450877 9583870 760126 681613
Slovak Republic 2200380 337960 466872 341250 6007649 513081 374066
Slovenia 976907 200579 159814 210357 2287053 173893 374025
Spain 40292390 3183704 3879261 2147433 51444247 4101208 2654842
Sweden 6627544 712527 721451 364833 11157720 786554 892642
Switzerland 5387481 570645 661132 390075 9568835 719229 683189
UK 103053938 4258438 6078592 3409492 74949540 5988545 3816487
Ukraine 7328468 1467548 3314135 1525414 47903633 3803081 2476377
The negative value reported for Luxembourg is a result of the very low population and the relatively high urban population and Gini index.
Table 2.
The predicted number of COVID-19 cases for countries whose total tests were equal or less than 10% of their population.
Country Actual No. of Tests Reported No. of Infections Predictions (Multilinear Regression) Predictions (Neural Network) Tests = 110% of Population Predictions (Multilinear Regression) Predictions (Neural Network)
Afghanistan 465731 55985 3034780 404687 41845929 3422007 509312
Algeria 230861 115410 3327210 349522 47358359 3768219 470700
Chad 119517 4328 1317840 158913 17541564 1480872 219913
DRC 159469 27077 6681106 779441 95469624 7572999 1110283
Egypt 2824316 191555 7770111 1100450 110426880 8777033 1318575
Guatemala 1411568 183014 1316936 194200 18264429 1474642 164909
Honduras 714929 178925 776194 98998 10720729 869826 85309
Indonesia 16610468 1430000 20837478 3404783 297688125 23467745 3339198
Mozambique 454528 64516 2372216 270953 33402640 2680538 387939
Pakistan 9530000 609964 16699868 2557415 238221850 18839920 2686825
Papua New Guinea 112995 2269 792772 127738 9653720 882052 176443
Syria 103566 16556 1357616 136954 18777149 1532360 158446
Yemen 62990 2908 2313264 271150 32078114 2612855 369019
All the numbers are reported up to /03/15/2021.
* Where no test data could be found, 03/15/2021data up till 05/15/2021 were used.

Using the same training dataset, we predicted the number of infections in selected countries where the number of tests is less than 10% of their populations (Table 2). The average number of the predicted infections is higher than that reported by 26 times for the linear regression model and 4 times for the NN. The discrepancy between the results from multilinear regression and NN models in Table 2 is due to the overfitting feature of the NN. The overfitting indicates that the generalization of the NN model is rather limited. This is due to the minimal dataset, i.e., 52 entries used for the training procedure, which is not enough for the NN model to avoid overfitting.

CONCLUSION

Both the multilinear regression and neural network models predicted the number of COVID-19 cases with a fair degree of accuracy on the European test data set. Considering Table 1, the number of cases predicted by the models was close to the number of cases reported for some countries, such as Italy, Poland and Slovakia. Yet, in most cases, the model predicted more cases than they were reported. The models were trained on data from the US country that tested extensively. Therefore, it seems that due to limited testing in most countries, the number of cases reported was a gross underestimation of the actual number of infections. This disparity was most pronounced in countries that were not testing extensively. The predicted number of infections for these countries was 26 times higher than the reported numbers on average. Therefore, the models can be effective tools for estimating the prevalence of COVID-19 infection in countries where sufficient testing is not available or where it is suspected that governments may not be entirely transparent about the number of COVID-19 infection.

LIST OF ABBREVIATIONS

MSE = Mean Squared Error
DNN = Deep Neural Network
SGD = Stochastic Gradient Descent

ETHICS APPROVAL AND CONSENT TO PARTICIPATE

Not applicable.

HUMAN AND ANIMAL RIGHTS

Not applicable.

CONSENT FOR PUBLICATION

Not applicable.

AVAILABILITY OF DATA AND MATERIALS

The authors confirm that the data supporting the findings of this study are available within the manuscript.

FUNDING

None.

CONFLICT OF INTEREST

The authors declare no conflict of interest, financial or otherwise.

ACKNOWLEDGEMENTS

Declared none.

REFERENCES

1
Walach H, Hockertz S. Wuhan Covid19 data – more questions than answers. Toxicology 2020; 440: 152486.
2
Zitek T. The Appropriate Use of Testing for COVID-19. West J Emerg Med 2020; 21(3): 470-2.
3
Spearman P. Diagnostic testing for SARS-CoV-2/COVID19. Curr Opin Pediatr 2021; 33(1): 122-8.
4
Kavanagh MM, Erondu NA, Tomori O, et al. Access to lifesaving medical resources for African countries: COVID-19 testing and response, ethics, and politics. Lancet 2020; 395(10238): 1735-8.
5
Habib N, Rahman MM. Diagnosis of corona diseases from associated genes and X-ray images using machine learning algorithms and deep CNN. Inform Med Unlocked 2021; 24: 100621.
6
Reyana A, Kautish S. Corona virus-related Disease Pandemic: A Review on Machine Learning Approaches and Treatment Trials on Diagnosed Population for Future Clinical Decision Support 2021.
7
Magar R, Yadav P, Barati Farimani A. Potential neutralizing antibodies discovered for novel corona virus using machine learning. Sci Rep 2021; 11(1): 5261.
8
Ban Z, Yuan P, Yu F, Peng T, Zhou Q, Hu X. Machine learning predicts the functional composition of the protein corona and the cellular recognition of nanoparticles. Proc Natl Acad Sci USA 2020; 117(19): 10492-9.
9
Duan Y, Coreas R, Liu Y, et al. Prediction of protein corona on nanomaterials by machine learning using novel descriptors. NanoImpact 2020; 17: 100207.
10
Findlay MR, Freitas DN, Mobed-Miremadi M, Wheeler KE. Machine learning provides predictive analysis into silver nanoparticle protein corona formation from physicochemical properties. Environ Sci Nano 2018; 5(1): 64-71.
11
Papa E, Doucet JP, Sangion A, Doucet-Panaye A. Investigation of the influence of protein corona composition on gold nanoparticle bioactivity using machine learning approaches. SAR QSAR Environ Res 2016; 27(7): 521-38.
12
Zoabi Y, Deri-Rozov S, Shomron N. Machine learning-based prediction of COVID-19 diagnosis based on symptoms. NPJ Digit Med 2021; 4(1): 3.
13
Sun L, Song F, Shi N, et al. Combination of four clinical indicators predicts the severe/critical symptom of patients infected COVID-19. J Clin Virol 2020; 128: 104431.
14
Lalmuanawma S, Hussain J, Chhakchhuak L. Applications of machine learning and artificial intelligence for Covid-19 (SARS-CoV-2) pandemic: A review. Chaos Solitons Fractals 2020; 139: 110059.
15
Hasell J, Mathieu E, Beltekian D, et al. A cross-country database of COVID-19 testing. Sci Data 2020; 7(1): 345.
16
The World Bank, World Development Indicators, Urban Population. https://data.worldbank.org/indicator/SP.URB.TOTL.IN.ZS?view=chart2019.
17
The World Bank World Development Indicators. 2019.https://data.worldbank.org/indicator/SP.URB.TOTL.IN.ZS?view=chart
18
The World Bank World Development Indicators. 2019.Population total https://data.worldbank.org/indicator/SP.POP.TOTL?view=chart
19
The World Bank World Development Indicators. https://data.worldbank.org/indicator/SI.POV.GINI?view=chart2019.
21
22
U.S. Department of Health & Human Services. 2021.https://healthdata.gov/Health/COVID-19-Community-Profile-Report/gqxm-d9w9
23
Worldometer. COVID Live Update 2021.https://www.worldometers.info/coronavirus/#countries
24
Bansal R, Kumar A, Singh AK, Kumar S. Stochastic filtering based transmissibility estimation of novel coronavirus. Digit Signal Process 2021; 112: 103001.
25
Rampasek L, Goldenberg A. TensorFlow: Biology’s Gateway to Deep Learning? Cell Syst 2016; 2(1): 12-4.