Abstract
Objective
The Coronavirus Disease 2019 (COVID-19) has currently ravaged through the world, resulting in over thirteen million confirmed cases and over five hundred thousand deaths, a complete change in daily life as we know it, worldwide lockdowns, travel restrictions, as well as heightened hygiene measures and physical distancing. Being able to analyse and predict the spread of this epidemic-causing disease is hence of utmost importance now, especially as it would help in the reasoning behind important decisions drastically affecting countries and their people, as well as in ensuring efficient resource and utility management. However, the needs of the people and specific conditions of the spread are varying widely from country to country. Hence, this article has two fold objectives: (i) conduct an in-depth statistical analysis of COVID-19 affected patients in India, (ii) propose a mathematical model for the prediction of spread of COVID-19 cases in India.
Materials and method
There has been limited research in modeling and predicting the spread of COVID-19 in India, owing both to the ongoing nature of the pandemic and limited availability of data. Currently famous SIR and non-SIR based Gauss-error-function and Monte Carlo simulation models do not perform well in the context of COVID-19 spread in India. We propose a ‘change-factor’ or ‘rate-of-change’ based mathematical model to predict the spread of the pandemic in India, with data drawn from hundreds of sources.
Results
Average age of affected patients was found to be 38.54 years, with 66.76% males, and 33.24% females. Most patients were in the age range of 18–40 years. Optimal parameter values of the prediction model are identified (α = 1.35, N = 3 and T = 10) by extensive experiments. Over the entire course of time since the outbreak started in India, the model has been 90.36% accurate in predicting the total number of cases the next day, correctly predicting the range in 150 out of the 166 days looked at.
Conclusion
The proposed system showed an accuracy of 90.36% for prediction since the first COVID-19 case in India, and 96.67% accuracy over the month of April. Predicted number of cases for the next day is found to be a function of the numbers over the last 3 days, but with an ‘increase’ factor influenced by the last 10 days. It is noticed that males are affected more than females. It is also noticed that in India, the number of people in each age bucket is steadily decreasing, with the largest number of adults infected being the youngest ones—a departure from the world trend. The model is self-correcting as it improves its predictions every day, by incorporating the previous day's data into the trend-line for the following days. This model can thus be used dynamically not only to predict the spread of COVID-19 in India, but also to check the effect of various government measures in a short span of time after they are implemented.
Keywords: COVID-19, India, Modeling, Predictive system, Statistical analysis
1. Introduction
COVID-19. Coronavirus Disease 2019 (COVID-19) which originated from the ‘SARS-CoV-2’ virus has led to an ongoing pandemic which has over 13,378,853 confirmed cases and 580,045 confirmed deaths over 213 countries, areas, and territories (WHO) [1] as of July 16, 2020.
COVID-19 in India. As of July 16, 2020, India reported a total of 968,876 COVID-19 cases, across 35 states and union territories (MoHFW) [2]. On January 30, 2020, the first laboratory confirmed case of COVID-19 in India was reported in Kerala, a southwestern state (WHO) [3]. Since then, cases have been on the rise all over India and several measures have been implemented to help curb the spread, including suspension of commercial passenger aircraft since March 22, 2020, and imposing a 21-day countrywide lockdown on March 25, 2020, which later got extended until May 3, 2020. On May 01, 2020, the lockdown got further extended by another fortnight. As of March 28, India had 909 confirmed cases of COVID-19, just a month before April 28's figure of 29,974, which is almost 33 times higher. By July 16, this figure had gone up almost another 33 times.
The present article delves into more specifics of the spread of COVID-19 in India, conducting a statistical analysis of the affected patients, as well as proposing a new mathematical model to predict the spread of the disease in the country. Section 2 outlines how the data has been collected; Section 3 discusses the patient-level statistical analysis and compares that with the world trends, and Section 4 looks at the details of the modeling and prediction system. The results have then been discussed and analysed thoroughly in Sections 5 and 6. Concluding remarks and areas for future work are portrayed in Section 7.
2. Data collection
The data analysed in this paper includes the official counts as released by MoHFW, hundreds of news reports, the COVID-19 India API, as well as volunteer-collected de-identified open source data until the date of July 16, 2020.
3. Patient-level analysis
In this section, we analyse various demographic and other factors of the COVID-19 affected patients in India, and contrast and compare those with the world trends.
Age counts of 2339 patients were received, and the mean age was found to be 38.54 years, with a standard deviation of 17.22 years.
Out of the 5313 affected people whose sex was identified, there were 3547 males (66.76%) and 1766 females (33.24%). This is somewhat skewed when compared globally to around 30 other countries, most of whom report males to have only slightly >50% cases on average (Global Health 5050 [4], WHO [1]).
As of July 16, 2020, there were 968,876 cases officially confirmed in India, with 612,814 (63.24%) of them having recovered, 331,146 (34.17%) active cases and 24,915 (2.57%) deceased. Of the cases with an outcome, 96.09% have recovered, somewhat higher than the current global average of 93% (Worldometers) [5]. On April 26, 2020, there were 26,917 cases officially confirmed in India, with 5914 (21.97%) of them having recovered, 20,177 (74.96%) active cases and 826 (3.07%) deceased. Of the cases with an outcome, 87.74% had recovered, somewhat higher than the then global average of 81% (Worldometers) [5].
Of the affected patients whose status was tracked, an outcome was obtained in a mean time of 10.69 days, with great variance observed—the mean recovery time (time between hospitalization and official status of recovery) was found to be 14.47 days and the mean time to death (time between hospitalization and death) was found to be a mere 3.18 days. The recovery rate is at par with CDC guidelines and WHO [1] statements that mention it may take the body up to 2 weeks to recover from the illness, and up to 6 weeks for severe or critical cases. However, the mean time for death is drastically lower than the WHO mentioned figure of 2–8 weeks.
We obtained notes regarding transmission for 3273 patients (mostly those infected towards the initial days of spread in India, after which tracking became much more complicated), out of whom 712 patients had travel history, 194 patients were family member transmissions, and an additional 578 patients were direct contacts of another infected person. Furthermore, 911 patients had attended mass gatherings with other infected people.
Out of the 2339 patients whose age data was received, the distribution was observed as in Table 1 . We see a sharp contrast in this distribution when compared with other countries/regions of the world. In Italy, about 71% of individuals affected with COVID-19 was over 50 years old (Statista) [6] but in India, that figure is only 27.53%. Only 8.1% of COVID-19 affected patients in China were in their 20s, versus around 23% of Indians. We also notice that for the age distribution of adults among the COVID-19 confirmed cases in India, the number of people in each age bucket is steadily decreasing, with the largest number of adults infected being the youngest ones—a departure from the world trend.
Table 1.
Age distribution among infected patients.
Age range | Number of patients | Fraction of total |
---|---|---|
≤18 | 217 | 9.28% |
18–29 | 558 | 23.86% |
30–39 | 534 | 22.83% |
40–49 | 386 | 16.50% |
50–59 | 311 | 13.30% |
60–69 | 237 | 10.13% |
70–79 | 73 | 3.12% |
80–89 | 18 | 0.77% |
≥90 | 5 | 0.21% |
4. Modeling and predicting the number of daily cases
A very important aspect of attempting to control this ongoing pandemic and save lives is to model the spread that has occurred so far, and to be able to use that to predict future spread and number of cases. There has been extensive ongoing research in this aspect, catering to the needs of various countries. In the rest of this article, we propose a model and prediction system that is developed very specifically with India in mind, and currently offers excellent predictions.
Background research. There are multiple models attempting to predict the spread of coronavirus in the world currently. These notably include the UPenn CHIME model [7], a discrete-time SIR (Susceptible-Infected-Recovered) modeling of infections/recovery, which as shown in the following paragraph, fails on predicting the spread of COVID-19 in India. Another SIR based model being used widely includes Health Catalyst [8]. Non-SIR models performing statistical/mathematical modeling are also evolving, such as a Gauss error function and Monte Carlo simulation [9] for Italy and other statistical models [10] in the USA—nonetheless, none of them satisfactorily predicting the spread yet, especially when applied to India.
Background research for India. There have been limited research in modeling and predicting the spread of COVID-19 in India so far, owing both to the ongoing nature of the pandemic and limited availability of data. In an article published in early April 2020 [11], we see the usage of the currently famous SIR Model, and a prediction that India will reach a final epidemic size of around 13,000 by the end of May 2020. Needless to say, this is a drastic difference from the current state of affairs, as we prepare to enter May 2020 with 35,403 cases, almost thrice the value predicted for the end of May. A different model [12] predicted that 54 days of total lockdown would yield around 5000 cases. On the 40th day of lockdown, there were seven times as many cases, and currently almost 200 times as many, and we can see that this model is not offering good predictions either. Yet another mathematical model [13] estimates that with lockdown from 25th March onwards (as happened), we would have just shy of 10,000 cases by end-April, which, once again, is one-third the actual total number of cases at the end of April, 2020, and the number of cases by the end of April 25, 2020 was about 2.7 times as predicted.
In the present article, we propose a mathematical model with a different approach, which has been yielding consistent results in predicting the spread since the beginning of the outbreak in India. Moreover, while there exist data-driven approaches to look at the outbreak of COVID-19 [14], [15], there have been no such studies regarding India yet, and this article attempts to fill in that gap as well.
4.1. Change factor
We note that even though the curve of the spread looks exponential, there are daily changes dependent more on the local context—the days immediately preceding. With somewhat lesser weightage but still important is a more global context of the last couple weeks and the pattern in that change. Moreover, once the curve has flattened out and the number of cases have started decreasing, there needs to be a measure for the model to automatically reflect this and adjust predictions accordingly. To account for all of this, we first define a ‘change factor’—a measure of the ratio of the daily increase with respect to the past N days. We start looking at the cases right from January 30, 2020, the day the first COVID-19 case was confirmed in India. The pseudocode for change factor calculation is as in Algorithm 1.
Algorithm 1
Calculate change factor for last N days with list of daily cases L
sum _ change ⟵ 0 |
forpast _ day _ number = 1 … Ndo |
change ⟵ L[day − past _ day _ number] |
sum _ change ⟵ sum _ change + change |
end for |
average _ change ⟵ sum _ change/N |
actual _ change ⟵ L[day] |
change _ factor ⟵ actual _ change/average _ change |
4.2. Prediction based on raw data
We have analysed the effects of the pattern in the change factor, the first derivative of cases (i.e., the daily increase, for all of confirmed, deceased and recovered), as well as the second derivative (i.e., the increase in daily increases, for all of confirmed, deceased and recovered) for the raw data in the effectivity of predicting the number of total cases for the following day(s), for confirmed, recovered, as well as deceased cases.
Based on these studies, we came up with Algorithm 2 for predicting the number of daily cases. Here, α is a factor we can tune as necessary.
Algorithm 2
Predict the number of cases for the next day based on last N days’ cases, and last T days’ change factors with list of daily cases L
trendline:: change factors for last T days |
predicted _ factor ⟵ sum(trendline)/T |
base _ cases _ considered ⟵ sum(cases in last N days) |
base _ average ⟵ base _ cases _ considered/N |
predicted _ value ⟵ α × predicted _ factor × base _ average |
Extensive experimentation with various values of α, N and T have yielded that it is most beneficial to currently keep α = 1.35, N = 3 and T = 10. This implies that the predicted cases for the next day is a function of that in the last 3 days, but with an ‘increase’ factor influenced by the last 10 days. This is a testament to and goes at par with the highly dynamic nature of the epidemic.
4.3. Prediction based on moving averaged data
We notice sharp spikes and troughs in the number of daily cases—and this can be due to a variety of reasons, including staggered testing, staggered collection of results, holidays, one-time breakout events, mass gatherings, new testing development, new medical guidelines, and so on. Due to this reason, we also do an analysis of the number of cases (confirmed, recovered and deceased) for a simple moving average of the data over a window of the last K days. We have experimented with various values of K and have settled on K = 3 as the most balanced point wherein we are not diluting the effects of every day's changes much, but also somewhat accounting for day-to-day external changes in testing and other reasons that are unrelated to the ‘true’ spread of the disease.
With the window averaged list of K = 3 days, we use a prediction system the same as in Algorithm 2. For the averaged data, the best results were found with an α = 0.65 and the same N = 3 and T = 10, as was found for the prediction with raw (original, unaveraged) data.
4.4. Final prediction
Because the ‘true’ spread of the epidemic is as hard to estimate as the number of cases (for total confirmed, recovered and deceased) per day, we have decided to instead offer a range of predicted values, with one end being the prediction obtained from the raw data and the other end being that obtained from moving averaged data. Over the entire course of time since the outbreak started in India, the model has been 90.36% accurate in predicting the total number of cases the next day, correctly predicting the range in 150 out of the 166 days looked at. Fig. 1 depicts the plot of the actual versus predicted number of daily cases so far, as of July 16, 2020.
Fig. 1.
Comparing the actual and predicted number of daily confirmed cases.
5. Results and analysis
We observe that the prediction range offered by the raw and moving averaged data offers a high accuracy of 90.36% until the date of July 16, 2020. We further note, that there is a tradeoff in selecting the values of α to be too high or too low—a value much further from 1 will spread out the range and thus have a greater chance of being ‘correct’, but at the same time a huge range prevents a model from being very precise, although accurate. Hence, we have kept the tradeoff balance by predicting a range whose midpoint, on average, deviates only by 61 cases from the actual or true value as of May 1, 2020 and by 429 cases as of July 16, 2020. The former is a mere 3.21% of the maximum daily cases reported in the first 90 days of the pandemic, and just 1.31% of the latest day (July 16, 2020)'s reported cases (Fig. 2). Moreover, there was only one mis-predicted day in the first 90 days of the pandemic (i.e., over the month of April), with the model thus having 96.67% accuracy over April and 98.90% accuracy over the last 3 months. We note, most of the prediction errors are around days 30–42. The RMSE score for the model is 778.23, which is just 2.38% of the total confirmed cases on the latest reported day. The MAPE value is also as low as 35.1, just 0.1% of the latest number of daily cases.
The model is self-correcting as it improves its predictions every day, by incorporating the previous day's data into the trendline for the following days. This model can thus be used dynamically not only to predict the spread of COVID-19 in India, but also to check the effect of various government measures in a short span of time after they are implemented.
6. Discussion
The prediction range bounded by the raw and moving averaged data obtained over the last 166 days, since the start of the current pandemic, offers an accuracy of 90.36% up to July 16, 2020. The averages of the midpoint of the prediction range deviates from the true value only by 429 cases as of the latest day, which is a mere 1.31%. The RMSE and MAPE scores are 778.23 and 35.1, just 2.38% and 0.11% of last reported day's cases, respectively. This is a self-correcting model which adapts to new data to yield better future predictions.
Comparing to other contemporary models, an article published in early April 2020 [11], using the currently famous SIR Model predicted that India will reach a final epidemic size of around 13,000 by the end of May 2020, while the real figure was 190,609, almost 14 times as much as predicted. The proposed model had 86.67% accuracy by the same day, predicting 8342 cases on May 30 (with the actual number being as close as 8336). A different model [12] predicted that 54 days of total lockdown would yield around 5000 cases. On the 40th day of lockdown, there were seven times as many cases. Another data-driven mathematical model [13] estimated that with lockdown from 25th March onwards (as happened), there would be a little less than 10,000 cases by end-April, which, once again, is one-third the actual total number of cases at the end of April, 2020, and the number of cases by the end of April 25, 2020 was about 2.7 times as actually predicted, while the current model was 83% accurate with data obtained up to the same time-frame.
The limitations of this paper include the lack of statewise data analysis, which is in the future works, as well as specific analysis for the spread of COVID-19 in ‘red’ and ‘green’ zones (the infection hotspots and safe areas as categorized by the Indian government) and major cities and the analysis of other possibly relevant factors in the spread of COVID-19 in India, notably social, cultural and economic ones, for want of sufficient data to draw conclusions from. We aim to address all these issues in the future.
7. Conclusion and future work
We can see an alarming growth in the number of cases in India over the last few months, but also a silver lining in the ‘stabilizing’ nature of trend factors in around April, possibly owing to the lockdown measures implemented, along with widespread testing, contact tracing, and general hygiene measures. Moreover, we also notice that a lot of the sharp spikes and new infections are for mass gathering events and the close contacts of the infected, respectively, highlighting even more the need for physical distancing and maintaining good hygiene. The age distribution among the known confirmed cases in India shows a tilt towards the younger population, a departure from the world trend. The quick turnaround of recovery in infected patients in India is another beacon of hope.
Further work is being done to incorporate more factors into the model and break it down statewise to gain more insight into the status of the spread in various states of India and fine-tune it to the needs of each individual state. We aim to model the spread in local “hotspots”, red zones, and major cities as well. We also hope to modify the present model to adapt it to the needs of different countries in the future. Additional work includes the analysis of the change factors. Brief investigation into the list of change factors shows that a factor that stays around 1 indicates a ‘stabilization’ in the daily number of confirmed cases.
Summary points
-
1.
The present article looks into the specifics of the spread of COVID-19 in India and conducts a statistical analysis of the affected patients.
-
2.
Existing models have failed to predict the spread of COVID-19 in India.
-
3.
The present article proposes a new mathematical model to predict the spread of the disease in the country, which has succeeded with 90.36% accuracy ever since the first COVID-19 case was declared in India.
Conflict of interest
None declared.
Footnotes
Supplementary data associated with this article can be found, in the online version, at https://doi.org/10.1016/j.ijmedinf.2020.104262.
Appendix A. Supplementary data
The following are the supplementary data to this article:
References
- 1.https://www.who.int/emergencies/diseases/novel-coronavirus-2019 (Accessed 16 July 2020).
- 2.https://www.mohfw.gov.in/ (Accessed 16 July 2020).
- 3.https://www.who.int/docs/default-source/wrindia/india-situation-report-1.pdf?sfvrsn=5ca2a672_0 (Accessed 28 April 2020).
- 4.https://globalhealth5050.org/covid19/age-and-sex-data/ (Accessed 28 April 2020).
- 5.https://www.worldometers.info/coronavirus/ (Accessed 16 July 2020).
- 6.https://www.statista.com/statistics/1103023/coronavirus-cases-distribution-by-age-group-italy/ (Accessed 28 April 2020).
- 7.http://predictivehealthcare.pennmedicine.org/2020/03/14/accouncing-chime.html (Accessed 01 May 2020).
- 8.https://hcdatascienceservices.com/COVID19CapPlan/?visit=Content-Mask-1.7.0.11#application_guidance (Accessed 01 May 2020).
- 9.Ciufolini I., Paolozzi A. Mathematical prediction of the time evolution of the covid-19 pandemic in Italy by a Gauss error function and Monte Carlo simulations. Eur. Phys. J. Plus. 2020 doi: 10.1140/epjp/s13360-020-00383-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.https://www.cdc.gov/coronavirus/2019-ncov/covid-data/forecasting-us.html (Accessed 01 May 2020).
- 11.Ranjan R. Predictions for covid-19 outbreak in India using epidemiological models. medRxiv. 2020 doi: 10.1101/2020.04.02.20051466. [DOI] [Google Scholar]
- 12.AMK . 2020. Modeling and Predictions for Covid 19 Spread in India. [DOI] [Google Scholar]
- 13.Roy S., Roy Bhattacharya K. 2020. Spread of Covid-19 in India: A Mathematical Model. [DOI] [Google Scholar]
- 14.Fang Y., Nie Y., Penny M. Transmission dynamics of the covid-19 outbreak and effectiveness of government interventions: a data-driven analysis. J. Med. Virol. 2020;92(6):645–659. doi: 10.1002/jmv.25750. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Dey S.K., Rahman M.M., Siddiqi U.R., Howlader A. Analyzing the epidemiological outbreak of covid-19: a visual exploratory data analysis approach. J. Med. Virol. 2020;92(6):632–638. doi: 10.1002/jmv.25743. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.