Skip to content

An R package with over 50 highly cited, read-to-use, up-to-date COVID-19 pandemic data resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md
Notifications You must be signed in to change notification settings

seandavi/sars2pack

Repository files navigation

sars2pack

codecov test-coverage

Overview

The sars2pack R package provides one-line access to over 40 COVID-related datasets. Datasets are accessed in real time directly from their sources and then transformed to tidy-data form where possible and applicable. The result of each dataset accessor is a ready-to-use R dataset, often a dataframe. Documentation includes dataset descriptions, sources and references, and examples. Online documentation is available in two locations:

Questions addressed by sars2pack

  • What are the current and historical total, new cases, and deaths of COVID-19 at the city, county, state, national, and international levels?
  • How do changes in infection rates differ across locations?
  • What are the non-pharmacological interventions in place at the local and national levels?
  • In the United States, what is the geographical distribution of healthcare capacity (ICU beds, total beds, doctors, etc.)?
  • What are the published values of key epidemic parameters, as curated from the literature?

Installation

# If you do not have BiocManager installed:
install.packages('BiocManager')

# Then, if sars2pack is not already installed:
BiocManager::install('seandavi/sars2pack')

After the one-time installation, load the packge to get started.

library(sars2pack)

Available datasets

name accessor data\_type geographical geospatial region resolution url
United States county-level geographic details us\_county\_geo\_details c(“demographics”, “geographic”) TRUE TRUE United States admin2 [LINK](https://github.com/josh-byster/fips_lat_long)
OECD International Unemployment Data oecd\_unemployment\_data c(“economics”, “time series”) TRUE FALSE World admin0 [LINK](https://oecd.org)
healthdata.org COVID-19 Mobility Observations and Projections healthdata\_mobility\_data c(“mobility”, “time series”, “projections”) TRUE FALSE International c(“admin0”, “admin1”) [LINK](https://covid19.healthdata.org/projections)
healthdata.org COVID-19 Testing Observations and Projections healthdata\_testing\_data c(“testing”, “time series”, “projections”) TRUE FALSE International c(“admin0”, “admin1”) [LINK](https://covid19.healthdata.org/projections)
Our World In Data testing and cases reporting owid\_data c(“time series”, “cases”, “deaths”, “testing”) TRUE FALSE World admin0 [LINK](https://ourworldindata.org/coronavirus)
CovidTracker data covidtracker\_data c(“time series”, “cases”, “deaths”, “testing”) TRUE FALSE United States admin1 [LINK](https://covidtracking.com/)
European CDC world tracking ecdc\_data c(“time series”, “cases”, “deaths”) TRUE FALSE World admin0 [LINK](https://www.ecdc.europa.eu/en/covid-19)
EU data Github aggregator eu\_data\_cache\_data c(“time series”, “cases”, “deaths”) TRUE FALSE Europe c(“admin0”, “admin1”) [LINK](https://github.com/covid19-eu-zh/covid19-eu-data)
USA Facts usa\_facts\_data c(“time series”, “cases”, “deaths”) TRUE FALSE United States admin1 [LINK](https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/)
Johns Hopkins dataset jhu\_data c(“time series”, “cases”, “deaths”) TRUE FALSE World admin0 [LINK](https://github.com/CSSEGISandData/COVID-19)
Johns Hopkins US-centric data jhu\_us\_data c(“time series”, “cases”, “deaths”) TRUE FALSE United States c(“admin1”, “admin2”) [LINK](https://github.com/CSSEGISandData/COVID-19)
New York Times county level data nytimes\_county\_data c(“time series”, “cases”, “deaths”) TRUE FALSE United States admin2 [LINK](https://raw.githubusercontent.com/nytimes/covid-19-data)
New York Times state level data nytimes\_state\_data c(“time series”, “cases”, “deaths”) TRUE FALSE United States admin1 [LINK](https://raw.githubusercontent.com/nytimes/covid-19-data)
The Economist: Excess deaths during COVID pandemic economist\_excess\_deaths c(“time series”, “deaths”, “excess deaths”) TRUE FALSE International c(“admin0”, “admin1”) [LINK](https://github.com/TheEconomist/covid-19-excess-deaths-tracker)
The : Excess deaths during COVID pandemic financial\_times\_excess\_deaths c(“time series”, “deaths”, “excess deaths”) TRUE FALSE International c(“admin0”, “admin1”) [LINK](https://github.com/Financial-Times/coronavirus-excess-mortality-data)
US CDC excess deaths dataset cdc\_excess\_deaths c(“time series”, “deaths”, “excess deaths”) TRUE FALSE United States admin1 [LINK](https://www.cdc.gov/nchs/nvss/vsrr/covid19/excess_deaths.html)
Descartes Labs Mobility Data descartes\_mobility\_data c(“time series”, “mobility”) TRUE FALSE United States admin1 [LINK](https://raw.githubusercontent.com/descarteslabs/DL-COVID-19)
Apple mobility data from maps apple\_mobility\_data c(“time series”, “mobility”) TRUE FALSE World c(“admin0”, “admin1”, “admin2”, “admin3”) [LINK](https://www.apple.com/covid19/mobility)
Healthdata.org projections of hospital utilization and deaths healthdata\_projections\_data c(“time series”, “projections”, “cases”, “deaths”) TRUE FALSE c(“United States”, “World”) c(“admin1”, “admin2”) [LINK](http://www.healthdata.org/covid)
Healthdata.org mobility data healthdata\_mobility\_data c(“time series”, “projections”, “mobility”) TRUE FALSE c(“United States”, “World”) c(“admin1”, “admin2”) [LINK](http://www.healthdata.org/covid)
United States CDC Social Vulnerability Index cdc\_social\_vulnerability\_index demographics TRUE FALSE United States admin2 [LINK](https://svi.cdc.gov/)
US county health rankings from ‘’ us\_county\_health\_rankings demographics TRUE FALSE United States c(“admin0”, “admin1”, “admin2”) [LINK](https://www.countyhealthrankings.org)
Country metadata from restcountries.eu country\_metadata demographics TRUE FALSE World admin0 [LINK](https://restcountries.eu)
Extensive United States hospital capabilities us\_hospital\_details healthcare capacity TRUE TRUE United States individual hospital [LINK](https://hub.arcgis.com/datasets/geoplatform::hospitals)
Kaiser Family Foundation ICU bed data kff\_icu\_beds healthcare capacity TRUE TRUE United States Individual hospital [LINK](https://khn.org/news/as-coronavirus-spreads-widely-millions-of-older-americans-live-in-counties-with-no-icu-beds)
CovidCare United States Healthcare Capacity us\_healthcare\_capacity healthcare capacity TRUE TRUE United States Individual hospital [LINK](https://github.com/covidcaremap/covid19-healthsystemcapacity)
GISAID metadata from thousands of SARS-CoV-2 sequences cov\_glue\_lineage\_data line list TRUE FALSE World multiple [LINK](https://github.com/hCoV-2019/lineages)
beoutbreakprepared beoutbreakprepared\_data line list TRUE FALSE World patient [LINK](https://github.com/beoutbreakprepared/nCoV2019)
Published epidemic parameters for COVID-19 param\_estimates\_published miscellaneous FALSE FALSE list() list() [LINK](https://github.com/midas-network/COVID-19/blob/master/parameter_estimates/2019_novel_coronavirus/estimates.csv)
Google mobility data google\_mobility\_data mobility TRUE FALSE World c(“admin0”, “admin1”, “admin2”) [LINK](https://www.google.com/covid19/mobility/)
Newick tree from thousands of SARS-CoV-2 sequences cov\_glue\_newick\_data phylogenetic FALSE FALSE World multiple [LINK](https://github.com/hCoV-2019/lineages)
Aggregated projections from US CDC cdc\_aggregated\_projections projections TRUE FALSE list() c(“admin0”, “admin1”) [LINK](https://www.cdc.gov/coronavirus/2019-ncov/covid-data/forecasting-us.html)
CoronaNet government response database coronanet\_government\_response\_data public policy TRUE FALSE World c(“admin0”, “admin1”) [LINK](https://coronanet-project.org/index.html)
Oxford Government Policy Intervention time series government\_policy\_timeline public policy TRUE FALSE World admin0 [LINK](https://www.bsg.ox.ac.uk/research/research-projects/oxford-covid-19-government-response-tracker)
United States social distancing policies us\_state\_distancing\_policy public policy TRUE FALSE United States admin1 [LINK](https://github.com/COVID19StatePolicy/SocialDistancing/)
Case tracking -------------

Updated tracking of city, county, state, national, and international confirmed cases, deaths, and testing is critical to driving policy, implementing interventions, and measuring their effectiveness. Case tracking datasets include date, a count of cases, and usually numerous other pieces of information related to location of reporting, etc.

Accessing case-tracking datasets is typically done with one function per dataset. The example here is data from the European Centers for Disease Control, or ECDC.

ecdc = ecdc_data()

Get a quick overview of the dataset.

head(ecdc)

## # A tibble: 6 x 8
## # Groups:   location_name, subset [6]
##   date       location_name iso2c iso3c population_2019 continent subset    count
##   <date>     <chr>         <chr> <chr>           <dbl> <chr>     <chr>     <dbl>
## 1 2019-12-31 Afghanistan   AF    AFG          38041757 Asia      confirmed     0
## 2 2019-12-31 Afghanistan   AF    AFG          38041757 Asia      deaths        0
## 3 2019-12-31 Algeria       DZ    DZA          43053054 Africa    confirmed     0
## 4 2019-12-31 Algeria       DZ    DZA          43053054 Africa    deaths        0
## 5 2019-12-31 Armenia       AM    ARM           2957728 Europe    confirmed     0
## 6 2019-12-31 Armenia       AM    ARM           2957728 Europe    deaths        0

The ecdc dataset is just a data.frame (actually, a tibble), so applying standard R or tidyverse functionality can get answers to basic questions with little code. The next code block generates a top10 of countries with the most deaths recorded to date. Note that if you do this on your own computer, the data will be updated to today’s data values.

library(dplyr)
top10 = ecdc %>% filter(subset=='deaths') %>% 
    group_by(location_name) %>%
    filter(count==max(count)) %>%
    arrange(desc(count)) %>%
    head(10) %>% select(-starts_with('iso'),-continent,-subset) %>%
    mutate(rate_per_100k = 1e5*count/population_2019)

Finally, present a nice table of those countries:

knitr::kable(
    top10,
    caption = "Reported COVID-19-related deaths in ten most affected countries.",
    format = 'pandoc')
Reported COVID-19-related deaths in ten most affected countries.
date location_name population_2019 count rate_per_100k
2020-07-06 United_States_of_America 329064917 129947 39.489776
2020-07-06 Brazil 211049519 64867 30.735441
2020-07-06 United_Kingdom 66647112 44220 66.349462
2020-07-06 Italy 60359546 34861 57.755570
2020-07-06 Mexico 127575529 30639 24.016361
2020-07-04 France 67012883 29893 44.607841
2020-07-05 France 67012883 29893 44.607841
2020-07-06 France 67012883 29893 44.607841
2020-05-24 Spain 46937060 28752 61.256500
2020-07-06 India 1366417756 19693 1.441214

Examine the spread of the pandemic throughout the world by examining cumulative deaths reported for the top 10 countries above.

ecdc_top10 = ecdc %>% filter(location_name %in% top10$location_name & subset=='deaths')
plot_epicurve(ecdc_top10,
              filter_expression = count > 10, 
              color='location_name')

Comparing the features of disease spread is easiest if all curves are shifted to “start” at the same absolute level of infection. In this case, shift the origin for all countries to start at the first time point when more than 100 cumulative cases had been observed. Note how some curves cross others which is evidence of less infection control at the same relative time in the pandemic for that country (eg., Brazil).

ecdc_top10 %>% align_to_baseline(count>100,group_vars=c('location_name')) %>%
    plot_epicurve(date_column = 'index',color='location_name')

Contributions

Pull requests are gladly accepted on Github.

Adding new datasets

See the Adding new datasets vignette.

Similar work