Benchmark datasets for SARS-CoV-2 surveillance bioinformatics

doi:10.7717/peerj.13821

. 2022 Sep 5:10:e13821.

doi: 10.7717/peerj.13821. eCollection 2022.

Benchmark datasets for SARS-CoV-2 surveillance bioinformatics

Lingzi Xiaoli^#¹, Jill V Hagey^#¹, Daniel J Park², Christopher A Gulvik¹, Erin L Young³, Nabil-Fareed Alikhan⁴, Adrian Lawsin¹, Norman Hassell¹, Kristen Knipe¹, Kelly F Oakeson³, Adam C Retchless¹, Migun Shakya⁵, Chien-Chi Lo⁵, Patrick Chain⁵, Andrew J Page⁴, Benjamin J Metcalf¹, Michelle Su¹, Jessica Rowell⁶, Eshaw Vidyaprakash⁶, Clinton R Paden¹, Andrew D Huang⁶, Dawn Roellig¹, Ketan Patel¹, Kathryn Winglee¹, Michael R Weigand¹, Lee S Katz¹

Affiliations

¹ Strain Surveillance and Emerging Variant Team, Centers for Disease Control and Prevention, Atlanta, GA, United States of America.
² Broad Institute of MIT and Harvard, Cambridge, MA, United States of America.
³ Utah Public Health Laboratory, Salt Lake City, UT, United States of America.
⁴ Quadram Institute Bioscience, Norwich Research Park, Norwich, United Kingdom.
⁵ Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM, United States of America.
⁶ SARS-CoV-2 Emerging Variant Sequencing Project Dry Lab Group Laboratory and Testing Task Force COVID-19 Emergency Response, Centers for Disease Control and Prevention, Atlanta, GA, United States of America.

^# Contributed equally.

PMID: 36093336
PMCID: PMC9454940
DOI: 10.7717/peerj.13821

Benchmark datasets for SARS-CoV-2 surveillance bioinformatics

Lingzi Xiaoli et al. PeerJ. 2022.

. 2022 Sep 5:10:e13821.

doi: 10.7717/peerj.13821. eCollection 2022.

Authors

Affiliations

¹ Strain Surveillance and Emerging Variant Team, Centers for Disease Control and Prevention, Atlanta, GA, United States of America.
² Broad Institute of MIT and Harvard, Cambridge, MA, United States of America.
³ Utah Public Health Laboratory, Salt Lake City, UT, United States of America.
⁴ Quadram Institute Bioscience, Norwich Research Park, Norwich, United Kingdom.
⁵ Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM, United States of America.
⁶ SARS-CoV-2 Emerging Variant Sequencing Project Dry Lab Group Laboratory and Testing Task Force COVID-19 Emergency Response, Centers for Disease Control and Prevention, Atlanta, GA, United States of America.

^# Contributed equally.

PMID: 36093336
PMCID: PMC9454940
DOI: 10.7717/peerj.13821

Abstract

Background: Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the cause of coronavirus disease 2019 (COVID-19), has spread globally and is being surveilled with an international genome sequencing effort. Surveillance consists of sample acquisition, library preparation, and whole genome sequencing. This has necessitated a classification scheme detailing Variants of Concern (VOC) and Variants of Interest (VOI), and the rapid expansion of bioinformatics tools for sequence analysis. These bioinformatic tools are means for major actionable results: maintaining quality assurance and checks, defining population structure, performing genomic epidemiology, and inferring lineage to allow reliable and actionable identification and classification. Additionally, the pandemic has required public health laboratories to reach high throughput proficiency in sequencing library preparation and downstream data analysis rapidly. However, both processes can be limited by a lack of a standardized sequence dataset.

Methods: We identified six SARS-CoV-2 sequence datasets from recent publications, public databases and internal resources. In addition, we created a method to mine public databases to identify representative genomes for these datasets. Using this novel method, we identified several genomes as either VOI/VOC representatives or non-VOI/VOC representatives. To describe each dataset, we utilized a previously published datasets format, which describes accession information and whole dataset information. Additionally, a script from the same publication has been enhanced to download and verify all data from this study.

Results: The benchmark datasets focus on the two most widely used sequencing platforms: long read sequencing data from the Oxford Nanopore Technologies platform and short read sequencing data from the Illumina platform. There are six datasets: three were derived from recent publications; two were derived from data mining public databases to answer common questions not covered by published datasets; one unique dataset representing common sequence failures was obtained by rigorously scrutinizing data that did not pass quality checks. The dataset summary table, data mining script and quality control (QC) values for all sequence data are publicly available on GitHub: https://github.com/CDCgov/datasets-sars-cov-2.

Discussion: The datasets presented here were generated to help public health laboratories build sequencing and bioinformatics capacity, benchmark different workflows and pipelines, and calibrate QC thresholds to ensure sequencing quality. Together, improvements in these areas support accurate and timely outbreak investigation and surveillance, providing actionable data for pandemic management. Furthermore, these publicly available and standardized benchmark data will facilitate the development and adjudication of new pipelines.

Keywords: Benchmarking; COVID-19; Standardization; WGS; sha256.

PubMed Disclaimer

Conflict of interest statement

The authors declare there are no competing interests.

Figures

**Figure 1. Automated workflow for identifying representative sequences for datasets.**
Sequences go through several quality checks before being considered as part of a benchmark dataset. These checks include lineage agreement with Pangolin, a minimum Phred score, a minimum depth of coverage, a check with the software TheiaCov, a check of the amplicon strategy, a minimization of the count of SNPs in regards to a reference genome, and a check against the spike region’s mutations. Asterisks denote steps taken with in-house python scripts.

See this image and copyright information in PMC

Cited by

Genome-wide identification and molecular evolution of elongation family of very long chain fatty acids proteins in Cyrtotrachelus buqueti.
Fu C, Yang T, Liao H, Huang Y, Wang H, Long W, Jiang N, Yang Y. Fu C, et al. BMC Genomics. 2024 Aug 2;25(1):758. doi: 10.1186/s12864-024-10658-8. BMC Genomics. 2024. PMID: 39095734 Free PMC article.
Lessons learned: overcoming common challenges in reconstructing the SARS-CoV-2 genome from short-read sequencing data via CoVpipe2.
Lataretu M, Drechsel O, Kmiecinski R, Trappe K, Hölzer M, Fuchs S. Lataretu M, et al. F1000Res. 2024 Apr 16;12:1091. doi: 10.12688/f1000research.136683.1. eCollection 2023. F1000Res. 2024. PMID: 38716230 Free PMC article.
PHA4GE quality control contextual data tags: standardized annotations for sharing public health sequence datasets with known quality issues to facilitate testing and training.
Griffiths EJ, Mendes I, Maguire F, Guthrie JL, Wee BA, Schmedes S, Holt K, Yadav C, Cameron R, Barclay C, Dooley D, MacCannell D, Chindelevitch L, Karsch-Mizrachi I, Waheed Z, Katz L, Petit Iii R, Dave M, Oluniyi P, Nasar MI, Raphenya A, Hsiao WWL, Timme RE. Griffiths EJ, et al. Microb Genom. 2024 Jun;10(6):001260. doi: 10.1099/mgen.0.001260. Microb Genom. 2024. PMID: 38860884 Free PMC article.
Bioinformatic investigation of discordant sequence data for SARS-CoV-2: insights for robust genomic analysis during pandemic surveillance.
Zufan SE, Lau KA, Donald A, Hoang T, Foster CSP, Sikazwe C, Theis T, Rawlinson WD, Ballard SA, Stinear TP, Howden BP, Jennison AV, Seemann T. Zufan SE, et al. Microb Genom. 2023 Nov;9(11):001146. doi: 10.1099/mgen.0.001146. Microb Genom. 2023. PMID: 38019123 Free PMC article.

References

1. Abdool Karim SS, De Oliveira T. New SARS-CoV-2 variants—clinical, public health, and vaccine implications. The New England Journal of Medicine. 2021;384(19):1866–1868. doi: 10.1056/NEJMc2100362. - DOI - PMC - PubMed
1. Andrews S. Babraham bioinformatics—FastQC a quality control tool for high throughput sequence data. https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ [03 November 2021];2010
1. ARTIC Home—artic pipeline. 2020. https://artic.readthedocs.io/en/latest/?badgelatest. [30 November 2021]. https://artic.readthedocs.io/en/latest/?badgelatest
1. Baker DJ, Aydin A, Le-Viet T, Kay GL, Rudder S, De Oliveira Martins L, Tedim AP, Kolyva A, Diaz M, Alikhan N-F, Meadows L, Bell A, Gutierrez AV, Trotter AJ, Thomson NM, Gilroy R, Griffith L, Adriaenssens EM, Stanley R, Charles IG, Elumogo N, Wain J, Prakash R, Meader E, Mather AE, Webber MA, Dervisevic S, Page AJ, O’Grady J. CoronaHiT: high-throughput sequencing of SARS-CoV-2 genomes. Genome Medicine. 2021;13:21. doi: 10.1186/s13073-021-00839-5. - DOI - PMC - PubMed
1. BBMap https://sourceforge.net/projects/bbmap/ [03 November 2021];2021

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Medical
- MedlinePlus Health Information
Miscellaneous
- NCI CPTAC Assay Portal

[1] Abdool Karim SS, De Oliveira T. New SARS-CoV-2 variants—clinical, public health, and vaccine implications. The New England Journal of Medicine. 2021;384(19):1866–1868. doi: 10.1056/NEJMc2100362. - DOI - PMC - PubMed

[2] Abdool Karim SS, De Oliveira T. New SARS-CoV-2 variants—clinical, public health, and vaccine implications. The New England Journal of Medicine. 2021;384(19):1866–1868. doi: 10.1056/NEJMc2100362. - DOI - PMC - PubMed

[3] Andrews S. Babraham bioinformatics—FastQC a quality control tool for high throughput sequence data. https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ [03 November 2021];2010

[4] Andrews S. Babraham bioinformatics—FastQC a quality control tool for high throughput sequence data. https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ [03 November 2021];2010

[5] ARTIC Home—artic pipeline. 2020. https://artic.readthedocs.io/en/latest/?badgelatest. [30 November 2021]. https://artic.readthedocs.io/en/latest/?badgelatest

[6] ARTIC Home—artic pipeline. 2020. https://artic.readthedocs.io/en/latest/?badgelatest. [30 November 2021]. https://artic.readthedocs.io/en/latest/?badgelatest

[7] Baker DJ, Aydin A, Le-Viet T, Kay GL, Rudder S, De Oliveira Martins L, Tedim AP, Kolyva A, Diaz M, Alikhan N-F, Meadows L, Bell A, Gutierrez AV, Trotter AJ, Thomson NM, Gilroy R, Griffith L, Adriaenssens EM, Stanley R, Charles IG, Elumogo N, Wain J, Prakash R, Meader E, Mather AE, Webber MA, Dervisevic S, Page AJ, O’Grady J. CoronaHiT: high-throughput sequencing of SARS-CoV-2 genomes. Genome Medicine. 2021;13:21. doi: 10.1186/s13073-021-00839-5. - DOI - PMC - PubMed

[8] Baker DJ, Aydin A, Le-Viet T, Kay GL, Rudder S, De Oliveira Martins L, Tedim AP, Kolyva A, Diaz M, Alikhan N-F, Meadows L, Bell A, Gutierrez AV, Trotter AJ, Thomson NM, Gilroy R, Griffith L, Adriaenssens EM, Stanley R, Charles IG, Elumogo N, Wain J, Prakash R, Meader E, Mather AE, Webber MA, Dervisevic S, Page AJ, O’Grady J. CoronaHiT: high-throughput sequencing of SARS-CoV-2 genomes. Genome Medicine. 2021;13:21. doi: 10.1186/s13073-021-00839-5. - DOI - PMC - PubMed

[9] BBMap https://sourceforge.net/projects/bbmap/ [03 November 2021];2021

[10] BBMap https://sourceforge.net/projects/bbmap/ [03 November 2021];2021

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Benchmark datasets for SARS-CoV-2 surveillance bioinformatics

Affiliations

Benchmark datasets for SARS-CoV-2 surveillance bioinformatics

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Medical

Miscellaneous

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Medical

Miscellaneous