Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 May;2(5):100128.
doi: 10.1016/j.xgen.2022.100128.

Benchmarking challenging small variants with linked and long reads

Affiliations

Benchmarking challenging small variants with linked and long reads

Justin Wagner et al. Cell Genom. 2022 May.

Abstract

Genome in a Bottle benchmarks are widely used to help validate clinical sequencing pipelines and develop variant calling and sequencing methods. Here we use accurate linked and long reads to expand benchmarks in 7 samples to include difficult-to-map regions and segmental duplications that are challenging for short reads. These benchmarks add more than 300,000 SNVs and 50,000 insertions or deletions (indels) and include 16% more exonic variants, many in challenging, clinically relevant genes not covered previously, such as PMS2. For HG002, we include 92% of the autosomal GRCh38 assembly while excluding regions problematic for benchmarking small variants, such as copy number variants, that should not have been in the previous version, which included 85% of GRCh38. It identifies eight times more false negatives in a short read variant call set relative to our previous benchmark. We demonstrate that this benchmark reliably identifies false positives and false negatives across technologies, enabling ongoing methods development.

PubMed Disclaimer

Conflict of interest statement

A.M.W. and W.J.R. are employees and shareholders of Pacific Biosciences. A.M.B. and I.T.F. were employees and shareholders of 10X Genomics. F.J.S. has received sponsored travel from Oxford Nanopore and Pacific Biosciences and a 2018 sequencing grant from Pacific Biosciences. A.S. and V.K. are employees of Seven Bridges. A.C. is an employee of Google Inc. and a former employee of DNAnexus. A.F. and C.-S.C. are employees of DNAnexus.

Figures

None
Graphical abstract
Figure 1
Figure 1
The new benchmark set includes more of the reference genome and more variants (A) Percentage of the genomic region that is included by HG002 v.3.3.2 and v.4.2.1 of all non-gap, autosomal GRCh38 bases; the MHC; low-mappability regions and segmental duplications; and 159 difficult-to-map, medically relevant genes described previously. (B) The number of unique SNVs by genomic context. Circle size indicates the total number of SNVs in the union of v.3.3.2 and v.4.2.1. Circles above the diagonal indicate a net gain of SNVs in the newer benchmark, and circles below the diagonal indicate a net loss of SNVs in the newer benchmark.
Figure 2
Figure 2
v.4.2.1 includes many more difficult-to-map, medically relevant genes (A) Cumulative distribution of the percentage of each gene included in HG002 v.4.2.1 benchmark regions for 159 autosomal difficult-to-map, medically relevant genes. Dashed lines indicate that the number of genes included more than 90% increased from 19 in v.3.3.2 to 110 in v.4.2.1. (B) Pairwise comparison of difficult-to-map, medically relevant gene inclusion in the benchmark set. Genes falling on the dashed line are similarly included by both benchmark sets, whereas genes above (red fill) or below (blue fill) the dashed line are included more by the v.4.2.1 or v.3.3.2 benchmark set, respectively. The genes included more by v.4.2.1 tend to be in segmental duplications, and the smaller number of genes included more by v.3.3.2 are mostly genes duplicated in HG002 relative to GRCh38 and should be excluded.
Figure 3
Figure 3
Genes in the KIR locus are excluded in v.4.2.1 because of duplication in HG002 Medically relevant genes in the KIR locus, such as KIR2DL1, were partially included in v.3.3.2 with many erroneous variants but are correctly excluded by v.4.2.1 because of a likely duplication and other structural variation. Thick blue bars indicate regions included by each benchmark, and orange and light blue lines indicate positions of homozygous and heterozygous benchmark variants, respectively. A duplication of part of this region, which is common in the population, is supported by higher-than-normal coverage and high variant density across all technologies as well as alignment of multiple contigs from the maternal trio-based HG002 Hifiasm assembly (Hifiasm-maternal). The region is very challenging to characterize and assemble accurately because of high variability and copy number polymorphisms in the population as well as segmental duplications (shaded regions).
Figure 4
Figure 4
The difficult-to-map, medically relevant gene PMS2 is better included in v.4.2.1 The medically relevant gene PMS2 is 85.6% included in the v.4.2.1 benchmark regions, whereas it is 25.9% included in v.3.3.2 because segmental duplications (shaded regions) were largely excluded in previous benchmark versions. Thick blue bars indicate regions included by each benchmark, and orange and light blue lines indicate positions of homozygous and heterozygous benchmark variants, respectively. This region is challenging for assembly-based approaches, and an extra contig from the maternal trio-based HG002 Hifiasm assembly (Hifiasm-maternal) aligned to the left half of the gene because of misalignment or misassembly.
Figure 5
Figure 5
Summary of manual curations from the evaluation of the v.4.1 benchmark, demonstrating that it reliably identifies FPs and FNs in 13 call sets from different technologies and variant callers (A) For each call set, we curated 20 FPs and 20 FNs, and this shows the proportion of curated FP and FN variants where the benchmark set was correct, and the query call set was incorrect. The dashed black line indicates the desired majority threshold, 50%. Half of the curated variants were from GRCh37, and half were from GRCh38. (B) Breakdown of the total number of variants by category determined during manual curation, where the benchmark curation bar indicates whether the benchmark variant and genotype were determined to be correct and the query curations color indicates whether the query variant and genotype were determined to be correct. (A and B) Excluded in (B) are variants from (A), where the benchmark was deemed correct and the query incorrect and shows that most of these sites were difficult to curate. (C) Benchmark-unsure variants by call set. ONT, Oxford Nanopore; PB, PacBio HiFi; Ill, Illumina PCR-free; 10X, 10X Genomics.

Similar articles

  • Curated variation benchmarks for challenging medically relevant autosomal genes.
    Wagner J, Olson ND, Harris L, McDaniel J, Cheng H, Fungtammasan A, Hwang YC, Gupta R, Wenger AM, Rowell WJ, Khan ZM, Farek J, Zhu Y, Pisupati A, Mahmoud M, Xiao C, Yoo B, Sahraeian SME, Miller DE, Jáspez D, Lorenzo-Salazar JM, Muñoz-Barrera A, Rubio-Rodríguez LA, Flores C, Narzisi G, Evani US, Clarke WE, Lee J, Mason CE, Lincoln SE, Miga KH, Ebbert MTW, Shumate A, Li H, Chin CS, Zook JM, Sedlazeck FJ. Wagner J, et al. Nat Biotechnol. 2022 May;40(5):672-680. doi: 10.1038/s41587-021-01158-1. Epub 2022 Feb 7. Nat Biotechnol. 2022. PMID: 35132260 Free PMC article.
  • PrecisionFDA Truth Challenge V2: Calling variants from short and long reads in difficult-to-map regions.
    Olson ND, Wagner J, McDaniel J, Stephens SH, Westreich ST, Prasanna AG, Johanson E, Boja E, Maier EJ, Serang O, Jáspez D, Lorenzo-Salazar JM, Muñoz-Barrera A, Rubio-Rodríguez LA, Flores C, Kyriakidis K, Malousi A, Shafin K, Pesout T, Jain M, Paten B, Chang PC, Kolesnikov A, Nattestad M, Baid G, Goel S, Yang H, Carroll A, Eveleigh R, Bourgey M, Bourque G, Li G, Ma C, Tang L, Du Y, Zhang S, Morata J, Tonda R, Parra G, Trotta JR, Brueffer C, Demirkaya-Budak S, Kabakci-Zorlu D, Turgut D, Kalay Ö, Budak G, Narcı K, Arslan E, Brown R, Johnson IJ, Dolgoborodov A, Semenyuk V, Jain A, Tetikol HS, Jain V, Ruehle M, Lajoie B, Roddey C, Catreux S, Mehio R, Ahsan MU, Liu Q, Wang K, Sahraeian SME, Fang LT, Mohiyuddin M, Hung C, Jain C, Feng H, Li Z, Chen L, Sedlazeck FJ, Zook JM. Olson ND, et al. Cell Genom. 2022 May 11;2(5):100129. doi: 10.1016/j.xgen.2022.100129. Epub 2022 Apr 27. Cell Genom. 2022. PMID: 35720974 Free PMC article.
  • Benchmarking datasets for assembly-based variant calling using high-fidelity long reads.
    Lee H, Kim J, Lee J. Lee H, et al. BMC Genomics. 2023 Mar 27;24(1):148. doi: 10.1186/s12864-023-09255-y. BMC Genomics. 2023. PMID: 36973656 Free PMC article.
  • A diploid assembly-based benchmark for variants in the major histocompatibility complex.
    Chin CS, Wagner J, Zeng Q, Garrison E, Garg S, Fungtammasan A, Rautiainen M, Aganezov S, Kirsche M, Zarate S, Schatz MC, Xiao C, Rowell WJ, Markello C, Farek J, Sedlazeck FJ, Bansal V, Yoo B, Miller N, Zhou X, Carroll A, Barrio AM, Salit M, Marschall T, Dilthey AT, Zook JM. Chin CS, et al. Nat Commun. 2020 Sep 22;11(1):4794. doi: 10.1038/s41467-020-18564-9. Nat Commun. 2020. PMID: 32963235 Free PMC article.
  • Haplotyping-Assisted Diploid Assembly and Variant Detection with Linked Reads.
    Hu Y, Yang C, Zhang L, Zhou X. Hu Y, et al. Methods Mol Biol. 2023;2590:161-182. doi: 10.1007/978-1-0716-2819-5_11. Methods Mol Biol. 2023. PMID: 36335499 Review.

Cited by

References

    1. Zook J.M., Catoe D., McDaniel J., Vang L., Spies N., Sidow A., Weng Z., Liu Y., Mason C.E., Alexander N., et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci. Data. 2016;3:160025. - PMC - PubMed
    1. Zook J.M., Chapman B., Wang J., Mittelman D., Hofmann O., Hide W., Salit M. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat. Biotechnol. 2014;32:246–251. - PubMed
    1. Zook J.M., Hansen N.F., Olson N.D., Chapman L., Mullikin J.C., Xiao C., Sherry S., Koren S., Phillippy A.M., Boutros P.C., et al. A robust benchmark for detection of germline large deletions and insertions. Nat. Biotechnol. 2020;38:1347–1355. - PMC - PubMed
    1. Zook J.M., McDaniel J., Olson N.D., Wagner J., Parikh H., Heaton H., Irvine S.A., Trigg L., Truty R., McLean C.Y., et al. An open resource for accurately benchmarking small variant and reference calls. Nat. Biotechnol. 2019;37:561–566. - PMC - PubMed
    1. Krusche P., Trigg L., Boutros P.C., Mason C.E., De La Vega F.M., Moore B.L., Gonzalez-Porta M., Eberle M.A., Tezak Z., Lababidi S., et al. Best practices for benchmarking germline small-variant calls in human genomes. Nat. Biotechnol. 2019;37:555–560. - PMC - PubMed