Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Jan;27(1):157-164.
doi: 10.1101/gr.210500.116. Epub 2016 Nov 30.

A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree

Affiliations

A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree

Michael A Eberle et al. Genome Res. 2017 Jan.

Abstract

Improvement of variant calling in next-generation sequence data requires a comprehensive, genome-wide catalog of high-confidence variants called in a set of genomes for use as a benchmark. We generated deep, whole-genome sequence data of 17 individuals in a three-generation pedigree and called variants in each genome using a range of currently available algorithms. We used haplotype transmission information to create a phased "Platinum" variant catalog of 4.7 million single-nucleotide variants (SNVs) plus 0.7 million small (1-50 bp) insertions and deletions (indels) that are consistent with the pattern of inheritance in the parents and 11 children of this pedigree. Platinum genotypes are highly concordant with the current catalog of the National Institute of Standards and Technology for both SNVs (>99.99%) and indels (99.92%) and add a validated truth catalog that has 26% more SNVs and 45% more indels. Analysis of 334,652 SNVs that were consistent between informatics pipelines yet inconsistent with haplotype transmission ("nonplatinum") revealed that the majority of these variants are de novo and cell-line mutations or reside within previously unidentified duplications and deletions. The reference materials from this study are a resource for objective assessment of the accuracy of variant calls throughout genomes.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Pedigree of the family sequenced for this study (CEPH pedigree 1463). The Coriell ID for each sample is defined by adding the prefix NA128 to each numbered individual: e.g., 77 = NA12877. Samples filled with dark orange are used in this analysis but the founder generations (blue) were also sequenced and used as further validation of the haplotypes generated in this study. The trio, 91-92-78, was also sequenced during Phase I of the 1000 Genomes Project (The 1000 Genomes Project Consortium 2010).
Figure 2.
Figure 2.
Structural abnormalities account for most of the inconsistencies in detection of the platinum haplotypes in the founders. (A) A 20-Mb structural abnormality identified by 177 variants in nine clusters that failed the founder haplotype validation test in NA12889. Based on the marked skew in frequency of the SNV alleles within the rearrangement (4:1) compared to the proximal flanking region (1:1), this event is likely a mosaic on the distal end of Chromosome 11, specific to NA12889. Each point represents the allele count at a SNV location expected to be heterozygous in NA12889 based on seeing the nontransmitted allele at least six times. Red points show the allele counts (average n = 11 within the mosaic) of the haplotype transmitted to NA12877, and blue points show the allele counts (average n = 43 in the mosaic) of the nontransmitted haplotype. The black line shows the average total depth in windows of 100 SNVs, highlighting that this mosaic is not associated with a change in copy number. (B) Allele counts for SNV positions in a possible cell-line somatic deletion in NA12891 identified by a cluster of 174 k-mer failures. Points are colored the same as in A. Within this deletion, there are virtually no reads corresponding to the transmitted haplotype, whereas a relatively constant read depth was observed across the region for the nontransmitted haplotype.
Figure 3.
Figure 3.
CNVs in this pedigree identified from nonplatinum variants. (A) Duplication on Chromosome 1 containing 242 Category 1 nonplatinum SNVs in a region of elevated read depth. Colored lines show the depths for the parents (NA12877 in purple and NA12878 in orange), and the gray lines show the depths for each of the children. Points along the bottom highlight the platinum variants (blue) and the nonplatinum variants (red). Note that there are 16 platinum variants in this duplication, because the presence of duplicated sequence can still produce genotypes that are consistent with those predicted by a diploid model (Supplemental Table S15). (B) Deletion on Chromosome 4 identified by 176 Category 2 nonplatinum SNVs that were consistent with the presence of a large hemizygous deletion. In addition to the large deletion, the depth supports the presence of several segmental duplications both overlapping and flanking the deletion. Lines and points are colored as in A. (C) Cell line or somatic deletions in multiple members of the pedigree on Chromosome 22 identified by 926 Category 4 nonplatinum SNVs. Lines are colored as in A, except for the addition of a black line that shows the depth for NA12893 (child). Although the other children (gray lines) do not appear to be deleted in any part of this region, the depths are highly variable, which may indicate somatic instability and mosaicism. The variability of this region within different cell line passages is evident when we compared these sequence data for NA12878 against corresponding data from the 1000 Genomes Project (1000 Genomes Project Consortium 2010) and NIST (Zook et al. 2014) (Supplemental Fig. S11).
Figure 4.
Figure 4.
Precision versus recall in NA12878 evaluated against the Platinum catalog data set. Triangles, circles, and squares, respectively, represent the results from 30×, 40×, and 50× sequencing depth for Platypus (red), FreeBayes (blue), GATK3 (green), and Strelka (black). Excluding Strelka, all callers are run in joint calling mode incorporating the parents. (A) Indels (large symbols) and SNV (small symbols) results plotted on the same axis. (B) Expansion of SNV results, also showing ROC curves for GATK3 and Strelka that reflect the trade-off of recall versus precision that is obtained by altering specific variable parameters when using the algorithms.

Similar articles

Cited by

References

    1. 1000 Genomes Project Consortium. 2010. A map of human genome variation from population-scale sequencing. Nature 467: 1061–1073. - PMC - PubMed
    1. 1000 Genomes Project Consortium. 2012. An integrated map of genetic variation from 1,092 human genomes. Nature 491: 56–65. - PMC - PubMed
    1. 1000 Genomes Project Consortium. 2015. A global reference for human genetic variation. Nature 526: 68–74. - PMC - PubMed
    1. Abecasis GR, Cherny SS, Cookson WO, Cardon LR. 2002. Merlin—rapid analysis of dense genetic maps using sparse gene flow trees. Nat Genet 30: 97–101. - PubMed
    1. Ajay SS, Parker SC, Abaan HO, Fajardo KV, Margulies EH. 2011. Accurate and comprehensive sequencing of personal genomes. Genome Res 21: 1498–1505. - PMC - PubMed