Accurate SNP and mutation detection by targeted custom microarray-based genomic enrichment of short-fragment sequencing libraries

doi:10.1093/nar/gkq072

. 2010 Jun;38(10):e116.

doi: 10.1093/nar/gkq072. Epub 2010 Feb 17.

Accurate SNP and mutation detection by targeted custom microarray-based genomic enrichment of short-fragment sequencing libraries

Michal Mokry¹, Harma Feitsma, Isaac J Nijman, Ewart de Bruijn, Pieter J van der Zaag, Victor Guryev, Edwin Cuppen

Affiliations

PMID: 20164091
PMCID: PMC2879533
DOI: 10.1093/nar/gkq072

Accurate SNP and mutation detection by targeted custom microarray-based genomic enrichment of short-fragment sequencing libraries

Michal Mokry et al. Nucleic Acids Res. 2010 Jun.

. 2010 Jun;38(10):e116.

doi: 10.1093/nar/gkq072. Epub 2010 Feb 17.

Authors

Michal Mokry¹, Harma Feitsma, Isaac J Nijman, Ewart de Bruijn, Pieter J van der Zaag, Victor Guryev, Edwin Cuppen

Affiliation

¹ Hubrecht Institute and University Medical Center Utrecht, KNAW, Uppsalalaan 8, 3584 CT Utrecht, The Netherlands.

PMID: 20164091
PMCID: PMC2879533
DOI: 10.1093/nar/gkq072

Abstract

Microarray-based enrichment of selected genomic loci is a powerful method for genome complexity reduction for next-generation sequencing. Since the vast majority of exons in vertebrate genomes are smaller than 150 nt, we explored the use of short fragment libraries (85-110 bp) to achieve higher enrichment specificity by reducing carryover and adverse effects of flanking intronic sequences. High enrichment specificity (60-75%) was obtained with a relative even base coverage. Up to 98% of the target-sequence was covered more than 20x at an average coverage depth of about 200x. To verify the accuracy of SNP/mutation detection, we evaluated 384 known non-reference SNPs in the targeted regions. At approximately 200x average sequence coverage, we were able to survey 96.4% of 1.69 Mb of genomic sequence with only 4.2% false negative calls, mostly due to low coverage. Using the same settings, a total of 1197 novel candidate variants were detected. Verification experiments revealed only eight false positive calls, indicating an overall false positive rate of less than 1 per approximately 200,000 bp. Taken together, short fragment libraries provide highly efficient and flexible enrichment of exonic targets and yield relatively even base coverage, which facilitates accurate SNP and mutation detection. Raw sequencing data, alignment files and called SNPs have been submitted into GEO database http://www.ncbi.nlm.nih.gov/geo/ with accession number GSE18542.

PubMed Disclaimer

Figures

**Figure 1.**
Size distribution of human exons. The median size of human exons is only 120 bp (with 70% of all exons shorter than 200 bp). Therefore many of the specifically captured DNA from long-fragment libraries will consist of sequences derived from introns flanking the exons of interest, which decreases the effective sequencing yield.

**Figure 2.**
Comparison of sequence coverage evenness after enrichment. The fraction of target positions with at least that coverage was as compared to the average coverage. (A) Comparison of various enrichment (washing temperature and input DNA) and sequencing (35- versus 50-mer) conditions. Library 1 sequenced by 50-mer reads results in the most even coverage compared to other libraries. The brown curve depicts the best possible evenness for an ideal evenly enriched sample with 100 × average coverage, where the unevenness is purely caused by statistical randomness in the coverage assuming a Poisson distribution of the sequencing reads. (B) The evenness score, E, represents the fraction of whole sequencing throughput that is correctly distributed (marked area below the curve). Consequently, 1-E represents the fraction of the (whole) sequencing output that has to be redistributed from positions with coverage above average to positions with coverage below average (by better enrichment) to get the ideal even coverage over all targeted positions. The more even the coverage, the higher the evenness score. (C) Correlation of evenness score E for randomized sets to the sequencing depth. In this simulation, the unevenness of these datasets is purely caused by the random distribution of reads and fits a Poisson distribution of sequence coverage. When the discrete character of the data is reduced by sufficient depth of coverage, E changes only slightly with increasing average coverage and thus can be characterized as relatively independent of sequencing depth. (D) Comparison between + and – strand coverage for 35- and 50-mer reads. In the case of 50-mer reads, the coverage is more even with fewer positions covered by extremely low (or high) numbers of sequencing tags. This difference is more prominent when the coverage is determined separately for the positive or negative strand. Independent strand coverage is better for 50-mer than for 35-mer sequencing.

**Figure 3.**
Exemplary representation of target coverage after enrichment. Sequencing results of Library 1 are shown for 35-mer (green) and 50-mer (purple) sequencing. Total, positive and negative strand reads are shown independently. Coverage is more equal and better represented by both strands for the longer sequencing reads.

**Figure 4.**
Correlation of probe density and sequencing coverage. Each genomic region was represented on the array with a variable number of capture probes throughout the region. The sequencing coverage per base (blue line) linearly correlates with probe density (red line).

**Figure 5.**
Distribution of non-reference allele reads. The percentage of non-reference allele reads was calculated for every heterozygous and homozygous non-reference allele position in the targeted region (n = 1197) and is represented in bins of 5%. For heterozygous calls, the distribution is skewed towards reference allele reads.

**Figure 6.**
Sequencing coverage and percentage of non-reference allele distribution for validated and non-validated SNPs. All polymorphic and non-reference positions that were identified by the SNP detection pipeline are plotted as a function of total base coverage versus non-reference read frequency. Validated SNPs (either by their presence in dbSNP or by resequencing) are indicated in blue, non-validated SNPs are shown in red and positions for which no working validation assay could be designed in green. False-positive SNPs tend to have a lower percentage of non-reference allele reads and/or low overall coverage.

See this image and copyright information in PMC

Cited by

Next generation diagnostics in inherited arrhythmia syndromes : a comparison of two approaches.
Ware JS, John S, Roberts AM, Buchan R, Gong S, Peters NS, Robinson DO, Lucassen A, Behr ER, Cook SA. Ware JS, et al. J Cardiovasc Transl Res. 2013 Feb;6(1):94-103. doi: 10.1007/s12265-012-9401-8. Epub 2012 Sep 7. J Cardiovasc Transl Res. 2013. PMID: 22956155 Free PMC article.
Using a priori knowledge to align sequencing reads to their exact genomic position.
Böttcher R, Amberg R, Ruzius FP, Guryev V, Verhaegh WF, Beyerlein P, van der Zaag PJ. Böttcher R, et al. Nucleic Acids Res. 2012 Sep;40(16):e125. doi: 10.1093/nar/gks393. Epub 2012 May 11. Nucleic Acids Res. 2012. PMID: 22581774 Free PMC article.
Next generation sequence analysis and computational genomics using graphical pipeline workflows.
Torri F, Dinov ID, Zamanyan A, Hobel S, Genco A, Petrosyan P, Clark AP, Liu Z, Eggert P, Pierce J, Knowles JA, Ames J, Kesselman C, Toga AW, Potkin SG, Vawter MP, Macciardi F. Torri F, et al. Genes (Basel). 2012 Aug 30;3(3):545-75. doi: 10.3390/genes3030545. Genes (Basel). 2012. PMID: 23139896 Free PMC article.
Genomic DNA pooling strategy for next-generation sequencing-based rare variant discovery in abdominal aortic aneurysm regions of interest-challenges and limitations.
Harakalova M, Nijman IJ, Medic J, Mokry M, Renkens I, Blankensteijn JD, Kloosterman W, Baas AF, Cuppen E. Harakalova M, et al. J Cardiovasc Transl Res. 2011 Jun;4(3):271-80. doi: 10.1007/s12265-011-9263-5. Epub 2011 Mar 1. J Cardiovasc Transl Res. 2011. PMID: 21360310 Free PMC article.
Chromothripsis is a common mechanism driving genomic rearrangements in primary and metastatic colorectal cancer.
Kloosterman WP, Hoogstraat M, Paling O, Tavakoli-Yaraki M, Renkens I, Vermaat JS, van Roosmalen MJ, van Lieshout S, Nijman IJ, Roessingh W, van 't Slot R, van de Belt J, Guryev V, Koudijs M, Voest E, Cuppen E. Kloosterman WP, et al. Genome Biol. 2011 Oct 19;12(10):R103. doi: 10.1186/gb-2011-12-10-r103. Genome Biol. 2011. PMID: 22014273 Free PMC article.

See all "Cited by" articles

References

1. Zheng W, Long J, Gao YT, Li C, Zheng Y, Xiang YB, Wen W, Levy S, Deming SL, Haines JL, et al. Genome-wide association study identifies a new breast cancer susceptibility locus at 6q25.1. Nat. Genet. 2009;41:324–328. - PMC - PubMed
1. Song H, Ramus SJ, Tyrer J, Bolton KL, Gentry-Maharaj A, Wozniak E, Anton-Culver H, Chang-Claude J, Cramer DW, DiCioccio R, et al. A genome-wide association study identifies a new ovarian cancer susceptibility locus on 9p22.2. Nat. Genet. 2009;41:996–1000. - PMC - PubMed
1. Papaemmanuil E, Hosking FJ, Vijayakrishnan J, Price A, Olver B, Sheridan E, Kinsey SE, Lightfoot T, Roman E, Irving JA, et al. Loci on 7p12.2, 10q21.2 and 14q11.2 are associated with risk of childhood acute lymphoblastic leukemia. Nat. Genet. 2009;41:1006–1010. - PMC - PubMed
1. Kathiresan S, Voight BF, Purcell S, Musunuru K, Ardissino D, Mannucci PM, Anand S, Engert JC, Samani NJ, Schunkert H, et al. Genome-wide association of early-onset myocardial infarction with single nucleotide polymorphisms and copy number variants. Nat. Genet. 2009;41:334–341. - PMC - PubMed
1. Ahmed S, Thomas G, Ghoussaini M, Healey CS, Humphreys MK, Platte R, Morrison J, Maranian M, Pooley KA, Luben R, et al. Newly discovered breast cancer susceptibility loci on 3p24 and 17q23.2. Nat. Genet. 2009;41:585–590. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Associated data

Actions
- Search in PubMed
- Search in GEO

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases
Research Materials
- NCI CPTC Antibody Characterization Program
Miscellaneous
- NCI CPTAC Assay Portal

[1] Zheng W, Long J, Gao YT, Li C, Zheng Y, Xiang YB, Wen W, Levy S, Deming SL, Haines JL, et al. Genome-wide association study identifies a new breast cancer susceptibility locus at 6q25.1. Nat. Genet. 2009;41:324–328. - PMC - PubMed

[2] Zheng W, Long J, Gao YT, Li C, Zheng Y, Xiang YB, Wen W, Levy S, Deming SL, Haines JL, et al. Genome-wide association study identifies a new breast cancer susceptibility locus at 6q25.1. Nat. Genet. 2009;41:324–328. - PMC - PubMed

[3] Song H, Ramus SJ, Tyrer J, Bolton KL, Gentry-Maharaj A, Wozniak E, Anton-Culver H, Chang-Claude J, Cramer DW, DiCioccio R, et al. A genome-wide association study identifies a new ovarian cancer susceptibility locus on 9p22.2. Nat. Genet. 2009;41:996–1000. - PMC - PubMed

[4] Song H, Ramus SJ, Tyrer J, Bolton KL, Gentry-Maharaj A, Wozniak E, Anton-Culver H, Chang-Claude J, Cramer DW, DiCioccio R, et al. A genome-wide association study identifies a new ovarian cancer susceptibility locus on 9p22.2. Nat. Genet. 2009;41:996–1000. - PMC - PubMed

[5] Papaemmanuil E, Hosking FJ, Vijayakrishnan J, Price A, Olver B, Sheridan E, Kinsey SE, Lightfoot T, Roman E, Irving JA, et al. Loci on 7p12.2, 10q21.2 and 14q11.2 are associated with risk of childhood acute lymphoblastic leukemia. Nat. Genet. 2009;41:1006–1010. - PMC - PubMed

[6] Papaemmanuil E, Hosking FJ, Vijayakrishnan J, Price A, Olver B, Sheridan E, Kinsey SE, Lightfoot T, Roman E, Irving JA, et al. Loci on 7p12.2, 10q21.2 and 14q11.2 are associated with risk of childhood acute lymphoblastic leukemia. Nat. Genet. 2009;41:1006–1010. - PMC - PubMed

[7] Kathiresan S, Voight BF, Purcell S, Musunuru K, Ardissino D, Mannucci PM, Anand S, Engert JC, Samani NJ, Schunkert H, et al. Genome-wide association of early-onset myocardial infarction with single nucleotide polymorphisms and copy number variants. Nat. Genet. 2009;41:334–341. - PMC - PubMed

[8] Kathiresan S, Voight BF, Purcell S, Musunuru K, Ardissino D, Mannucci PM, Anand S, Engert JC, Samani NJ, Schunkert H, et al. Genome-wide association of early-onset myocardial infarction with single nucleotide polymorphisms and copy number variants. Nat. Genet. 2009;41:334–341. - PMC - PubMed

[9] Ahmed S, Thomas G, Ghoussaini M, Healey CS, Humphreys MK, Platte R, Morrison J, Maranian M, Pooley KA, Luben R, et al. Newly discovered breast cancer susceptibility loci on 3p24 and 17q23.2. Nat. Genet. 2009;41:585–590. - PMC - PubMed

[10] Ahmed S, Thomas G, Ghoussaini M, Healey CS, Humphreys MK, Platte R, Morrison J, Maranian M, Pooley KA, Luben R, et al. Newly discovered breast cancer susceptibility loci on 3p24 and 17q23.2. Nat. Genet. 2009;41:585–590. - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Accurate SNP and mutation detection by targeted custom microarray-based genomic enrichment of short-fragment sequencing libraries

Affiliation

Accurate SNP and mutation detection by targeted custom microarray-based genomic enrichment of short-fragment sequencing libraries

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Associated data

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Research Materials

Miscellaneous

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Associated data

Related information

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Research Materials

Miscellaneous