Fast and SNP-tolerant detection of complex variants and splicing in short reads

doi:10.1093/bioinformatics/btq057

. 2010 Apr 1;26(7):873-81.

doi: 10.1093/bioinformatics/btq057. Epub 2010 Feb 10.

Fast and SNP-tolerant detection of complex variants and splicing in short reads

Thomas D Wu¹, Serban Nacu

Affiliations

PMID: 20147302
PMCID: PMC2844994
DOI: 10.1093/bioinformatics/btq057

Fast and SNP-tolerant detection of complex variants and splicing in short reads

Thomas D Wu et al. Bioinformatics. 2010.

. 2010 Apr 1;26(7):873-81.

doi: 10.1093/bioinformatics/btq057. Epub 2010 Feb 10.

Authors

Thomas D Wu¹, Serban Nacu

Affiliation

¹ Department of Bioinformatics, Genentech, Inc., 1 DNA Way, South San Francisco, CA, USA. twu@gene.com

PMID: 20147302
PMCID: PMC2844994
DOI: 10.1093/bioinformatics/btq057

Abstract

Motivation: Next-generation sequencing captures sequence differences in reads relative to a reference genome or transcriptome, including splicing events and complex variants involving multiple mismatches and long indels. We present computational methods for fast detection of complex variants and splicing in short reads, based on a successively constrained search process of merging and filtering position lists from a genomic index. Our methods are implemented in GSNAP (Genomic Short-read Nucleotide Alignment Program), which can align both single- and paired-end reads as short as 14 nt and of arbitrarily long length. It can detect short- and long-distance splicing, including interchromosomal splicing, in individual reads, using probabilistic models or a database of known splice sites. Our program also permits SNP-tolerant alignment to a reference space of all possible combinations of major and minor alleles, and can align reads from bisulfite-treated DNA for the study of methylation state.

Results: In comparison testing, GSNAP has speeds comparable to existing programs, especially in reads of > or=70 nt and is fastest in detecting complex variants with four or more mismatches or insertions of 1-9 nt and deletions of 1-30 nt. Although SNP tolerance does not increase alignment yield substantially, it affects alignment results in 7-8% of transcriptional reads, typically by revealing alternate genomic mappings for a read. Simulations of bisulfite-converted DNA show a decrease in identifying genomic positions uniquely in 6% of 36 nt reads and 3% of 70 nt reads.

Availability: Source code in C and utility programs in Perl are freely available for download as part of the GMAP package at http://share.gene.com/gmap.

PubMed Disclaimer

Figures

**Fig. 1.**
Examples of complex variants detected by GSNAP. (A) A 17 nt indel with mismatches in reads (below) relative to a genomic region (above), matching a known polymorphism in dbSNP. (B) Splicing found using probabilistic models reveals an intron within exon 1 of HOXA9, supported experimentally (Dintilhac *et al.*, 2004). (C) Splicing found using known splice sites, despite low probabilistic model scores. Ellipses indicate ‘half intron’ alignments, where reads have insufficient sequence to determine the distal exon. (D) Interchromosomal splicing between BCAS4 and BCAS3 found in the MAQC universal human reference RNA sample and observed in MCF7 cell lines (Hampton *et al.*, 2009). (E) SNP-tolerant alignment near a splice site allows both genotypes to align equally well.

**Fig. 2.**
Representing a reference sequence and a reference space for genomic alignment. (A) A hash table consists of an offset file of possible 12mers and a position file containing a sorted list of genomic positions for each 12mer. (B) SNPs in a genomic 12mer are represented by duplicating the position in the lists for all combinations of major and minor alleles within the 12mer. (C) Major alleles are represented in one compressed genome, while minor alleles are represented in another compressed genome.

**Fig. 3.**
Spanning set method for generating and filtering mismatch candidates. (A) A read of length 62 nt is analyzed at shifts of 0, 1 and 2 nt, with spanning sets each consisting of five elements. Elements at the ends may overhang the ends by 1 or 2 nt. Spanning set elements are shown in detail for the shift of 2 nt. (B) An overhanging 12mer is represented by a union of lists obtained from hash table lookups of all possible substitutions for the overhang. (C) Overlapping 12mers are represented by taking the intersection of their position lists. (D) Elements used for generating candidates (gray). (E) Elements used for filtering candidates (white). A candidate region (black) is supported by two of the generating elements, and is checked for support in the remaining filtering elements.

**Fig. 4.**
Complete set method for generating and filtering mismatch candidates. (A) Patterns of supporting (gray) and non-supporting (white) 12mers induced by a single mismatch, by two close mismatches, and by two distant mismatches (crosses). These patterns indicate a lower bound of ⌊(Δp+6)/12⌋ mismatches, where Δp is the distance between start locations of consecutive supporting 12mers. (B) Pattern-based lower bound calculation for a read of 51 nt, shown on top with actual mismatches. Supporting 12mers (gray) start at read locations 5, 8, 11 and 29, with end locations at −3 and L−9=42. The lower bound formula is summed over successive supporting 12mers to give a total lower bound of four mismatches.

**Fig. 5.**
Efficient detection of indels and splice pairs. (A) The complete set method generates candidate regions with supporting 12mers (gray). (B) Pairs of candidates within the allowed distance are tested for middle indels and short-distance splice pairs. The constraint on number of mismatches (shown for the value 1) determines a range of crossover points. (C) End indels are tested in the distal 14 nt when the long region of the read has a sufficiently low number of mismatches. (D) Long-distance splicing is detected by identifying known or novel splice sites in single candidate regions within areas defined by constraints on number of mismatches. Candidate regions with donor and acceptor splice sites are then paired to reveal splice junctions.

See this image and copyright information in PMC

Cited by

A thesaurus of genetic variation for interrogation of repetitive genomic regions.
Kerzendorfer C, Konopka T, Nijman SM. Kerzendorfer C, et al. Nucleic Acids Res. 2015 May 26;43(10):e68. doi: 10.1093/nar/gkv178. Epub 2015 Mar 27. Nucleic Acids Res. 2015. PMID: 25820428 Free PMC article.
Next generation sequencing in cancer research and clinical application.
Shyr D, Liu Q. Shyr D, et al. Biol Proced Online. 2013 Feb 13;15(1):4. doi: 10.1186/1480-9222-15-4. Biol Proced Online. 2013. PMID: 23406336 Free PMC article.
PrfA-like transcription factor gene lmo0753 contributes to L-rhamnose utilization in Listeria monocytogenes strains associated with human food-borne infections.
Salazar JK, Wu Z, McMullen PD, Luo Q, Freitag NE, Tortorello ML, Hu S, Zhang W. Salazar JK, et al. Appl Environ Microbiol. 2013 Sep;79(18):5584-92. doi: 10.1128/AEM.01812-13. Epub 2013 Jul 8. Appl Environ Microbiol. 2013. PMID: 23835178 Free PMC article.
Brain Transcriptomics of Wild and Domestic Rabbits Suggests That Changes in Dopamine Signaling and Ciliary Function Contributed to Evolution of Tameness.
Sato DX, Rafati N, Ring H, Younis S, Feng C, Blanco-Aguiar JA, Rubin CJ, Villafuerte R, Hallböök F, Carneiro M, Andersson L. Sato DX, et al. Genome Biol Evol. 2020 Oct 1;12(10):1918-1928. doi: 10.1093/gbe/evaa158. Genome Biol Evol. 2020. PMID: 32835359 Free PMC article.
Genome sequencing and mapping reveal loss of heterozygosity as a mechanism for rapid adaptation in the vegetable pathogen Phytophthora capsici.
Lamour KH, Mudge J, Gobena D, Hurtado-Gonzales OP, Schmutz J, Kuo A, Miller NA, Rice BJ, Raffaele S, Cano LM, Bharti AK, Donahoo RS, Finley S, Huitema E, Hulvey J, Platt D, Salamov A, Savidor A, Sharma R, Stam R, Storey D, Thines M, Win J, Haas BJ, Dinwiddie DL, Jenkins J, Knight JR, Affourtit JP, Han CS, Chertkov O, Lindquist EA, Detter C, Grigoriev IV, Kamoun S, Kingsmore SF. Lamour KH, et al. Mol Plant Microbe Interact. 2012 Oct;25(10):1350-60. doi: 10.1094/MPMI-02-12-0028-R. Mol Plant Microbe Interact. 2012. PMID: 22712506 Free PMC article.

See all "Cited by" articles

References

1. Bhangale TR, et al. Comprehensive identification and characterization of diallelic insertion-deletion polymorphisms in 330 human candidate genes. Hum. Mol. Genet. 2005;14:59–69. - PubMed
1. Bona FD, et al. Optimal spliced alignments of short sequence reads. Bioinformatics. 2008;24:i174–180. - PubMed
1. Burrows M, Wheeler DJ. Technical Report 124. California: Digital Equipment Corporation, Palo Alto; 1994. A block-sorting lossless data compression algorithm.
1. Canales RD, et al. Evaluation of DNA microarray results with quantitative gene expression platforms. Nat. Biotechnol. 2006;24:1115–1122. - PubMed
1. Cao X, Jacobsen SE. Locus-specific control of asymmetric and CpNpG methylation by the DRM and CMT3 methyltransferase genes. Proc. Natl Acad. Sci. 2002;99(suppl. 4):16491–16498. - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases
Research Materials
- NCI CPTC Antibody Characterization Program

[1] Bhangale TR, et al. Comprehensive identification and characterization of diallelic insertion-deletion polymorphisms in 330 human candidate genes. Hum. Mol. Genet. 2005;14:59–69. - PubMed

[2] Bhangale TR, et al. Comprehensive identification and characterization of diallelic insertion-deletion polymorphisms in 330 human candidate genes. Hum. Mol. Genet. 2005;14:59–69. - PubMed

[3] Bona FD, et al. Optimal spliced alignments of short sequence reads. Bioinformatics. 2008;24:i174–180. - PubMed

[4] Bona FD, et al. Optimal spliced alignments of short sequence reads. Bioinformatics. 2008;24:i174–180. - PubMed

[5] Burrows M, Wheeler DJ. Technical Report 124. California: Digital Equipment Corporation, Palo Alto; 1994. A block-sorting lossless data compression algorithm.

[6] Burrows M, Wheeler DJ. Technical Report 124. California: Digital Equipment Corporation, Palo Alto; 1994. A block-sorting lossless data compression algorithm.

[7] Canales RD, et al. Evaluation of DNA microarray results with quantitative gene expression platforms. Nat. Biotechnol. 2006;24:1115–1122. - PubMed

[8] Canales RD, et al. Evaluation of DNA microarray results with quantitative gene expression platforms. Nat. Biotechnol. 2006;24:1115–1122. - PubMed

[9] Cao X, Jacobsen SE. Locus-specific control of asymmetric and CpNpG methylation by the DRM and CMT3 methyltransferase genes. Proc. Natl Acad. Sci. 2002;99(suppl. 4):16491–16498. - PMC - PubMed

[10] Cao X, Jacobsen SE. Locus-specific control of asymmetric and CpNpG methylation by the DRM and CMT3 methyltransferase genes. Proc. Natl Acad. Sci. 2002;99(suppl. 4):16491–16498. - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Fast and SNP-tolerant detection of complex variants and splicing in short reads

Affiliation

Fast and SNP-tolerant detection of complex variants and splicing in short reads

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Research Materials

Abstract

Figures

Similar articles

Cited by

References

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Research Materials