Optimal spliced alignments of short sequence reads
- PMID: 18689821
- DOI: 10.1093/bioinformatics/btn300
Optimal spliced alignments of short sequence reads
Abstract
Motivation: Next generation sequencing technologies open exciting new possibilities for genome and transcriptome sequencing. While reads produced by these technologies are relatively short and error prone compared to the Sanger method their throughput is several magnitudes higher. To utilize such reads for transcriptome sequencing and gene structure identification, one needs to be able to accurately align the sequence reads over intron boundaries. This represents a significant challenge given their short length and inherent high error rate.
Results: We present a novel approach, called QPALMA, for computing accurate spliced alignments which takes advantage of the read's quality information as well as computational splice site predictions. Our method uses a training set of spliced reads with quality information and known alignments. It uses a large margin approach similar to support vector machines to estimate its parameters to maximize alignment accuracy. In computational experiments, we illustrate that the quality information as well as the splice site predictions help to improve the alignment quality. Finally, to facilitate mapping of massive amounts of sequencing data typically generated by the new technologies, we have combined our method with a fast mapping pipeline based on enhanced suffix arrays. Our algorithms were optimized and tested using reads produced with the Illumina Genome Analyzer for the model plant Arabidopsis thaliana.
Availability: Datasets for training and evaluation, additional results and a stand-alone alignment tool implemented in C++ and python are available at http://www.fml.mpg.de/raetsch/projects/qpalma.
Similar articles
-
PALMA: mRNA to genome alignments using large margin algorithms.Bioinformatics. 2007 Aug 1;23(15):1892-900. doi: 10.1093/bioinformatics/btm275. Epub 2007 May 30. Bioinformatics. 2007. PMID: 17537755
-
Correction of sequencing errors in a mixed set of reads.Bioinformatics. 2010 May 15;26(10):1284-90. doi: 10.1093/bioinformatics/btq151. Epub 2010 Apr 8. Bioinformatics. 2010. PMID: 20378555
-
De novo sequencing of plant genomes using second-generation technologies.Brief Bioinform. 2009 Nov;10(6):609-18. doi: 10.1093/bib/bbp039. Brief Bioinform. 2009. PMID: 19933209 Review.
-
RASE: recognition of alternatively spliced exons in C.elegans.Bioinformatics. 2005 Jun;21 Suppl 1:i369-77. doi: 10.1093/bioinformatics/bti1053. Bioinformatics. 2005. PMID: 15961480
-
The relative value of operon predictions.Brief Bioinform. 2008 Sep;9(5):367-75. doi: 10.1093/bib/bbn019. Epub 2008 Apr 17. Brief Bioinform. 2008. PMID: 18420711 Review.
Cited by
-
A survey of sequence alignment algorithms for next-generation sequencing.Brief Bioinform. 2010 Sep;11(5):473-83. doi: 10.1093/bib/bbq015. Epub 2010 May 11. Brief Bioinform. 2010. PMID: 20460430 Free PMC article. Review.
-
Computational Epigenetics: the new scientific paradigm.Bioinformation. 2010 Jan 23;4(7):331-7. doi: 10.6026/97320630004331. Bioinformation. 2010. PMID: 20978607 Free PMC article.
-
Annotating genomes with massive-scale RNA sequencing.Genome Biol. 2008;9(12):R175. doi: 10.1186/gb-2008-9-12-r175. Epub 2008 Dec 16. Genome Biol. 2008. PMID: 19087247 Free PMC article.
-
TopHat: discovering splice junctions with RNA-Seq.Bioinformatics. 2009 May 1;25(9):1105-11. doi: 10.1093/bioinformatics/btp120. Epub 2009 Mar 16. Bioinformatics. 2009. PMID: 19289445 Free PMC article.
-
Identification of donor splice sites using support vector machine: a computational approach based on positional, compositional and dependency features.Algorithms Mol Biol. 2016 Jun 1;11:16. doi: 10.1186/s13015-016-0078-4. eCollection 2016. Algorithms Mol Biol. 2016. PMID: 27252772 Free PMC article.
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Other Literature Sources