Adjust quality scores from alignment and improve sequencing accuracy

doi:10.1093/nar/gkh850

Comparative Study

. 2004 Sep 30;32(17):5183-91.

doi: 10.1093/nar/gkh850. Print 2004.

Adjust quality scores from alignment and improve sequencing accuracy

Ming Li¹, Magnus Nordborg, Lei M Li

Affiliations

PMID: 15459287
PMCID: PMC521663
DOI: 10.1093/nar/gkh850

Comparative Study

Adjust quality scores from alignment and improve sequencing accuracy

Ming Li et al. Nucleic Acids Res. 2004.

. 2004 Sep 30;32(17):5183-91.

doi: 10.1093/nar/gkh850. Print 2004.

Authors

Ming Li¹, Magnus Nordborg, Lei M Li

Affiliation

¹ Computational Biology, University of Southern California, Los Angeles, CA, USA.

PMID: 15459287
PMCID: PMC521663
DOI: 10.1093/nar/gkh850

Abstract

In shotgun sequencing, statistical reconstruction of a consensus from alignment requires a model of measurement error. Churchill and Waterman proposed one such model and an expectation-maximization (EM) algorithm to estimate sequencing error rates for each assembly matrix. Ewing and Green defined Phred quality scores for base-calling from sequencing traces by training a model on a large amount of data. However, sample preparations and sequencing machines may work under different conditions in practice and therefore quality scores need to be adjusted. Moreover, the information given by quality scores is incomplete in the sense that they do not describe error patterns. We observe that each nucleotide base has its specific error pattern that varies across the range of quality values. We develop models of measurement error for shotgun sequencing by combining the two perspectives above. We propose a logistic model taking quality scores as covariates. The model is trained by a procedure combining an EM algorithm and model selection techniques. The training results in calibration of quality values and leads to a more accurate construction of consensus. Besides Phred scores obtained from ABI sequencers, we apply the same technique to calibrate quality values that come along with Beckman sequencers.

PubMed Disclaimer

Figures

**Figure 1**
An illustrative example of the problem. The bases with a ∼ sign represent their complementary bases.

**Figure 2**
Observed sequencing error rates versus predicted error rates by *Phred* quality score.

**Figure 3**
Observed sequencing error rates versus corrected error rates by a logistic model.

**Figure 4**
Observed sequencing error rates versus predicted error rates by CEQ Quality Score.

**Figure 5**
Observed sequencing error rates versus predicted error rates by a logistic model.

**Figure 6**
Observed score-wise conditional error rates. The true base is A.

**Figure 7**
Conditional error rates predicted from a logistic model. The true base is A.

See this image and copyright information in PMC

Cited by

RIG: Recalibration and interrelation of genomic sequence data with the GATK.
McCormick RF, Truong SK, Mullet JE. McCormick RF, et al. G3 (Bethesda). 2015 Feb 13;5(4):655-65. doi: 10.1534/g3.115.017012. G3 (Bethesda). 2015. PMID: 25681258 Free PMC article.
SEME: a fast mapper of Illumina sequencing reads with statistical evaluation.
Chen S, Wang A, Li LM. Chen S, et al. J Comput Biol. 2013 Nov;20(11):847-60. doi: 10.1089/cmb.2013.0111. J Comput Biol. 2013. PMID: 24195707 Free PMC article.
Next generation sequence analysis and computational genomics using graphical pipeline workflows.
Torri F, Dinov ID, Zamanyan A, Hobel S, Genco A, Petrosyan P, Clark AP, Liu Z, Eggert P, Pierce J, Knowles JA, Ames J, Kesselman C, Toga AW, Potkin SG, Vawter MP, Macciardi F. Torri F, et al. Genes (Basel). 2012 Aug 30;3(3):545-75. doi: 10.3390/genes3030545. Genes (Basel). 2012. PMID: 23139896 Free PMC article.
A framework for variation discovery and genotyping using next-generation DNA sequencing data.
DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell TJ, Kernytsky AM, Sivachenko AY, Cibulskis K, Gabriel SB, Altshuler D, Daly MJ. DePristo MA, et al. Nat Genet. 2011 May;43(5):491-8. doi: 10.1038/ng.806. Epub 2011 Apr 10. Nat Genet. 2011. PMID: 21478889 Free PMC article.
Assessing the necessity of confirmatory testing for exome-sequencing results in a clinical molecular diagnostic laboratory.
Strom SP, Lee H, Das K, Vilain E, Nelson SF, Grody WW, Deignan JL. Strom SP, et al. Genet Med. 2014 Jul;16(7):510-5. doi: 10.1038/gim.2013.183. Epub 2014 Jan 9. Genet Med. 2014. PMID: 24406459 Free PMC article.

See all "Cited by" articles

References

1. Adams M.D., Fields,C. and Ventor,J.C. (eds). (1994) Automated DNA Sequencing and Analysis. Academic Press, London, San Diego.
1. Ewing B. and Green,P. (1998) Base-calling of automated sequencer traces using phred. 2. Error probabilities. Genome Res., 8, 186–194. - PubMed
1. Ewing B., Hillier,L., Wendl,M.C. and Green,P. (1998) Base-calling of automated sequencer traces using phred. 1. Accuracy assessment. Genome Res., 8, 175–185. - PubMed
1. Churchill G.A. and Waterman,M.S. (1992) The accuracy of DNA sequences: estimating sequence quality. Genomics, 14, 89–98. - PubMed
1. Parkhill J., Wren,B.W., Mungall,K., Ketley,J.M., Churcher,C., Basham,D., Chillingworth,T., Davies,R.M., Feltwell,T., Holroyd,S., et al. (2000) The genome sequence of the food-borne pathogen Campylobacter jejuni reveals hypervariable sequences. Nature, 403, 665–668. - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

[1] Adams M.D., Fields,C. and Ventor,J.C. (eds). (1994) Automated DNA Sequencing and Analysis. Academic Press, London, San Diego.

[2] Adams M.D., Fields,C. and Ventor,J.C. (eds). (1994) Automated DNA Sequencing and Analysis. Academic Press, London, San Diego.

[3] Ewing B. and Green,P. (1998) Base-calling of automated sequencer traces using phred. 2. Error probabilities. Genome Res., 8, 186–194. - PubMed

[4] Ewing B. and Green,P. (1998) Base-calling of automated sequencer traces using phred. 2. Error probabilities. Genome Res., 8, 186–194. - PubMed

[5] Ewing B., Hillier,L., Wendl,M.C. and Green,P. (1998) Base-calling of automated sequencer traces using phred. 1. Accuracy assessment. Genome Res., 8, 175–185. - PubMed

[6] Ewing B., Hillier,L., Wendl,M.C. and Green,P. (1998) Base-calling of automated sequencer traces using phred. 1. Accuracy assessment. Genome Res., 8, 175–185. - PubMed

[7] Churchill G.A. and Waterman,M.S. (1992) The accuracy of DNA sequences: estimating sequence quality. Genomics, 14, 89–98. - PubMed

[8] Churchill G.A. and Waterman,M.S. (1992) The accuracy of DNA sequences: estimating sequence quality. Genomics, 14, 89–98. - PubMed

[9] Parkhill J., Wren,B.W., Mungall,K., Ketley,J.M., Churcher,C., Basham,D., Chillingworth,T., Davies,R.M., Feltwell,T., Holroyd,S., et al. (2000) The genome sequence of the food-borne pathogen Campylobacter jejuni reveals hypervariable sequences. Nature, 403, 665–668. - PubMed

[10] Parkhill J., Wren,B.W., Mungall,K., Ketley,J.M., Churcher,C., Basham,D., Chillingworth,T., Davies,R.M., Feltwell,T., Holroyd,S., et al. (2000) The genome sequence of the food-borne pathogen Campylobacter jejuni reveals hypervariable sequences. Nature, 403, 665–668. - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Adjust quality scores from alignment and improve sequencing accuracy

Affiliation

Adjust quality scores from alignment and improve sequencing accuracy

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources