Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2004 Sep 30;32(17):5183-91.
doi: 10.1093/nar/gkh850. Print 2004.

Adjust quality scores from alignment and improve sequencing accuracy

Affiliations
Comparative Study

Adjust quality scores from alignment and improve sequencing accuracy

Ming Li et al. Nucleic Acids Res. .

Abstract

In shotgun sequencing, statistical reconstruction of a consensus from alignment requires a model of measurement error. Churchill and Waterman proposed one such model and an expectation-maximization (EM) algorithm to estimate sequencing error rates for each assembly matrix. Ewing and Green defined Phred quality scores for base-calling from sequencing traces by training a model on a large amount of data. However, sample preparations and sequencing machines may work under different conditions in practice and therefore quality scores need to be adjusted. Moreover, the information given by quality scores is incomplete in the sense that they do not describe error patterns. We observe that each nucleotide base has its specific error pattern that varies across the range of quality values. We develop models of measurement error for shotgun sequencing by combining the two perspectives above. We propose a logistic model taking quality scores as covariates. The model is trained by a procedure combining an EM algorithm and model selection techniques. The training results in calibration of quality values and leads to a more accurate construction of consensus. Besides Phred scores obtained from ABI sequencers, we apply the same technique to calibrate quality values that come along with Beckman sequencers.

PubMed Disclaimer

Figures

Figure 1
Figure 1
An illustrative example of the problem. The bases with a ∼ sign represent their complementary bases.
Figure 2
Figure 2
Observed sequencing error rates versus predicted error rates by Phred quality score.
Figure 3
Figure 3
Observed sequencing error rates versus corrected error rates by a logistic model.
Figure 4
Figure 4
Observed sequencing error rates versus predicted error rates by CEQ Quality Score.
Figure 5
Figure 5
Observed sequencing error rates versus predicted error rates by a logistic model.
Figure 6
Figure 6
Observed score-wise conditional error rates. The true base is A.
Figure 7
Figure 7
Conditional error rates predicted from a logistic model. The true base is A.

Similar articles

Cited by

References

    1. Adams M.D., Fields,C. and Ventor,J.C. (eds). (1994) Automated DNA Sequencing and Analysis. Academic Press, London, San Diego.
    1. Ewing B. and Green,P. (1998) Base-calling of automated sequencer traces using phred. 2. Error probabilities. Genome Res., 8, 186–194. - PubMed
    1. Ewing B., Hillier,L., Wendl,M.C. and Green,P. (1998) Base-calling of automated sequencer traces using phred. 1. Accuracy assessment. Genome Res., 8, 175–185. - PubMed
    1. Churchill G.A. and Waterman,M.S. (1992) The accuracy of DNA sequences: estimating sequence quality. Genomics, 14, 89–98. - PubMed
    1. Parkhill J., Wren,B.W., Mungall,K., Ketley,J.M., Churcher,C., Basham,D., Chillingworth,T., Davies,R.M., Feltwell,T., Holroyd,S., et al. (2000) The genome sequence of the food-borne pathogen Campylobacter jejuni reveals hypervariable sequences. Nature, 403, 665–668. - PubMed

Publication types