Skip to main page content
U.S. flag

An official website of the United States government

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Jul 1;30(7):693-700.
doi: 10.1038/nbt.2280.

Hybrid error correction and de novo assembly of single-molecule sequencing reads

Affiliations

Hybrid error correction and de novo assembly of single-molecule sequencing reads

Sergey Koren et al. Nat Biotechnol. .

Abstract

Single-molecule sequencing instruments can generate multikilobase sequences with the potential to greatly improve genome and transcriptome assembly. However, the error rates of single-molecule reads are high, which has limited their use thus far to resequencing bacteria. To address this limitation, we introduce a correction algorithm and assembly strategy that uses short, high-fidelity sequences to correct the error in single-molecule sequences. We demonstrate the utility of this approach on reads generated by a PacBio RS instrument from phage, prokaryotic and eukaryotic whole genomes, including the previously unsequenced genome of the parrot Melopsittacus undulatus, as well as for RNA-Seq reads of the corn (Zea mays) transcriptome. Our long-read correction achieves >99.9% base-call accuracy, leading to substantially better assemblies than current sequencing strategies: in the best example, the median contig size was quintupled relative to high-coverage, second-generation assemblies. Greater gains are predicted if read lengths continue to increase, including the prospect of single-contig bacterial chromosome assembly.

PubMed Disclaimer

Conflict of interest statement

Competing Interests The authors declare that they have no competing financial interests.

Figures

Figure 1
Figure 1. The PBcR single-molecule read correction and assembly pipeline
a) The high-error, indicated by black vertical bars, in single-pass PacBio RS sequences obscures overlaps. b) Given a high-accuracy sequence (~99% identical to the truth), the error between it and a PacBio RS sequence is half the error between two PacBio RS sequences. Therefore, accurate alignments can be computed. In this example, black bars in the short-reads indicate “mapping errors” that are a combination of the sequencing error in both the long and short reads. In addition, a two-copy inexact repeat is present (outlined in gray) leading to “pileups” of reads at each copy. To avoid mapping reads to the wrong repeat copy, the pipeline selects a cutoff, C, and only the top C hits for each short read are used. The spurious mappings (in white) are discarded. c) The remaining alignments are used to generate a new consensus sequence, trimming and splitting long reads whenever there is a gap in the short-read tiling. Sequencing errors, indicated in black, may propagate to the PBcR read in rare cases where sequencing error co-occurs. d) After correction, overlaps between long PBcR sequences can be easily detected. e) The resulting assembly is able to span repeats that are unresolvable using only the short reads.
Figure 2
Figure 2. Long-reads yield assembly improvements, even at low coverage
a) Effect of PacBio corrected read length (PBcR) on contig size is measured for the OLC assembler Celera Assembler and the de Brujin assembler SOAPdenovo. Contig size, after breaking contigs at mis-joins, is measured using the standard N50 metric (N such that 50% of the genome is contained in contigs ≥ N). The baseline SOAPdenovo assembly (purple star) represents an assembly of 50X of real 76 bp Illumina paired-end (300 bp) reads from S. cerevisiae S228c. The effect of increasing PBcR read length was tested using 10X of simulated, error-free reads sampled from the S. cerevisiae genome. Read length was randomly sampled from actual length distributions of PBcR reads (from other genomes) to represent: the pre-release PacBio instrument (Q1, 2011), the first publicly available instrument (Q2, 2011), and the latest ”C2” chemistry upgrade (Q1, 2012). b). Effect of PBcR coverage is measured for Escherichia coli, sequenced with a combination of PacBio and second-generation sequencing. The benefit of the PBcR sequences is visible even below 5X, which leads to a 50%–100% increase in N50. Maximum contig N50 is reached by ~10X, where adding 10X of PBcR increases the N50 by up as much as 3.5-fold (250%). The larger gain versus the 454-only assembly is due the longer PBcR sequences available for E. coli JM221. The variation in N50 is due to random subsampling of sequencing data.
Figure 3
Figure 3. Contig sizes for various combinations of sequencing technologies
Assemblies are for E. coli C227-11 (assemblies including Illumina and PacBio CCS) and E. coli JM221 (assemblies including 454). Both genomes have similar repeat content, PacBio read length, and coverage. Assemblies of only second-generation data are comparable and average N50 ≈ 100 Kbp. By comparison, adding 25X or 50X of PBcR to these data sets increases N50 as much as 5 fold and pushes the maximum contig size greater than 1 Mbp (for the PBcR/CCS combination).
Figure 4
Figure 4. Error correction of RNA-Seq data provides more accurate mapping of transcripts
A genome browser view of cDNA alignments using uncorrected (purple) and Illumina-corrected (green) PacBio reads generated from Zea mays B73 cDNAs. The splice-aware aligner, BLAT, was used for aligning PacBio reads to the genome. Long gaps in the alignment correspond to introns in the PacBio reads but not the reference genome, and short gaps (only visible in the pre-corrected PacBio reads) are putative indel errors. The read coverage of the Illumina reads used for correction is also shown, along with the current reference gene annotation for this locus. The corrected PBcR sequences match the reference annotations end-to-end and include two isoforms. The colored bars in read coverage are an artifact of the aligner, indicating reads that have overhangs across exon junctions. Genome coordinates for chr6 are shown from the RefGen v2 genome assembly (http://maizesequence.org/).

Similar articles

Cited by

References

    1. Margulies M, et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005;437:376–380. - PMC - PubMed
    1. Bentley D. Whole-genome re-sequencing. Current Opinion in Genetics & Development. 2006;16:545–552. - PubMed
    1. Sanger F, Nicklen S, Coulson A. DNA sequencing with chain-terminating inhibitors. PNAS. 1977;74:5463–5467. - PMC - PubMed
    1. Niu B, Fu L, Sun S, Li W. Artificial and natural duplicates in pyrosequencing reads of metagenomic data. BMC bioinformatics. 2010;11:187. - PMC - PubMed
    1. Dohm J, Lottaz C, Borodina T, Himmelbauer H. Substantial biases in ultra-short read data sets from high-throughput dna sequencing. Nucleic Acids Research. 2008;36:e105. - PMC - PubMed

Publication types