Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Jun 3:5:10814.
doi: 10.1038/srep10814.

Illumina Synthetic Long Read Sequencing Allows Recovery of Missing Sequences even in the "Finished" C. elegans Genome

Affiliations

Illumina Synthetic Long Read Sequencing Allows Recovery of Missing Sequences even in the "Finished" C. elegans Genome

Runsheng Li et al. Sci Rep. .

Abstract

Most next-generation sequencing platforms permit acquisition of high-throughput DNA sequences, but the relatively short read length limits their use in genome assembly or finishing. Illumina has recently released a technology called Synthetic Long-Read Sequencing that can produce reads of unusual length, i.e., predominately around 10 Kb. However, a systematic assessment of their use in genome finishing and assembly is still lacking. We evaluate the promise and deficiency of the long reads in these aspects using isogenic C. elegans genome with no gap. First, the reads are highly accurate and capable of recovering most types of repetitive sequences. However, the presence of tandem repetitive sequences prevents pre-assembly of long reads in the relevant genomic region. Second, the reads are able to reliably detect missing but not extra sequences in the C. elegans genome. Third, the reads of smaller size are more capable of recovering repetitive sequences than those of bigger size. Fourth, at least 40 Kbp missing genomic sequences are recovered in the C. elegans genome using the long reads. Finally, an N50 contig size of at least 86 Kbp can be achieved with 24 × reads but with substantial mis-assembly errors, highlighting a need for novel assembly algorithm for the long reads.

PubMed Disclaimer

Figures

Figure 1
Figure 1. Read quality and mappability across its lengths.
(A) Read quality score (mean ± SD). (B) SNV count. (C and D) Count of deletion (red) or insertion (green) respectively. (E and F) Count of long deletion (red) or insertion (> = 9 bp) (green) respectively (see Methods). Scale in horizontal axis was normalized to a 100% of the read length. The window size of the plot is 1% of the read length.
Figure 2
Figure 2. Read count and mappability across its lengths.
(A, B and C) Count, yield and genomic coverage of long read across its lengths respectively with a window size of 100 bp. (D, E, F and G) Accumulative read yield and its coverage of genome, protein-coding genes and repetitive sequences across read lengths respectively. Blue and red lines represent accumulative yield or percentage respectively using the reads shorter or longer than a given read length.
Figure 3
Figure 3. Capacity of long reads in recovering various repetitive sequences.
(A) Effect of read coverage on the recovery rate of various types of repetitive sequence as color coded. (B and C) Relative fold enrichment and composition of various repetitive sequences within the gap regions (the genomic region not covered by any read) respectively. Same color coding scheme is used in (A, B and C).
Figure 4
Figure 4. Impact of the arrangement of repetitive sequences in the reference genome on its coverage by the long reads.
(A) An example showing the full recovery of a 10 Kbp genomic region containing various types of repetitive sequences indicated by arrows and differentially color coded by long reads. One satellite repeat within the region is highlighted for its tandem-arranged unit at the bottom. Note, certain reads cannot cover the repetitive region while the others can. (B and C) Examples of the uncovered region (gap) containing simple and RC type of repeat respectively. Note that both repetitive sequences consist of clusters formed by identical tandem units as indicated and the clusters are longer than 500 bp in size while all the repeat clusters in panel (A) are much smaller than this size. Tandem units are highlighted in the bottom. Same color coding scheme is used in (A, B and C). Chromosome coordinates are indicated on the top.
Figure 5
Figure 5. Identification and validation of the long indels (>=9 bp) by PCR and/or NGS data.
(A-D) Shown are the definitions of four possible ways of detecting insertion (insertion position indicated with an inverse triangle). (A) “Within read” type of insertion is defined as an unalignable region within a read against N2 genome while both arms of the read can be successfully aligned with the same reference genome. (B) “Self-mapped” type of insertion is similarly defined as that in (A) except that a second arm can only be aligned with the N2 genome using more permissive alignment parameters (see Methods). (C) “Cross-mapped” type of insertion is defined as an insertion recovered by the overlapping unalignable ends from two independent reads located adjacently in the reference genome. Both unalignable ends that can be partially aligned against N2 genome that are adjacent to each other. (D) “Close breakpoint” type of insertion is similarly defined as that in (C) except that the recovered insertion contains a gap with unknown length due to non-overlapping part between the two unalignable ends. (E-H) Genomic distribution of each type of insertion identified in (A-D). Genomic regions are classified based on their coding potential and differentially colored on the top. Number of insertions located inside coding exon is indicated. (I) Validation results for 51 insertions identified above and 19 deletions identified within long reads by PCR. Shown are the percentages of validated indels that are detected using three different sources of DNA samples as color coded. The numbers of the validated events are indicated above each bar. N2_1 and N2_2 are two N2 C. elegans strain that have been independently maintained over 20 years. CB4856 is the Hawaii isolate of C. elegans. (J) Validation results of three types of insertion as identified in (A, B and C) by NGS data (see Methods). The number of the validated events is indicated above each bar. NA, not applicable.
Figure 6
Figure 6. A revised gene model of rod-1 and its validation based on an insertion located within its coding exon.
(A) A revised gene model of rod-1 with an insertion that alters its last exon. Pink, existing exon, yellow, newly added exon. (B) Validation of the insertion by PCR with three sources of DNAs as indicated in 5(I). Shown also are the observed and expected sizes of the PCR products. (C) A magnified view of the affected exon that is confirmed by RNA-seq data whose mapping reads are shown below the corrected gene model. (D) Partial multiple alignment using the C.elegans ROD-1 protein sequence and its orthologs in C. briggsae (CBR), C. remanei (CRE) and C. japonica (CJP). Discrepancies in alignment are shaded in grey.
Figure 7
Figure 7. Effect of sequencing depth on de novo genome assembly using two genome assemblers.
(A) Plotting of contig N50 length against the sequencing depth. (B, C and D) Plotting of read (blue) and contig (red and green) coverage of genome, protein-coding genes and repetitive sequences against the sequencing depth respectively. Contigs were assembled using Celera (red) or MIRA (green) assembler with 3×, 6×, 12× and 24× reads.

Similar articles

Cited by

References

    1. Alkan C., Sajjadian S. & Eichler E. E. Limitations of next-generation genome sequence assembly. Nature methods 8, 61–65, 10.1038/nmeth.1527 (2011). - DOI - PMC - PubMed
    1. Bentley D. R. et al.. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59, 10.1038/nature07517 (2008). - DOI - PMC - PubMed
    1. Margulies M. et al.. Genome sequencing in microfabricated high-density picolitre reactors. Nature 437, 376–380 (2005). - PMC - PubMed
    1. Bashir A. et al.. A hybrid approach for the automated finishing of bacterial genomes. Nature biotechnology 30, 701–707 (2012). - PMC - PubMed
    1. Carneiro M. O. et al.. Pacific biosciences sequencing technology for genotyping and variation discovery in human data. BMC genomics 13, 375, 10.1186/1471-2164-13-375 (2012). - DOI - PMC - PubMed

Publication types

LinkOut - more resources