Illumina Synthetic Long Read Sequencing Allows Recovery of Missing Sequences even in the "Finished" C. elegans Genome

doi:10.1038/srep10814

. 2015 Jun 3:5:10814.

doi: 10.1038/srep10814.

Illumina Synthetic Long Read Sequencing Allows Recovery of Missing Sequences even in the "Finished" C. elegans Genome

Runsheng Li¹, Chia-Ling Hsieh², Amanda Young², Zhihong Zhang², Xiaoliang Ren¹, Zhongying Zhao³

Affiliations

¹ Department of Biology, Hong Kong Baptist University, 224 Waterloo Road, Kowloon Tong, Hong Kong, China.
² Illumina Inc., 5200 Illumina Way, San Diego, 92122, USA.
³ 1] Department of Biology, Hong Kong Baptist University, 224 Waterloo Road, Kowloon Tong, Hong Kong, China [2] State Key Laboratory of Environmental and Biological Analysis, Hong Kong Baptist University, Hong Kong, China.

PMID: 26039588
PMCID: PMC4650653
DOI: 10.1038/srep10814

Illumina Synthetic Long Read Sequencing Allows Recovery of Missing Sequences even in the "Finished" C. elegans Genome

Runsheng Li et al. Sci Rep. 2015.

. 2015 Jun 3:5:10814.

doi: 10.1038/srep10814.

Authors

Runsheng Li¹, Chia-Ling Hsieh², Amanda Young², Zhihong Zhang², Xiaoliang Ren¹, Zhongying Zhao³

Affiliations

¹ Department of Biology, Hong Kong Baptist University, 224 Waterloo Road, Kowloon Tong, Hong Kong, China.
² Illumina Inc., 5200 Illumina Way, San Diego, 92122, USA.
³ 1] Department of Biology, Hong Kong Baptist University, 224 Waterloo Road, Kowloon Tong, Hong Kong, China [2] State Key Laboratory of Environmental and Biological Analysis, Hong Kong Baptist University, Hong Kong, China.

PMID: 26039588
PMCID: PMC4650653
DOI: 10.1038/srep10814

Abstract

Most next-generation sequencing platforms permit acquisition of high-throughput DNA sequences, but the relatively short read length limits their use in genome assembly or finishing. Illumina has recently released a technology called Synthetic Long-Read Sequencing that can produce reads of unusual length, i.e., predominately around 10 Kb. However, a systematic assessment of their use in genome finishing and assembly is still lacking. We evaluate the promise and deficiency of the long reads in these aspects using isogenic C. elegans genome with no gap. First, the reads are highly accurate and capable of recovering most types of repetitive sequences. However, the presence of tandem repetitive sequences prevents pre-assembly of long reads in the relevant genomic region. Second, the reads are able to reliably detect missing but not extra sequences in the C. elegans genome. Third, the reads of smaller size are more capable of recovering repetitive sequences than those of bigger size. Fourth, at least 40 Kbp missing genomic sequences are recovered in the C. elegans genome using the long reads. Finally, an N50 contig size of at least 86 Kbp can be achieved with 24 × reads but with substantial mis-assembly errors, highlighting a need for novel assembly algorithm for the long reads.

PubMed Disclaimer

Figures

**Figure 1. Read quality and mappability across its lengths.**
(A) Read quality score (mean ± SD). (B) SNV count. (C and D) Count of deletion (red) or insertion (green) respectively. (E and F) Count of long deletion (red) or insertion (> = 9 bp) (green) respectively (see Methods). Scale in horizontal axis was normalized to a 100% of the read length. The window size of the plot is 1% of the read length.

**Figure 2. Read count and mappability across its lengths.**
(A, B and C) Count, yield and genomic coverage of long read across its lengths respectively with a window size of 100 bp. (D, E, F and G) Accumulative read yield and its coverage of genome, protein-coding genes and repetitive sequences across read lengths respectively. Blue and red lines represent accumulative yield or percentage respectively using the reads shorter or longer than a given read length.

**Figure 3. Capacity of long reads in recovering various repetitive sequences.**
(A) Effect of read coverage on the recovery rate of various types of repetitive sequence as color coded. (B and C) Relative fold enrichment and composition of various repetitive sequences within the gap regions (the genomic region not covered by any read) respectively. Same color coding scheme is used in (A, B and C).

**Figure 4. Impact of the arrangement of repetitive sequences in the reference genome on its coverage by the long reads.**
(A) An example showing the full recovery of a 10 Kbp genomic region containing various types of repetitive sequences indicated by arrows and differentially color coded by long reads. One satellite repeat within the region is highlighted for its tandem-arranged unit at the bottom. Note, certain reads cannot cover the repetitive region while the others can. (B and C) Examples of the uncovered region (gap) containing simple and RC type of repeat respectively. Note that both repetitive sequences consist of clusters formed by identical tandem units as indicated and the clusters are longer than 500 bp in size while all the repeat clusters in panel (A) are much smaller than this size. Tandem units are highlighted in the bottom. Same color coding scheme is used in (A, B and C). Chromosome coordinates are indicated on the top.

**Figure 5. Identification and validation of the long indels (>=9 bp) by PCR and/or NGS data.**
(A-D) Shown are the definitions of four possible ways of detecting insertion (insertion position indicated with an inverse triangle). (A) “Within read” type of insertion is defined as an unalignable region within a read against N2 genome while both arms of the read can be successfully aligned with the same reference genome. (B) “Self-mapped” type of insertion is similarly defined as that in (A) except that a second arm can only be aligned with the N2 genome using more permissive alignment parameters (see Methods). (C) “Cross-mapped” type of insertion is defined as an insertion recovered by the overlapping unalignable ends from two independent reads located adjacently in the reference genome. Both unalignable ends that can be partially aligned against N2 genome that are adjacent to each other. (D) “Close breakpoint” type of insertion is similarly defined as that in (C) except that the recovered insertion contains a gap with unknown length due to non-overlapping part between the two unalignable ends. (E-H) Genomic distribution of each type of insertion identified in (A-D). Genomic regions are classified based on their coding potential and differentially colored on the top. Number of insertions located inside coding exon is indicated. (I) Validation results for 51 insertions identified above and 19 deletions identified within long reads by PCR. Shown are the percentages of validated indels that are detected using three different sources of DNA samples as color coded. The numbers of the validated events are indicated above each bar. N2_1 and N2_2 are two N2 *C. elegans* strain that have been independently maintained over 20 years. CB4856 is the Hawaii isolate of *C. elegans*. (J) Validation results of three types of insertion as identified in (A, B and C) by NGS data (see Methods). The number of the validated events is indicated above each bar. NA, not applicable.

**Figure 6. A revised gene model of ***rod-1*** and its validation based on an insertion located within its coding exon.**
(A) A revised gene model of *rod-1* with an insertion that alters its last exon. Pink, existing exon, yellow, newly added exon. (B) Validation of the insertion by PCR with three sources of DNAs as indicated in 5(I). Shown also are the observed and expected sizes of the PCR products. (C) A magnified view of the affected exon that is confirmed by RNA-seq data whose mapping reads are shown below the corrected gene model. (D) Partial multiple alignment using the *C.elegans* ROD-1 protein sequence and its orthologs in *C. briggsae* (CBR), *C. remanei* (CRE) and *C. japonica* (CJP). Discrepancies in alignment are shaded in grey.

**Figure 7. Effect of sequencing depth on ***de novo*** genome assembly using two genome assemblers.**
(A) Plotting of contig N50 length against the sequencing depth. (B, C and D) Plotting of read (blue) and contig (red and green) coverage of genome, protein-coding genes and repetitive sequences against the sequencing depth respectively. Contigs were assembled using Celera (red) or MIRA (green) assembler with 3×, 6×, 12× and 24× reads.

See this image and copyright information in PMC

Cited by

The Application of Metagenomics to Study Microbial Communities and Develop Desirable Traits in Fermented Foods.
Srinivas M, O'Sullivan O, Cotter PD, Sinderen DV, Kenny JG. Srinivas M, et al. Foods. 2022 Oct 21;11(20):3297. doi: 10.3390/foods11203297. Foods. 2022. PMID: 37431045 Free PMC article. Review.
Genetic exchange with an outcrossing sister species causes severe genome-wide dysregulation in a selfing Caenorhabditis nematode.
Xie D, Ye P, Ma Y, Li Y, Liu X, Sarkies P, Zhao Z. Xie D, et al. Genome Res. 2022 Nov-Dec;32(11-12):2015-2027. doi: 10.1101/gr.277205.122. Epub 2022 Nov 9. Genome Res. 2022. PMID: 36351773 Free PMC article.
Genomic architecture of 5S rDNA cluster and its variations within and between species.
Ding Q, Li R, Ren X, Chan LY, Ho VWS, Xie D, Ye P, Zhao Z. Ding Q, et al. BMC Genomics. 2022 Mar 27;23(1):238. doi: 10.1186/s12864-022-08476-x. BMC Genomics. 2022. PMID: 35346033 Free PMC article.
G-quadruplexes in helminth parasites.
Cantara A, Luo Y, Dobrovolná M, Bohalova N, Fojta M, Verga D, Guittat L, Cucchiarini A, Savrimoutou S, Häberli C, Guillon J, Keiser J, Brázda V, Mergny JL. Cantara A, et al. Nucleic Acids Res. 2022 Mar 21;50(5):2719-2735. doi: 10.1093/nar/gkac129. Nucleic Acids Res. 2022. PMID: 35234933 Free PMC article.
Comprehensive Wet-Bench and Bioinformatics Workflow for Complex Microbiota Using Oxford Nanopore Technologies.
Ammer-Herrmenau C, Pfisterer N, van den Berg T, Gavrilova I, Amanzada A, Singh SK, Khalil A, Alili R, Belda E, Clement K, Abd El Wahed A, Gady EE, Haubrock M, Beißbarth T, Ellenrieder V, Neesse A. Ammer-Herrmenau C, et al. mSystems. 2021 Aug 31;6(4):e0075021. doi: 10.1128/mSystems.00750-21. Epub 2021 Aug 24. mSystems. 2021. PMID: 34427527 Free PMC article.

See all "Cited by" articles

References

1. Alkan C., Sajjadian S. & Eichler E. E. Limitations of next-generation genome sequence assembly. Nature methods 8, 61–65, 10.1038/nmeth.1527 (2011). - DOI - PMC - PubMed
1. Bentley D. R. et al.. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59, 10.1038/nature07517 (2008). - DOI - PMC - PubMed
1. Margulies M. et al.. Genome sequencing in microfabricated high-density picolitre reactors. Nature 437, 376–380 (2005). - PMC - PubMed
1. Bashir A. et al.. A hybrid approach for the automated finishing of bacterial genomes. Nature biotechnology 30, 701–707 (2012). - PMC - PubMed
1. Carneiro M. O. et al.. Pacific biosciences sequencing technology for genotyping and variation discovery in human data. BMC genomics 13, 375, 10.1186/1471-2164-13-375 (2012). - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Molecular Biology Databases
- SILVA
Miscellaneous
- NCI CPTAC Assay Portal

[1] Alkan C., Sajjadian S. & Eichler E. E. Limitations of next-generation genome sequence assembly. Nature methods 8, 61–65, 10.1038/nmeth.1527 (2011). - DOI - PMC - PubMed

[2] Alkan C., Sajjadian S. & Eichler E. E. Limitations of next-generation genome sequence assembly. Nature methods 8, 61–65, 10.1038/nmeth.1527 (2011). - DOI - PMC - PubMed

[3] Bentley D. R. et al.. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59, 10.1038/nature07517 (2008). - DOI - PMC - PubMed

[4] Bentley D. R. et al.. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59, 10.1038/nature07517 (2008). - DOI - PMC - PubMed

[5] Margulies M. et al.. Genome sequencing in microfabricated high-density picolitre reactors. Nature 437, 376–380 (2005). - PMC - PubMed

[6] Margulies M. et al.. Genome sequencing in microfabricated high-density picolitre reactors. Nature 437, 376–380 (2005). - PMC - PubMed

[7] Bashir A. et al.. A hybrid approach for the automated finishing of bacterial genomes. Nature biotechnology 30, 701–707 (2012). - PMC - PubMed

[8] Bashir A. et al.. A hybrid approach for the automated finishing of bacterial genomes. Nature biotechnology 30, 701–707 (2012). - PMC - PubMed

[9] Carneiro M. O. et al.. Pacific biosciences sequencing technology for genotyping and variation discovery in human data. BMC genomics 13, 375, 10.1186/1471-2164-13-375 (2012). - DOI - PMC - PubMed

[10] Carneiro M. O. et al.. Pacific biosciences sequencing technology for genotyping and variation discovery in human data. BMC genomics 13, 375, 10.1186/1471-2164-13-375 (2012). - DOI - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Illumina Synthetic Long Read Sequencing Allows Recovery of Missing Sequences even in the "Finished" C. elegans Genome

Affiliations

Illumina Synthetic Long Read Sequencing Allows Recovery of Missing Sequences even in the "Finished" C. elegans Genome

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Miscellaneous

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Miscellaneous