Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes

doi:10.1038/s41587-020-0503-6

. 2020 Sep;38(9):1044-1053.

doi: 10.1038/s41587-020-0503-6. Epub 2020 May 4.

Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes

Kishwar Shafin^#¹, Trevor Pesout^#¹, Ryan Lorig-Roach^#¹, Marina Haukness^#¹, Hugh E Olsen^#¹, Colleen Bosworth¹, Joel Armstrong¹, Kristof Tigyi^{1

2}, Nicholas Maurer¹, Sergey Koren³, Fritz J Sedlazeck⁴, Tobias Marschall⁵, Simon Mayes⁶, Vania Costa⁶, Justin M Zook⁷, Kelvin J Liu⁸, Duncan Kilburn⁸, Melanie Sorensen⁹, Katy M Munson⁹, Mitchell R Vollger⁹, Jean Monlong¹, Erik Garrison¹, Evan E Eichler^{2

9}, Sofie Salama^{1

2}, David Haussler^{1

2}, Richard E Green¹, Mark Akeson¹, Adam Phillippy³, Karen H Miga¹, Paolo Carnevali¹⁰, Miten Jain¹¹, Benedict Paten¹²

Affiliations

¹ UC Santa Cruz Genomics Institute, Santa Cruz, CA, USA.
² Howard Hughes Medical Institute, University of California, Santa Cruz, CA, USA.
³ Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, Bethesda, MD, USA.
⁴ Baylor College of Medicine, Human Genome Sequencing Center, Houston, TX, USA.
⁵ Max Planck Institute for Informatics, Saarbrücken, Germany.
⁶ Oxford Nanopore Technologies, Oxford, UK.
⁷ National Institute of Standards and Technology, Gaithersburg, MD, USA.
⁸ Circulomics Inc., Baltimore, MD, USA.
⁹ Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA.
¹⁰ Chan Zuckerberg Initiative, Redwood City, CA, USA. paolo@chanzuckerberg.com.
¹¹ UC Santa Cruz Genomics Institute, Santa Cruz, CA, USA. miten@soe.ucsc.edu.
¹² UC Santa Cruz Genomics Institute, Santa Cruz, CA, USA. bpaten@ucsc.edu.

^# Contributed equally.

PMID: 32686750
PMCID: PMC7483855
DOI: 10.1038/s41587-020-0503-6

Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes

Kishwar Shafin et al. Nat Biotechnol. 2020 Sep.

. 2020 Sep;38(9):1044-1053.

doi: 10.1038/s41587-020-0503-6. Epub 2020 May 4.

Authors

Affiliations

¹ UC Santa Cruz Genomics Institute, Santa Cruz, CA, USA.
² Howard Hughes Medical Institute, University of California, Santa Cruz, CA, USA.
³ Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, Bethesda, MD, USA.
⁴ Baylor College of Medicine, Human Genome Sequencing Center, Houston, TX, USA.
⁵ Max Planck Institute for Informatics, Saarbrücken, Germany.
⁶ Oxford Nanopore Technologies, Oxford, UK.
⁷ National Institute of Standards and Technology, Gaithersburg, MD, USA.
⁸ Circulomics Inc., Baltimore, MD, USA.
⁹ Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA.
¹⁰ Chan Zuckerberg Initiative, Redwood City, CA, USA. paolo@chanzuckerberg.com.
¹¹ UC Santa Cruz Genomics Institute, Santa Cruz, CA, USA. miten@soe.ucsc.edu.
¹² UC Santa Cruz Genomics Institute, Santa Cruz, CA, USA. bpaten@ucsc.edu.

^# Contributed equally.

PMID: 32686750
PMCID: PMC7483855
DOI: 10.1038/s41587-020-0503-6

Abstract

De novo assembly of a human genome using nanopore long-read sequences has been reported, but it used more than 150,000 CPU hours and weeks of wall-clock time. To enable rapid human genome assembly, we present Shasta, a de novo long-read assembler, and polishing algorithms named MarginPolish and HELEN. Using a single PromethION nanopore sequencer and our toolkit, we assembled 11 highly contiguous human genomes de novo in 9 d. We achieved roughly 63× coverage, 42-kb read N50 values and 6.5× coverage in reads >100 kb using three flow cells per sample. Shasta produced a complete haploid human genome assembly in under 6 h on a single commercial compute node. MarginPolish and HELEN polished haploid assemblies to more than 99.9% identity (Phred quality score QV = 30) with nanopore reads alone. Addition of proximity-ligation sequencing enabled near chromosome-level scaffolds for all 11 genomes. We compare our assembly performance to existing methods for diploid, haploid and trio-binned human samples and report superior accuracy and speed.

PubMed Disclaimer

Conflict of interest statement

M.A. is a paid consultant to ONT. V.C. and S.M. are employees of ONT.

Figures

**Fig. 1. Nanopore sequencing data.**
a, Throughput in gigabases from each of three flow cells for 11 samples, with total throughput at top. Each point is a flow cell. b, Read N50 values for each flow cell. Each point is a flow cell. c, Alignment identities against GRCh38. Medians in a–c shown by dashed lines, dotted line in c is the mode. Each line is a single sample comprising three flow cells. d, Genome coverage as a function of read length. Dashed lines indicate coverage at 10 and 100 kb. HG00733 is accentuated in dark blue as an example. Each line is a single sample comprising three flow cells. e, Alignment identity for standard and RLE reads. Data for HG00733 chromosome 1 flow cell 1 are shown (4.6 Gb raw sequence). Dashed lines denote quartiles. Source data

**Fig. 2. Assembly metrics for Shasta, Wtgdb2, Flye and Canu before polishing.**
a, NGx plot showing contig length distribution. The intersection of each line with the dashed line is the NG50 for that assembly. b, NGAx plot showing the distribution of aligned contig lengths. Each horizontal line represents an aligned segment of the assembly unbroken by a disagreement or unmappable sequence with respect to GRCh38. The intersection of each line with the dashed line is the aligned NGA50 for that assembly. c, Assembly disagreement counts for regions outside centromeres, segmental duplications and, for HG002, known SVs. d, Total generated sequence length versus total aligned sequence length (against GRCh38). e, Balanced base-level error rates for assembled sequences. f, Average runtime and cost for assemblers (Canu not shown). Source data

**Fig. 3. Shasta MHC assemblies compared with the reference human genome.**
Unpolished Shasta assembly for CHM13 and HG00733, including HG00733 trio-binned maternal and paternal assemblies. Shaded gray areas are regions in which coverage (as aligned to GRCh38) drops below 20. Horizontal black lines indicate contig breaks. Blue and green describe unique alignments (aligning forward and reverse, respectively) and orange describes multiple alignments. Source data

**Fig. 4. Polishing assembled genomes.**
a, Balanced error rates for the four methods on HG00733 and CHM13. b, Row-normalized heatmaps describing the predicted run lengths (x axis) given true run lengths (y axis) for four steps of the pipeline on HG00733. Guppy v.2.3.3 was generated from 3.7 Gb of RLE sequence. Shasta, MarginPolish and HELEN were generated from whole assemblies aligned to their respective truth sequences. c, Error rates for MarginPolish and HELEN on four assemblies. d, Average runtime and cost. Source data

**Fig. 5. HiRise scaffolding for 11 genomes.**
a, NGx plots for each of the 11 genomes, before (dashed) and after (solid) scaffolding with HiC sequencing reads, GRCh38 minus alternate sequences is shown for comparison. b, Dot plot showing alignments between the scaffolded HG00733 Shasta assembly and GRCh38 chromosome scaffolds. Blue indicates forward aligning segments, green indicates reverse, with both indicating unique alignments. Source data

**Extended Data Fig. 1. Read Markers.**
Markers aligned to a run length encoded read.

**Extended Data Fig. 2. Marker Alignment.**
A marker alignment represented as a dot-plot. Elements that are identical between the two sequences are displayed in green or red - the ones in green are the ones that are part of the optimal alignment computed by the Shasta assembler. Because of the much larger alphabet, matrix elements that are identical between the sequences but are not part of the optimal alignment are infrequent. Each alignment matrix element here corresponds on average to a 13 13 block in the alignment matrix in raw base sequence.

**Extended Data Fig. 3. Read Graph.**
An example of a portion of the read graph (A) as displayed by the Shasta http server, and (B) showing obviously incorrect connections.

**Extended Data Fig. 4. Marker Graph.**
An illustration of marker graph construction for two sequences.

**Extended Data Fig. 5. Assembly Graph.**
(A) A marker graph with linear sequence of edges colored. (B) The corresponding assembly graph. Colors were chosen to indicate the correspondence to marker graph edges.

**Extended Data Fig. 6. Bubbles.**
(A) A simple bubble. (B) A superbubble.

**Extended Data Fig. 7. POA Example.**
(A) An example POA, assuming approximately 30x read coverage. The backbone is shown in red. Each non-source/sink node has a vector of weights, one for each possible base. Deletion edges are shown in teal, they also each have a weight. Finally insertion nodes are shown in brown, each also has a weight. (B) A pruned POA, removing deletions and insertions that have less than a threshold weight and highlighting plausible bases in bold. There are six plausible nucleotide sequences represented by paths through the POA and selections of plausible base labels: G;AT;A;T;A;C:A, G;AT;A;T;A;C:G, G;A;T;A;C:A, G;A;T;A;C:G, G;A;C:A, G;A;C:G. To avoid the combinatorial explosion of such enumeration we identify subgraphs (C) and locally enumerate the possible subsequences in these regions independently (dotted rectangles identify subgraphs selected). In each subgraph there is a source and sink node that does not overlap any proposed edit.

**Extended Data Fig. 8. RLE Inference Distributions.**
Visual representation of run length inference. This diagram shows how a consensus run length is inferred for a set of aligned lengths (X) that pertain to a single position. The lengths are factored and then iterated over, and log likelihood is calculated for every possible true length up to a predefined limit. Note that in this example, the most frequent observation (4bp) is not the most likely true length (5bp) given the model.

**Extended Data Fig. 9. MarginPolish HELEN Image Generation.**
A graphical representation of images from two labeled regions selected to demonstrate: the encoding of a single POA node into two run-length blocks (i), a true deletion (i), and a true insert (ii). (a) shows the alignment in raw and run-length space, (b) shows the features as they are exported to HELEN. The y-axis shows truth labels for nucleotides and run-lengths, the x-axis describes features in the images, and colors show associated weights.

**Extended Data Fig. 10. HELEN Model.**
The sequence-to-sequence model implemented in Helen.

See this image and copyright information in PMC

Cited by

TDFPS-Designer: an efficient toolkit for barcode design and selection in nanopore sequencing.
Qi J, Li Z, Zhang YZ, Li G, Gao X, Han R. Qi J, et al. Genome Biol. 2024 Nov 4;25(1):285. doi: 10.1186/s13059-024-03423-3. Genome Biol. 2024. PMID: 39497190
Optical genome mapping of structural variants in Parkinson's disease-related induced pluripotent stem cells.
Trinh J, Schaake S, Gabbert C, Lüth T, Cowley SA, Fienemann A, Ullrich KK, Klein C, Seibler P. Trinh J, et al. BMC Genomics. 2024 Oct 19;25(1):980. doi: 10.1186/s12864-024-10902-1. BMC Genomics. 2024. PMID: 39425080 Free PMC article.
SpLitteR: diploid genome assembly using TELL-Seq linked-reads and assembly graphs.
Tolstoganov I, Chen Z, Pevzner P, Korobeynikov A. Tolstoganov I, et al. PeerJ. 2024 Sep 27;12:e18050. doi: 10.7717/peerj.18050. eCollection 2024. PeerJ. 2024. PMID: 39351368 Free PMC article.
Performance of somatic structural variant calling in lung cancer using Oxford Nanopore sequencing technology.
Liu L, Zhang J, Wood S, Newell F, Leonard C, Koufariotis LT, Nones K, Dalley AJ, Chittoory H, Bashirzadeh F, Son JH, Steinfort D, Williamson JP, Bint M, Pahoff C, Nguyen PT, Twaddell S, Arnold D, Grainge C, Simpson PT, Fielding D, Waddell N, Pearson JV. Liu L, et al. BMC Genomics. 2024 Sep 30;25(1):898. doi: 10.1186/s12864-024-10792-3. BMC Genomics. 2024. PMID: 39350042 Free PMC article.
Highly accurate assembly polishing with DeepPolisher.
Mastoras M, Asri M, Brambrink L, Hebbar P, Kolesnikov A, Cook DE, Nattestad M, Lucas J, Won TS, Chang PC, Carroll A, Paten B, Shafin K. Mastoras M, et al. bioRxiv [Preprint]. 2024 Sep 19:2024.09.17.613505. doi: 10.1101/2024.09.17.613505. bioRxiv. 2024. PMID: 39345401 Free PMC article. Preprint.

See all "Cited by" articles

References

1. McKenna A, et al. The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20:1297–1303. - PMC - PubMed
1. Ebler J, Haukness M, Pesout T, Marschall T, Paten B. Haplotype-aware diplotyping from noisy long reads. Genome Biol. 2019;20:e116. - PMC - PubMed
1. Zook JM, et al. An open resource for accurately benchmarking small variant and reference calls. Nat. Biotechnol. 2019;37:561–566. - PMC - PubMed
1. Poplin R, et al. A universal snp and small-indel variant caller using deep neural networks. Nat. Biotechnol. 2018;36:983–987. - PubMed
1. Bradnam KR, et al. Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. Gigascience. 2013;2:10. - PMC - PubMed

Publication types

Actions
Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

LinkOut - more resources

Full Text Sources
- Nature Publishing Group
- PubMed Central
Research Materials
- Coriell Cell Repositories

[1] McKenna A, et al. The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20:1297–1303. - PMC - PubMed

[2] McKenna A, et al. The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20:1297–1303. - PMC - PubMed

[3] Ebler J, Haukness M, Pesout T, Marschall T, Paten B. Haplotype-aware diplotyping from noisy long reads. Genome Biol. 2019;20:e116. - PMC - PubMed

[4] Ebler J, Haukness M, Pesout T, Marschall T, Paten B. Haplotype-aware diplotyping from noisy long reads. Genome Biol. 2019;20:e116. - PMC - PubMed

[5] Zook JM, et al. An open resource for accurately benchmarking small variant and reference calls. Nat. Biotechnol. 2019;37:561–566. - PMC - PubMed

[6] Zook JM, et al. An open resource for accurately benchmarking small variant and reference calls. Nat. Biotechnol. 2019;37:561–566. - PMC - PubMed

[7] Poplin R, et al. A universal snp and small-indel variant caller using deep neural networks. Nat. Biotechnol. 2018;36:983–987. - PubMed

[8] Poplin R, et al. A universal snp and small-indel variant caller using deep neural networks. Nat. Biotechnol. 2018;36:983–987. - PubMed

[9] Bradnam KR, et al. Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. Gigascience. 2013;2:10. - PMC - PubMed

[10] Bradnam KR, et al. Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. Gigascience. 2013;2:10. - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes

Affiliations

Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Research Materials

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Research Materials