Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2023 Feb 22:2023.02.21.529152.
doi: 10.1101/2023.02.21.529152.

Phased nanopore assembly with Shasta and modular graph phasing with GFAse

Affiliations

Phased nanopore assembly with Shasta and modular graph phasing with GFAse

Ryan Lorig-Roach et al. bioRxiv. .

Update in

  • Phased nanopore assembly with Shasta and modular graph phasing with GFAse.
    Lorig-Roach R, Meredith M, Monlong J, Jain M, Olsen HE, McNulty B, Porubsky D, Montague TG, Lucas JK, Condon C, Eizenga JM, Juul S, McKenzie SK, Simmonds SE, Park J, Asri M, Koren S, Eichler EE, Axel R, Martin B, Carnevali P, Miga KH, Paten B. Lorig-Roach R, et al. Genome Res. 2024 Apr 25;34(3):454-468. doi: 10.1101/gr.278268.123. Genome Res. 2024. PMID: 38627094 Free PMC article.

Abstract

As a step towards simplifying and reducing the cost of haplotype resolved de novo assembly, we describe new methods for accurately phasing nanopore data with the Shasta genome assembler and a modular tool for extending phasing to the chromosome scale called GFAse. We test using new variants of Oxford Nanopore Technologies' (ONT) PromethION sequencing, including those using proximity ligation and show that newer, higher accuracy ONT reads substantially improve assembly quality.

PubMed Disclaimer

Figures

Figure 1:
Figure 1:
Summary of de novo phasing pipeline using Shasta and GFAse. Shasta performs de novo assembly and phases to the extent that is supported by informative variants in the nanopore reads. GFAse then takes a partially phased assembly GFA and extends phasing using orthogonal phasing information. GFAse can perform phasing based on any alignable data type (HiC, Pore-C, etc.). For Shasta graphs, GFAse can also use parental sequencing. The pathways with bolded arrows and blue fill are the methods that are previously undescribed.
Figure 2:
Figure 2:
a-b) Identity and length metrics for nanopore read sets used in the HG002 evaluation. c-f) Pore-C and Hi-C metrics for contacts and signal ratio, measured on a per-library basis. “Alignment length” and “alignments per read” are proxies for subread statistics. Only mappings which are usable for phasing are shown, i.e with mapping quality (mapQ) >0 in a diploid reference. Signal ratio is computed using a high quality trio-phased assembly to indicate the number of consistent and inconsistent contacts (see methods).
Figure 3:
Figure 3:
Phasing metrics for HG002 assemblies, as evaluated using the GIAB v4.2.1 benchmark VCF, phased with StrandSeq using WhatsHap (see methods). All shasta assemblies are unpolished. Assemblies not phased with GFAse are shaded gray. Each dot represents a chromosome error rate, generated by WhatsHap compare. Native Hifiasm HiC uses 30x coverage. Each pair of HiC is ~17x. PoreC flowcells have ~30x yield.
Figure 4:
Figure 4:
Structural variant, base level, and gene level accuracy metrics for HG002 assemblies. a) Base accuracy evaluated using yak with Illumina NovaSeq. b) Gene completeness measured by asmgene using human transcript sequences. “Full single copy” genes only indicate unfragmented, non-duplicated genes, matching transcripts by 99% or greater coverage and stratified by >97% (translucent) or >99% identity (opaque). Multicopy genes are similarly stratified. c) SVs evaluated using the GIAB Tier1 benchmark VCF with Truvari. d-e) NGx Plots for Shasta haplotypes, before and after unzipping bubble chains with GFAse. For comparison, the phased portion of the un-chained Verkko ‘production’ assembly is shown. The vertical line indicates the NG50 for each assembly.
Figure 4:
Figure 4:
The two types of Shasta output graphs, visualized as a 2D layout in Bandage at two different scales. a) A subregion of the “Assembly-Detailed.gfa”, showing a near-variant scale nodes in a bubble chain and their phasing indicated by colors produced by Shasta. b) A subregion of the “Assembly-Phased.gfa” showing a phased portion of chr11 from HG002 which terminates at two tangles, presumably caused by telomeric and centromeric sequences.
Figure 5:
Figure 5:
Diagram of sampling method for optimizing proximity linkages in an assembly graph. Edge weights in the contact graph are represented by teal curves. For each inner iteration, a greedily converged phase state is used to update a distribution of orientations among bubbles. Bubbles with the strongest signal at the end of sampling are merged for successive iterations. By the end of each round r of merging, the largest possible bubble set is 2^r in size.

Similar articles

References

    1. Cordeiro J. M. et al. Compound Heterozygous Mutations P336L and I1660V in the Human Cardiac Sodium Channel Associated With the Brugada Syndrome. Circulation 114, 2026–2033 (2006). - PMC - PubMed
    1. Miller D. B. & Piccolo S. R. Compound Heterozygous Variants in Pediatric Cancers: A Systematic Review. Front. Genet. 11, (2020). - PMC - PubMed
    1. Walker M. A. et al. Novel Compound Heterozygous Mutations Expand the Recognized Phenotypes of FARS2-Linked Disease. J. Child Neurol. 31, 1127–1137 (2016). - PMC - PubMed
    1. Marchini J. & Howie B. Genotype imputation for genome-wide association studies. Nat. Rev. Genet. 11, 499–511 (2010). - PubMed
    1. Peterson R. E. et al. Genome-wide Association Studies in Ancestrally Diverse Populations: Opportunities, Methods, Pitfalls, and Recommendations. Cell 179, 589–603 (2019). - PMC - PubMed

Publication types