Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Oct 10;25(1):265.
doi: 10.1186/s13059-024-03409-1.

Graphasing: phasing diploid genome assembly graphs with single-cell strand sequencing

Affiliations

Graphasing: phasing diploid genome assembly graphs with single-cell strand sequencing

Mir Henglin et al. Genome Biol. .

Abstract

Haplotype information is crucial for biomedical and population genetics research. However, current strategies to produce de novo haplotype-resolved assemblies often require either difficult-to-acquire parental data or an intermediate haplotype-collapsed assembly. Here, we present Graphasing, a workflow which synthesizes the global phase signal of Strand-seq with assembly graph topology to produce chromosome-scale de novo haplotypes for diploid genomes. Graphasing readily integrates with any assembly workflow that both outputs an assembly graph and has a haplotype assembly mode. Graphasing performs comparably to trio phasing in contiguity, phasing accuracy, and assembly quality, outperforms Hi-C in phasing accuracy, and generates human assemblies with over 18 chromosome-spanning haplotypes.

Keywords: Assembly graph; De novo assembly; Haplotype; Hi-C; Hifiasm; Phasing; Strand-seq; Trio; Verkko.

PubMed Disclaimer

Conflict of interest statement

E.E.E. is a scientific advisory board (SAB) member of Variant Bio, Inc. S.K. has received travel funding for speaking at events hosted by ONT.

Figures

Fig. 1
Fig. 1
Pipeline overview. A Reads from Strand-seq libraries are aligned to graph unitigs (gray circles) using “bwa mem” and “bwa fastmap.” “bwa fastmap” alignments are used to identify haplotype informative reads, which are used for step “D.” B Unitigs (gray points) are clustered using a cosine-similarity based agglomerative clustering strategy. C Unitigs (solid outline) and their flipped inverses (dotted outline) are used to correct misoriented unitigs. Unitigs in opposite orientation form a bisected structure that is captured with cosine-similarity clustering. D The vector capturing the haplotype-informative libraries (left) is used to pool Strand-seq libraries and produce a haplotype shading of the assembly (right, middle). Rukki is run on the shaded graph to produce haplotype calls and scaffolds (right, bottom). Tangles and gaps are bridged, as indicated by the dotted line in the red haplotype
Fig. 2
Fig. 2
Nx curves. The dotted black line in each facet corresponds to the reference standards, which are the Q100 v1.0 assembly for NA24385 and the CHM13 v2.0 assembly for HG00733
Fig. 3
Fig. 3
Haplotype error rate scatter. The X-coordinate of each point is the estimated switch error rate for a haplotype, and the Y-coordinate is the estimated Hamming error rate. Points are colored by phasing data
Fig. 4
Fig. 4
Hap-mer blob plots. For the NA24385 assemblies, only contigs aligning to autosomal chromosomes are plotted. The X- and Y-coordinate of each point is the number of hap-mers occurring on the contig, and the size of each point corresponds to contig length. Green points correspond to the Strand-seq and Hi-C HaPUs, while orange points correspond to the trio maternal haplotype, and blue points to the trio paternal haplotype. The gray line is the line of equality, where the number of hap-mers from either parent occurring on a contig is equal. The greater the phasing accuracy, the closer a blob is aligned to each axis
Fig. 5
Fig. 5
Assembly QV. Points are colored by phasing method
Fig. 6
Fig. 6
paftools.js misjoin statistics: three event categories are plotted: gaps, interchromosomal misjoins, and inversions. Each bar is colored blue according to the fraction of the misjoin type occurring entirely on acrocentric chromosomes (chromosomes 13, 14, 15, 21, 22)
Fig. 7
Fig. 7
The fraction of missing multi-copy genes (MMC) and missing single-copy genes (MSC) calculated from paftools.js asmgene statistics
Fig. 8
Fig. 8
Disagreement between titrated and reference assemblies for NA24385. For each titrated Strand-seq library set, the haplotypes called by Rukki were compared to the reference haplotypes from the HG002 v1.0 assembly. Each color corresponds to a different fraction of high-quality libraries sampled for the titrated library set, and shape corresponds to the inclusion or exclusion of unitigs aligning to the acrocentric chromosomes. Disagreement is quantified as the percent of the total length of the assembly for which haplotype calls disagree with the reference calls, calculated using unitigs longer than 50 kbp

Update of

Similar articles

Cited by

  • Complex genetic variation in nearly complete human genomes.
    Logsdon GA, Ebert P, Audano PA, Loftus M, Porubsky D, Ebler J, Yilmaz F, Hallast P, Prodanov T, Yoo D, Paisie CA, Harvey WT, Zhao X, Martino GV, Henglin M, Munson KM, Rabbani K, Chin CS, Gu B, Ashraf H, Austine-Orimoloye O, Balachandran P, Bonder MJ, Cheng H, Chong Z, Crabtree J, Gerstein M, Guethlein LA, Hasenfeld P, Hickey G, Hoekzema K, Hunt SE, Jensen M, Jiang Y, Koren S, Kwon Y, Li C, Li H, Li J, Norman PJ, Oshima KK, Paten B, Phillippy AM, Pollock NR, Rausch T, Rautiainen M, Scholz S, Song Y, Söylev A, Sulovari A, Surapaneni L, Tsapalou V, Zhou W, Zhou Y, Zhu Q, Zody MC, Mills RE, Devine SE, Shi X, Talkowski ME, Chaisson MJP, Dilthey AT, Konkel MK, Korbel JO, Lee C, Beck CR, Eichler EE, Marschall T. Logsdon GA, et al. bioRxiv [Preprint]. 2024 Sep 25:2024.09.24.614721. doi: 10.1101/2024.09.24.614721. bioRxiv. 2024. PMID: 39372794 Free PMC article. Preprint.
  • A familial, telomere-to-telomere reference for human de novo mutation and recombination from a four-generation pedigree.
    Porubsky D, Dashnow H, Sasani TA, Logsdon GA, Hallast P, Noyes MD, Kronenberg ZN, Mokveld T, Koundinya N, Nolan C, Steely CJ, Guarracino A, Dolzhenko E, Harvey WT, Rowell WJ, Grigorev K, Nicholas TJ, Oshima KK, Lin J, Ebert P, Watkins WS, Leung TY, Hanlon VCT, McGee S, Pedersen BS, Goldberg ME, Happ HC, Jeong H, Munson KM, Hoekzema K, Chan DD, Wang Y, Knuth J, Garcia GH, Fanslow C, Lambert C, Lee C, Smith JD, Levy S, Mason CE, Garrison E, Lansdorp PM, Neklason DW, Jorde LB, Quinlan AR, Eberle MA, Eichler EE. Porubsky D, et al. bioRxiv [Preprint]. 2024 Aug 5:2024.08.05.606142. doi: 10.1101/2024.08.05.606142. bioRxiv. 2024. PMID: 39149261 Free PMC article. Preprint.

References

    1. Jarvis ED, Formenti G, Rhie A, Guarracino A, Yang C, Wood J, et al. Semi-automated assembly of high-quality diploid human reference genomes. Nature. 2022;611:519–31. - PMC - PubMed
    1. Glusman G, Cox HC, Roach JC. Whole-genome haplotyping approaches and genomic medicine. Genome Med. 2014;6:73. - PMC - PubMed
    1. Tewhey R, Bansal V, Torkamani A, Topol EJ, Schork NJ. The importance of phase information for human genomics. Nat Rev Genet. 2011;12:215–23. - PMC - PubMed
    1. Leitwein M, Duranton M, Rougemont Q, Gagnaire P-A, Bernatchez L. Using haplotype information for conservation genomics. Trends Ecol Evol. 2020;35:245–58. - PubMed
    1. Green RE, Krause J, Briggs AW, Maricic T, Stenzel U, Kircher M, et al. A draft sequence of the Neandertal genome. Science. 2010;328:710–22. - PMC - PubMed

LinkOut - more resources