Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 May 20;50(9):e53.
doi: 10.1093/nar/gkac039.

Dysgu: efficient structural variant calling using short or long reads

Affiliations

Dysgu: efficient structural variant calling using short or long reads

Kez Cleal et al. Nucleic Acids Res. .

Abstract

Structural variation (SV) plays a fundamental role in genome evolution and can underlie inherited or acquired diseases such as cancer. Long-read sequencing technologies have led to improvements in the characterization of structural variants (SVs), although paired-end sequencing offers better scalability. Here, we present dysgu, which calls SVs or indels using paired-end or long reads. Dysgu detects signals from alignment gaps, discordant and supplementary mappings, and generates consensus contigs, before classifying events using machine learning. Additional SVs are identified by remapping of anomalous sequences. Dysgu outperforms existing state-of-the-art tools using paired-end or long-reads, offering high sensitivity and precision whilst being among the fastest tools to run. We find that combining low coverage paired-end and long-reads is competitive in terms of performance with long-reads at higher coverage values.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Overview of dysgu pipeline. Dysgu analyses long-read (top pane, A) or paired-end data (bottom pane, A), assessing alignment gaps, discordant and split-reads, and soft-clipped alignments. An example deletion is shown for the different read types, with numbers representing SV signatures assessed by dysgu. Candidate SV signatures are clustered by their proximity on the reference genome (span-position-distance based) and by using a minimizer sketch of the soft-clipped portion of reads (B). SV evidence is clustered on a graph G, using black edges to link matching signatures from B, and grey edges to link alignments from the same read (C). The top and bottom panes in (C) and (D) depict example graphs from analysing LR or PE data, respectively. The graph G is simplified to a quotient graph Q, where nodes represent partitions of G that are linked by black edges, and edges in Q are induced by grey edges between partitions. Connected components in Q are then further analysed. A fast consensus sequence algorithm is employed to generate break-site sequences (E), which may be re-aligned to the reference genome to discover additional small SVs. Finally, up to 42 features are calculated for each candidate SV, depending on read type and scored using a gradient boosting machine classifier (F). Output vcf files may then be optionally merged/unified with other samples or read types.
Figure 2.
Figure 2.
Performance of dysgu using 20 × PE reads. Dysgu was compared to SV callers manta, delly and lumpy, and indel callers strelka and gatk, using the HG002 benchmark. Precision-recall curves are shown for all genomic regions (A, B), as well as high-confidence Tier 1 regions (C, D). The secondary y-axis indicates duplicate true-positives (TP) as a fraction of true-positive calls. Intersections and aggregates of intersections of SV calls for the all-regions benchmark are displayed using an upset plot (E, F). To investigate combinations of SV callers, the union of true-positives between callers (labelled concordant), was plotted against the sum of false-positives (labelled non concordant) (G, H). The 5 and 10% non-concordance (NC) is also illustrated as a solid or dashed line, respectively.
Figure 3.
Figure 3.
Performance of dysgu using PacBio reads. Precision-recall curves are shown for all genomic regions (A, B), as well as high-confidence Tier 1 regions (C, D). Analysis of SV intersections and aggregates of intersections for the all-regions benchmark are displayed using an upset plot (E, F). Combinations of SV callers were assessed by plotting the union of true-positives (labelled concordant), against the sum of false-positives (labelled non concordant) (G, H). The 5 and 10% non-concordance (NC) are shown as a solid or dashed line, respectively.

Similar articles

Cited by

References

    1. Stankiewicz P., Lupski J.R.. Structural variation in the human genome and its role in disease. Annu. Rev. Med. 2010; 61:437–455. - PubMed
    1. Cleal K., Baird D.M.. Catastrophic endgames: emerging mechanisms of telomere-driven genomic instability. Trends Genet. 2020; 36:347–359. - PubMed
    1. Cleal K., Jones R.E., Grimstead J.W., Hendrickson E.A., Baird D.M.. Chromothripsis during telomere crisis is independent of NHEJ, and consistent with a replicative origin. Genome Res. 2019; 29:737–749. - PMC - PubMed
    1. Escudero L., Cleal K., Ashelford K., Fegan C., Pepper C., Liddiard K., Baird D.M.. Telomere fusions associate with coding sequence and copy number alterations in CLL. Leukemia. 2019; 33:2093–2097. - PMC - PubMed
    1. Turro E., Astle W.J., Megy K., Gräf S., Greene D., Shamardina O., Allen H.L., Sanchis-Juan A., Frontini M., Thys C.et al. .. Whole-genome sequencing of patients with rare diseases in a national health system. Nature. 2020; 583:96–102. - PMC - PubMed