Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Apr;28(4):581-591.
doi: 10.1101/gr.221028.117. Epub 2018 Mar 13.

SvABA: genome-wide detection of structural variants and indels by local assembly

Affiliations

SvABA: genome-wide detection of structural variants and indels by local assembly

Jeremiah A Wala et al. Genome Res. 2018 Apr.

Abstract

Structural variants (SVs), including small insertion and deletion variants (indels), are challenging to detect through standard alignment-based variant calling methods. Sequence assembly offers a powerful approach to identifying SVs, but is difficult to apply at scale genome-wide for SV detection due to its computational complexity and the difficulty of extracting SVs from assembly contigs. We describe SvABA, an efficient and accurate method for detecting SVs from short-read sequencing data using genome-wide local assembly with low memory and computing requirements. We evaluated SvABA's performance on the NA12878 human genome and in simulated and real cancer genomes. SvABA demonstrates superior sensitivity and specificity across a large spectrum of SVs and substantially improves detection performance for variants in the 20-300 bp range, compared with existing methods. SvABA also identifies complex somatic rearrangements with chains of short (<1000 bp) templated-sequence insertions copied from distant genomic regions. We applied SvABA to 344 cancer genomes from 11 cancer types and found that short templated-sequence insertions occur in ∼4% of all somatic rearrangements. Finally, we demonstrate that SvABA can identify sites of viral integration and cancer driver alterations containing medium-sized (50-300 bp) SVs.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Overview of the SvABA structural variation detection tool. (A, left) SvABA uses String Graph Assembler (SGA) to assemble aberrantly aligned sequence reads that may reflect an indel or SV. Such reads include gapped alignments (for indels), clipped alignments (for medium and large SVs), and discordant read pairs (for large SVs). In addition to detecting indels and SVs, SvABA can identify complex rearrangement junctions (middle) and sites of viral integration (right). (B) The workflow for the SvABA pipeline: (1) reads within a small window are extracted from one or multiple BAM files and discordant reads are clustered; (2) discordant reads are realigned to the reference to remove pairs that have a candidate nondiscordant alignment; (3) the discordant read clusters are used to identify additional regions where reads should be extracted; (4) the sequences are error-corrected with BFC and assembled with SGA into contigs, and contigs are immediately aligned to the reference with BWA-MEM; (5) contigs with multipart alignments or gapped alignments are parsed to extract candidate variants; and (6) sequence reads are aligned to the contig and to the reference to establish read support for the reference and alternative haplotypes.
Figure 2.
Figure 2.
Detection of SVs and indels in the NA12878 human genome. (A) The number of SV events and the types of supporting evidence used by SvABA for detecting SV events of different lengths (indel variants not shown). SVs are detected through realignment of assembly contigs (purple), discordant read clusters (orange), or a combination of both (green). SVs with shorter lengths than the average size of the sequencing fragments are identified almost exclusively through assembly and realignment. (B) The length distributions of indels and small SVs in NA12878 determined from different sequencing and analytical technologies: 151-base paired-end Illumina sequencing by SvABA (red), HySA calls from PacBio sequencing data (blue), and the indel call set of the Genome in a Bottle consortium (green). (C) Comparison of detection accuracy of SvABA, LUMPY, DELLY, and Pindel for deletions (left) and for insertions/duplications (right) across three different length regimes in NA12878. The F1 score is a combined measure of precision and recall and was calculated using the PacBio assemblies and Genome in a Bottle (GIAB) as a truth set. (D) Total CPU and peak memory usage for several indel and SV detection tools applied to a single 33× human genome. SGA CPU and memory usage were estimated using published data (Simpson and Durbin 2012).
Figure 3.
Figure 3.
Benchmarking somatic variants with an in silico tumor. (A) True positive counts for indel calling (y-axis) as a function of variant size (x-axis) for SvABA (red), Pindel (blue), FreeBayes (orange), and Strelka (light blue). All callers achieved similar sensitivities for small somatic indels, while SvABA maintained high sensitivity for larger (>10 bp) indels. (B) Stacked bar chart of the number of SVs detected across all SV types (y-axis) as a function of variant size (x-axis). SvABA maintained sensitivity across variants of all sizes. novoBreak had the second highest sensitivity for medium and large variants after SvABA. Combining calls from a dedicated indel and SV caller (LUMPY and FreeBayes or DELLY and Strelka) improved overall sensitivity, but still left a gap for medium-sized SVs.
Figure 4.
Figure 4.
Somatic variant detection in the HCC1143 breast cancer cell line using different sequencing and informatics approaches. (A) Comparison of combined somatic SV and indel detection in HCC1143 using: local assembly using SvABA with 101-base paired-end reads (purple), SvABA with 250-base paired-end PCR-free reads (orange), or global assembly using DISCOVAR de novo on 250-base paired-end PCR-free reads and SVlib to extract variants (green). (B) Somatic variant counts (y-axis) for DISCOVAR de novo (250-base PCR-free reads; green) and SvABA using 101-base (purple) or 250-base PCR-free reads (orange), as a function of variant size (x-axis). All methods have similar sensitivities across different sizes, except DISCOVAR de novo was more sensitive to short indels in simple repeats.
Figure 5.
Figure 5.
SvABA identifies rearrangements with short templated-sequence insertions (STSI) derived from distant genomic loci. (A) Somatic rearrangement between Chr X and Chr 11 in HCC1143 containing a 38-bp fragment of Chr 8. STSI rearrangements are identified by assembly contigs that have multiple non-overlapping alignments to the reference. The direction of the arrows represents the strand that the contig fragment was aligned to (right-facing is forward strand). (B) Partial view of the contig from A showing the multiple alignments of the contig to the reference and the read-to-contig alignments. The top three lines indicate which bp of the contig each of the three BWA-MEM alignments covers (> is forward strand alignment; < is reverse strand alignment). The first two alignments indicate an insertion of 5 bp of novel sequence at the first junction (left), and the second two indicate 4 bp of microhomology at the second junction (right). The middle alignment supports the STSI fragment. These plots are automatically generated by SvABA for each variant (in the *.alignments.txt.gz file). (C) STSI fragment lengths from somatic rearrangements across 344 cancer genomes (mean 86 bp). (D) Prevalence of STSI rearrangements (x-axis) across 11 tumor types (y-axis). (ESAD) esophageal cancer; (BRCA) breast cancer; (LUSC) lung squamous cell carcinoma; (SKCM) melanoma; (GBM) glioblastoma; (HNSC) head and neck squamous cell carcinoma; (LUAD) lung adenocarcinoma; (COAD) colorectal adenocarcinoma; (THCA) thyroid carcinoma; (PRAD) prostate adenocarcinoma; (LIHC) hepatocellular carcinoma. (E) Bases of breakpoint microhomology (x-axis) for different cancer types (y-axis) for somatic STSI rearrangements (green) and somatic non-STSI rearrangements (purple). The STSI rearrangements have a significantly higher degree of breakpoint microhomology than their non-STSI counterparts across all tumor types.

Similar articles

Cited by

References

    1. Abo RP, Ducar M, Garcia EP, Thorner AR, Rojas-Rudilla V, Lin L, Sholl LM, Hahn WC, Meyerson M, Lindeman NI, et al. 2015. BreaKmer: detection of structural variation in targeted massively parallel sequencing data using kmers. Nucleic Acids Res 43: e19. - PMC - PubMed
    1. Beroukhim R, Mermel CH, Porter D, Wei G, Raychaudhuri S, Donovan J, Barretina J, Boehm JS, Dobson J, Urashima M, et al. 2010. The landscape of somatic copy-number alteration across human cancers. Nature 463: 899–905. - PMC - PubMed
    1. Campbell PJ, Getz G, Stuart JM, Korbel JO, Stein LD, ICGC/TCGA Pan-Cancer of Whole Genomes Net. 2017. Pan-cancer analysis of whole genomes. bioRxiv 10.1101/162784. - DOI
    1. Chen K, Chen L, Fan X, Wallis J, Ding L, Weinstock G. 2014. TIGRA: a targeted iterative graph routing assembler for breakpoint assembly. Genome Res 24: 310–317. - PMC - PubMed
    1. Chen X, Schulz-Trieglaff O, Shaw R, Barnes B, Schlesinger F, Källberg M, Cox AJ, Kruglyak S, Saunders CT. 2016. Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinformatics 32: 1220–1222. - PubMed

Publication types