Skip to main page content
U.S. flag

An official website of the United States government

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Dec 1;11(12):e1004572.
doi: 10.1371/journal.pcbi.1004572. eCollection 2015 Dec.

Wham: Identifying Structural Variants of Biological Consequence

Affiliations

Wham: Identifying Structural Variants of Biological Consequence

Zev N Kronenberg et al. PLoS Comput Biol. .

Abstract

Existing methods for identifying structural variants (SVs) from short read datasets are inaccurate. This complicates disease-gene identification and efforts to understand the consequences of genetic variation. In response, we have created Wham (Whole-genome Alignment Metrics) to provide a single, integrated framework for both structural variant calling and association testing, thereby bypassing many of the difficulties that currently frustrate attempts to employ SVs in association testing. Here we describe Wham, benchmark it against three other widely used SV identification tools-Lumpy, Delly and SoftSearch-and demonstrate Wham's ability to identify and associate SVs with phenotypes using data from humans, domestic pigeons, and vaccinia virus. Wham and all associated software are covered under the MIT License and can be freely downloaded from github (https://github.com/zeeev/wham), with documentation on a wiki (http://zeeev.github.io/wham/). For community support please post questions to https://www.biostars.org/.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Sensitivity and false discovery rates (FDR) for simulated data.
The sensitivity and FDR of Delly, Lumpy, SoftSearch and Wham for simulated deletions, duplications, insertions and inversions. The sensitivity is measured for each category at depths of 10x and 50x. SVs ranging from 50 bp to 1 Mb are grouped into four left-closed size intervals. A) The sensitivity of the three tools is faceted on size, depth and SV type. At 10x Wham has noticeably better sensitivity for deletions and duplications in the smallest size class. Wham’s sensitivity is higher than Delly and Lumpy for insertions at 10x and gains sensitivity at 50x. B) The FDR for each type of SV faceted by depth and the amount of slop added to each confidence interval. In the 25 bp slop category, each confidence interval was extended in both directions by 25 bp. At 10x depth Wham has the highest FDR across all SV classes and Lumpy has the lowest. At 50x Delly has heightened FDR for deletions and Lumpy has a much higher FDR for insertions. Shrinking the confidence intervals increases the FDR for Delly and Lumpy, but not Wham. C) Breakpoint sensitivity for deletions. The confidence intervals, provided by the three tools are ignored and slop is incrementally added to the predicted breakpoints. Wham has the highest sensitivity when 1–10 bp of slop is added. D) Genotype sensitivity for the homozygous non-reference simulated SVs. Delly and Wham have similar sensitivity for deletions and duplications while both tools fail to correctly genotype duplications.
Fig 2
Fig 2. Benchmarking Delly, Lumpy, SoftSearch and Wham against NA12878 and CHM1 datasets.
A) The sensitivity and FDR for filtered NA12878 Phase III deletion calls across four size intervals. The number of true positives and the number NA12878 calls are listed above sensitivity, while the total number of false positives and total calls for each tool is listed above FDR. Most true positives and false positives are within the 150–1,000 bp interval. B) The sensitivity and FDR for CHM1 deletions. C) The size distribution of the true positive calls that overlap the CHM1 deletions. One thousand true positives were randomly sampled from each tool and the truth set (CHM1-DEL).
Fig 3
Fig 3. Genotyping assays.
A) Comparison of Genome STRiP (GS) genotypes vs. Delly, Lumpy and Wham. The x-axis lists the GS genotype. Different colors denote the zygosity of the Delly, Lumpy, and Wham genotypes. B) The fraction of Chromosome 1 deletions for the NA12878, NA1277, and NA12882 trio that conform to Mendelian inheritance patterns. C) The CEPH/Utah Pedigree 1463 allele frequency (AF) spectrum represented as an empirical cumulative distribution function (ECDF). This curve is derived from Chromosome 1 deletions. FB, Freebayes; UG, Unified Genotyper [49].
Fig 4
Fig 4. Identification of the e1 allele using Wham’s LRT.
A) Wham’s LRT interrogates allele frequency differences between recessive red and wild type birds. Genomic scaffolds are denoted by different colors and are sorted by size in increasing order. The highest LRT score (dashed vertical line) is a 7.5-kb deletion upstream of the Sox10 gene, which encodes a transcription factor that is critical to the melanin synthesis pathway. Only LRT values above 1.5 are shown in A. B) The quantile-quantile plot after converting Wham’s likelihood ratio values to p-values. C) Scaffold974 association tests from SNPs, Delly SV calls and WHAM SV calls.
Fig 5
Fig 5. Wham detects structural variation in vaccinia virus populations.
A) Read depth normalized within each sample is plotted across the ~200 kb vaccinia genome (excluding inverted terminal repeats) for either the parental strain (top panel) or an adapted strain (middle and bottom panels, called by Wham or Lumpy, respectively). Arrows highlight the positions of K3L CNV and E3L deletion. The black lines represent the breakpoints of every SV call after filtering (see Supporting Information). B) Wham calls in the adapted strain near the K3L duplication breakpoint are shown as black triangles above the viral genes in colored boxes. The height of the triangle represents split-read (SR) count supporting the call. Sanger sequencing positions relative to the reference sequence are listed below. Asterisks (*) indicate Wham calls that match the exact breakpoint determined by Sanger sequencing (see S3 Table for Wham and Lumpy breakpoints). C) Wham calls in the adapted strain near the E3L deletion are shown above the genes, and Sanger sequence confirmed positions below, as in B. The arrow indicates the position of the 11K promoter driving β-gal expression. For breakpoints in grey, the height of the triangle indicates the relative mate-pair count from Wham, as these positions do not have SR support.

Similar articles

Cited by

References

    1. Gemayel R, Vinces MD, Legendre M, Verstrepen KJ (2010) Variable tandem repeats accelerate evolution of coding and regulatory sequences. Annu Rev Genet 44: 445–477. 10.1146/annurev-genet-072610-155046 - DOI - PubMed
    1. Chan YF, Marks ME, Jones FC, Villarreal G, Shapiro MD, et al. (2010) Adaptive evolution of pelvic reduction in sticklebacks by recurrent deletion of a Pitx1 enhancer. Science 327: 302–305. 10.1126/science.1182213 - DOI - PMC - PubMed
    1. Perry G, Yang F, Marques-Bonet T (2008) Copy number variation and evolution in humans and chimpanzees. Genome Res 18: 1698–1710. 10.1101/gr.082016.108 - DOI - PMC - PubMed
    1. Axelsson E, Ratnakumar A, Arendt M-L, Maqbool K, Webster MT, et al. (2013) The genomic signature of dog domestication reveals adaptation to a starch-rich diet. Nature 495: 360–364. 10.1038/nature11837 - DOI - PubMed
    1. McCarroll S, Altshuler DM (2007) Copy-number variation and association studies of human disease. Nat Genet 39: S37–S42. - PubMed

Publication types