Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Aug 13;15(1):6956.
doi: 10.1038/s41467-024-51282-0.

VolcanoSV enables accurate and robust structural variant calling in diploid genomes from single-molecule long read sequencing

Affiliations

VolcanoSV enables accurate and robust structural variant calling in diploid genomes from single-molecule long read sequencing

Can Luo et al. Nat Commun. .

Abstract

Structural variants (SVs) significantly contribute to human genome diversity and play a crucial role in precision medicine. Although advancements in single-molecule long-read sequencing offer a groundbreaking resource for SV detection, identifying SV breakpoints and sequences accurately and robustly remains challenging. We introduce VolcanoSV, an innovative hybrid SV detection pipeline that utilizes both a reference genome and local de novo assembly to generate a phased diploid assembly. VolcanoSV uses phased SNPs and unique k-mer similarity analysis, enabling precise haplotype-resolved SV discovery. VolcanoSV is adept at constructing comprehensive genetic maps encompassing SNPs, small indels, and all types of SVs, making it well-suited for human genomics studies. Our extensive experiments demonstrate that VolcanoSV surpasses state-of-the-art assembly-based tools in the detection of insertion and deletion SVs, exhibiting superior recall, precision, F1 scores, and genotype accuracy across a diverse range of datasets, including low-coverage (10x) datasets. VolcanoSV outperforms assembly-based tools in the identification of complex SVs, including translocations, duplications, and inversions, in both simulated and real cancer data. Moreover, VolcanoSV is robust to various evaluation parameters and accurately identifies breakpoints and SV sequences.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. VolcanoSV overall workflow.
The main workflow of VolcanoSV consists of two key components, VolcanoSV-asm and VolcanoSV-vc. VolcanoSV-asm (left, blue square) comprises three conceptual modules to perform diploid assembly (partitioning reads into corresponding haplotypes, assigning unphased reads, and performing a haplotype-aware local assembly). The output of this component is processed by the VolcanoSV-vc component (center, red rectangle) to perform variant detection. Further details are provided in the Methods section.
Fig. 2
Fig. 2. VolcanoSV-vc workflow.
VolcanoSV-vc includes three main modules: a large indel SV detection, b complex SV detection, and c small indel detection. The output of this component is a phased VCF file. Further details are provided in the Methods section.
Fig. 3
Fig. 3. Cross datasets evaluation against GIAB H002 benchmark.
a, b F1 heatmap for deletions (DEL) and insertions (INS) by four assembly-based tools. c, d Recall bar plots for insertions deletion (DEL) and insertions (INS) by four assembly-based tools. e, f Precision bar plots for insertions deletion (DEL) and insertions (INS) by four assembly-based tools. g, h Genotype accuracy (represented by GT_concordance) bar plots for insertions deletion (DEL) and insertions (INS) by four assembly-based tools. Source data are provided as a Source Data file.
Fig. 4
Fig. 4. Overlapping calls, size distribution, and accuracy for SV discovery and complex SV analysis.
ac UpSet plot for analysis of shared and unique true positive (TP) calls between different assembly-based tools. df F1 accuracy of SV detection at different size ranges. The negative size range represents deletions and the positive size range represents insertions. The bar plot shows benchmark SV distribution at different size ranges. The line plot shows the F1 score of four different methods. g F1 and GT_F1 heatmap for complex SV detection on simulated data. h The recall heatmap for complex somatic SV detection on real data. Source data are provided as a Source Data file.
Fig. 5
Fig. 5. Subsampling effect for different methods.
a Recall-precision-F1 curves show the subsampling effect on deletion and insertion by different tools on Hifi_L1. b Recall-precision-F1 curves show the subsampling effect on deletion and insertion by different tools on CLR_L1. c Recall-precision-F1 curves show the subsampling effect on deletion and insertion by different tools on ONT_L1. The coverage depth varies from 5×, 10×, 20×, 30×, 40× to 50×. Solid lines with markers are for different coverage depths, and corresponding dashed lines are for genotyping (gt) accuracy. For both insertions and deletions, we zoom in on the top right part of the plot to demonstrate the curves more clearly. Source data are provided as a Source Data file.
Fig. 6
Fig. 6. F1 accuracy by tuning different evaluation parameters and distribution of breakpoint shift and alternate allele sequence similarity for SVs on Hifi_L1.
a Grid search heatmap of F1 values for deletions by different assembly-based tools. b Distribution of breakpoint shift for deletions by assembly-based tools. c Distribution of alternate sequence similarity for deletions by assembly-based tools. df Equivalent visual representations as shown in ac for insertions. Source data are provided as a Source Data file.

Similar articles

References

    1. Sudmant, P. H. et al. An integrated map of structural variation in 2,504 human genomes. Nature526, 75–81 (2015). 10.1038/nature15394 - DOI - PMC - PubMed
    1. Feuk, L., Marshall, C. R., Wintle, R. F. & Scherer, S. W. Structural variants: changing the landscape of chromosomes and design of disease studies. Hum. Mol. Genet.15, 57–66 (2006).10.1093/hmg/ddl057 - DOI - PubMed
    1. Spielmann, M., Lupiáñez, D. G. & Mundlos, S. Structural variation in the 3d genome. Nat. Rev. Genet.19, 453–467 (2018). 10.1038/s41576-018-0007-0 - DOI - PubMed
    1. Weischenfeldt, J. et al. Pan-cancer analysis of somatic copy-number alterations implicates irs4 and igf2 in enhancer hijacking. Nat. Genet.49, 65–74 (2017). 10.1038/ng.3722 - DOI - PMC - PubMed
    1. Handsaker, R. E. et al. Large multiallelic copy number variations in humans. Nat. Genet.47, 296–303 (2015). 10.1038/ng.3200 - DOI - PMC - PubMed

LinkOut - more resources