Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 May 23;25(4):bbae336.
doi: 10.1093/bib/bbae336.

SVDF: enhancing structural variation detect from long-read sequencing via automatic filtering strategies

Affiliations

SVDF: enhancing structural variation detect from long-read sequencing via automatic filtering strategies

Heng Hu et al. Brief Bioinform. .

Abstract

Structural variation (SV) is an important form of genomic variation that influences gene function and expression by altering the structure of the genome. Although long-read data have been proven to better characterize SVs, SVs detected from noisy long-read data still include a considerable portion of false-positive calls. To accurately detect SVs in long-read data, we present SVDF, a method that employs a learning-based noise filtering strategy and an SV signature-adaptive clustering algorithm, for effectively reducing the likelihood of false-positive events. Benchmarking results from multiple orthogonal experiments demonstrate that, across different sequencing platforms and depths, SVDF achieves higher calling accuracy for each sample compared to several existing general SV calling tools. We believe that, with its meticulous and sensitive SV detection capability, SVDF can bring new opportunities and advancements to cutting-edge genomic research.

Keywords: deep learning; false-positives; long-read sequencing; structural variation detection.

PubMed Disclaimer

Figures

Figure 1
Figure 1
The overall workflow of SVDF. (1) SV signature collection: SVDF comprehensively collects and records various types of SV signatures from intra and inter alignments. (2) Signature clustering: a two-step adaptive clustering algorithm is employed to enhance the signals of true SV events. (3) SV filtering: a CNN is employed to filter false-positive SVs within the clustering results. (4) SV calling and genotype: generates the SV call set and allocates genotypes.
Figure 2
Figure 2
SV detection in simulated sample. (a) and (b) The F1 scores for calling SVs of various SV types by all tools, with CLR data depicted on the left and ONT data on the right. (c) The comparative performance of all tools under different minimum supporting reads thresholds. (d) The trend of precision and recall SV calling by SVDF under different minimum supporting read thresholds.
Figure 3
Figure 3
SV detection in the HG002 sample. (a) and (b) Comparison of overall calls F1 scores across Tier1 GIAB genome-wide SV (left) and CMRG benchmark (right). (c) Radar chart of MDR results for the Ashkenazi trio family, including MDR values for overall and different SV types. (d) and (e) The line plot displays the calls F1 scores of various SV sizes in CLR and ONT data by all tools, whilst the bar plot indicates the number of SVs called by SVDF across different SV size ranges. (f–h) Comparison of F1 scores in different down-sampled datasets are illustrated from left to right for CLR, CCS (HIFI), and ONT data.
Figure 4
Figure 4
SV detection in the CHM13 sample. (a) and (b) Precision–recall graphs of all tools for SV calling and genotyping in the CHM13 sample compared to the assembly based method (SVIM is hidden in genotyping because of its low performance). (c) The distribution of breakpoint deviation for SVs successfully called by all tools in the CHM13 sample, with the horizontal axis representing the number of SV calls and the vertical axis representing the range of breakpoint deviation.
Figure 5
Figure 5
SV detection in the HCC1395 cancer sample. (a) The number of successful calls of different types of SV by all tools compared to the HCC1395 cancer-specific SV benchmark set. (b) A Venn diagram indicates the overlap of validated SVs called in the HCC1395 cancer sample by SVDF and comparative tools, with numbers representing the count of overlaps.

Similar articles

References

    1. Mahmoud M, Gobet N, Cruz-Dávalos DI. et al. . Structural variant calling: the long and the short of it. Genome Biol 2019;20:1–14. 10.1186/s13059-019-1828-7. - DOI - PMC - PubMed
    1. Rovelet-Lecrux A, Hannequin D, Raux G. et al. . APP locus duplication causes autosomal dominant early-onset Alzheimer disease with cerebral amyloid angiopathy. Nat Genet 2006;38:24–6. 10.1038/ng1718. - DOI - PubMed
    1. The International SNP Map Working Group . A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature 2001;409:928–33. 10.1038/35057149. - DOI - PubMed
    1. Walters RG, Coin LJM, Ruokonen A. et al. . Rare genomic structural variants in complex disease: lessons from the replication of associations with obesity. PloS One 2013;8:e58048. 10.1371/journal.pone.0058048. - DOI - PMC - PubMed
    1. Li Y, Roberts ND, Wala JA. et al. . Patterns of somatic structural variation in human cancer genomes. Nature 2020;578:112–21. 10.1038/s41586-019-1913-9. - DOI - PMC - PubMed