Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jan 22;25(2):bbae049.
doi: 10.1093/bib/bbae049.

Kled: an ultra-fast and sensitive structural variant detection tool for long-read sequencing data

Affiliations

Kled: an ultra-fast and sensitive structural variant detection tool for long-read sequencing data

Zhendong Zhang et al. Brief Bioinform. .

Abstract

Structural Variants (SVs) are a crucial type of genetic variant that can significantly impact phenotypes. Therefore, the identification of SVs is an essential part of modern genomic analysis. In this article, we present kled, an ultra-fast and sensitive SV caller for long-read sequencing data given the specially designed approach with a novel signature-merging algorithm, custom refinement strategies and a high-performance program structure. The evaluation results demonstrate that kled can achieve optimal SV calling compared to several state-of-the-art methods on simulated and real long-read data for different platforms and sequencing depths. Furthermore, kled excels at rapid SV calling and can efficiently utilize multiple Central Processing Unit (CPU) cores while maintaining low memory usage. The source code for kled can be obtained from https://github.com/CoREse/kled.

Keywords: long-read sequencing; structural variation; variant calling.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Schematic diagram of the overall design of kled. Step 1: Signatures are collected from CIGAR strings and split read information, and the read depth information is collected. Step 2: CIGAR signatures are first merged by an innovative algorithm called OMA, and the signatures are separated by SV types and areas and subsequently clustered by spans and lengths. Step 3: The clusters are refined by the supporting reads and lengths using self-adaptive criteria, after which a genotyping phase based on the supporting reads is conducted.
Figure 2
Figure 2
F1 score comparisons on simulated data. The vertical axes denote the F1 scores for presence or genotype; the horizontal axes represent the sequencing depths of the data. The sub-figures include (A) Overall presence of F1 score benchmarking; Presence of F1 score benchmarking for (B) deletions (with an enlarged view for 20× and 30× sequencing depths), (C) insertions (with an enlarged view for 20× and 30× sequencing depths), (D) duplications and (E) inversions. (F) Overall genotype F1 score benchmarking; Genotype F1 score benchmarking for (G) deletions (with an enlarged view for 20× and 30× sequencing depths), (H) insertions (with an enlarged view for 20× and 30× sequencing depths), (I) duplications and (J) inversions.
Figure 3
Figure 3
Benchmarking results on the Genome in a Bottle consortium (GIAB) SV dataset for HG002. The vertical axes denote the F1 scores or genotype F1 scores, and the horizontal axes represent the sequencing depths of the data. The sub-figures include presence F1 score benchmarking on (A) PacBio HiFi data, (B) PacBio CLR data, and (C) ONT data; genotype F1 score benchmarking on (D) PacBio HiFi data, (E) PacBio CLR data and (F) ONT data.
Figure 4
Figure 4
MDR comparisons on Ashkenazi trio family data. The vertical axes represent the MDR value, where lower values are preferable; the horizontal axes and colors mark different tools. In the SV-type specific diagrams, the sizes of the bubbles indicate the counts of consistent SV loci, and larger bubbles are more desirable. The sub-figures include (A) Histogram of the MDR values; MDR values and consistent SV loci counts for (B) deletions, (C) insertions, (D) duplications and (E) inversions.
Figure 5
Figure 5
Line charts of the time (s) and maximum physical memory (MB) consumed to complete SV detections on 30× data from PacBio HiFi, PacBio CLR and ONT platforms. The vertical axes represent the time consumed or maximum physical memory occupied; the horizontal axes denote the thread counts of the detections. Sub-figures: Time consumptions on (A) 30× PacBio HiFi data, (B) 30× PacBio CLR data and (C) 30× ONT data; maximum physical memory requirements on (D) 30× PacBio HiFi data, (E) 30× PacBio CLR data and (F) 30× ONT data.

Similar articles

Cited by

References

    1. Kim S, Misra A. SNP genotyping: technologies and biomedical applications. Annu Rev Biomed Eng 2007;9:289–320. - PubMed
    1. Auton A, Abecasis GR, Altshuler DM, et al. A global reference for human genetic variation. Nature 2015;526:68–74. - PMC - PubMed
    1. Bennett EP, Petersen BL, Johansen IE, et al. INDEL detection, the ‘Achilles heel’ of precise genome editing: a survey of methods for accurate profiling of gene editing induced indels. Nucleic Acids Res 2020;48:11958–81. - PMC - PubMed
    1. Conrad DF, Pinto D, Redon R, et al. Origins and functional impact of copy number variation in the human genome. Nature 2010;464:704–12. - PMC - PubMed
    1. Kidd JM, Graves T, Newman TL, et al. A human genome structural variation sequencing resource reveals insights into mutational mechanisms. Cell 2010;143:837–47. - PMC - PubMed