Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Feb 17;18(2):e1009269.
doi: 10.1371/journal.pcbi.1009269. eCollection 2022 Feb.

Tool evaluation for the detection of variably sized indels from next generation whole genome and targeted sequencing data

Affiliations

Tool evaluation for the detection of variably sized indels from next generation whole genome and targeted sequencing data

Ning Wang et al. PLoS Comput Biol. .

Abstract

Insertions and deletions (indels) in human genomes are associated with a wide range of phenotypes, including various clinical disorders. High-throughput, next generation sequencing (NGS) technologies enable the detection of short genetic variants, such as single nucleotide variants (SNVs) and indels. However, the variant calling accuracy for indels remains considerably lower than for SNVs. Here we present a comparative study of the performance of variant calling tools for indel calling, evaluated with a wide repertoire of NGS datasets. While there is no single optimal tool to suit all circumstances, our results demonstrate that the choice of variant calling tool greatly impacts the precision and recall of indel calling. Furthermore, to reliably detect indels, it is essential to choose NGS technologies that offer a long read length and high coverage coupled with specific variant calling tools.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Precision rates, recall rates and F1 scores for tools’ deletion calling using the semi-simulated datasets.
(A) 5× coverage, 100bp read length sequencing data. (B) 30× coverage, 100bp read length sequencing data. (C) 60× coverage, 100bp read length sequencing data. (D) 30× coverage, 250bp read length sequencing data.
Fig 2
Fig 2. Precision rates, recall rates and F1 scores for tools’ insertion calling using the semi-simulated datasets.
(A) 5× coverage, 100bp read length sequencing data. (B) 30× coverage, 100bp read length sequencing data. (C) 60× coverage, 100bp read length sequencing data. (D) 30× coverage, 250bp read length sequencing data.
Fig 3
Fig 3. Precision rates, recall rates and F1 scores of the tools for calling deletions < 50bp using the semi-simulated datasets.
(A) 5× coverage, 100bp read length sequencing data. (B) 30× coverage, 100bp read length sequencing data. (C) 60× coverage, 100bp read length sequencing data. (D) 30× coverage, 250bp read length sequencing data.
Fig 4
Fig 4. Precision rates, recall rates and F1 scores of the tools for calling insertions < 50bp using the semi-simulated datasets.
(A) 5× coverage, 100bp read length sequencing data. (B) 30× coverage, 100bp read length sequencing data. (C) 60× coverage, 100bp read length sequencing data. (D) 30× coverage, 250bp read length sequencing data.
Fig 5
Fig 5. Homozygous and heterozygous precisions of variant calling tools using the semi-simulated dataset.
(A) 5× coverage, 100bp read length sequencing data. (B) 30× coverage, 100bp read length sequencing data. (C) 60× coverage, 100bp read length sequencing data. (D) 30× coverage, 250bp read length sequencing data.
Fig 6
Fig 6. The proportions of SR regions annotated FP indels of the tools with the semi-simulated dataset.
(A) 5× coverage, 100bp read length sequencing data. (B) 30× coverage, 100bp read length sequencing data. (C) 60× coverage, 100bp read length sequencing data. (D) 30× coverage, 250bp read length sequencing data. The numbers on the top of each bar are the total numbers of FP results called by each tool with corresponding sequencing data.
Fig 7
Fig 7. Indel calling evaluation results for variant calling tools with GIAB NA24385 WES data.
The evaluation results were calculated using hap.py. The table below shows the values of the precision rates, recall rates, and F1 scores of each tool.
Fig 8
Fig 8. Indel calling results for variant calling tools with CHM1 cell line sequencing data.
(A) The FDRs and sensitivities of deletion calls in size range 50bp – 200bp. (B) The FDRs and sensitivities of insertion calls in size range 50bp – 200bp. (C) The FDRs and sensitivities of deletion calls in size range 200bp – 500bp. (D) The FDRs and sensitivities of insertion calls in size range 200bp – 500bp. (E) The FDRs and sensitivities of deletion calls in size range ≥ 500bp. (F) The FDRs and sensitivities of insertion calls in size range ≥ 500bp. The number of FPs and the number tool-detected indels are listed above FDRs. The number of TPs and the number of indels in truth set are listed above sensitivities.
Fig 9
Fig 9. The running times and the maximum memory usages of variant calling tools.
A) Total CPU times of each variant calling tool with 30× coverage, 250bp read length semi-simulated sequencing data. The total CPU time of pre-processing included aligning the sequencing reads into the BAM file. Tool total CPU time included analyzing the input BAM file into the output VCF format result. (B) Maximum memory usage of each variant calling tool with 30× coverage, 250bp read length semi-simulated sequencing data. Maximum memory of pre-processing included aligning the sequencing reads into the BAM file. Tool maximum memory usage included analyzing the input BAM file into the output VCF format result. Because FermiKit is a de novo assembly algorithm-based variant calling tool, which took sequencing reads as input, it did not require pre-processing.

Similar articles

Cited by

References

    1. Sanger F, Nicklen S, Coulson AR. DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci. 1977;74: 5463–5467. doi: 10.1073/pnas.74.12.5463 - DOI - PMC - PubMed
    1. Loman NJ, Misra R V., Dallman TJ, Constantinidou C, Gharbia SE, Wain J, et al.. Performance comparison of benchtop high-throughput sequencing platforms. Nat Biotechnol. 2012. doi: 10.1038/nbt.2198 - DOI - PubMed
    1. Park ST, Kim J. Trends in next-generation sequencing and a new era for whole genome sequencing. International Neurourology Journal. 2016. doi: 10.5213/inj.1632742.371 - DOI - PMC - PubMed
    1. Macintyre G, Goranova TE, De Silva D, Ennis D, Piskorz AM, Eldridge M, et al.. Copy number signatures and mutational processes in ovarian carcinoma. Nat Genet. 2018. doi: 10.1038/s41588-018-0179-8 - DOI - PMC - PubMed
    1. Flannick J, Fuchsberger C, Mahajan A, Teslovich TM, Agarwala V, Gaulton KJ, et al.. Erratum: Sequence data and association statistics from 12,940 type 2 diabetes cases and controls. Scientific data. 2018. doi: 10.1038/sdata.2018.2 - DOI - PMC - PubMed

Publication types

MeSH terms

Grants and funding

N.W. has received funding from the Turku University Foundation (https://www.yliopistosaatio.fi/en/). K.O. has received funding from State Research Funding from the Turku University Hospital (https://www.vsshp.fi/en/tutkijoille/rahoitus/Pages/default.aspx). L.L.E. reports grants from the European Research Council ERC (677943) (https://erc.europa.eu/), Academy of Finland (296801, 310561, 314443, 329278, 335434, 335611) (https://www.aka.fi/en/), and Sigrid Juselius Foundation (https://www.sigridjuselius.fi/en/), during the conduct of the study. Our research is also supported by University of Turku Graduate School (UTUGS), Biocenter Finland, and ELIXIR Finland. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.