Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Nov 1;35(22):4716-4723.
doi: 10.1093/bioinformatics/btz275.

NGSEP3: accurate variant calling across species and sequencing protocols

Affiliations

NGSEP3: accurate variant calling across species and sequencing protocols

Daniel Tello et al. Bioinformatics. .

Abstract

Motivation: Accurate detection, genotyping and downstream analysis of genomic variants from high-throughput sequencing data are fundamental features in modern production pipelines for genetic-based diagnosis in medicine or genomic selection in plant and animal breeding. Our research group maintains the Next-Generation Sequencing Experience Platform (NGSEP) as a precise, efficient and easy-to-use software solution for these features.

Results: Understanding that incorrect alignments around short tandem repeats are an important source of genotyping errors, we implemented in NGSEP new algorithms for realignment and haplotype clustering of reads spanning indels and short tandem repeats. We performed extensive benchmark experiments comparing NGSEP to state-of-the-art software using real data from three sequencing protocols and four species with different distributions of repetitive elements. NGSEP consistently shows comparative accuracy and better efficiency compared to the existing solutions. We expect that this work will contribute to the continuous improvement of quality in variant calling needed for modern applications in medicine and agriculture.

Availability and implementation: NGSEP is available as open source software at http://ngsep.sf.net.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Schematic procedure for indel realignment and haplotype clustering. (A) Flowchart with the overview of the process. (B) Voting mechanism to select the most likely start of a given indel. Blue bars represent aligned reads. The numbers on top represent the reads supporting each possible start position. (C) Realignment of reads supporting indel events according to the results of the voting procedure and clustering of candidate haplotypes to discover and genotype small indels. Reads that do not span across the indel (marked with X) are soft-clipped
Fig. 2.
Fig. 2.
Comparison of different tools for variants discovery from reads taken from an F1 pool of segregants derived from two yeast haploid strains. Results are discriminated by variant type (SNVs, Indels and STRs) and gold standard genotype (homozygous variant or heterozygous). The proportion of FPPM is used as a measure of specificity. Curves are obtained varying the filter of minimum genotype quality (GQ field in the VCF file) from 0 to 90
Fig. 3.
Fig. 3.
(A) Comparison of different tools for variants discovery from reads taken from real WGS data of the Hapmap human individual NA12878. Results are discriminated by region type (single copy or repetitive), variant type (SNVs, Indels and STRs) and genotype in the PlatGen gold standard (homozygous or heterozygous). The proportion of FPPM is used as a measure of specificity. Curves are obtained varying the filter of minimum genotype quality (GQ field in the VCF file) from 0 to 90. (B) Runtime in minutes taken by the six tools to analyze the evaluated human samples. WGS, WES1 and WES2 correspond to one WGS and two WES datasets taken from the human Hapmap individual NA12878. SynDip corresponds to the synthetic diploid individual developed by Li et al. (2018)
Fig. 4.
Fig. 4.
Comparison of different tools for population variants discovery and genotyping using reads taken from real GBS experiments on two different biparental populations of cassava (left panels) and rice (right panels). In absence of a gold standard, the total number of genotype calls is used as a measure of sensitivity and the number of errors inferred from the population structure is used as a measure of specificity. Curves are obtained varying the filter of minimum genotype quality (GQ field in the VCF file) according to the observed distribution of GQ values for each tool

Similar articles

Cited by

References

    1. Bao R. et al. (2014) Review of current methods, applications, and data management for the bioinformatics analysis of whole exome sequencing. Cancer Inform., 13, 67–82. - PMC - PubMed
    1. Benson G. (1999) Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res., 27, 573–580. - PMC - PubMed
    1. Crossa J. et al. (2013) Genomic prediction in maize breeding populations with genotyping-by-sequencing. G3, 3, 1903–1926. - PMC - PubMed
    1. Duitama J. et al. (2014) An integrated framework for discovery and genotyping of genomic variants from high-throughput sequencing experiments. Nucleic Acids Res., 42, e44.. - PMC - PubMed
    1. Eberle M.A. et al. (2017) A reference dataset of 5. 4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree. Genome Res., 27, 1–31. - PMC - PubMed

Publication types