Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Mar 1;10(1):998.
doi: 10.1038/s41467-019-09025-z.

A multi-task convolutional deep neural network for variant calling in single molecule sequencing

Affiliations

A multi-task convolutional deep neural network for variant calling in single molecule sequencing

Ruibang Luo et al. Nat Commun. .

Abstract

The accurate identification of DNA sequence variants is an important, but challenging task in genomics. It is particularly difficult for single molecule sequencing, which has a per-nucleotide error rate of ~5-15%. Meeting this demand, we developed Clairvoyante, a multi-task five-layer convolutional neural network model for predicting variant type (SNP or indel), zygosity, alternative allele and indel length from aligned reads. For the well-characterized NA12878 human sample, Clairvoyante achieves 99.67, 95.78, 90.53% F1-score on 1KP common variants, and 98.65, 92.57, 87.26% F1-score for whole-genome analysis, using Illumina, PacBio, and Oxford Nanopore data, respectively. Training on a second human sample shows Clairvoyante is sample agnostic and finds variants in less than 2 h on a standard server. Furthermore, we present 3,135 variants that are missed using Illumina but supported independently by both PacBio and Oxford Nanopore reads. Clairvoyante is available open-source ( https://github.com/aquaskyline/Clairvoyante ), with modules to train, utilize and visualize the model.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
The IGV screen capture of the selected variants. a A heterozygote SNP from T to G at chromosome 11, position 98,146,409 called only in the PacBio and ONT data, b a heterozygote deletion AA at chromosome 20, position 3,200,689 not called in all three technologies, c a heterozygote insertion ATCCTTCCT at chromosome 1, position 184,999,851 called only in the Illumina data, and d a heterozygote deletion G at chromosome 1, position 5,072,694 called in all three technologies. The tracks from top to down show the alignments of the Illumina, PacBio, and ONT reads from HG001 aligned to the human reference GRCh37
Fig. 2
Fig. 2
A Venn diagram that shows the number of undetected known variants by different sequencing technologies or combinations
Fig. 3
Fig. 3
Clairvoyante network architecture and layer details. The descriptions under each layer, include (1) the layer’s function; (2) the activation function used; (3) the dimension of the layer in parenthesis (input layer: height × width × arrays, convolution layer: height × width × filters, fully connected layer: nodes), and (4) kernel size in brackets (height × width)
Fig. 4
Fig. 4
Selected illustrations of how Clairvoyante represents the three common types of a small variant, and a nonvariant. The figure includes: (top left) a C > G SNP, (top right) a 9-bp insertion, (bottom left) a 4-bp deletion, and (bottom right) a nonvariant with a reference allele. The color intensity represents the strength of a certain variant signal. The SNP insertion and deletion examples are ideal with almost zero-background noise. The nonvariant example illustrates how the background noises look like when not mingled with any variant signal

Similar articles

Cited by

References

    1. Goodwin S, McPherson JD, McCombie WR. Coming of age: ten years of next-generation sequencing technologies. Nat. Rev. Genet. 2016;17:333–351. doi: 10.1038/nrg.2016.49. - DOI - PMC - PubMed
    1. Nakamura K, et al. Sequence-specific error profile of Illumina sequencers. Nucleic Acids Res. 2011;39:e90. doi: 10.1093/nar/gkr344. - DOI - PMC - PubMed
    1. Hatem A, Bozdag D, Toland AE, Catalyurek UV. Benchmarking short sequence mapping tools. BMC Bioinforma. 2013;14:184. doi: 10.1186/1471-2105-14-184. - DOI - PMC - PubMed
    1. Li H. Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics. 2014;30:2843–2851. doi: 10.1093/bioinformatics/btu356. - DOI - PMC - PubMed
    1. Luo R, Schatz MC, Salzberg SL. 16GT: a fast and sensitive variant caller using a 16-genotype probabilistic model. Gigascience. 2017;6:1–4. doi: 10.1093/gigascience/gix045. - DOI - PMC - PubMed

Publication types