Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Apr 1;28(7):907-13.
doi: 10.1093/bioinformatics/bts053. Epub 2012 Jan 27.

JointSNVMix: a probabilistic model for accurate detection of somatic mutations in normal/tumour paired next-generation sequencing data

Affiliations

JointSNVMix: a probabilistic model for accurate detection of somatic mutations in normal/tumour paired next-generation sequencing data

Andrew Roth et al. Bioinformatics. .

Abstract

Motivation: Identification of somatic single nucleotide variants (SNVs) in tumour genomes is a necessary step in defining the mutational landscapes of cancers. Experimental designs for genome-wide ascertainment of somatic mutations now routinely include next-generation sequencing (NGS) of tumour DNA and matched constitutional DNA from the same individual. This allows investigators to control for germline polymorphisms and distinguish somatic mutations that are unique to the tumour, thus reducing the burden of labour-intensive and expensive downstream experiments needed to verify initial predictions. In order to make full use of such paired datasets, computational tools for simultaneous analysis of tumour-normal paired sequence data are required, but are currently under-developed and under-represented in the bioinformatics literature.

Results: In this contribution, we introduce two novel probabilistic graphical models called JointSNVMix1 and JointSNVMix2 for jointly analysing paired tumour-normal digital allelic count data from NGS experiments. In contrast to independent analysis of the tumour and normal data, our method allows statistical strength to be borrowed across the samples and therefore amplifies the statistical power to identify and distinguish both germline and somatic events in a unified probabilistic framework.

Availability: The JointSNVMix models and four other models discussed in the article are part of the JointSNVMix software package available for download at http://compbio.bccrc.ca

Contact: sshah@bccrc.ca

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Hypothetical example of the JointSNVMix analysis process. Reads are first aligned to the reference genome (green). Next the allelic counts, which are the number of matches and depth of reads at each position are tabulated. Allelic count information can then be used to identify germline (blue) and somatic positions (red). At the bottom of the Figure, we show the hypothetical probabilities of the nine joint genotypes based on the count data for the somatic position (AA, AB).
Fig. 2.
Fig. 2.
Probabilistic graphical model representing the (a) JointSNVMix1 and (b) JointSNVMix2 model. Shaded nodes represent observed values or fixed values, while the values of unshaded nodes are learned using EM. Only the distributions for the normal are shown below, the tumour distributions are the same. We have defined f(q|a, z)=z[qa+(1 − q)(1 − a)]+0.5(1 − z) and g(r|z)=zr+(1 − z)(1 − r). Description of all random variables is given in Table 2.
Fig. 3.
Fig. 3.
Concordance analysis of the 12 DLBCL datasets. The Somatic column represents concordance with the merged COSMIC and ground truth set. The germline column represents concordance with the 1000 Genomes positions with the cosmic positions removed. The horizontal axis shows the number of somatic predictions made and the vertical axes shows the fraction of those predictions found to be in the respective set. Lines are drawn by computing concordance as the threshold for classification is lowered. Lines start always from the left side because multiple positions may have ℙ(Somatic)=1. Circles at the start of lines indicate this positions, these points are also labelled with the number of somatic predictions (in 1000's) and concordance.

Similar articles

Cited by

References

    1. Berger M.F., et al. The genomic complexity of primary human prostate cancer. Nature. 2011;470:214–220. - PMC - PubMed
    1. Campbell P.J., et al. The patterns and dynamics of genomic instability in metastatic pancreatic cancer. Nature. 2010;467:1109–1113. - PMC - PubMed
    1. DePristo M., et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 2011;43:491–498. - PMC - PubMed
    1. Ding L., et al. Genome remodelling in a basal-like breast cancer metastasis and xenograft. Nature. 2010;464:999–1005. - PMC - PubMed
    1. Ding J., et al. Feature based classifiers for somatic mutation detection in tumour-normal paired sequencing data. Bioinformatics. 2012;28:167–175. - PMC - PubMed

Publication types