Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Feb 3;10(2):e0117380.
doi: 10.1371/journal.pone.0117380. eCollection 2015.

PON-P2: prediction method for fast and reliable identification of harmful variants

Affiliations

PON-P2: prediction method for fast and reliable identification of harmful variants

Abhishek Niroula et al. PLoS One. .

Abstract

More reliable and faster prediction methods are needed to interpret enormous amounts of data generated by sequencing and genome projects. We have developed a new computational tool, PON-P2, for classification of amino acid substitutions in human proteins. The method is a machine learning-based classifier and groups the variants into pathogenic, neutral and unknown classes, on the basis of random forest probability score. PON-P2 is trained using pathogenic and neutral variants obtained from VariBench, a database for benchmark variation datasets. PON-P2 utilizes information about evolutionary conservation of sequences, physical and biochemical properties of amino acids, GO annotations and if available, functional annotations of variation sites. Extensive feature selection was performed to identify 8 informative features among altogether 622 features. PON-P2 consistently showed superior performance in comparison to existing state-of-the-art tools. In 10-fold cross-validation test, its accuracy and MCC are 0.90 and 0.80, respectively, and in the independent test, they are 0.86 and 0.71, respectively. The coverage of PON-P2 is 61.7% in the 10-fold cross-validation and 62.1% in the test dataset. PON-P2 is a powerful tool for screening harmful variants and for ranking and prioritizing experimental characterization. It is very fast making it capable of analyzing large variant datasets. PON-P2 is freely available at http://structure.bmc.lu.se/PON-P2/.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Overview of PON-P2 architecture and implementation.
PON-P2 uses pre-calculated feature vectors and bootstrap random forest for prediction. In addition, it makes benefit of information about functional and/or structural annotations, when available, and identifies reliably predicted variations and groups them either as pathogenic or neutral.
Fig 2
Fig 2. Distribution of variations at functional and structural sites.
The pathogenic variations are represented by white bars and neutral variations by grey bars. The functional and structural annotation sites were obtained from Swiss-Prot and PDB. Binding, binding site; Metal, metal binding site; Active, active site; IM, intra membrane region; Site, catalytic, co-factor, anti-codon, regulatory or other essential site surrounding ligands in the structure.
Fig 3
Fig 3. Performance cuboids for PON-P2 and other methods.
Six performance measures: PPV, NPV, sensitivity, specificity, acc (accuracy) and normalized MCC (nMCC = MCC×0.5+0.5) for each method are represented by the distances of the six faces of the cuboid from the origin. (A) Performance cuboids for different feature subsets used in PON-P2. Seq prof, Proportions of reference and altered amino acids and number of sequences in multiple sequence alignment; Sel pres + Seq prof, evolutionary features; Sel pres + Seq prof + GO, evolutionary features and GO annotations (B) Performance cuboids for PolyPhen-2, PON-P, PON-P2 and SIFT for all predicted variations by each method on independent test dataset. The performance scores for PON-P and PON-P2 are for predictions at 0.95 confidence level. OPMs for PolyPhen-2, PON-P, PON-P2 and SIFT are 0.41, 0.61, 0.63 and 0.40, respectively. (C) Performance cuboids for predictors using c95-test set. OPMs for PolyPhen-2, PON-P, PON-P2 and SIFT are 0.47, 0.61, 0.63 and 0.48, respectively.

Similar articles

Cited by

References

    1. Ashley EA, Butte AJ, Wheeler MT, Chen R, Klein TE, et al. (2010) Clinical assessment incorporating a personal genome. Lancet 375: 1525–1535. 10.1016/S0140-6736(10)60452-7 - DOI - PMC - PubMed
    1. Fernald GH, Capriotti E, Daneshjou R, Karczewski KJ, Altman RB (2011) Bioinformatics challenges for personalized medicine. Bioinformatics 27: 1741–1748. 10.1093/bioinformatics/btr295 - DOI - PMC - PubMed
    1. Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, et al. (2001) dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 29: 308–311. - PMC - PubMed
    1. Forbes SA, Bindal N, Bamford S, Cole C, Kok CY, et al. (2011) COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer. Nucleic Acids Res 39: D945–950. 10.1093/nar/gkq929 - DOI - PMC - PubMed
    1. Abecasis GR, Altshuler D, Auton A, Brooks LD, Durbin RM, et al. (2010) A map of human genome variation from population-scale sequencing. Nature 467: 1061–1073. 10.1038/nature09534 - DOI - PMC - PubMed

Publication types

Grants and funding

MV received funding from Faculty of Medicine, Lund University (http://www.med.lu.se/english). MV received funding from Vetenskapsrådet (http://www.vr.se/inenglish.4.12fff4451215cbd83e4800015152.html). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.