Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Jun 23:15:213.
doi: 10.1186/1471-2105-15-213.

Homology-based prediction of interactions between proteins using Averaged One-Dependence Estimators

Affiliations

Homology-based prediction of interactions between proteins using Averaged One-Dependence Estimators

Yoichi Murakami et al. BMC Bioinformatics. .

Abstract

Background: Identification of protein-protein interactions (PPIs) is essential for a better understanding of biological processes, pathways and functions. However, experimental identification of the complete set of PPIs in a cell/organism ("an interactome") is still a difficult task. To circumvent limitations of current high-throughput experimental techniques, it is necessary to develop high-performance computational methods for predicting PPIs.

Results: In this article, we propose a new computational method to predict interaction between a given pair of protein sequences using features derived from known homologous PPIs. The proposed method is capable of predicting interaction between two proteins (of unknown structure) using Averaged One-Dependence Estimators (AODE) and three features calculated for the protein pair: (a) sequence similarities to a known interacting protein pair (FSeq), (b) statistical propensities of domain pairs observed in interacting proteins (FDom) and (c) a sum of edge weights along the shortest path between homologous proteins in a PPI network (FNet). Feature vectors were defined to lie in a half-space of the symmetrical high-dimensional feature space to make them independent of the protein order. The predictability of the method was assessed by a 10-fold cross validation on a recently created human PPI dataset with randomly sampled negative data, and the best model achieved an Area Under the Curve of 0.79 (pAUC0.5% = 0.16). In addition, the AODE trained on all three features (named PSOPIA) showed better prediction performance on a separate independent data set than a recently reported homology-based method.

Conclusions: Our results suggest that FNet, a feature representing proximity in a known PPI network between two proteins that are homologous to a target protein pair, contributes to the prediction of whether the target proteins interact or not. PSOPIA will help identify novel PPIs and estimate complete PPI networks. The method proposed in this article is freely available on the web at http://mizuguchilab.org/PSOPIA.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Three homology-based features used for training AODEs. (a) A feature set of sequence similarities to known interacting proteins (FSeq = {e-valueA, mincovA, e-valueB, mincovB}). For a target pair (SA, SB), the interacting pair (TA, TB) with the smallest value of √(e-valueA2 + e-valueB2), where e-valuex is a BLAST e-value between Sx and Tx and x is either A or B, is selected, and then the minimum coverage (mincov) for Sx and Tx is calculated as (the number of positive matches)/(the length of the longer sequence). If no known homologous interacting pair is found, an e-value of 102 and a mincov of 0 are assigned to FSeq. (b) Statistical propensities of domain pairs observed in interacting proteins (FDom). A sum of the interaction propensities for all possible Pfam domain pairs (dA, dB) appeared in Sx and Tx is calculated (see more details in the text). If not Pfam domain is found, an FDom value of 0 is given to the target pair. (c) A sum of edge weights along the shortest path between homologous proteins (PA, PB) in the PPI network (FNet). In this study, we set the default edge weight to be 1.0. If no path is found, an FNet of -1 is given to the target pair.
Figure 2
Figure 2
Selection of feature vectors in the half-space defined by X1. To define a FV uniquely, an arbitrary pair of proteins whose FSeq and FSeq′ values corresponded to points X1 and X2 in the high-dimensional feature space is first chosen. For a given pair of proteins, its FV, FSeq′′, was defined by choosing point P1 such that cos θ = rnV/|rn||V| > 0, where rn is a reference normal vector from the midpoint between X1 and X2 (rp) to X1, V is a vector from rp to P1, rnV is the inner product of rn and V and |rn| and |V| are the length of rn and V, respectively. If cos θ = 0, one of the two possibilities was arbitrarily selected.
Figure 3
Figure 3
A probabilistic graphical model of the AODE. The AODE weakens the NBC’s independence assumption, of which the dependence structure is shown in (a), by allowing a one-dependence as shown in (b). P^c is the probability of each class label c ∈ {interaction or non-interaction}, P^fi|c, such as P^FNet|c, is the conditional probability of the i-th feature (fi) given c, P^c,fi, such as P^c,FNet, is the joint probability of c and fj and also P^fj|c,fi, such as P^FNet|c,FDom, is the conditional probability of the j-th feature (fj) given c and fi
Figure 4
Figure 4
A schematic diagram to represent the procedure for the proposed method. (i) Three different homology-based features (FSeq, FDom and FNet) for a protein sequence pair are calculated and converted to a FV. (ii) Only the FV that lies in the half-space defined by a reference feature vector is selected (see Figure 2). (iii) A feature value in the i-th element of the selected FV is discretized using the entropy-based discretization method [36]. (iv) The probability of being in the positive class for the target pair is estimated using the AODE, and if the probability is greater than or equal to a threshold, the target pair is predicted to be interacting, otherwise non-interacting.
Figure 5
Figure 5
The percentage of protein pairs with different SPWs in the PPI network generated from Dset1. An SPW of 0 means a known interaction of a homologous protein with itself and that of 1.0 means a known interaction between a homologous protein pair. Also, a homologous protein pair indirectly linked by n proteins has an SPW of n + 1.0. NP (No Path) indicates that there was no path between two homologous proteins.
Figure 6
Figure 6
ROC curves for the AODEs trained with all three features in 10- fold CV. The ROC curves for the AODE-VII are shown, as well as the averaged ROC curve (in red). This model achieved an AUC of 0.79 and a pAUC0.5% of 0.16% on average, in each round of 10-fold CV on each of the three independent data sets in Dset1.

Similar articles

Cited by

References

    1. von Mering C, Krause R, Snel B, Cornell M, Oliver SG, Fields S, Bork P. Comparative assessment of large-scale data sets of protein-protein interactions. Nature. 2002;417(6887):399–403. - PubMed
    1. Han JD, Dupuy D, Bertin N, Cusick ME, Vidal M. Effect of sampling on topology predictions of protein-protein interaction networks. Nat Biotechnol. 2005;23(7):839–844. - PubMed
    1. Bock JR, Gough DA. Predicting protein–protein interactions from primary structure. Bioinformatics. 2001;17(5):455–460. - PubMed
    1. Sprinzak E, Margalit H. Correlated sequence-signatures as markers of protein-protein interaction. J Mol Biol. 2001;311(4):681–692. - PubMed
    1. Gomez SM, Noble WS, Rzhetsky A. Learning to predict protein-protein interactions from protein sequences. Bioinformatics. 2003;19(15):1875–1881. - PubMed

Publication types

LinkOut - more resources