Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Nov 12;10(11):924.
doi: 10.3390/genes10110924.

Global Vectors Representation of Protein Sequences and Its Application for Predicting Self-Interacting Proteins with Multi-Grained Cascade Forest Model

Affiliations

Global Vectors Representation of Protein Sequences and Its Application for Predicting Self-Interacting Proteins with Multi-Grained Cascade Forest Model

Zhan-Heng Chen et al. Genes (Basel). .

Abstract

Self-interacting proteins (SIPs) is of paramount importance in current molecular biology. There have been developed a number of traditional biological experiment methods for predicting SIPs in the past few years. However, these methods are costly, time-consuming and inefficient, and often limit their usage for predicting SIPs. Therefore, the development of computational method emerges at the times require. In this paper, we for the first time proposed a novel deep learning model which combined natural language processing (NLP) method for potential SIPs prediction from the protein sequence information. More specifically, the protein sequence is de novo assembled by k-mers. Then, we obtained the global vectors representation for each protein sequences by using natural language processing (NLP) technique. Finally, based on the knowledge of known self-interacting and non-interacting proteins, a multi-grained cascade forest model is trained to predict SIPs. Comprehensive experiments were performed on yeast and human datasets, which obtained an accuracy rate of 91.45% and 93.12%, respectively. From our evaluations, the experimental results show that the use of amino acid semantics information is very helpful for addressing the problem of sequences containing both self-interacting and non-interacting pairs of proteins. This work would have potential applications for various biological classification problems.

Keywords: de novo protein sequence; global vector representation; multi-grained cascade forest; self-interacting proteins.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Figure 1
Figure 1
De novo assembled protein sequences by 3-mer.
Figure 2
Figure 2
Process of multi-grained scanning.
Figure 3
Figure 3
Cascade forest model.
Figure 4
Figure 4
The receiver operating characteristic (ROC) curve of proposed model on yeast dataset.
Figure 5
Figure 5
The ROC curve of proposed model on human dataset.

Similar articles

Cited by

References

    1. Yıldırım M.A., Goh K.-I., Cusick M.E., Barabási A.-L., Vidal M. Drug—Target network. Nat. Biotechnol. 2007;25:1119. doi: 10.1038/nbt1338. - DOI - PubMed
    1. Radivojac P., Clark W.T., Oron T.R., Schnoes A.M., Wittkop T., Sokolov A., Graim K., Funk C., Verspoor K., Ben-Hur A., et al. A large-scale evaluation of computational protein function prediction. Nat. Methods. 2013;10:221. doi: 10.1038/nmeth.2340. - DOI - PMC - PubMed
    1. Cao R., Cheng J. Integrated protein function prediction by mining function associations, sequences, and protein–protein and gene–gene interaction networks. Methods. 2016;93:84–91. doi: 10.1016/j.ymeth.2015.09.011. - DOI - PMC - PubMed
    1. Ispolatov I., Yuryev A., Mazo I., Maslov S. Binding properties and evolution of homodimers in protein–protein interaction networks. Nucleic Acids Res. 2005;33:3629–3635. doi: 10.1093/nar/gki678. - DOI - PMC - PubMed
    1. Shoemaker B., Panchenko A. Deciphering protein-protein interactions. PLoS Comput. Biol. 2006;3:e43 - PMC - PubMed

Publication types

LinkOut - more resources