Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2015 Nov 6;12(112):20150571.
doi: 10.1098/rsif.2015.0571.

Methods for biological data integration: perspectives and challenges

Affiliations
Review

Methods for biological data integration: perspectives and challenges

Vladimir Gligorijević et al. J R Soc Interface. .

Abstract

Rapid technological advances have led to the production of different types of biological data and enabled construction of complex networks with various types of interactions between diverse biological entities. Standard network data analysis methods were shown to be limited in dealing with such heterogeneous networked data and consequently, new methods for integrative data analyses have been proposed. The integrative methods can collectively mine multiple types of biological data and produce more holistic, systems-level biological insights. We survey recent methods for collective mining (integration) of various types of networked biological data. We compare different state-of-the-art methods for data integration and highlight their advantages and disadvantages in addressing important biological problems. We identify the important computational challenges of these methods and provide a general guideline for which methods are suited for specific biological problems, or specific data types. Moreover, we propose that recent non-negative matrix factorization-based approaches may become the integration methodology of choice, as they are well suited and accurate in dealing with heterogeneous data and have many opportunities for further development.

Keywords: biological networks; data fusion; heterogeneous data integration; non-negative matrix factorization; omics data; systems biology.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
A schematic illustration of the molecular information layers of a cell.
Figure 2.
Figure 2.
(a) An illustration of a heterogeneous network composed of a gene–gene interaction network (blue), a disease–disease association network (red) and a gene–disease association network (black edges). A simple integrated network is obtained via either gene, or disease projection method (see details in §3). The thickness of an edge in a projected network illustrates its weight. (b) An illustration of homogeneous gene–gene interaction networks. An integrated network is constructed by using a simple data merging method (see text in §3 for details).
Figure 3.
Figure 3.
(a) A schematic illustration of a gene regulatory network modelled by BN. Genes are represented by nodes, whereas regulatory relations between genes are represented by directed edges. Gene g1 regulates the expression of genes g2, g3 and g4, and genes g3 and g4 regulate the expression of gene g5. Gene g1 is called a parent of g2, g3 and g4, whereas genes g2, g3 and g4 are called children of gene g1 (similar holds for other relations). A sparse representation implies that the expression level of a gene depends only on the expression levels of its regulators (parents in the network). The JPD of the system is formula image. (b) An example of a naive BN with a class node y being the parent to independent nodes x1, x2 , … , xN.
Figure 4.
Figure 4.
(a) Heterogeneous networks of genes (PPI, GI and MI) and drugs (chemical similarities) and links between drugs and genes (DTI). Intertype relations are represented by drug–target interaction (DTI) network, whereas intratype connections are represented by four networks: protein–protein interaction (PPI), genetic interaction (GI) and metabolic interaction (MI) molecular networks of genes, and the chemical similarity network of drugs (see §2 for further details about these networks and their construction). (b) An illustration of a KB data integration method for drug clustering. All kernel matrices are expressed in the drug similarity feature space based on the closeness between their targets (proteins) in each molecular network (K1, K2 and K3) and based on the similarity between their chemical structures (K4). All kernel matrices are linearly combined into a resulting kernel matrix K, on which the drug clustering is performed by using KB clustering methods. (c) An illustration of an NMTF-based data integration method for drug clustering: factorization of the DTI relation matrix under the guidance of molecular and chemical connectivity constraints represented by the constraint matrices. Drugs are assigned to clusters based on the entries in obtained G2 cluster indicator matrix.

Similar articles

Cited by

References

    1. Ito T, et al. 2000. Toward a protein–protein interaction map of the budding yeast: a comprehensive system to examine two-hybrid interactions in all possible combinations between the yeast proteins. Proc. Natl Acad. Sci. USA 97, 1143–1147. (10.1073/pnas.97.3.1143) - DOI - PMC - PubMed
    1. Uetz P, et al. 2000. A comprehensive analysis of protein–protein interactions in Saccharomyces cerevisiae. Nature 403, 623–627. (10.1038/35001009) - DOI - PubMed
    1. Giot L, et al. 2003. A protein interaction map of Drosophila melanogaster. Science 302, 1727–1736. (10.1126/science.1090289) - DOI - PubMed
    1. Li S, et al. 2004. A map of the interactome network of the metazoan C. elegans. Science 303, 540–543. (10.1126/science.1091403) - DOI - PMC - PubMed
    1. Stelzl U, et al. 2005. A human protein–protein interaction network: a resource for annotating the proteome. Cell 122, 957–968. (10.1016/j.cell.2005.08.029) - DOI - PubMed

Publication types