Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Feb 15;12 Suppl 1(Suppl 1):S44.
doi: 10.1186/1471-2105-12-S1-S44.

Resolving the structure of interactomes with hierarchical agglomerative clustering

Affiliations

Resolving the structure of interactomes with hierarchical agglomerative clustering

Yongjin Park et al. BMC Bioinformatics. .

Abstract

Background: Graphs provide a natural framework for visualizing and analyzing networks of many types, including biological networks. Network clustering is a valuable approach for summarizing the structure in large networks, for predicting unobserved interactions, and for predicting functional annotations. Many current clustering algorithms suffer from a common set of limitations: poor resolution of top-level clusters; over-splitting of bottom-level clusters; requirements to pre-define the number of clusters prior to analysis; and an inability to jointly cluster over multiple interaction types.

Results: A new algorithm, Hierarchical Agglomerative Clustering (HAC), is developed for fast clustering of heterogeneous interaction networks. This algorithm uses maximum likelihood to drive the inference of a hierarchical stochastic block model for network structure. Bayesian model selection provides a principled method for collapsing the fine-structure within the smallest groups, and for identifying the top-level groups within a network. Model scores are additive over independent interaction types, providing a direct route for simultaneous analysis of multiple interaction types. In addition to inferring network structure, this algorithm generates link predictions that with cross-validation provide a quantitative assessment of performance for real-world examples.

Conclusions: When applied to genome-scale data sets representing several organisms and interaction types, HAC provides the overall best performance in link prediction when compared with other clustering methods and with model-free graph diffusion kernels. Investigation of performance on genome-scale yeast protein interactions reveals roughly 100 top-level clusters, with a long-tailed distribution of cluster sizes. These are in turn partitioned into 1000 fine-level clusters containing 5 proteins on average, again with a long-tailed size distribution. Top-level clusters correspond to broad biological processes, whereas fine-level clusters correspond to discrete complexes. Surprisingly, link prediction based on joint clustering of physical and genetic interactions performs worse than predictions based on individual data sets, suggesting a lack of synergy in current high-throughput data.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Link prediction performance of Yeast data sets. A: Precision Recall (PR) curve of 80/20 cross-validation experiment (CV) in YEAST-PPI dataset (10% missing links); B: F1 scores over different fractions of missing links in YEAST-PPI dataset from 1.5% to 90%; C: Area under ROC curve (AUC) scores over different fractions of missing links in YEAST-PPI dataset; D: PR curve of a 80/20 CV in YEAST-GEN dataset; E: F1 scores in YEAST-GEN dataset; F: AUC scores in YEAST-GEN dataset.
Figure 2
Figure 2
Cluster size distribution. Black closed circles: Counts of top-level clusters; Black solid line: Maximum likelihood power-law fit; Red open squares: Counts of low-level clusters; Red dashed line: Maximum likelihood power-law fit; A, B, C: Each panel respectively corresponds to the result of YEAST-PPI, YEAST-GEN, and YEAST-SGA datasets.
Figure 3
Figure 3
Interaction enrichment within clusters. Black solid lines: Edge-density distribution of the top-level clusters; Red dashed lines: Edge-density distribution of the bottom-level clusters. A, B, C: Each panel respectively corresponds to the result of YEAST-PPI, YEAST-GEN, and YEAST-SGA datasets.
Figure 4
Figure 4
Protein transport complex. Bottom level clusters: Different shapes and colors in the topmost and leftmost panel indicate different bottom-level clusters; Other panels: Each box indicates one GO keyword and its enrichment within the subnetwork, and vertices belonging to this GO category are highlighted by non-gray colors.

Similar articles

Cited by

References

    1. Bader GD, Hogue CWV. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics. 2003;4:2. doi: 10.1186/1471-2105-4-2. - DOI - PMC - PubMed
    1. Spirin V, Mirny LA. Protein complexes and functional modules in molecular networks. Proc Natl Acad Sci USA. 2003;100(21):12123–12128. doi: 10.1073/pnas.2032324100. - DOI - PMC - PubMed
    1. Zachary WW. An Information Flow Model for Conflict and Fission in Small Groups. Journal of Anthropological Research. 1977;33(4):452–473.
    1. Huang H, Bader JS. Precision and recall estimates for two-hybrid screens. Bioinformatics. 2009;25(3):372–8. doi: 10.1093/bioinformatics/btn640. - DOI - PMC - PubMed
    1. Yu H, Braun P, Yildirim MA, Lemmens I, Venkatesan K, Sahalie J, Hirozane-Kishikawa T, Gebreab F, Li N, Simonis N, Hao T, Rual JF, Dricot A, Vazquez A, Murray RR, Simon C, Tardivo L, Tam S, Svrzikapa N, Fan C, de Smet AS, Motyl A, Hudson ME, Park J, Xin X, Cusick ME, Moore T, Boone C, Snyder M, Roth FP, Barabasi AL, Tavernier J, Hill DE, Vidal M. High-Quality Binary Protein Interaction Map of the Yeast Interactome Network. Science. 2008;322(5898):104–110. doi: 10.1126/science.1158684. - DOI - PMC - PubMed

Publication types

LinkOut - more resources