HCsnip: An R Package for Semi-supervised Snipping of the Hierarchical Clustering Tree

doi:10.4137/CIN.S22080

Review

. 2015 Mar 22:14:1-19.

doi: 10.4137/CIN.S22080. eCollection 2015.

HCsnip: An R Package for Semi-supervised Snipping of the Hierarchical Clustering Tree

Askar Obulkasim¹, Mark A van de Wiel²

Affiliations

¹ Department of Epidemiology and Biostatistics, Vrije Universiteit Medical Center, Amsterdam, The Netherlands.
² Department of Epidemiology and Biostatistics, Vrije Universiteit Medical Center, Amsterdam, The Netherlands. ; Department of Mathematics, Vrije Universiteit, Amsterdam, The Netherlands.

PMID: 25861213
PMCID: PMC4372030
DOI: 10.4137/CIN.S22080

Review

HCsnip: An R Package for Semi-supervised Snipping of the Hierarchical Clustering Tree

Askar Obulkasim et al. Cancer Inform. 2015.

. 2015 Mar 22:14:1-19.

doi: 10.4137/CIN.S22080. eCollection 2015.

Authors

Askar Obulkasim¹, Mark A van de Wiel²

Affiliations

¹ Department of Epidemiology and Biostatistics, Vrije Universiteit Medical Center, Amsterdam, The Netherlands.
² Department of Epidemiology and Biostatistics, Vrije Universiteit Medical Center, Amsterdam, The Netherlands. ; Department of Mathematics, Vrije Universiteit, Amsterdam, The Netherlands.

PMID: 25861213
PMCID: PMC4372030
DOI: 10.4137/CIN.S22080

Abstract

Hierarchical clustering (HC) is one of the most frequently used methods in computational biology in the analysis of high-dimensional genomics data. Given a data set, HC outputs a binary tree leaves of which are the data points and internal nodes represent clusters of various sizes. Normally, a fixed-height cut on the HC tree is chosen, and each contiguous branch of data points below that height is considered as a separate cluster. However, the fixed-height branch cut may not be ideal in situations where one expects a complicated tree structure with nested clusters. Furthermore, due to lack of utilization of related background information in selecting the cutoff, induced clusters are often difficult to interpret. This paper describes a novel procedure that aims to automatically extract meaningful clusters from the HC tree in a semi-supervised way. The procedure is implemented in the R package HCsnip available from Bioconductor. Rather than cutting the HC tree at a fixed-height, HCsnip probes the various way of snipping, possibly at variable heights, to tease out hidden clusters ensconced deep down in the tree. The cluster extraction process utilizes, along with the data set from which the HC tree is derived, commonly available background information. Consequently, the extracted clusters are highly reproducible and robust against various sources of variations that "haunted" high-dimensional genomics data. Since the clustering process is guided by the background information, clusters are easy to interpret. Unlike existing packages, no constraint is placed on the data type on which clustering is desired. Particularly, the package accepts patient follow-up data for guiding the cluster extraction process. To our knowledge, HCsnip is the first package that is able to decomposes the HC tree into clusters with piecewise snipping under the guidance of patient time-to-event information. Our implementation of the semi-supervised HC tree snipping framework is generic, and can be combined with other algorithms that operate on detected clusters.

Keywords: R package; data integration; hierarchical clustering; high-dimensional data; semi-supervised clustering.

PubMed Disclaimer

Figures

**Figure 2**
Schemata of the optimal partition selection procedure.

**Figure 3**
Illustration of the indirect projection principle.

**Figure 4**
Illustration of the three scenarios considered in the Concordance index calculation.

**Figure 5**
Schemata of group assignment from *TwoHC_assign* function for a new sample.

**Figure 6**
The HC tree that is derived from the first 30 samples’ expression profiles in the Leukemia data set. Partitions induced by the piecewise snipping are displayed at the bottom of the HC tree.

**Figure 7**
Kaplan–Meier survival analysis by clusters that were induced on the new samples.

**Figure 8**
The gene expression entropy distributions of the two clusters that were induced on the 50 new samples. The shape of the violin plot shows the distribution of the gene expression entropy values within each cluster and the white dot in each denotes the mean entropy level. Distributions are relatively different between cluster1 and cluster2, with a larger mean gene expression entropy value observed in cluster1.

**Figure 9**
PCA plot. The 50 samples shown in the 2D plane spanned by their first two principal components. This type of plot is useful for visualizing the cluster homogeneity. The smaller the connected area, the more similar are the samples within the clusters. The scatter plots show that there is less within-cluster variation in cluster2 (right), which is indicative of a higher level of within-cluster homogeneity resulting from low within-cluster gene expression entropy. In cluster1(left), the samples are more spread out, indicative of a higher degree of within-cluster variation resulting from a higher within-cluster gene expression entropy.

**Figure 10**
(A) The density of the observed relative risk ratios (rr_obs). (B) The distribution of the test statistics from the null model obtained by permutation (T_perm). **Note:** The black broken line denotes the location of the observed test statistics.

**Figure 11**
The HC tree that has been derived by using the DNA copy number profiles in the GBM data set. The optimal partition extracted by the two approaches is displayed at the bottom.

**Figure 12**
Kaplan–Meier survival analysis by clusters displayed in Figure 11. For visualization purpose, clusters in each branch are plotted separately.

**Figure 13**
Gene expression entropy distribution by clusters displayed in Figure 11. The shape of the violin plot shows the distribution of the gene expression entropy values within each cluster, and the white dot in each denotes the mean entropy level.

**Figure 14**
(A) The empirical distribution of rr_obs. (B) The distribution of the test statistics from the null model obtained by permutation T_perm. **Note:** The black broken line denotes the location of the observed test statistics.

See this image and copyright information in PMC

References

1. de Souto M, Costa IG, de Araujo D, Ludermir TB, Schliep A. Clustering cancer gene expression data: a comparative study. BMC Bioinformatics. 2008;9:497. - PMC - PubMed
1. Sørlie T, Perou C-M, Tibshirani R, et al. Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc Natl Acad Sci USA. 2001;98:10869–74. - PMC - PubMed
1. Bair E, Tibshirani R. Semi-supervised methods to predict patient survival from gene expression data. PLoS Biol. 2004;2:511–22. - PMC - PubMed
1. Obulkasim A, Meijer AG, van de Wiel AM. Semi-supervised adaptive-height snipping of the hierarchical clustering tree. BMC Bioinformatics. 2015;16(1):15. - PMC - PubMed
1. Sboner A, Demichelis F, Calza S, et al. Molecular sampling of prostate cancer: a dilemma for predicting disease progression. BMC Med Genomics. 2010;3:8. - PMC - PubMed

Publication types

Actions

LinkOut - more resources

Full Text Sources

[1] de Souto M, Costa IG, de Araujo D, Ludermir TB, Schliep A. Clustering cancer gene expression data: a comparative study. BMC Bioinformatics. 2008;9:497. - PMC - PubMed

[2] de Souto M, Costa IG, de Araujo D, Ludermir TB, Schliep A. Clustering cancer gene expression data: a comparative study. BMC Bioinformatics. 2008;9:497. - PMC - PubMed

[3] Sørlie T, Perou C-M, Tibshirani R, et al. Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc Natl Acad Sci USA. 2001;98:10869–74. - PMC - PubMed

[4] Sørlie T, Perou C-M, Tibshirani R, et al. Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc Natl Acad Sci USA. 2001;98:10869–74. - PMC - PubMed

[5] Bair E, Tibshirani R. Semi-supervised methods to predict patient survival from gene expression data. PLoS Biol. 2004;2:511–22. - PMC - PubMed

[6] Bair E, Tibshirani R. Semi-supervised methods to predict patient survival from gene expression data. PLoS Biol. 2004;2:511–22. - PMC - PubMed

[7] Obulkasim A, Meijer AG, van de Wiel AM. Semi-supervised adaptive-height snipping of the hierarchical clustering tree. BMC Bioinformatics. 2015;16(1):15. - PMC - PubMed

[8] Obulkasim A, Meijer AG, van de Wiel AM. Semi-supervised adaptive-height snipping of the hierarchical clustering tree. BMC Bioinformatics. 2015;16(1):15. - PMC - PubMed

[9] Sboner A, Demichelis F, Calza S, et al. Molecular sampling of prostate cancer: a dilemma for predicting disease progression. BMC Med Genomics. 2010;3:8. - PMC - PubMed

[10] Sboner A, Demichelis F, Calza S, et al. Molecular sampling of prostate cancer: a dilemma for predicting disease progression. BMC Med Genomics. 2010;3:8. - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

HCsnip: An R Package for Semi-supervised Snipping of the Hierarchical Clustering Tree

Affiliations

HCsnip: An R Package for Semi-supervised Snipping of the Hierarchical Clustering Tree

Authors

Affiliations

Abstract

Figures

Similar articles

References

Publication types

LinkOut - more resources

Full Text Sources