simplifyEnrichment: A Bioconductor Package for Clustering and Visualizing Functional Enrichment Results

doi:10.1016/j.gpb.2022.04.008

. 2023 Feb;21(1):190-202.

doi: 10.1016/j.gpb.2022.04.008. Epub 2022 Jun 6.

simplifyEnrichment: A Bioconductor Package for Clustering and Visualizing Functional Enrichment Results

Zuguang Gu¹, Daniel Hübschmann²

Affiliations

¹ Molecular Precision Oncology Program, National Center for Tumor Diseases (NCT) Heidelberg, D-69120 Heidelberg, Germany. Electronic address: z.gu@dkfz.de.
² Molecular Precision Oncology Program, National Center for Tumor Diseases (NCT) Heidelberg, D-69120 Heidelberg, Germany; Heidelberg Institute of Stem Cell Technology and Experimental Medicine (HI-STEM), D-69120 Heidelberg, Germany; German Cancer Consortium (DKTK), D-69120 Heidelberg, Germany; Department of Pediatric Immunology, Hematology and Oncology, University Hospital Heidelberg, D-69120 Heidelberg, Germany. Electronic address: d.huebschmann@dkfz.de.

PMID: 35680096
PMCID: PMC10373083
DOI: 10.1016/j.gpb.2022.04.008

simplifyEnrichment: A Bioconductor Package for Clustering and Visualizing Functional Enrichment Results

Zuguang Gu et al. Genomics Proteomics Bioinformatics. 2023 Feb.

. 2023 Feb;21(1):190-202.

doi: 10.1016/j.gpb.2022.04.008. Epub 2022 Jun 6.

Authors

Zuguang Gu¹, Daniel Hübschmann²

Affiliations

¹ Molecular Precision Oncology Program, National Center for Tumor Diseases (NCT) Heidelberg, D-69120 Heidelberg, Germany. Electronic address: z.gu@dkfz.de.
² Molecular Precision Oncology Program, National Center for Tumor Diseases (NCT) Heidelberg, D-69120 Heidelberg, Germany; Heidelberg Institute of Stem Cell Technology and Experimental Medicine (HI-STEM), D-69120 Heidelberg, Germany; German Cancer Consortium (DKTK), D-69120 Heidelberg, Germany; Department of Pediatric Immunology, Hematology and Oncology, University Hospital Heidelberg, D-69120 Heidelberg, Germany. Electronic address: d.huebschmann@dkfz.de.

PMID: 35680096
PMCID: PMC10373083
DOI: 10.1016/j.gpb.2022.04.008

Abstract

Functional enrichment analysis or gene set enrichment analysis is a basic bioinformatics method that evaluates the biological importance of a list of genes of interest. However, it may produce a long list of significant terms with highly redundant information that is difficult to summarize. Current tools to simplify enrichment results by clustering them into groups either still produce redundancy between clusters or do not retain consistent term similarities within clusters. We propose a new method named binary cut for clustering similarity matrices of functional terms. Through comprehensive benchmarks on both simulated and real-world datasets, we demonstrated that binary cut could efficiently cluster functional terms into groups where terms showed consistent similarities within groups and were mutually exclusive between groups. We compared binary cut clustering on the similarity matrices obtained from different similarity measures and found that semantic similarity worked well with binary cut, while similarity matrices based on gene overlap showed less consistent patterns. We implemented the binary cut algorithm in the R package simplifyEnrichment, which additionally provides functionalities for visualizing, summarizing, and comparing the clustering. The simplifyEnrichment package and the documentation are available at https://bioconductor.org/packages/simplifyEnrichment/.

Keywords: Clustering; Functional enrichment; R/Bioconductor; Simplify enrichment; Software; Visualization.

PubMed Disclaimer

Conflict of interest statement

The authors have declared no competing interests.

Figures

**Figure 1**
**Examples of similarity matrices for two sets of GO terms** The two matrices are denoted as M_a and M_b. Both matrices are split into two groups in the two dimensions, where submatrices are labeled as M₁₁, M₁₂, M₂₁, and M₂₂. GO, Gene Ontology.

**Figure 2**
**A demonstration of the *binary cut* clustering with 500 random GO terms** A. The clustering in the first iteration. B. The clustering in the second iteration. C. The clustering in the third iteration. D. The complete dendrogram from *binary cut* clustering. The colors of the dendrogram segments correspond to the scores s assigned to the nodes. Nodes to split are marked with crosses.

**Figure 3**
**Example of a similarity heatmap from 500 random GO terms that have been clustered and annotated with word clouds** The bottom right cluster with no word cloud annotation contains all other small clusters with numbers of terms less than 5. The plot was made by the function *simplifyGO()*.

**Figure 4**
**Comparison of different clustering methods** A. Clustering by *binary cut*. B. Clustering by *kmeans*. C. Clustering by PAM. D. Clustering by dynamicTreeCut. E. Clustering by mclust. F. Clustering by apcluster. G. Clustering by *hdbscan*. H. Clustering by the fast greedy graph community method. I. Clustering by the louvain graph community method. J. Clustering by the walktrap graph community method. K. Clustering by the MCL graph community method. L. Numbers of all clusters and numbers of large clusters with size ≥ 5. For some methods, the small clusters (size < 5) were put into one single cluster on the bottom right of the heatmap and were marked by green lines. All the methods were applied to the same GO semantic similarity matrix from 500 random GO terms from the biological process ontology. The plots were generated by the function *compare_clustering_methods()*. PAM, partitioning around medoids.

**Figure 5**
**Benchmarks of different clustering methods** A. Difference scores. B. Numbers of clusters. C. Block mean values. The analysis in A−C was applied to 100 random GO lists of 500 BP terms. D. Difference scores. E. Numbers of clusters. F. Block mean values. The analyses in E and F were applied to the functional enrichment results from 485 Expression Atlas datasets. BP, Biological Process.

**Figure 6**
**Comparison of clusterings on similarity matrices by different similarity measures** A. Numbers of clusters. B. Average numbers of terms per cluster. Y-axes in A and B are on log₁₀ scale. C. Mean concordance of the clusterings. The definition of concordance can be found in File S13. The analysis was applied to 100 random GO lists with 500 BP terms. D. and E. Examples of *binary cut* clustering on similarity matrix by semantic and kappa measurement. The two similarity matrices correspond to the same list of random GO terms. F.−J. Analogous to A−E, but on the functional enrichment results from 485 Expression Atlas datasets. I and J are based on the same Expression Atlas dataset. K. Distributions of cluster sizes on similarity matrices from different similarity measures. L. Distributions of the fraction of the largest cluster. K and L are based on Expression Atlas datasets.

**Figure 7**
**Compare enrichment results from three gene lists** A. Heatmap of the expression of signature genes from a three-group classification of Golub leukemia dataset. The three gene lists were generated by applying k-means clustering on rows of the expression matrix. The z-score standardization was applied to matrix rows. B. GO terms that were significant in any enrichment results of the three gene lists were clustered, and their similarities were visualized as a heatmap. The left heatmap demonstrates whether the GO terms were significant in the corresponding gene list. The labels “1” and “2” on word clouds are explained in the main text.

See this image and copyright information in PMC

Cited by

TDP-43 chronic deficiency leads to dysregulation of transposable elements and gene expression by affecting R-loop and 5hmC crosstalk.
Hou Y, Li Y, Xiang JF, Tilahun K, Jiang J, Corces VG, Yao B. Hou Y, et al. Cell Rep. 2024 Jan 23;43(1):113662. doi: 10.1016/j.celrep.2023.113662. Epub 2024 Jan 6. Cell Rep. 2024. PMID: 38184854 Free PMC article.
A carbon-nitrogen negative feedback loop underlies the repeated evolution of cnidarian-Symbiodiniaceae symbioses.
Cui G, Mi J, Moret A, Menzies J, Zhong H, Li A, Hung SH, Al-Babili S, Aranda M. Cui G, et al. Nat Commun. 2023 Nov 1;14(1):6949. doi: 10.1038/s41467-023-42582-y. Nat Commun. 2023. PMID: 37914686 Free PMC article.
Network Pharmacology Combined with Experimental Validation Reveals the Anti-tumor Effect of Duchesnea indica against Hepatocellular Carcinoma.
Liu X, Wang K, Wang L, Fan X. Liu X, et al. J Cancer. 2023 Feb 13;14(4):505-518. doi: 10.7150/jca.76591. eCollection 2023. J Cancer. 2023. PMID: 37057280 Free PMC article.
Characterization of Single-Cell Cis-regulatory Elements Informs Implications for Cell Differentiation.
Ren YY, Liu Z. Ren YY, et al. Genome Biol Evol. 2024 Nov 1;16(11):evae241. doi: 10.1093/gbe/evae241. Genome Biol Evol. 2024. PMID: 39506564 Free PMC article.
Identification and validation of M2 macrophage-related genes in endometriosis.
Ding H, Xu H, Zhang T, Shi C. Ding H, et al. Heliyon. 2023 Nov 11;9(11):e22258. doi: 10.1016/j.heliyon.2023.e22258. eCollection 2023 Nov. Heliyon. 2023. PMID: 38058639 Free PMC article.

See all "Cited by" articles

References

1. Khatri P., Drăghici S. Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics. 2005;21:3587–3595. - PMC - PubMed
1. Subramanian A., Tamayo P., Mootha V.K., Mukherjee S., Ebert B.L., Gillette M.A., et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A. 2005;102:15545–15550. - PMC - PubMed
1. Hung J.H., Yang T.H., Hu Z., Weng Z., DeLisi C. Gene set enrichment analysis: performance evaluation and usage guidelines. Brief Bioinform. 2012;13:281–291. - PMC - PubMed
1. McLean C.Y., Bristor D., Hiller M., Clarke S.L., Schaar B.T., Lowe C.B., et al. GREAT improves functional interpretation of cis-regulatory regions. Nat Biotechnol. 2010;28:495–501. - PMC - PubMed
1. The Gene Ontology Consortium The gene ontology resource: 20 years and still GOing strong. Nucleic Acids Res. 2019;47:D330–D338. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

[1] Khatri P., Drăghici S. Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics. 2005;21:3587–3595. - PMC - PubMed

[2] Khatri P., Drăghici S. Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics. 2005;21:3587–3595. - PMC - PubMed

[3] Subramanian A., Tamayo P., Mootha V.K., Mukherjee S., Ebert B.L., Gillette M.A., et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A. 2005;102:15545–15550. - PMC - PubMed

[4] Subramanian A., Tamayo P., Mootha V.K., Mukherjee S., Ebert B.L., Gillette M.A., et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A. 2005;102:15545–15550. - PMC - PubMed

[5] Hung J.H., Yang T.H., Hu Z., Weng Z., DeLisi C. Gene set enrichment analysis: performance evaluation and usage guidelines. Brief Bioinform. 2012;13:281–291. - PMC - PubMed

[6] Hung J.H., Yang T.H., Hu Z., Weng Z., DeLisi C. Gene set enrichment analysis: performance evaluation and usage guidelines. Brief Bioinform. 2012;13:281–291. - PMC - PubMed

[7] McLean C.Y., Bristor D., Hiller M., Clarke S.L., Schaar B.T., Lowe C.B., et al. GREAT improves functional interpretation of cis-regulatory regions. Nat Biotechnol. 2010;28:495–501. - PMC - PubMed

[8] McLean C.Y., Bristor D., Hiller M., Clarke S.L., Schaar B.T., Lowe C.B., et al. GREAT improves functional interpretation of cis-regulatory regions. Nat Biotechnol. 2010;28:495–501. - PMC - PubMed

[9] The Gene Ontology Consortium The gene ontology resource: 20 years and still GOing strong. Nucleic Acids Res. 2019;47:D330–D338. - PMC - PubMed

[10] The Gene Ontology Consortium The gene ontology resource: 20 years and still GOing strong. Nucleic Acids Res. 2019;47:D330–D338. - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

simplifyEnrichment: A Bioconductor Package for Clustering and Visualizing Functional Enrichment Results

Affiliations

simplifyEnrichment: A Bioconductor Package for Clustering and Visualizing Functional Enrichment Results

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources