Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Feb;21(1):190-202.
doi: 10.1016/j.gpb.2022.04.008. Epub 2022 Jun 6.

simplifyEnrichment: A Bioconductor Package for Clustering and Visualizing Functional Enrichment Results

Affiliations

simplifyEnrichment: A Bioconductor Package for Clustering and Visualizing Functional Enrichment Results

Zuguang Gu et al. Genomics Proteomics Bioinformatics. 2023 Feb.

Abstract

Functional enrichment analysis or gene set enrichment analysis is a basic bioinformatics method that evaluates the biological importance of a list of genes of interest. However, it may produce a long list of significant terms with highly redundant information that is difficult to summarize. Current tools to simplify enrichment results by clustering them into groups either still produce redundancy between clusters or do not retain consistent term similarities within clusters. We propose a new method named binary cut for clustering similarity matrices of functional terms. Through comprehensive benchmarks on both simulated and real-world datasets, we demonstrated that binary cut could efficiently cluster functional terms into groups where terms showed consistent similarities within groups and were mutually exclusive between groups. We compared binary cut clustering on the similarity matrices obtained from different similarity measures and found that semantic similarity worked well with binary cut, while similarity matrices based on gene overlap showed less consistent patterns. We implemented the binary cut algorithm in the R package simplifyEnrichment, which additionally provides functionalities for visualizing, summarizing, and comparing the clustering. The simplifyEnrichment package and the documentation are available at https://bioconductor.org/packages/simplifyEnrichment/.

Keywords: Clustering; Functional enrichment; R/Bioconductor; Simplify enrichment; Software; Visualization.

PubMed Disclaimer

Conflict of interest statement

The authors have declared no competing interests.

Figures

Figure 1
Figure 1
Examples of similarity matrices for two sets of GO terms The two matrices are denoted as Ma and Mb. Both matrices are split into two groups in the two dimensions, where submatrices are labeled as M11, M12, M21, and M22. GO, Gene Ontology.
Figure 2
Figure 2
A demonstration of the binary cut clustering with 500 random GO terms A. The clustering in the first iteration. B. The clustering in the second iteration. C. The clustering in the third iteration. D. The complete dendrogram from binary cut clustering. The colors of the dendrogram segments correspond to the scores s assigned to the nodes. Nodes to split are marked with crosses.
Figure 3
Figure 3
Example of a similarity heatmap from 500 random GO terms that have been clustered and annotated with word clouds The bottom right cluster with no word cloud annotation contains all other small clusters with numbers of terms less than 5. The plot was made by the function simplifyGO().
Figure 4
Figure 4
Comparison of different clustering methods A. Clustering by binary cut. B. Clustering by kmeans. C. Clustering by PAM. D. Clustering by dynamicTreeCut. E. Clustering by mclust. F. Clustering by apcluster. G. Clustering by hdbscan. H. Clustering by the fast greedy graph community method. I. Clustering by the louvain graph community method. J. Clustering by the walktrap graph community method. K. Clustering by the MCL graph community method. L. Numbers of all clusters and numbers of large clusters with size ≥ 5. For some methods, the small clusters (size < 5) were put into one single cluster on the bottom right of the heatmap and were marked by green lines. All the methods were applied to the same GO semantic similarity matrix from 500 random GO terms from the biological process ontology. The plots were generated by the function compare_clustering_methods(). PAM, partitioning around medoids.
Figure 5
Figure 5
Benchmarks of different clustering methods A. Difference scores. B. Numbers of clusters. C. Block mean values. The analysis in A−C was applied to 100 random GO lists of 500 BP terms. D. Difference scores. E. Numbers of clusters. F. Block mean values. The analyses in E and F were applied to the functional enrichment results from 485 Expression Atlas datasets. BP, Biological Process.
Figure 6
Figure 6
Comparison of clusterings on similarity matrices by different similarity measures A. Numbers of clusters. B. Average numbers of terms per cluster. Y-axes in A and B are on log10 scale. C. Mean concordance of the clusterings. The definition of concordance can be found in File S13. The analysis was applied to 100 random GO lists with 500 BP terms. D. and E. Examples of binary cut clustering on similarity matrix by semantic and kappa measurement. The two similarity matrices correspond to the same list of random GO terms. F.J. Analogous to A−E, but on the functional enrichment results from 485 Expression Atlas datasets. I and J are based on the same Expression Atlas dataset. K. Distributions of cluster sizes on similarity matrices from different similarity measures. L. Distributions of the fraction of the largest cluster. K and L are based on Expression Atlas datasets.
Figure 7
Figure 7
Compare enrichment results from three gene lists A. Heatmap of the expression of signature genes from a three-group classification of Golub leukemia dataset. The three gene lists were generated by applying k-means clustering on rows of the expression matrix. The z-score standardization was applied to matrix rows. B. GO terms that were significant in any enrichment results of the three gene lists were clustered, and their similarities were visualized as a heatmap. The left heatmap demonstrates whether the GO terms were significant in the corresponding gene list. The labels “1” and “2” on word clouds are explained in the main text.

Similar articles

Cited by

References

    1. Khatri P., Drăghici S. Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics. 2005;21:3587–3595. - PMC - PubMed
    1. Subramanian A., Tamayo P., Mootha V.K., Mukherjee S., Ebert B.L., Gillette M.A., et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A. 2005;102:15545–15550. - PMC - PubMed
    1. Hung J.H., Yang T.H., Hu Z., Weng Z., DeLisi C. Gene set enrichment analysis: performance evaluation and usage guidelines. Brief Bioinform. 2012;13:281–291. - PMC - PubMed
    1. McLean C.Y., Bristor D., Hiller M., Clarke S.L., Schaar B.T., Lowe C.B., et al. GREAT improves functional interpretation of cis-regulatory regions. Nat Biotechnol. 2010;28:495–501. - PMC - PubMed
    1. The Gene Ontology Consortium The gene ontology resource: 20 years and still GOing strong. Nucleic Acids Res. 2019;47:D330–D338. - PMC - PubMed

Publication types