Gene set analysis approaches for RNA-seq data: performance evaluation and application guideline

doi:10.1093/bib/bbv069

. 2016 May;17(3):393-407.

doi: 10.1093/bib/bbv069. Epub 2015 Sep 4.

Gene set analysis approaches for RNA-seq data: performance evaluation and application guideline

Yasir Rahmatallah, Frank Emmert-Streib, Galina Glazko

PMID: 26342128
PMCID: PMC4870397
DOI: 10.1093/bib/bbv069

Gene set analysis approaches for RNA-seq data: performance evaluation and application guideline

Yasir Rahmatallah et al. Brief Bioinform. 2016 May.

. 2016 May;17(3):393-407.

doi: 10.1093/bib/bbv069. Epub 2015 Sep 4.

Authors

Yasir Rahmatallah, Frank Emmert-Streib, Galina Glazko

PMID: 26342128
PMCID: PMC4870397
DOI: 10.1093/bib/bbv069

Abstract

Transcriptome sequencing (RNA-seq) is gradually replacing microarrays for high-throughput studies of gene expression. The main challenge of analyzing microarray data is not in finding differentially expressed genes, but in gaining insights into the biological processes underlying phenotypic differences. To interpret experimental results from microarrays, gene set analysis (GSA) has become the method of choice, in particular because it incorporates pre-existing biological knowledge (in a form of functionally related gene sets) into the analysis. Here we provide a brief review of several statistically different GSA approaches (competitive and self-contained) that can be adapted from microarrays practice as well as those specifically designed for RNA-seq. We evaluate their performance (in terms of Type I error rate, power, robustness to the sample size and heterogeneity, as well as the sensitivity to different types of selection biases) on simulated and real RNA-seq data. Not surprisingly, the performance of various GSA approaches depends only on the statistical hypothesis they test and does not depend on whether the test was developed for microarrays or RNA-seq data. Interestingly, we found that competitive methods have lower power as well as robustness to the samples heterogeneity than self-contained methods, leading to poor results reproducibility. We also found that the power of unsupervised competitive methods depends on the balance between up- and down-regulated genes in tested gene sets. These properties of competitive methods have been overlooked before. Our evaluation provides a concise guideline for selecting GSA approaches, best performing under particular experimental settings in the context of RNA-seq.

Keywords: RNA-seq; competitive; gene set analysis; robustness; self-contained.

PubMed Disclaimer

Figures

**Figure 1.**
Schematic overview illustrating the breakup of the GSA methods that can be adapted from microarrays practice to fit RNA-seq data (boxes with dots) as well as those specifically designed for RNA-seq (boxes with diagonal stripes) based on the different null hypotheses they test.

**Figure 2.**
Illustrative histograms and corresponding CCDS curves obtained using commonly detected C2 gene sets at a significance level of 0.05 in 100 subsets of the Nigerian data set with sample size 28. (A) Histogram of the number of commonly detected C2 gene sets by N-statistic in b subsets out of 100; (B) histogram of the number of commonly detected C2 gene sets by GSVA in b subsets out of 100; (C) CCDS curve showing the CCDS for N-statistic; (D) CCDS curve showing the CCDS for GSVA.

**Figure 3.**
The power of different tests to detect differences between two groups of samples when the alternative hypothesis (H₁) holds true with different settings (values of β, γ and FC). The gene set size is p = 16 and the sample size in each group is N/2 (N = 20). (A) β = 0.05, γ = 0.125; (B) β = 0.05, γ = 0.25; (C) β = 0.05, γ = 0.5; (D) β = 0.25, γ = 0.125; (E) β = 0.25, γ = 0.25; (F) β = 0.25, γ = 0.5. A colour version of this figure is available at BIB online: http://bib.oxfordjournals.org.

**Figure 4.**
The estimated TPR (A), FPR (B) and the number of detected gene sets (C) by different GSA approaches. For each sample size, the results are averaged over 100 subsets composed of subsamples from the full Nigerian data set. A colour version of this figure is available at BIB online: http://bib.oxfordjournals.org.

**Figure 5.**
CCDS curves for different GSA approaches when 100 subsets composed of subsamples from the full Nigerian data set (58 samples) are considered with different sample sizes. (A) Sample size = 48; (B) sample size = 38; (C) sample size = 28; (D) sample size = 18. A colour version of this figure is available at BIB online: http://bib.oxfordjournals.org.

**Figure 6.**
A dendrogram showing the similarity between different GSA approaches in terms of detected C2 gene sets at a significance level of 0.05. A colour version of this figure is available at BIB online: http://bib.oxfordjournals.org.

**Figure 7.**
Boxplots comparing (A) the number of genes in gene sets (gene set size), (B) the proportion of DE genes in gene sets and (C) the average gene length per gene set in detected C2 gene sets (among 3890 C2 gene sets, α = 0.05) found by different GSA approaches.

See this image and copyright information in PMC

Cited by

Discovery of Selenocysteine as a Potential Nanomedicine Promotes Cartilage Regeneration With Enhanced Immune Response by Text Mining and Biomedical Databases.
Ye J, Xu B, Fan B, Zhang J, Yuan F, Chen Y, Sun Z, Yan X, Song Y, Song S, Yang M, Yu JK. Ye J, et al. Front Pharmacol. 2020 Jul 24;11:1138. doi: 10.3389/fphar.2020.01138. eCollection 2020. Front Pharmacol. 2020. PMID: 32792959 Free PMC article.
Longitudinal linear combination test for gene set analysis.
Khodayari Moez E, Hajihosseini M, Andrews JL, Dinu I. Khodayari Moez E, et al. BMC Bioinformatics. 2019 Dec 10;20(1):650. doi: 10.1186/s12859-019-3221-7. BMC Bioinformatics. 2019. PMID: 31822265 Free PMC article.
GSAR: Bioconductor package for Gene Set analysis in R.
Rahmatallah Y, Zybailov B, Emmert-Streib F, Glazko G. Rahmatallah Y, et al. BMC Bioinformatics. 2017 Jan 24;18(1):61. doi: 10.1186/s12859-017-1482-6. BMC Bioinformatics. 2017. PMID: 28118818 Free PMC article.
Roastgsa: a comparison of rotation-based scores for gene set enrichment analysis.
Caballé-Mestres A, Berenguer-Llergo A, Stephan-Otto Attolini C. Caballé-Mestres A, et al. BMC Bioinformatics. 2023 Oct 30;24(1):408. doi: 10.1186/s12859-023-05510-x. BMC Bioinformatics. 2023. PMID: 37904108 Free PMC article.
Resolving host-pathogen interactions by dual RNA-seq.
Westermann AJ, Barquist L, Vogel J. Westermann AJ, et al. PLoS Pathog. 2017 Feb 16;13(2):e1006033. doi: 10.1371/journal.ppat.1006033. eCollection 2017 Feb. PLoS Pathog. 2017. PMID: 28207848 Free PMC article. Review.

See all "Cited by" articles

References

1. Ledford H. The death of microarrays? Nature 2008;455:847. - PubMed
1. Fu X, Fu N, Guo S, et al. Estimating accuracy of RNA-Seq and microarrays with proteomics. BMC Genomics 2009;10:161. - PMC - PubMed
1. Wang ET, Sandberg R, Luo S, et al. Alternative isoform regulation in human tissue transcriptomes. Nature 2008;456:470–6. - PMC - PubMed
1. Core LJ, Waterfall JJ, Lis JT. Nascent RNA sequencing reveals widespread pausing and divergent initiation at human promoters. Science 2008;322:1845–8. - PMC - PubMed
1. Wilhelm BT, Marguerat S, Watt S, et al. Dynamic repertoire of a eukaryotic transcriptome surveyed at single-nucleotide resolution. Nature 2008;453:1239–43. - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

[1] Ledford H. The death of microarrays? Nature 2008;455:847. - PubMed

[2] Ledford H. The death of microarrays? Nature 2008;455:847. - PubMed

[3] Fu X, Fu N, Guo S, et al. Estimating accuracy of RNA-Seq and microarrays with proteomics. BMC Genomics 2009;10:161. - PMC - PubMed

[4] Fu X, Fu N, Guo S, et al. Estimating accuracy of RNA-Seq and microarrays with proteomics. BMC Genomics 2009;10:161. - PMC - PubMed

[5] Wang ET, Sandberg R, Luo S, et al. Alternative isoform regulation in human tissue transcriptomes. Nature 2008;456:470–6. - PMC - PubMed

[6] Wang ET, Sandberg R, Luo S, et al. Alternative isoform regulation in human tissue transcriptomes. Nature 2008;456:470–6. - PMC - PubMed

[7] Core LJ, Waterfall JJ, Lis JT. Nascent RNA sequencing reveals widespread pausing and divergent initiation at human promoters. Science 2008;322:1845–8. - PMC - PubMed

[8] Core LJ, Waterfall JJ, Lis JT. Nascent RNA sequencing reveals widespread pausing and divergent initiation at human promoters. Science 2008;322:1845–8. - PMC - PubMed

[9] Wilhelm BT, Marguerat S, Watt S, et al. Dynamic repertoire of a eukaryotic transcriptome surveyed at single-nucleotide resolution. Nature 2008;453:1239–43. - PubMed

[10] Wilhelm BT, Marguerat S, Watt S, et al. Dynamic repertoire of a eukaryotic transcriptome surveyed at single-nucleotide resolution. Nature 2008;453:1239–43. - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Gene set analysis approaches for RNA-seq data: performance evaluation and application guideline

Gene set analysis approaches for RNA-seq data: performance evaluation and application guideline

Authors

Abstract

Figures

Similar articles

Cited by

References

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources