Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2014 Dec 5;15(1):397.
doi: 10.1186/s12859-014-0397-8.

Comparative evaluation of gene set analysis approaches for RNA-Seq data

Affiliations
Comparative Study

Comparative evaluation of gene set analysis approaches for RNA-Seq data

Yasir Rahmatallah et al. BMC Bioinformatics. .

Abstract

Background: Over the last few years transcriptome sequencing (RNA-Seq) has almost completely taken over microarrays for high-throughput studies of gene expression. Currently, the most popular use of RNA-Seq is to identify genes which are differentially expressed between two or more conditions. Despite the importance of Gene Set Analysis (GSA) in the interpretation of the results from RNA-Seq experiments, the limitations of GSA methods developed for microarrays in the context of RNA-Seq data are not well understood.

Results: We provide a thorough evaluation of popular multivariate and gene-level self-contained GSA approaches on simulated and real RNA-Seq data. The multivariate approach employs multivariate non-parametric tests combined with popular normalizations for RNA-Seq data. The gene-level approach utilizes univariate tests designed for the analysis of RNA-Seq data to find gene-specific P-values and combines them into a pathway P-value using classical statistical techniques. Our results demonstrate that the Type I error rate and the power of multivariate tests depend only on the test statistics and are insensitive to the different normalizations. In general standard multivariate GSA tests detect pathways that do not have any bias in terms of pathways size, percentage of differentially expressed genes, or average gene length in a pathway. In contrast the Type I error rate and the power of gene-level GSA tests are heavily affected by the methods for combining P-values, and all aforementioned biases are present in detected pathways.

Conclusions: Our result emphasizes the importance of using self-contained non-parametric multivariate tests for detecting differentially expressed pathways for RNA-Seq data and warns against applying gene-level GSA tests, especially because of their high level of Type I error rates for both, simulated and real data.

PubMed Disclaimer

Figures

Figure 1
Figure 1
The functional relationship between the transformed and the original P- values for different transformation functions H (used by FM, SM and GM with STT=0.05).
Figure 2
Figure 2
The power curves of multivariate tests with different normalizations when shift alternative hypothesis ( H 1 ) holds true and the number of genes in pathways p=16 (N=20).
Figure 3
Figure 3
The power curves of gene - level GSA methods when shift alternative hypothesis ( H 1 ) holds true and the number of genes in pathways p=16 ( N=20).
Figure 4
Figure 4
Venn diagrams showing the number of common pathways detected in the processed Nigerian dataset by multivariate tests with normalizations and univariate tests with combined P - values for gene - level GSA methods (α = 0.05). (a) N-statistic with different normalizations and ROAST; (b) WW with different normalizations and ROAST; (c) KS with different normalizations and ROAST; (d) edgeR with different P-values combining methods; (e) DESeq with different P-values combining methods; (f) eBayes with different P-values combining methods; (g) univariate tests with FM; (h) univariate tests with SM; (i) univariate tests with GM.
Figure 5
Figure 5
The percentage of DE genes , number of genes and average gene length in detected pathways in the processed Nigerian dataset by different methods.

Similar articles

Cited by

References

    1. Core LJ, Waterfall JJ, Lis JT. Nascent RNA sequencing reveals widespread pausing and divergent initiation at human promoters. Science. 2008;322(5909):1845–1848. doi: 10.1126/science.1162228. - DOI - PMC - PubMed
    1. Wilhelm BT, Marguerat S, Watt S, Schubert F, Wood V, Goodhead I, Penkett CJ, Rogers J, Bahler J. Dynamic repertoire of a eukaryotic transcriptome surveyed at single-nucleotide resolution. Nature. 2008;453(7199):1239–1243. doi: 10.1038/nature07002. - DOI - PubMed
    1. Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26(1):139–140. doi: 10.1093/bioinformatics/btp616. - DOI - PMC - PubMed
    1. Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010;11(10):R106. doi: 10.1186/gb-2010-11-10-r106. - DOI - PMC - PubMed
    1. Li J, Tibshirani R. Finding consistent patterns: A nonparametric approach for identifying differential expression in RNA-Seq data. Stat Methods Med Res. 2013;22(5):519–536. doi: 10.1177/0962280211428386. - DOI - PMC - PubMed

Publication types

LinkOut - more resources