Power analysis and sample size estimation for RNA-Seq differential expression

doi:10.1261/rna.046011.114

. 2014 Nov;20(11):1684-96.

doi: 10.1261/rna.046011.114. Epub 2014 Sep 22.

Power analysis and sample size estimation for RNA-Seq differential expression

Travers Ching¹, Sijia Huang¹, Lana X Garmire²

Affiliations

¹ University of Hawaii Cancer Center, Honolulu, Hawaii 96813, USA Graduate Program of Molecular Biosciences and Bioengineering, University of Hawaii-Manoa, Honolulu, Hawaii 96822, USA.
² University of Hawaii Cancer Center, Honolulu, Hawaii 96813, USA Graduate Program of Molecular Biosciences and Bioengineering, University of Hawaii-Manoa, Honolulu, Hawaii 96822, USA lgarmire@cc.hawaii.edu.

PMID: 25246651
PMCID: PMC4201821
DOI: 10.1261/rna.046011.114

Power analysis and sample size estimation for RNA-Seq differential expression

Travers Ching et al. RNA. 2014 Nov.

. 2014 Nov;20(11):1684-96.

doi: 10.1261/rna.046011.114. Epub 2014 Sep 22.

Authors

Travers Ching¹, Sijia Huang¹, Lana X Garmire²

Affiliations

¹ University of Hawaii Cancer Center, Honolulu, Hawaii 96813, USA Graduate Program of Molecular Biosciences and Bioengineering, University of Hawaii-Manoa, Honolulu, Hawaii 96822, USA.
² University of Hawaii Cancer Center, Honolulu, Hawaii 96813, USA Graduate Program of Molecular Biosciences and Bioengineering, University of Hawaii-Manoa, Honolulu, Hawaii 96822, USA lgarmire@cc.hawaii.edu.

PMID: 25246651
PMCID: PMC4201821
DOI: 10.1261/rna.046011.114

Abstract

It is crucial for researchers to optimize RNA-seq experimental designs for differential expression detection. Currently, the field lacks general methods to estimate power and sample size for RNA-Seq in complex experimental designs, under the assumption of the negative binomial distribution. We simulate RNA-Seq count data based on parameters estimated from six widely different public data sets (including cell line comparison, tissue comparison, and cancer data sets) and calculate the statistical power in paired and unpaired sample experiments. We comprehensively compare five differential expression analysis packages (DESeq, edgeR, DESeq2, sSeq, and EBSeq) and evaluate their performance by power, receiver operator characteristic (ROC) curves, and other metrics including areas under the curve (AUC), Matthews correlation coefficient (MCC), and F-measures. DESeq2 and edgeR tend to give the best performance in general. Increasing sample size or sequencing depth increases power; however, increasing sample size is more potent than sequencing depth to increase power, especially when the sequencing depth reaches 20 million reads. Long intergenic noncoding RNAs (lincRNA) yields lower power relative to the protein coding mRNAs, given their lower expression level in the same RNA-Seq experiment. On the other hand, paired-sample RNA-Seq significantly enhances the statistical power, confirming the importance of considering the multifactor experimental design. Finally, a local optimal power is achievable for a given budget constraint, and the dominant contributing factor is sample size rather than the sequencing depth. In conclusion, we provide a power analysis tool (http://www2.hawaii.edu/~lgarmire/RNASeqPowerCalculator.htm) that captures the dispersion in the data and can serve as a practical reference under the budget constraint of RNA-Seq experiments.

Keywords: RNA-Seq; bioinformatics; power analysis; sample size; simulation.

PubMed Disclaimer

Figures

**FIGURE 1.**
Power curves based on the number of samples per condition for the six public data sets and five RNA-Seq differential expression analysis packages. Library sizes were estimated from the gene counts of the real data sets. Per-gene dispersion was estimated through the Cox–Reid adjusted profile likelihood. (A) Power curves relative to sample size and differential expression methods in six public data sets. The four unpaired-sample data sets (Bottomly, Bullard, Huang, M–P) were analyzed with edgeR, DESeq, DESeq2, EBSeq, and sSeq. The paired-sample data sets (Tuch and Qian) were analyzed with edgeR, DESeq, DESeq2, and sSeq. Note that *EBSeq* is not included as it is currently not adapted to analyzing paired-sample data. (B) Heatmap of averaged power over the differential expression methods in six public data sets.

**FIGURE 2.**
Performance comparison with receiver operator characteristics (ROC) curves and other metrics for the six public data sets and five RNA-Seq differential expression analysis packages. Sensitivity and 1 − specificity were estimated in each simulation for n = 4 samples per condition. The simulations were conducted as in Figure 1. (A) ROC curve comparison. True positive rate (TPR) versus false positive rate (FPR) was plotted. (B) Other performance metrics. Area under the curve (AUC) was measured up to FPR = 0.5 of the ROC curves in A. Matthew correlation coefficient (MCC) and F-measure were measured at the threshold of α = 0.05.

**FIGURE 3.**
Paired versus single-factor power analysis of paired-sample data sets (Qian and Tuch). The data sets were evaluated with pairing information (paired analysis, solid line) or without pairing information (single-factor analysis, dashed line), using the standard analysis pipelines for the respective packages as in Figure 1. Note that *EBSeq* is not included as it is currently not adapted to analyzing paired-sample data.

**FIGURE 4.**
Power of protein coding genes versus long noncoding RNA (lincRNA) transcripts. The comparison was made using the Huang data set, which used ribosomal RNA removal for RNA library construction. The transcriptome was separated into protein coding genes (solid line) or lincRNA (dashed line) categories. Power was estimated in each simulation for these two categories, using the standard analysis pipelines for the respective packages as in Figure 1.

**FIGURE 5.**
Optimization of power given a budget constraint. The cost of RNA-Seq per sample is dependent on the cost of constructing the RNA-Seq library and the cost of single-end sequencing under the multiplex arrangement, where multiple samples could be barcoded to share one lane of the HiSeq flow cell. Both sequencing depth and sample size are variables under the budget constraint. (A) Power curves relative to samples, exemplified by increasing budgets of $3000, $5000, and $10,000 among five RNA-Seq differential expression analysis packages. (B) Optimal powers achieved for given budget constraints. (C) Biological replicates required to obtain optimal powers for given budget constraints. (D) Sequencing depths required to obtain optimal powers for given budget constraints.

See this image and copyright information in PMC

Cited by

Unraveling the molecular heterogeneity in type 2 diabetes: a potential subtype discovery followed by metabolic modeling.
Khoshnejat M, Kavousi K, Banaei-Moghaddam AM, Moosavi-Movahedi AA. Khoshnejat M, et al. BMC Med Genomics. 2020 Aug 24;13(1):119. doi: 10.1186/s12920-020-00767-0. BMC Med Genomics. 2020. PMID: 32831068 Free PMC article.
Interferon-β corrects massive gene dysregulation in multiple sclerosis: Short-term and long-term effects on immune regulation and neuroprotection.
Feng X, Bao R, Li L, Deisenhammer F, Arnason BGW, Reder AT. Feng X, et al. EBioMedicine. 2019 Nov;49:269-283. doi: 10.1016/j.ebiom.2019.09.059. Epub 2019 Oct 21. EBioMedicine. 2019. PMID: 31648992 Free PMC article.
Linking environmental risk factors with epigenetic mechanisms in Parkinson's disease.
Tsalenchuk M, Gentleman SM, Marzi SJ. Tsalenchuk M, et al. NPJ Parkinsons Dis. 2023 Aug 25;9(1):123. doi: 10.1038/s41531-023-00568-z. NPJ Parkinsons Dis. 2023. PMID: 37626097 Free PMC article. Review.
Power and sample size calculations for high-throughput sequencing-based experiments.
Li CI, Samuels DC, Zhao YY, Shyr Y, Guo Y. Li CI, et al. Brief Bioinform. 2018 Nov 27;19(6):1247-1255. doi: 10.1093/bib/bbx061. Brief Bioinform. 2018. PMID: 28605403 Free PMC article.
Sample size calculation while controlling false discovery rate for differential expression analysis with RNA-sequencing experiments.
Bi R, Liu P. Bi R, et al. BMC Bioinformatics. 2016 Mar 31;17:146. doi: 10.1186/s12859-016-0994-9. BMC Bioinformatics. 2016. PMID: 27029470 Free PMC article.

See all "Cited by" articles

References

1. Aban IB, Cutter GR, Mavinga N. 2008. Inferences and power analysis concerning two negative binomial distributions with an application to MRI lesion counts data. Comput Stat Data Anal 53: 820–833 - PMC - PubMed
1. Anders S, Huber W. 2010. Differential expression analysis for sequence count data. Genome Biol 11: R106. - PMC - PubMed
1. Bottomly D, Walter NA, Hunter JE, Darakjian P, Kawane S, Buck KJ, Searles RP, Mooney M, McWeeney SK, Hitzemann R. 2011. Evaluating gene expression in C57BL/6J and DBA/2J mouse striatum using RNA-Seq and microarrays. PLoS One 6: e17820. - PMC - PubMed
1. Bullard JH, Purdom E, Hansen KD, Dudoit S. 2010. Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics 11: 94. - PMC - PubMed
1. Busby MA, Stewart C, Miller CA, Grzeda KR, Marth GT. 2013. Scotty: a web tool for designing RNA-Seq experiments to measure differential gene expression. Bioinformatics 29: 656–657 - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

[1] Aban IB, Cutter GR, Mavinga N. 2008. Inferences and power analysis concerning two negative binomial distributions with an application to MRI lesion counts data. Comput Stat Data Anal 53: 820–833 - PMC - PubMed

[2] Aban IB, Cutter GR, Mavinga N. 2008. Inferences and power analysis concerning two negative binomial distributions with an application to MRI lesion counts data. Comput Stat Data Anal 53: 820–833 - PMC - PubMed

[3] Anders S, Huber W. 2010. Differential expression analysis for sequence count data. Genome Biol 11: R106. - PMC - PubMed

[4] Anders S, Huber W. 2010. Differential expression analysis for sequence count data. Genome Biol 11: R106. - PMC - PubMed

[5] Bottomly D, Walter NA, Hunter JE, Darakjian P, Kawane S, Buck KJ, Searles RP, Mooney M, McWeeney SK, Hitzemann R. 2011. Evaluating gene expression in C57BL/6J and DBA/2J mouse striatum using RNA-Seq and microarrays. PLoS One 6: e17820. - PMC - PubMed

[6] Bottomly D, Walter NA, Hunter JE, Darakjian P, Kawane S, Buck KJ, Searles RP, Mooney M, McWeeney SK, Hitzemann R. 2011. Evaluating gene expression in C57BL/6J and DBA/2J mouse striatum using RNA-Seq and microarrays. PLoS One 6: e17820. - PMC - PubMed

[7] Bullard JH, Purdom E, Hansen KD, Dudoit S. 2010. Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics 11: 94. - PMC - PubMed

[8] Bullard JH, Purdom E, Hansen KD, Dudoit S. 2010. Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics 11: 94. - PMC - PubMed

[9] Busby MA, Stewart C, Miller CA, Grzeda KR, Marth GT. 2013. Scotty: a web tool for designing RNA-Seq experiments to measure differential gene expression. Bioinformatics 29: 656–657 - PMC - PubMed

[10] Busby MA, Stewart C, Miller CA, Grzeda KR, Marth GT. 2013. Scotty: a web tool for designing RNA-Seq experiments to measure differential gene expression. Bioinformatics 29: 656–657 - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Power analysis and sample size estimation for RNA-Seq differential expression

Affiliations

Power analysis and sample size estimation for RNA-Seq differential expression

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources