Dispersion estimation and its effect on test performance in RNA-seq data analysis: a simulation-based comparison of methods

doi:10.1371/journal.pone.0081415

. 2013 Dec 9;8(12):e81415.

doi: 10.1371/journal.pone.0081415. eCollection 2013.

Dispersion estimation and its effect on test performance in RNA-seq data analysis: a simulation-based comparison of methods

William Michael Landau¹, Peng Liu¹

Affiliations

PMID: 24349066
PMCID: PMC3857202
DOI: 10.1371/journal.pone.0081415

Dispersion estimation and its effect on test performance in RNA-seq data analysis: a simulation-based comparison of methods

William Michael Landau et al. PLoS One. 2013.

. 2013 Dec 9;8(12):e81415.

doi: 10.1371/journal.pone.0081415. eCollection 2013.

Authors

William Michael Landau¹, Peng Liu¹

Affiliation

¹ Department of Statistics, Iowa State University, Ames, Iowa, United States of America.

PMID: 24349066
PMCID: PMC3857202
DOI: 10.1371/journal.pone.0081415

Abstract

A central goal of RNA sequencing (RNA-seq) experiments is to detect differentially expressed genes. In the ubiquitous negative binomial model for RNA-seq data, each gene is given a dispersion parameter, and correctly estimating these dispersion parameters is vital to detecting differential expression. Since the dispersions control the variances of the gene counts, underestimation may lead to false discovery, while overestimation may lower the rate of true detection. After briefly reviewing several popular dispersion estimation methods, this article describes a simulation study that compares them in terms of point estimation and the effect on the performance of tests for differential expression. The methods that maximize the test performance are the ones that use a moderate degree of dispersion shrinkage: the DSS, Tagwise wqCML, and Tagwise APL. In practical RNA-seq data analysis, we recommend using one of these moderate-shrinkage methods with the QLShrink test in QuasiSeq R package.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

**Figure 1. A look at the data.**
Hammer data and Hammer-generated pseudo-data are in blue, while Pickrell data and Pickrell-generated pseudo-data are shown in black. The top two panels show the gene-wise log geometric mean counts and log dispersion estimates, estimated with the QL method, for the Hammer and Pickrell datasets. The bottom two panels plot the analogous quantities for example simulated pseudo-datasets, except that the log dispersions plotted are the true dispersions used to simulate the pseudo-counts and the gene-wise log geometric mean counts are the ’s (log geometric mean counts from the real data) used in the simulations. The vertical bar at around in the plots of the log dispersions is an artifact of the QL method, which sets extremely low dispersions (i.e., dispersions of non-overdispersed genes) to a common minimum value.

formula image — **Figure 1. A look at the data.**
Hammer data and Hammer-generated pseudo-data are in blue, while Pickrell data and Pickrell-generated pseudo-data are shown in black. The top two panels show the gene-wise log geometric mean counts and log dispersion estimates, estimated with the QL method, for the Hammer and Pickrell datasets. The bottom two panels plot the analogous quantities for example simulated pseudo-datasets, except that the log dispersions plotted are the true dispersions used to simulate the pseudo-counts and the gene-wise log geometric mean counts are the ’s (log geometric mean counts from the real data) used in the simulations. The vertical bar at around in the plots of the log dispersions is an artifact of the QL method, which sets extremely low dispersions (i.e., dispersions of non-overdispersed genes) to a common minimum value.

**Figure 2. Dispersion-mean relationships.**
The top two panels show the relationship between the log QL-method-estimated dispersions and the gene-wise log geometric mean counts of the Hammer and Pickrell datasets. The bottom two plot the analogous quantities for example simulated pseudo-datasets, except that the log dispersions plotted are the true log dispersions used to simulate the pseudo-counts (i.e., the ’s) and the gene-wise log geometric mean counts are the ’s used in the simulations. Bins in these two-dimensional histograms are shaded by their log frequency.

**Figure 3. Mean squared error of the transformed dispersions.**

**Figure 4. Simulation setting II: estimated vs true dispersions for an example pseudo-dataset.**
Dispersions with gene-wise log geometric mean counts below the median (log mean from 2.17 to 1.63) are shown in black, while those above the median (log mean from 1.63 to 10.6) are shown in light blue. Overlapping points are shown in dark blue. Results for simulation settings I and III are similar.

**Figure 5. Simulation setting V: estimated vs true dispersions for an example pseudo-dataset.**
Dispersions with gene-wise log geometric mean counts below the median (log mean from −2.17 to 4.49) are shown in black, while those above the median (log mean from 4.49 to 12.3) are shown in light blue. Overlapping points are shown in dark blue. Results for simulations IV and VI are similar.

**Figure 6. Simulation setting I: areas under ROC curves.**
Boxplots of AUC calculated based on 30 pseudo-datasets are shown for each combination of dispersion estimation method and test for differential expression.

**Figure 7. Simulation setting II: areas under ROC curves.**
Boxplots of AUC calculated based on 30 pseudo-datasets are shown for each combination of dispersion estimation method and test for differential expression.

**Figure 8. Simulation setting III: areas under ROC curves.**
Boxplots of AUC calculated based on 30 pseudo-datasets are shown for each combination of dispersion estimation method and test for differential expression.

**Figure 9. Simulation setting IV: areas under ROC curves.**
Boxplots of AUC calculated based on 30 pseudo-datasets are shown for each combination of dispersion estimation method and test for differential expression.

**Figure 10. Simulation setting V: areas under ROC curves.**
Boxplots of AUC calculated based on 30 pseudo-datasets are shown for each combination of dispersion estimation method and test for differential expression.

**Figure 11. Simulation setting VI: areas under ROC curves.**
Boxplots of AUC calculated based on 30 pseudo-datasets are shown for each combination of dispersion estimation method and test for differential expression.

See this image and copyright information in PMC

Cited by

Differentially expressed heterogeneous overdispersion genes testing for count data.
Yuan Y, Xu Q, Wani A, Dahrendorff J, Wang C, Shen A, Donglasan J, Burgan S, Graham Z, Uddin M, Wildman D, Qu A. Yuan Y, et al. PLoS One. 2024 Jul 17;19(7):e0300565. doi: 10.1371/journal.pone.0300565. eCollection 2024. PLoS One. 2024. PMID: 39018275 Free PMC article.
Chemotherapy for pain: reversing inflammatory and neuropathic pain with the anticancer agent mithramycin A.
Xu Z, Lee MC, Sheehan K, Fujii K, Rabl K, Rader G, Varney S, Sharma M, Eilers H, Kober K, Miaskowski C, Levine JD, Schumacher MA. Xu Z, et al. Pain. 2024 Jan 1;165(1):54-74. doi: 10.1097/j.pain.0000000000002972. Epub 2023 Jun 27. Pain. 2024. PMID: 37366593 Free PMC article.
Detection of genes with differential expression dispersion unravels the role of autophagy in cancer progression.
Le Priol C, Azencott CA, Gidrol X. Le Priol C, et al. PLoS Comput Biol. 2023 Mar 9;19(3):e1010342. doi: 10.1371/journal.pcbi.1010342. eCollection 2023 Mar. PLoS Comput Biol. 2023. PMID: 36893104 Free PMC article.
Cancer-related cognitive impairment is associated with perturbations in inflammatory pathways.
Oppegaard K, Harris CS, Shin J, Paul SM, Cooper BA, Chan A, Anguera JA, Levine J, Conley Y, Hammer M, Miaskowski CA, Chan RJ, Kober KM. Oppegaard K, et al. Cytokine. 2021 Dec;148:155653. doi: 10.1016/j.cyto.2021.155653. Epub 2021 Aug 10. Cytokine. 2021. PMID: 34388477 Free PMC article.
Genomic profiling of Nipah virus using NGS driven RNA-Seq expression data.
Hassan MZ, Ahmed MS, Khan MM, Uddin MA, Chowdhury F, Kamruzzaman M. Hassan MZ, et al. Bioinformation. 2019 Dec 31;15(12):853-862. doi: 10.6026/97320630015853. eCollection 2019. Bioinformation. 2019. PMID: 32256005 Free PMC article.

See all "Cited by" articles

References

1. Oshlack A, Robinson MD, Young MD (2010) From rna-seq reads to differential expression results. Genome Biology 11. - PMC - PubMed
1. Wang L, Li P, Brutnell TP (2010) Exploring plant transcriptomes using ultra high-throughput sequencing. Briefings in Functional Genomics 9: 118–128. - PubMed
1. Cameron AC, Trivedi PK (1998) Regression Analysis of Count Data. Cambridge University Press.
1. Si Y, Liu P (2012) An optimal test with maximum average power while controlling fdr with application to rna-seq data. Biometrics 69: 594–605. - PubMed
1. Anders S, Huber W (2010) Differential expression analysis for sequence count data. Genome Biology 11. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions

Grants and funding

This work is supported in part by the National Science Foundation Grant IOS-1127017. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

[1] Oshlack A, Robinson MD, Young MD (2010) From rna-seq reads to differential expression results. Genome Biology 11. - PMC - PubMed

[2] Oshlack A, Robinson MD, Young MD (2010) From rna-seq reads to differential expression results. Genome Biology 11. - PMC - PubMed

[3] Wang L, Li P, Brutnell TP (2010) Exploring plant transcriptomes using ultra high-throughput sequencing. Briefings in Functional Genomics 9: 118–128. - PubMed

[4] Wang L, Li P, Brutnell TP (2010) Exploring plant transcriptomes using ultra high-throughput sequencing. Briefings in Functional Genomics 9: 118–128. - PubMed

[5] Cameron AC, Trivedi PK (1998) Regression Analysis of Count Data. Cambridge University Press.

[6] Cameron AC, Trivedi PK (1998) Regression Analysis of Count Data. Cambridge University Press.

[7] Si Y, Liu P (2012) An optimal test with maximum average power while controlling fdr with application to rna-seq data. Biometrics 69: 594–605. - PubMed

[8] Si Y, Liu P (2012) An optimal test with maximum average power while controlling fdr with application to rna-seq data. Biometrics 69: 594–605. - PubMed

[9] Anders S, Huber W (2010) Differential expression analysis for sequence count data. Genome Biology 11. - PMC - PubMed

[10] Anders S, Huber W (2010) Differential expression analysis for sequence count data. Genome Biology 11. - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Dispersion estimation and its effect on test performance in RNA-seq data analysis: a simulation-based comparison of methods

Affiliation

Dispersion estimation and its effect on test performance in RNA-seq data analysis: a simulation-based comparison of methods

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources