Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Dec 9;8(12):e81415.
doi: 10.1371/journal.pone.0081415. eCollection 2013.

Dispersion estimation and its effect on test performance in RNA-seq data analysis: a simulation-based comparison of methods

Affiliations

Dispersion estimation and its effect on test performance in RNA-seq data analysis: a simulation-based comparison of methods

William Michael Landau et al. PLoS One. .

Abstract

A central goal of RNA sequencing (RNA-seq) experiments is to detect differentially expressed genes. In the ubiquitous negative binomial model for RNA-seq data, each gene is given a dispersion parameter, and correctly estimating these dispersion parameters is vital to detecting differential expression. Since the dispersions control the variances of the gene counts, underestimation may lead to false discovery, while overestimation may lower the rate of true detection. After briefly reviewing several popular dispersion estimation methods, this article describes a simulation study that compares them in terms of point estimation and the effect on the performance of tests for differential expression. The methods that maximize the test performance are the ones that use a moderate degree of dispersion shrinkage: the DSS, Tagwise wqCML, and Tagwise APL. In practical RNA-seq data analysis, we recommend using one of these moderate-shrinkage methods with the QLShrink test in QuasiSeq R package.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. A look at the data.
Hammer data and Hammer-generated pseudo-data are in blue, while Pickrell data and Pickrell-generated pseudo-data are shown in black. The top two panels show the gene-wise log geometric mean counts and log dispersion estimates, estimated with the QL method, for the Hammer and Pickrell datasets. The bottom two panels plot the analogous quantities for example simulated pseudo-datasets, except that the log dispersions plotted are the true dispersions used to simulate the pseudo-counts and the gene-wise log geometric mean counts are the formula image’s (log geometric mean counts from the real data) used in the simulations. The vertical bar at around formula image in the plots of the log dispersions is an artifact of the QL method, which sets extremely low dispersions (i.e., dispersions of non-overdispersed genes) to a common minimum value.
Figure 2
Figure 2. Dispersion-mean relationships.
The top two panels show the relationship between the log QL-method-estimated dispersions and the gene-wise log geometric mean counts of the Hammer and Pickrell datasets. The bottom two plot the analogous quantities for example simulated pseudo-datasets, except that the log dispersions plotted are the true log dispersions used to simulate the pseudo-counts (i.e., the formula image’s) and the gene-wise log geometric mean counts are the formula image’s used in the simulations. Bins in these two-dimensional histograms are shaded by their log frequency.
Figure 3
Figure 3. Mean squared error of the transformed dispersions.
Figure 4
Figure 4. Simulation setting II: estimated vs true dispersions for an example pseudo-dataset.
Dispersions with gene-wise log geometric mean counts below the median (log mean from formula image2.17 to 1.63) are shown in black, while those above the median (log mean from 1.63 to 10.6) are shown in light blue. Overlapping points are shown in dark blue. Results for simulation settings I and III are similar.
Figure 5
Figure 5. Simulation setting V: estimated vs true dispersions for an example pseudo-dataset.
Dispersions with gene-wise log geometric mean counts below the median (log mean from −2.17 to 4.49) are shown in black, while those above the median (log mean from 4.49 to 12.3) are shown in light blue. Overlapping points are shown in dark blue. Results for simulations IV and VI are similar.
Figure 6
Figure 6. Simulation setting I: areas under ROC curves.
Boxplots of AUC calculated based on 30 pseudo-datasets are shown for each combination of dispersion estimation method and test for differential expression.
Figure 7
Figure 7. Simulation setting II: areas under ROC curves.
Boxplots of AUC calculated based on 30 pseudo-datasets are shown for each combination of dispersion estimation method and test for differential expression.
Figure 8
Figure 8. Simulation setting III: areas under ROC curves.
Boxplots of AUC calculated based on 30 pseudo-datasets are shown for each combination of dispersion estimation method and test for differential expression.
Figure 9
Figure 9. Simulation setting IV: areas under ROC curves.
Boxplots of AUC calculated based on 30 pseudo-datasets are shown for each combination of dispersion estimation method and test for differential expression.
Figure 10
Figure 10. Simulation setting V: areas under ROC curves.
Boxplots of AUC calculated based on 30 pseudo-datasets are shown for each combination of dispersion estimation method and test for differential expression.
Figure 11
Figure 11. Simulation setting VI: areas under ROC curves.
Boxplots of AUC calculated based on 30 pseudo-datasets are shown for each combination of dispersion estimation method and test for differential expression.

Similar articles

Cited by

References

    1. Oshlack A, Robinson MD, Young MD (2010) From rna-seq reads to differential expression results. Genome Biology 11. - PMC - PubMed
    1. Wang L, Li P, Brutnell TP (2010) Exploring plant transcriptomes using ultra high-throughput sequencing. Briefings in Functional Genomics 9: 118–128. - PubMed
    1. Cameron AC, Trivedi PK (1998) Regression Analysis of Count Data. Cambridge University Press.
    1. Si Y, Liu P (2012) An optimal test with maximum average power while controlling fdr with application to rna-seq data. Biometrics 69: 594–605. - PubMed
    1. Anders S, Huber W (2010) Differential expression analysis for sequence count data. Genome Biology 11. - PMC - PubMed

Publication types

MeSH terms

Grants and funding

This work is supported in part by the National Science Foundation Grant IOS-1127017. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

LinkOut - more resources