Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Jul 7:13:304.
doi: 10.1186/1471-2164-13-304.

Technical and biological variance structure in mRNA-Seq data: life in the real world

Affiliations

Technical and biological variance structure in mRNA-Seq data: life in the real world

Ann L Oberg et al. BMC Genomics. .

Abstract

Background: mRNA expression data from next generation sequencing platforms is obtained in the form of counts per gene or exon. Counts have classically been assumed to follow a Poisson distribution in which the variance is equal to the mean. The Negative Binomial distribution which allows for over-dispersion, i.e., for the variance to be greater than the mean, is commonly used to model count data as well.

Results: In mRNA-Seq data from 25 subjects, we found technical variation to generally follow a Poisson distribution as has been reported previously and biological variability was over-dispersed relative to the Poisson model. The mean-variance relationship across all genes was quadratic, in keeping with a Negative Binomial (NB) distribution. Over-dispersed Poisson and NB distributional assumptions demonstrated marked improvements in goodness-of-fit (GOF) over the standard Poisson model assumptions, but with evidence of over-fitting in some genes. Modeling of experimental effects improved GOF for high variance genes but increased the over-fitting problem.

Conclusions: These conclusions will guide development of analytical strategies for accurate modeling of variance structure in these data and sample size determination which in turn will aid in the identification of true biological signals that inform our understanding of biological systems.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Study design. A) Cartoon depicting the allocation of subject samples to flow cells. One high (H) and one low (L) responder was allocated to each flow cell. Within a flow cell, each patient was randomly allocated to lanes 1-4 or 5-8, ensuring that H/L response was balanced over lanes across all flow cells. Finally, for each subject, two technical replicates of their stimulated and unstimulated specimens were randomly allocated to the first two or second two lanes such that stimulation status was balanced over lanes. B) Flow diagram demonstrating the full initial sample set and reasons for excluded lanes of data for the final analysis data set.
Figure 2
Figure 2
Distributions of counts. A) Histogram of total reads per lane for 46 lanes (unstimulated specimens) on the scale of millions of reads. B) Frequency histogram of average counts per gene per lane on the log10 scale. C) Cumulative percent of average counts per lane as a function of the percent of genes contributing. Lines for both high (red) and low (blue) responders were drawn, but not distinguishable.
Figure 3
Figure 3
Assessing presence and magnitude of over-dispersion. A) The horizontal axis indicates the mean scaled count within each of the high/low response groups on the square root scale (labeled on the raw scale) and the vertical axis indicates the variation on the standard deviation (i.e. square root of the variance) within each group. Each gene is thus represented by two points, one for each response group. The green line corresponds to the Poisson assumptions, the blue line corresponds to OD Poisson assumptions, and the red line corresponds to NB assumptions, with lines constructed as described in the text. B) Local estimates of φ from the edgeR function versus per-group mean count. The shading indicates density of points in that area with darker shading representing higher density.
Figure 4
Figure 4
Distribution of GOF statistics. Residual QQ plots of model fits normalized with the 75% count and no blocking factor. Tick-marks along the top indicate deciles. The top 5% of GOF statistics are indicated in alternate colors with the top 1% being red and the next 4% being blue. A) Standard Poisson, B) NB with a global estimate of φ, C) NB with per-gene estimates of φ, D) NB with local estimates of φ. Panels E-H are as in A-D but zoomed in on the bottom left corner of the plots.
Figure 5
Figure 5
Distribution of GOF statistics when experimental factors are included in the model. QQ plots of model fits with the NB distribution, local estimates of φ and 75th percentile count offset including blocking factors as indicated. Tick-marks along the top indicate deciles. The top 5% of GOF statistics are indicated in alternate colors with the top 1% being red and the next 4% being blue. A) lane-pair, B) library preparation batch, C) flow cell. Panel D is the same as A, but zoomed in on the bottom left corner of the plot; no zoom is needed for panels B and C.
Figure 6
Figure 6
Distribution of flow cell effects. Box plots of contrast coefficient estimates indicating the difference of flow cells 2 – 13 from flow cell 1 sorted by run order. The flow cells represented by the left four (blue) boxes were analyzed with SCS v 2.01 while the right-most eight (red) were analyzed with SCS v 2.4. A) Results from models without an offset to account for differences in total counts per lane. B) Results from models including the 75th percentile offset.
Figure 7
Figure 7
Understanding causes of poor model fit. A) GOF statistics for genes with an average count per subject <5 are shown in red on the QQ plot from NB, locally estimated variance with no blocking factor models. B) Dot plot demonstrating the large variance in a gene with an extremely large GOF statistic.

Similar articles

Cited by

References

    1. Asmann YW, Klee EW, Thompson EA, Perez EA, Middha S, Oberg AL, Therneau TM, Smith DI, Poland GA, Wieben ED. et al.3′ tag digital gene expression profiling of human brain and universal reference RNA using Illumina Genome Analyzer. BMC Genomics. 2009;10:531. doi: 10.1186/1471-2164-10-531. - DOI - PMC - PubMed
    1. Datta S, Datta S, Kim S, Chakraborty S, Gill RS. Statistical analyses of next generation sequence data: a partial overview. J Proteomics Bioinformatics. 2010;3(6):183–190. doi: 10.4172/jpb.1000138. - DOI - PMC - PubMed
    1. Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y. RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome research. 2008;18(9):1509–1517. doi: 10.1101/gr.079558.108. - DOI - PMC - PubMed
    1. Lee A, Hansen KD, Bullard J, Dudoit S, Sherlock G. Novel low abundance and transient RNAs in yeast revealed by tiling microarrays and ultra high-throughput sequencing are not conserved across closely related yeast species. PLoS genetics. 2008;4(12):e1000299. doi: 10.1371/journal.pgen.1000299. - DOI - PMC - PubMed
    1. Bullard JH, Purdom E, Hansen KD, Dudoit S. Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics. 2010;11:94. doi: 10.1186/1471-2105-11-94. - DOI - PMC - PubMed

Publication types