forked from jtleek/genstats
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path03_09_Calculating_statistics.Rmd
202 lines (144 loc) · 5.43 KB
/
03_09_Calculating_statistics.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
---
title: Calculating statistics
author: Jeff Leek
output:
rmarkdown::html_document:
toc: true
vignette: >
%\VignetteIndexEntry{Calculating statistics}
%\VignetteEngine{knitr::rmarkdown}
\usepackage[utf8]{inputenc}
---
```{r front, child="./../front.Rmd", echo=FALSE}
```
## Dependencies
This document depends on the following packages:
```{r load_hidden, echo=FALSE, results="hide", warning=FALSE}
suppressPackageStartupMessages({
library(devtools)
library(Biobase)
library(limma)
library(edge)
library(genefilter)
})
```
```{r load}
library(devtools)
library(Biobase)
library(limma)
library(edge)
library(genefilter)
```
To install these packages you can use the code (or if you are compiling the document, remove the `eval=FALSE` from the chunk.)
```{r install_packages, eval=FALSE}
install.packages(c("devtools"))
source("http://www.bioconductor.org/biocLite.R")
biocLite(c("Biobase","limma","genefilter","jdstorey/edge"))
```
## Download the data
Here we are going to use some data from the paper [Evaluating gene expression in C57BL/6J and DBA/2J mouse striatum using RNA-Seq and microarrays.](http://www.ncbi.nlm.nih.gov/pubmed?term=21455293) that is a comparative RNA-seq analysis of different mouse strains.
```{r}
con =url("http://bowtie-bio.sourceforge.net/recount/ExpressionSets/bottomly_eset.RData")
load(file=con)
close(con)
bot = bottomly.eset
pdata=pData(bot)
edata=as.matrix(exprs(bot))
fdata = fData(bot)
ls()
```
## Transform the data
Here we will transform the data and remove lowly expressed genes.
```{r}
edata = log2(as.matrix(edata) + 1)
edata = edata[rowMeans(edata) > 10, ]
```
## Calculate t- or F-statistics rapidly
The `genefilter` package lets you compute statistics rapidly for very simple cases (two or multi-group) comparisons. These are not moderated in any way.
```{r}
tstats_obj = rowttests(edata,pdata$strain)
names(tstats_obj)
hist(tstats_obj$statistic,col=2)
```
For multi-group comparisons use the F-statistic
```{r}
fstats_obj = rowFtests(edata,as.factor(pdata$lane.number))
names(fstats_obj)
hist(fstats_obj$statistic,col=2)
```
## Fit many statistics with limma
This approach fits many moderated statistics simultaneously
```{r}
mod = model.matrix(~ pdata$strain)
fit_limma = lmFit(edata,mod)
ebayes_limma = eBayes(fit_limma)
head(ebayes_limma$t)
```
If the model is unadjusted you get the moderated version of the statistic from `rowttests`
```{r}
plot(ebayes_limma$t[,2],-tstats_obj$statistic,col=4,
xlab="Moderated T-stat",ylab="T-stat")
abline(c(0,1),col="darkgrey",lwd=3)
```
## Fit many adjusted statistics with limma
Here we adjust for the lane number, now the test-statistic for the strain is adjusted for the lane number (a surrogate for a batch effect).
```{r}
mod_adj = model.matrix(~ pdata$strain + as.factor(pdata$lane.number))
fit_limma_adj = lmFit(edata,mod_adj)
ebayes_limma_adj = eBayes(fit_limma_adj)
head(ebayes_limma_adj$t)
```
If the model is adjusted you get the moderated version of the statistic but it doesn't match `rowttests` because it is adjusted
```{r}
plot(ebayes_limma_adj$t[,2],-tstats_obj$statistic,col=3,
xlab="Moderated T-stat",ylab="T-stat")
abline(c(0,1),lwd=3,col="darkgrey")
```
## Calculating a nested model comparison with limma
Sometimes we want to compare the null model to the alternative model with some additional covariates. Here we have to know which coefficients we want to test in the alternative model.
Suppose we wanted to find lane effects then we can fit a limma model and find which coefficients belong to the lane variable.
```{r}
mod_lane = model.matrix(~ as.factor(pdata$lane.number))
fit_limma_lane = lmFit(edata,mod_lane)
ebayes_limma_lane = eBayes(fit_limma_lane)
head(ebayes_limma_lane$t)
```
Then we can get the F-statistics with `topTable`
```{r}
top_lane = topTable(ebayes_limma_lane, coef=2:7,
number=dim(edata)[1],sort.by="none")
head(top_lane)
```
This is again the moderated version of the F-statistic we saw earlier
```{r}
plot(top_lane$F,fstats_obj$statistic,
xlab="Moderated F-statistic",ylab="F-statistic",col=3)
```
## Calculating a nested comparison with edge
We can also perform the unmoderated nested comparisons in the `edge` package, which also has functions for calculating a more powerful [odp statistic](http://genomine.org/papers/SDL_Biostat_2007.pdf)
```{r}
edge_study = build_study(edata, grp = as.factor(pdata$lane.number))
de_obj = lrt(edge_study)
qval = qvalueObj(de_obj)
plot(qval$stat,fstats_obj$statistic,col=4,
xlab="F-stat from edge",ylab="F-stat from genefilter")
```
We can easily adjust for variables by passing arguments to the `adj.var` variable.
```{r}
edge_study2 = build_study(edata, grp = as.factor(pdata$lane.number),
adj.var=pdata$strain)
de_obj2 = lrt(edge_study2)
qval2 = qvalueObj(de_obj2)
plot(qval2$stat,fstats_obj$statistic,col=4,
xlab="F-stat from edge",ylab="F-stat from genefilter")
```
## More information
You can find a lot more information on this model fitting strategy in:
* The [limma paper](http://www.bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf)
* The [limma vignette](http://www.bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf)
## Session information
Here is the session information
```{r session_info}
devtools::session_info()
```
It is also useful to compile the time the document was processed. This document was processed on: `r Sys.Date()`.