-
Notifications
You must be signed in to change notification settings - Fork 0
/
report.Rmd
251 lines (200 loc) · 13.8 KB
/
report.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
---
title: "Analysis of Liver Hepatocellular Carcinoma Data in R"
author: 'Bo Yang, Rui Nie, Yueying Hu'
output:
pdf_document:
fig_width: 3
fig_height: 2
latex_engine: xelatex
header-includes:
- \usepackage{geometry}
- \geometry{left=0.6in, right=0.6in, top=0.6in, bottom=0.6in}
- \usepackage{setspace}
- \singlespacing
- \usepackage{enumitem}
- \setlist[itemize]{noitemsep, topsep=0pt}
- \setlist[enumerate]{noitemsep, topsep=0pt}
fontsize: 8pt
bibliography: references.bib
---
```{r, setup, message=FALSE, include=FALSE}
knitr::opts_chunk$set(warning = FALSE, message = FALSE, echo = TRUE, fig.width=6, fig.height=4)
```
```{r, message=FALSE, include=FALSE}
rm(list = ls())
# Load packages
library("TCGAbiolinks")
library("limma")
library("edgeR")
library("glmnet")
library("caret")
library("SummarizedExperiment")
library("gplots")
library("survival")
library("survminer")
library("RColorBrewer")
library("gProfileR")
library("genefilter")
library("jsonlite")
library("rpart")
library("rpart.plot")
library("randomForest")
library("pROC")
library("data.table")
library("clusterProfiler")
library("org.Hs.eg.db")
# library(countToFPKM)
```
# Introduction
Hepatocellular carcinoma (LIHC/HCC) is the one of the leading cause of cancer death, which accounts for 8.2% of global cancer incidence and 4.7% of cancer-related mortality [@fitzmaurice2017global] Previous study suggested that lifestyles and personal habits, including drinking alcohol, smoking, are positively associated with the onset LIHC [@kuper2000tobacco], which has greatly enhanced the efficiency of screening for the disease by health providers. Nevertheless, the difficulty for disease detection remains alarmingly high. Therefore, an indicator with higher sensitivity would advance the diagnosis stage and prolong patients' survival.
RNA Sequencing (RNA-seq) using the capabilities of high-throughput sequencing methods provides insight into the transcriptome of a cell. Through understanding the gene expression patterns reflected by RNA sequencing results, we aim to locate indicator genes for LIHC with the help of statistical learning methods Furthermore, given the massive amount of existing genes in human genome, our problem can be formulated as a high-dimensional problem. One of our side goal is to perform variable selection to reduce the risk of overfitting.
# Methods
## LIHC data from TCGA
We're primarily interested in analyzing RNA-seq data provided through The Cancer Genome Atlas (TCGA) for liver cancer patients. The dataset includes data from 424 LIHC participants and over 20,000 genes for expression level analysis. The inclusion of both normal and tumor samples allows us for comparative analysis between cancerous and non-cancerous liver tissues. We used `GDCquery`, `GDCdownload`, and `GDCprepare` functions for data accessing to download and prepare the LIHC data from TCGAbiolinks.
```{r, message=FALSE, results='hide'}
query_TCGA_LIHC = GDCquery(project = "TCGA-LIHC", data.category = "Transcriptome Profiling",
experimental.strategy = "RNA-Seq", workflow.type = "STAR - Counts",
data.type = 'Gene Expression Quantification')
GDCdownload(query = query_TCGA_LIHC)
tcga_data_LIHC = GDCprepare(query_TCGA_LIHC)
# transform a large summarized experiment into readable matrix
count=assay(tcga_data_LIHC)
```
Normalization for reads count help account for gene length and sequencing depth. Fragments Per Kilobase of transcript per Million mapped reads (FPKM) provides a comparable measure to interpret and utilize RNA-seq results. We standardized our data into FPKM format for fair representation and comparison gene expressions.
```{r}
## load gene annotation data
file.annotations <- system.file("extdata", "Biomart.annotations.hg38.txt",
package="countToFPKM")
gene.annotations <- read.table(file.annotations, sep="\t", header=TRUE)
## combine gene annotation and expression data
genelength=data.frame(gene=gene.annotations$Gene_ID,
length=gene.annotations$length,
type=gene.annotations$Gene_type)
rownames(count)<-substring(rownames(count),1,15)
df_temp = merge(genelength, count, by.x="gene",by.y="row.names")
## FPKM calculation by hand
countToFpkm <- function(counts, effLen)
{
N <- sum(counts)
exp( log(counts) + log(1e9) - log(effLen) - log(N) )
}
df<-countToFpkm(df_temp[,-c(1:3)], df_temp$length)
rownames(df)<-df_temp$gene; X = as.matrix(t(df));dim(df)
```
## Preprocessing
We collected the sample IDs from normal samples and from tumor samples separately and retrieved their index for future train-test dataset split. In summary, out of 424 samples, there are 50 normal samples and 374 tumor samples. This suggests we consider additional metrics such as precision, F1, or AUC-ROC scores to account for potential bias introduced by the imbalanced labels aside from crude accuracy for classification tasks.
```{r}
## retrieve the label for each sample
sample_ID = colnames(df)
tumor_sample_ID = sample_ID[which(substring(sample_ID,14,15)!=11)]
normal_sample_ID = sample_ID[which(substring(sample_ID,14,15)==11)]
## index for each label group
tumor_index = which(sample_ID %in% tumor_sample_ID)
normal_index = which(sample_ID %in% normal_sample_ID)
label = rep(0, length(sample_ID)); label[tumor_index] = 1
label = factor(label, labels = c("normal", "tumor")); table(label)
```
## Variable Selection
Of all the variables (genes) selected for our preprocessed data, we'd like to filter out the ones with constant values across samples, as they provide no information in differentiating tumor samples from normal samples. This leads to the removal of 381 genes from the entire sample.
```{r}
sd_x = apply(X, 2, sd)
ind = which(sd_x == 0)
X = X[, -ind];length(ind)
```
To select variables while maintaining interpretation along the process, we will use the variance filtering method to select genes with a more variable expression activity from the samples we included. We kept the variance cutoff to be as high as 0.95, leaving only 5% (985) of the original genes for future statistical analysis.
```{r}
Xnew = t(varFilter(t(X), var.func=IQR, var.cutoff = 0.95, filterByQuantile = TRUE));dim(Xnew)
```
## Statistical Leaning
For building predictive statistical learning methods in LIHC classification, we continue with 985 selected genes using Elastic Net, decision tree, and random forest and assess their performances under the idea of train-test split validation. We will also use cross-validation to tune the hyperparameters for each model and compare their performances in terms of specificity, sensitivity, precision, F1, and AUC-ROC scores. \## Train-Test Split Considering the imbalanced labels of two classes, it is important to preserve the representation of both classes in the training and testing samples to ensure the model is not overly biasing towards one class over the other. As a consequence, we use `createDataPartition` to create a stratified split for training and testing purposes with a 3:1 ratio.
```{r}
set.seed(123)
train_index = createDataPartition(label, p = 0.75, list = FALSE)
trainX = Xnew[train_index, ]; testX = Xnew[-train_index, ]
train_label = label[train_index]; test_label = label[-train_index]
```
## Elastic Net
Elastic Net is a penalized regression method that combines the L1 and L2 penalties of Lasso and Ridge regression. The L1 penalty suppress the coefficients of some variables to zero, which is equivalent to variable selection. The L2 penalty shrinks the coefficients of some variables towards a smaller scale. We take the idea of logistic regression and use below formula as the loss function for Elastic Net in classification problems:
$$
\min_{\beta_0, \beta} \frac{1}{N} \sum_{i=1}^N \log(1 + \exp(-y_i(\beta_0 + x_i^T \beta))) + \lambda \left[ \frac{1}{2} (1 - \alpha) \sum_{j=1}^p \beta_j^2 + \alpha \sum_{j=1}^p |\beta_j| \right]
$$
We determine the alpha and lambda coefficient through grid search, where alpha determines the balance between L1 and L2 penalty and lambda determines the strength of the penalty.
```{r, warning=FALSE, message=FALSE}
## grid search for hyperparameter tuning
set.seed(123)
alpha_grid = seq(0, 1, 0.1)
lambda_grid = seq(0.001, 0.1, 0.001)
cv_control <- trainControl(method = "cv", number = 5,classProbs = TRUE)
elastic <- train(x = trainX, y = train_label, method = "glmnet", trControl = cv_control,
family = "binomial", tuneGrid = expand.grid(alpha = alpha_grid, lambda = lambda_grid))
elastic$bestTune
```
```{r}
## Evaluate the performance of Elastic Net
set.seed(123)
elastic_model = glmnet(trainX, train_label, family="binomial", alpha = 0.2, lambda = 0.036)
elastic_pred = predict(elastic_model, testX, type = "class")
result.elastic = confusionMatrix(factor(elastic_pred), test_label)$byClass[c(1, 2, 5, 7)]
## Variables selected by Elastic Net
coef.elastic = coef(elastic_model, s = "lambda.min")
idx = which(coef.elastic[,1]!=0)
variables <- row.names(coef.elastic)[idx]
variables <- variables[!variables %in% c('(Intercept)')]
names = rownames(rowData(tcga_data_LIHC))
names = gsub("\\..*","", names)
genes <- rowData(tcga_data_LIHC)[which(variables %in% names), "gene_name"]; genes[1:10]
```
## Decision Tree
Decision Tree is a non-parametric supervised learning method that can be used for both classification and regression problems. It works by recursively partitioning the data into smaller subsets based on the most significant variable at each step. The partitioning process is repeated until the data is completely separated or the stopping criteria is met. We displayed the gene names of the most important variables selected by Decision Tree.
```{r}
## Pruning the tree by ten-fold cross validation
rt = rpart(train_label~., data = data.frame(trainX))
cp = rt$cptable[which.min(rt$cptable[,"xerror"]),"CP"]
rt.prune = prune(rt, cp = cp)
variables <- rt.prune$variable.importance[order(rt.prune$variable.importance, decreasing = TRUE)]
genes <- rowData(tcga_data_LIHC)[which(names(variables) %in% names), "gene_name"]; genes
```
```{r}
## Evaluate the performance of Decision Tree
rt_pred = predict(rt.prune, newdata = data.frame(testX), type = "class")
result.decision = confusionMatrix(factor(rt_pred), test_label)$byClass[c(1, 2, 5, 7)]
```
## Random Forest
Random Forest is an ensemble learning method that combines multiple decision trees to improve the performance of the model. It works by building multiple decision trees on different subsets of the data and then averaging the predictions of all the trees. The random forest model is more robust to overfitting and less sensitive to outliers than a single decision tree. Here, we use the `randomForest` function to build the random forest model and tune the hyperparameters by cross validation.
```{r}
set.seed(123)
rf = randomForest(x = trainX, y = train_label, importance=TRUE)
## Evaluate the performance of Random Forest
rf.pred = predict(rf, newdata = testX)
result.rf = confusionMatrix(factor(rf.pred), test_label)$byClass[c(1, 2, 5, 7)]
```
Further, we explore the the variable importance, we consider two types of measure of variable importance: Mean Decrease Accuracy and Mean Decrease Gini. We see the gene TSPAN6 is the most important predictor in Mean Decrease Accuracy and in Mean Decrease Gini, which is consistent with the results of the decision tree and elastic net regression.
```{r}
imp <- rf$importance
variables <- names(head(sort(imp[, 3], decreasing = T)))
genes <- rowData(tcga_data_LIHC)[which(variables %in% names), "gene_name"]; genes
variables <- names(head(sort(imp[, 4], decreasing = T)))
genes <- rowData(tcga_data_LIHC)[which(variables %in% names), "gene_name"]; genes
```
## Discussion
Overall, the three statistical learning methods have similar performance in terms of specificity and raw accuracy, and they all successfully identify similar genes as the most important predictors ("TSPAN6", "TNMD", "DPM1", "SCYL3", "C1orf112", "FGR"). However, the single decision tree model has the lowest sensitivity, F1 score, and AUC-ROC score, which indicates that the model is overfitting. Comparably, the elastic net regression model and the random forest model are more robust to such unbalanced dataset. All code is avaliable in github: <https://github.com/rrrrn/LIHC625/tree/main>
```{r}
data.frame(elastic = result.elastic, decision = result.decision, rf = result.rf)
```
```{r, warning=FALSE, message=FALSE, echo=FALSE, results='hide', out.width="50%"}
## Plot the ROC curves
elastic_prob = predict(elastic_model, testX, type = "response")
roc(test_label, elastic_prob, plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Postive Percentage", col="#377eb8", lwd=2, print.auc=TRUE)
decision_prob = predict(rt.prune, newdata = data.frame(testX), type = "prob")
plot.roc(test_label, decision_prob[,2], percent=TRUE, col="#4daf4a", lwd=2, print.auc=TRUE, add=TRUE, print.auc.y=30)
rf_prob = predict(rf, newdata = testX, type = "prob")
plot.roc(test_label, rf_prob[,2], percent=TRUE, col="#b237b8", lty=2, lwd=2, print.auc=TRUE, add=TRUE, print.auc.y=40)
legend("bottomright", legend=c("Elastic Net", "Decision Tree", "Random Forest"), col=c("#377eb8", "#4daf4a","#b237b8"), lwd=2)
```
## Member Contribution
| Member | Contribution |
|----------|-------------------------------------------------------------|
| Bo Yang | Data Preprocessing, Data analysis, write LIHC data from TCGA and Preprocessing part in the report |
| Rui Nie | Variable Selection ,Data analysis, write Introduction, Variable Selection and Elastic Net part in report |
| Yueying Hu | Data collection, Data analysis, write Discussion, Decision Tree, Random Forest part in report |
## References