Dear Biostars-Community,
I am new to RNA-seq analysis (I have a molecular biology/genetics background) but I don´t have someone to confirm whether my pipeline is appropiate for my biological question. Therefore, I would appreciate if somebody more knowledgeable than me can take a look at the pipeline and the form of normalization and tell me whether it makes sense or not. I have treated samples and control samples to compare them to, both taken as triplicates. Another dimension that comes into play is the timepoint of taking the samples. First batch of triplicates was taken after 5h and one after 15h. Earlier samples have very poor RNA quality due to treatment. As I want to do comparisons between samples, I decided to perform the edgeR TMM normalization method to detect DE after using rsem to get expected counts (reads were quality trimmed and deduplicated). After using (y=DGEList)
keep <- filterByExpr(y)
y <- y[keep, , keep.lib.sizes=FALSE]
y <- calcNormFactors(y)
I used the following to generate the design and QL dispersion estimation(which Im still not 100% sure what it is), an calculated the logfc using gmlTreat (only show one contrast here)
design <- model.matrix(~0+group, data=y$samples)
colnames(design) <- levels(y$samples$group)
y <- estimateDisp(y, design)
fit <- glmQLFit(y, design)
C3y15hvsI3y15h_log2 <- glmTreat(fit, contrast=c(1,0,-1,0))
(...)
The (probably unelegant) design looks like this
group lib.size norm.factors
C15h C5h IS15h IS5h
C5h_R1 0 1 0 0
C5h_R2 0 1 0 0
C5h_R3 0 1 0 0
C15h_R1 1 0 0 0
C15h_R2 1 0 0 0
C15h_R3 1 0 0 0
I5h_R1 0 0 0 1
I5h_R2 0 0 0 1
I5h_R3 0 0 0 1
I15h_R1 0 0 1 0
I15h_R2 0 0 1 0
I15h_R3 0 0 1 0
attr(,"assign")
[1] 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")$group
[1] "contr.treatment"
The library sizes and norm factors look like this
group lib.size norm.factors
C5h_R1 C5h 3570360 0.8834285
C5h_R2 C5h 3976661 1.0916737
C5h_R3 C5h 3209819 0.9523015
C15h_R1 C15h 4992727 0.6459741
C15h_R2 C15h 4724964 0.7069480
C15h_R3 C15h 4235307 0.6576700
I5h_R1 I5h 2763710 1.3489477
I5h_R2 I5h 2713486 1.3310887
I5h_R3 I5h 2968483 1.4400839
I15h_R1 I15h 4676785 0.9398597
I15h_R2 I15h 4430954 1.2580874
I15h_R3 I15h 4966053 1.1857328
The samples that have a very low RIN are I5h_R1 to _R3 (hence, had a high duplication rate) . I assume dedupliaction is the reason for the small library size? I also assume that C5h_R2 was used as reference sample? Now, besides Goseq, I want to generate a heatmap and use cpm() to get normalized expression values.I also plotted the log values of unnormalized and TMM normalized values and got this
Apparantly, the samples, especially the I5h, have a lot of highly or lowly exressed genes? So I started to wonder whether EdgeR is the right choice, as it is filtering out the top and bottom M and A values during normalization, if I understood correctly? Furthermore, I wanted to look a specific genes between samples and plot their fold change in a bar plot. I found that oftentimes while eg. Gene A would see an increase of expression in eg. I5h compared to C5h according to rsem generated TPM, the same Gene A experiences an appareant decrease of expression when using EdgeR normalized values (disrearding p value and FDR at this point). Now I´m worried to loose information or discover false positives because I do not use appropriate normalization. I understand that normalization of RNA-seq data is a heavily debated topic and I lack experience and knowledge to make an educated choice on which would be a good method for my problem. So I would highly appreciate any input on whether the pipeline so far makes sense or whether I should try something different. If any important information is missing at this point, I can gladly add it. Thank you to anyone who even reads this wall of text.