Question

which normalization method for my RNA-seq data set?

1

Entering edit mode

4.6 years ago

nyoegut ▴ 10

Dear Biostars-Community,

I am new to RNA-seq analysis (I have a molecular biology/genetics background) but I don´t have someone to confirm whether my pipeline is appropiate for my biological question. Therefore, I would appreciate if somebody more knowledgeable than me can take a look at the pipeline and the form of normalization and tell me whether it makes sense or not. I have treated samples and control samples to compare them to, both taken as triplicates. Another dimension that comes into play is the timepoint of taking the samples. First batch of triplicates was taken after 5h and one after 15h. Earlier samples have very poor RNA quality due to treatment. As I want to do comparisons between samples, I decided to perform the edgeR TMM normalization method to detect DE after using rsem to get expected counts (reads were quality trimmed and deduplicated). After using (y=DGEList)

keep <- filterByExpr(y)
y <- y[keep, , keep.lib.sizes=FALSE]
y <- calcNormFactors(y)

I used the following to generate the design and QL dispersion estimation(which Im still not 100% sure what it is), an calculated the logfc using gmlTreat (only show one contrast here)

    design <- model.matrix(~0+group, data=y$samples)
        colnames(design) <- levels(y$samples$group)
        y <- estimateDisp(y, design)
        fit <- glmQLFit(y, design)
    C3y15hvsI3y15h_log2 <- glmTreat(fit, contrast=c(1,0,-1,0))
(...)

The (probably unelegant) design looks like this

group lib.size norm.factors
     C15h C5h IS15h IS5h
    C5h_R1       0    1      0     0
    C5h_R2       0    1      0     0
    C5h_R3       0    1      0     0
    C15h_R1      1    0      0     0
    C15h_R2      1    0      0     0
    C15h_R3      1    0      0     0
    I5h_R1       0    0      0     1
    I5h_R2       0    0      0     1
    I5h_R3       0    0      0     1
    I15h_R1      0    0      1     0
    I15h_R2      0    0      1     0
    I15h_R3      0    0      1     0
    attr(,"assign")
    [1] 1 1 1 1
    attr(,"contrasts")
    attr(,"contrasts")$group
    [1] "contr.treatment"

The library sizes and norm factors look like this

group lib.size norm.factors
    C5h_R1     C5h  3570360    0.8834285
    C5h_R2     C5h  3976661    1.0916737
    C5h_R3     C5h  3209819    0.9523015
    C15h_R1   C15h  4992727    0.6459741
    C15h_R2   C15h  4724964    0.7069480
    C15h_R3   C15h  4235307    0.6576700
    I5h_R1   I5h  2763710    1.3489477
    I5h_R2   I5h  2713486    1.3310887
    I5h_R3   I5h  2968483    1.4400839
    I15h_R1 I15h  4676785    0.9398597
    I15h_R2 I15h  4430954    1.2580874
    I15h_R3 I15h  4966053    1.1857328

The samples that have a very low RIN are I5h_R1 to _R3 (hence, had a high duplication rate) . I assume dedupliaction is the reason for the small library size? I also assume that C5h_R2 was used as reference sample? Now, besides Goseq, I want to generate a heatmap and use cpm() to get normalized expression values.I also plotted the log values of unnormalized and TMM normalized values and got this

Apparantly, the samples, especially the I5h, have a lot of highly or lowly exressed genes? So I started to wonder whether EdgeR is the right choice, as it is filtering out the top and bottom M and A values during normalization, if I understood correctly? Furthermore, I wanted to look a specific genes between samples and plot their fold change in a bar plot. I found that oftentimes while eg. Gene A would see an increase of expression in eg. I5h compared to C5h according to rsem generated TPM, the same Gene A experiences an appareant decrease of expression when using EdgeR normalized values (disrearding p value and FDR at this point). Now I´m worried to loose information or discover false positives because I do not use appropriate normalization. I understand that normalization of RNA-seq data is a heavily debated topic and I lack experience and knowledge to make an educated choice on which would be a good method for my problem. So I would highly appreciate any input on whether the pipeline so far makes sense or whether I should try something different. If any important information is missing at this point, I can gladly add it. Thank you to anyone who even reads this wall of text.

RNA-Seq edgeR DeSeq2 TMM TPM • 994 views

ADD COMMENT • link updated 4.6 years ago by GenoMax 146k • written 4.6 years ago by nyoegut ▴ 10

score 2 · Answer 1 · 2020-04-02

Some thoughts (and thanks for all the details, that really allows to understand what you are talking about, people often neglect that).

You typically do not remove duplicates in RNA-seq. It is not standard since duplicates are expected in RNA-seq. Do not do that.
If one sample has poor RIN then it can well be that all differences you see is due to poor RNA quality rather than biological effect. Be sure to validate your findings before making big claims.
FilterByExpr requires a design to properly work. It is not very clear from the manual, I admit that. That is likely the reason some of your genes are missing even though highly expressed in some samples. Without a design edgeR will treat each column of the count matrix as a group and once you have then low count in any sample for that gene it will be gone since with n=1 and low count there is no chance this could ever be significant. Be sure to first define the design and then feed it to the function using the design argument.
the standard normalization is fine. You can check with MA-plots if the majority of genes is centered at y ~ 0. edgeR has a function for this afaik. Not sure, I always use custom functions. Check the manual.
The design by the way is fine as far as I can see.You encode conditions as full factorial (so one factor per group). I do it like this most of the time unless designs are more complex like any co-factors that have to be taken into account.
Don't use the RSEM TPMs for anything. Trust the output of edgeR after making the suggested changes. FDR must be considered when making claims about expression differences, logFC or logCPM alone is not sufficient.
If you do not specifiy a Null hypothesis other than 0 to glmTreat then it is the same as glmQLFTest. This is just a remark, you probably know that.
For quality control a PCA or MDS plot (e.g. plotMDS) can be helpful to detect outliers/bad samples.