I performed a de novo assembly (of RNA-seq reads) of the transcriptome of my target organism by means of Trinity.
Next, I followed the Trinity pipeline and scripts to get the following data matrices about the assembled genes:
FPKM
TPM
TMM
My question is:
Which of these data (FPKM, TPM or TMM) should I use to perform a hierarchichal clustering of the genes and draw a heatmap?
I'd like to use TMM because it is a normalized value across samples (and the trinity scripts use TMM for clustering and heatmaps).
However, I've seen in some papers that the FPKM values are used instead.
Also, which kind of normalization is better for drawing a heatmap? z-score or centered log2 transformation?
I think VST counts from DESeq2 might be a good choice (seq depth+composition bias correction) for heatmaps and MDS. But I think VST is not controlling for gene length. I am not sure if it is possible to get length normalised VST.
The normalization should be performed by the tool you are using (the most popular being EdgeR, DESeq2 and limma), each one of them has a different way of normalizing the data, but if your data is robust (one of the important thing is having enough replicates), they should give similar results,
If you are using Trinity, there is a script called "run_DE_analysis.pl" which will perform the normalization (using EdgeR, DESeq or limma as you choose) and pairwise comparisons among each of your sample. To know how to run it you can just follow this trinity tutorial : https://github.com/trinityrnaseq/trinityrnaseq/wiki/Trinity-Differential-Expression. As you can read on this page, this script is asking for a "matrix of raw read counts (not normalized!)". This tutorial explain every step (including drawing heatmaps).
Now, if you want information on how FPKM, RPKM and TPM work, I find this video useful (and by the way all the videos from StatQuest are good): https://www.youtube.com/watch?time_continue=608&v=TTUrtCY2k-w basically FPKM, RPKM and TPM normalize by library size (sequencing depth) and transcripts length, which should be enough if all your samples come from the same tissue.
I do not know a lot about TMM but as I understood it, it also adjusts for library composition. Meaning that it is useful if you want to compare different tissues, indeed if a gene is heavily expressed in one tissue and not the other, it will "absorb" most of the reads and the other genes will seems less expressed. Here is a video explaining how DESeq2 normalize data :
So in the end it depends on your experiment / data type.
Thanks for your response. I had followed the trinity instructions and scripts to perform differential expression analysis (using the gene counts matrix). The trinity scripts also provided a mean to automatically perform several analysis, including a heatmap where the TMM matrix of differential expressed genes is represented. Trinity scripts also provide a TPM matrix; and a FPKM matrix can be easily obtained from the RSEM output. However, I'd like to draw additional heatmaps for specific gene sets.
Trinity scripts help to draw a heatmap, which is based on mean-cetered-log2(TMM+1) values. I thought using this metric because i do comparisons among samples in my experiment design. However, in many papers they employ the FPKM values instead, others use CPM (count per millions), and so on, even when they compare among samples (as my case). Additionally, in some papers they use z-scores instead of log2 transformation.
By comparing heatmaps drawed with different metrics (TMM, TPM, or FPKM) and transformations (log2 or z-core) I got different heatmap coloring patterns and clusters. So my doubt still remains regarding if is it better to use (or more accepted by scientific community) any particular type of metric and transformation? or is one's choice which metric to present? just for the specific case of drawing and clustering gene sets in a heatmap.
I think VST counts from DESeq2 might be a good choice (seq depth+composition bias correction) for heatmaps and MDS. But I think VST is not controlling for gene length. I am not sure if it is possible to get length normalised VST.