Hi,
I am currently analysing RNA-seq data, and I realized in the GTF file (Ensembl v.96) which I use in mapping, there are ~19000 clone based (Ensembl) genes.
Some of them share the exons with their 'parent' protein coding genes in terms of genomic locations.
I am considering removing these clone based genes as they will be affecting the statistics of genome alignment (especially the multi-mappers), and transcript quantification which I plan to do later on.
The total number of genes in the Ensembl hg38.p12 GTF file is ~58000. So if I remove these clone based ones, I am left with ~37000 genes.
Would it be a good call to remove these clone based genes from GTF file? Or would it lower the power of the analyses?
Examples:
- where non-coding Z83844.1 (Clone-based (Ensembl) gene) exons overlap with NOL12:
- where coding AC008403.1 exons overlap with CYTH2 gene :
I would appreciate your suggestions. Thank you in advance.
Can you post a screenshot from a genome browser showing such a gene? The term is non-standard. Do you refer to different isoforms? If so, no absolutely don't remove them. The cell uses isoforms for a reason so they might be biologically meaningful. It is also important to correct for isoform usage during analysis.
Say cell A only expresses isoA with length L1 and cell B expresses isoB with length L1*1.5 (it is 1.5times longer) you will get 1.5 times more reads from this isoB which would come out as differentially expressed due to length bias. A recommended way of doing this is to use the
tximport
package which calculates offsets for the linear models of DEG analysis based on transcript length. A recommended pipeline is salmon-tximport-DEG with DEG being e.g. DESeq2, edgeR or limma.Sure. I attached a link to a screenshot in the question.
I do not refer them as isoforms, as they do not share the same gene source, their transcripts are generated separately from their exons, which do not have a consistent description in between Ensembl genome versions.
I could use protein coding genes & transcripts when using tximport to get rid of them in the downstream analyses, but the number of reads which maps to the exons of these clone based genes would skew the numbers in the wrong way already when I am in tximport stage?