GSEA on non-model organisms: How to handle duplicated KEGG KO
2
0
Entering edit mode
6 months ago
mksaad28 • 0

Hi!

I am using ClusterProfiler's gseKEGG to perform pathway enrichment on my non-model organism (Scomber scombrus) RNA Seq data. My organism does not have an organism code in KEGG so to perform enrichment analysis, I used a combination of GhostKoala and eggNOG-Mapper to obtain KEGG Orthology (KO) for my genes. Since many genes are similar (e.g., isoforms), I often have KEGG_ko duplicated in my gene list (screenshotted below as an example).

An example of a KEGG_ko that is duplicated in my data

When I run gseKEGG, I get this warning: "Warning: There are duplicate gene names, fgsea may produce unexpected results." I saw a previous post about a similar situation but did not find a clear solution.

I am wondering what the best way to handle the duplicated KEGG KOs is. Right now, I have a geneList that is the ranked log2FC. A couple of ideas that came to my mind are 1) to average the log2FCs for each KEGG_ko and use that as the unique entry for gseKEGG 2) to use the entrezgene_id with the highest basemean as the unique entry for gseKEGG.

Do either of these seem appropriate? Any other suggestions would be appreciated!

I am using clusterProfiler_4.10.1 and am implementing gseKEGG as follows:

set.seed(12345)

  gseaKegg <- gseKEGG(
    gene = kegglist_fc_sort,
    organism = 'ko',
    minGSSize = 10,
    maxGSSize = 500,
    eps = 0,
    pvalueCutoff = 0.05,
    verbose = FALSE,
    nPermSimple = 10000,
    seed = TRUE)
KEGG GSEA clusterProfiler GhostKOALA • 704 views
ADD COMMENT
2
Entering edit mode
6 months ago
alserg ▴ 980

First of all, you should get rid of the duplicate gene names. Making that to be a warning instead of error was a mistake on our part in fgsea, we'll try to change it. The problem stems from the fact that GSEA assumes each gene to be independent from each other in its null hypothesis, so if you have some dependence in the ranking it messes up the null distribution.

As for how, there can be multiple options, but importantly that should be done without taking the results of differential expression into consideration (e.g. don't select the gene with the lowest p-value). Selecting one gene with the highest baseMean is a good option. You can also some up counts of different genes for the same KO prior to doing differential expression. Summing logFC could be reasonable as well, as it's similar to summing up of log-expressions, but there can be nuances.

ADD COMMENT
0
Entering edit mode
6 months ago
Andres ▴ 20

I think duplicated KEGG codes are expected for a pathway enrichment analysis. ¿Are you sure this is not due to "entrezgene_id" or "gene_symbol" duplicates?. Isoforms may be collapsed into a single gene depending on your question.

If you only have a single count of each different KO on your genelist there would be no overrepresentation/enrichment to begin with, each KO would have the same probability of appearing. I guess if there is a higher hierarchical clustering among different KOs you can compare that. But choosing a gene or averaging fold-changes doesn't feel right. I've never worked with clusterProfiler. I've done enrichment with other tools, this strikes me as weird.

ADD COMMENT

Login before adding your answer.

Traffic: 1458 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6