Hi!
I am using ClusterProfiler's gseKEGG to perform pathway enrichment on my non-model organism (Scomber scombrus) RNA Seq data. My organism does not have an organism code in KEGG so to perform enrichment analysis, I used a combination of GhostKoala and eggNOG-Mapper to obtain KEGG Orthology (KO) for my genes. Since many genes are similar (e.g., isoforms), I often have KEGG_ko duplicated in my gene list (screenshotted below as an example).
When I run gseKEGG, I get this warning: "Warning: There are duplicate gene names, fgsea may produce unexpected results." I saw a previous post about a similar situation but did not find a clear solution.
I am wondering what the best way to handle the duplicated KEGG KOs is. Right now, I have a geneList that is the ranked log2FC. A couple of ideas that came to my mind are 1) to average the log2FCs for each KEGG_ko and use that as the unique entry for gseKEGG 2) to use the entrezgene_id with the highest basemean as the unique entry for gseKEGG.
Do either of these seem appropriate? Any other suggestions would be appreciated!
I am using clusterProfiler_4.10.1 and am implementing gseKEGG as follows:
set.seed(12345)
gseaKegg <- gseKEGG(
gene = kegglist_fc_sort,
organism = 'ko',
minGSSize = 10,
maxGSSize = 500,
eps = 0,
pvalueCutoff = 0.05,
verbose = FALSE,
nPermSimple = 10000,
seed = TRUE)