Importing data from GEO
1
1
Entering edit mode
2.6 years ago
LHA_trash ▴ 10

Hi everyone,

I am struggling since hours to import expression data from GEO. I am trying to work on this dataset: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE151095 It is a Superseries. Some scRNAseq but also bulkRNA seq. I am especially interested in the bulkRNA seq samples (in detail: DC2 and DC3 samples with and without TLR as stated in the sample list).

I have tried this approach in R:

library(DESeq2)
library(tidyverse)
library(GEOquery)
gse<-getGEO(GEO='GSE151095', GSEMatrix=TRUE)
gse <- gse[[1]]
metadata <-pData(phenoData(gse))
exprs(gse)

GSM4565845 GSM4565846 GSM4565847 #and so on

So this does not seem to work, it only gives me the GSM characters. I also could not find a .csv file with countdata to import manually. Any advice, on how I can work on this dataset or on GEO datasets in general? Is there a recommended workflow?

Thanks in advance!

GEO GSE R • 2.0k views
ADD COMMENT
2
Entering edit mode
2.6 years ago

Hello,

getGEO() only works for retrieving the expression data for array data, not high throughput sequencing [data] (bulk RNA-seq, scRNA-seq, etc).

You will have to retrieve the data that is in the TAR file (see https://ftp.ncbi.nlm.nih.gov/geo/series/GSE151nnn/GSE151095/suppl/) and then use that manually. To import the MTX scRNA-seq data, you can use DropletUtils (R), or, indeed, Seurat, or something else.

Kind regards,

Kevin

ADD COMMENT
0
Entering edit mode

Hey Kevin, thanks for your answer. Is it correct to import the data for the bulk sequencing from here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE151073, I really only want to compare DC2 and DC3 with and without TLR. THose samples are included.

Using the GSE151073_PG_Bulk_Blood_raw_counts.csv.gz?

gse<-getGEO(GEO='GSE151073', GSEMatrix=TRUE)
data<-read_csv("GSE151073_PG_Bulk_Blood_raw_counts.csv.gz") ##https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE151073

gse<-gse[[1]]
metadata <-pData(phenoData(gse))

I did this and i got totally different results after DESeq2, than the group who published the results. But I really don't unterstand why. Looking back at the raw counts, their results are impossible. So i think, that I might have done something wrong.

ADD REPLY
0
Entering edit mode

Hi again. In which way were the results different?

ADD REPLY
0
Entering edit mode

I Will try to quickly state my workflow and post their results. This might be a bigger post then.

head(data)
# A tibble: 6 × 45
  ...1       `DC2-CD5neg-d1` `DC2-CD5neg-d2` `DC2-CD5neg-d3` `DC2-CD5neg-d4` `DC2-CD5pos-d1` `DC2-CD5pos-d2` `DC2-CD5pos-d3` `DC2-CD5pos-d4` `DC3-d1` `DC3-d2` `DC3-d3` `DC3-d4` `Mono-CD88pos-…` `Mono-CD88pos-…`
  <chr>                <dbl>           <dbl>           <dbl>           <dbl>           <dbl>           <dbl>           <dbl>           <dbl>    <dbl>    <dbl>    <dbl>    <dbl>            <dbl>            <dbl>
1 ENSG00000…               0               2               7               0               2               0               2               0        1        0        0        0                0                3
2 ENSG00000…              32              37              46              32              47              41              31              29       36       33       49       27               28               22
3 ENSG00000…               0               2               0               2               0               0               0               0        0        0        0        0                1                0
4 ENSG00000…               0               0               1               2               0               0               0               0        0        0        0        0                1                3
5 ENSG00000…               0               0               0               0               0               0               0               0        0        0        0        0                0                0
6 ENSG00000…               0               0               0               0               0               0               0               0        0        0        3        0                0                0
# … with 30 more variables: `Mono-CD88pos-d3 ##and so on

I performed tidying to adjust metadata and the data (=countdata)

metadata1 <-select(metadata,c(1,20,48,49,50))
metadata2 <- metadata1 %>%
  filter(title %in% c("DC3_1","DC3_2","DC3_3","DC3_4","DC2_CD5pos_1","DC2_CD5pos_2","DC2_CD5pos_3","DC2_CD5pos_4","DC2_TLR_1","DC2_TLR_3","DC2_TLR_4","DC3_TLR_1","DC3_TLR_3","DC3_TLR_4"))
data2 <-select(data,c(1,6,7,8,9,10,11,12,13,28,29,30,31,32,33)) #those are the corresponding columns to metadata (also DC2 and DC3)
metadata3<-metadata2 %>% 
  mutate(across(where(is.character), str_remove_all, pattern = fixed(" ")))
metadata4<- metadata3 %>%
  dplyr::rename("celltype"="cell type:ch1") %>%
  dplyr::rename("description"="description.1")%>%
  dplyr::rename("donor"="donor:ch1")%>%
  dplyr::rename("tissue"="tissue:ch1")
metadata5 <- metadata4
vec <- c(rep("DC2_ctrl",4),rep("DC3_ctrl",4),rep("DC2_TLR",3), rep("DC3_TLR",3))

vec2 <- c(rep("DC2",4),rep("DC3",4),rep("DC2",3), rep("DC3",3))

vec3 <- c(rep("Ctrl",8),rep("TLR",6))

metadata6 <- cbind(metadata5,cell_group=vec,celltype2=vec2,stimulation=vec3)
rownames(metadata6) <- metadata6$description
data3 <- data2 %>%
  rename("GeneID"="...1")

Getting these final data.frames

head(data3)
# A tibble: 6 × 15
  GeneID    `DC2-CD5pos-d1` `DC2-CD5pos-d2` `DC2-CD5pos-d3` `DC2-CD5pos-d4` `DC3-d1` `DC3-d2` `DC3-d3` `DC3-d4` `DC2-BTLA-S-d1` `DC2-BTLA-S-d3` `DC2-BTLA-S-d4` `DC3-CD163-S-d1` `DC3-CD163-S-d3` `DC3-CD163-S-d4`
  <chr>               <dbl>           <dbl>           <dbl>           <dbl>    <dbl>    <dbl>    <dbl>    <dbl>           <dbl>           <dbl>           <dbl>            <dbl>            <dbl>            <dbl>
1 ENSG0000…               2               0               2               0        1        0        0        0               2               6               2                2                5                0
2 ENSG0000…              47              41              31              29       36       33       49       27               4               6               1                8                9                5
3 ENSG0000…               0               0               0               0        0        0        0        0               0               1               0                2                0                0
4 ENSG0000…               0               0               0               0        0        0        0        0               1               0               0                0                0                0
5 ENSG0000…               0               0               0               0        0        0        0        0               0               0               0                0                0                0
6 ENSG0000…               0               0               0               0        0        0        3        0               0               0               0                0                0                0
head(metadata6)
                 title   description         celltype donor          tissue cell_group celltype2 stimulation
DC2-CD5pos-d1 DC2_CD5pos_1 DC2-CD5pos-d1   CD1c+CD5+cells     1 Peripheralblood   DC2_ctrl       DC2        Ctrl
DC2-CD5pos-d2 DC2_CD5pos_2 DC2-CD5pos-d2   CD1c+CD5+cells     2 Peripheralblood   DC2_ctrl       DC2        Ctrl
DC2-CD5pos-d3 DC2_CD5pos_3 DC2-CD5pos-d3   CD1c+CD5+cells     3 Peripheralblood   DC2_ctrl       DC2        Ctrl
DC2-CD5pos-d4 DC2_CD5pos_4 DC2-CD5pos-d4   CD1c+CD5+cells     4 Peripheralblood   DC2_ctrl       DC2        Ctrl
DC3-d1               DC3_1        DC3-d1 CD1c+CD163+cells     1 Peripheralblood   DC3_ctrl       DC3        Ctrl
DC3-d2               DC3_2        DC3-d2 CD1c+CD163+cells     2 Peripheralblood   DC3_ctrl       DC3        Ctrl

prepareing DESeq

countdata <- data3 %>%
  column_to_rownames("GeneID") %>%
  as.matrix()
design2 <- as.formula(~cell_group)
ddsObj.raw2 <- DESeqDataSetFromMatrix(countData = countdata,
                                     colData = metadata6,
                                     design = design2)
ddsObj2 <- DESeq(ddsObj.raw2)
rDC3vDC2_c<-results(ddsObj2,
                    name="cell_group_DC3_ctrl_vs_DC2_ctrl",
                    alpha=0.05)
rDC3vDC2_t<-results(ddsObj2,
                    contrast=c("cell_group","DC3_TLR","DC2_TLR"),
                    alpha=0.05)

I will stop here and just quickly say what I did next: I used lfcShrink (for the contrast group I used ashr), I used biomaRt to annotate the genes (actually the gene names were listed behind the ENSG-number, so I could see if i did any mistakes and then I plotted the values. My vulcano plot looks totally different then the one from the publication.

Vulcano plot from publication u can see, that theey state e.g. IL1B as upregulated DEG for DC3_TLR, but when i look back at the original raw count data table for IL1B, this is what I see: rawcounts_IL1B I know that those are raw counts, but how can this end up as a DEG upregulated in DC3? (This always refers to the DC3-CD163_s column, which means TLR stimulated)

my plot as comparison, my plot

ADD REPLY
1
Entering edit mode

Hmm, the volcano plot from the publication looks completely bogus. For one, the x-axis values all seem positive, and why is that 10 and not -5?

Some of your genes also appear to have the opposite directionality, in terms of fold-change, but --would you believe-- I have seen this more often than I would like to admit. It is clear that a few --possibly many-- published works contain erroneous results.

ADD REPLY
0
Entering edit mode

Thanks for taking the time to look at it, because I was unsure whether I fucked it up at some place. Well this is sad, as this is a major paper in our field...

ADD REPLY

Login before adding your answer.

Traffic: 864 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6