Molecular signatures database (MSigDB) 3.0

Author Notes

Abstract

Motivation: Well-annotated gene sets representing the universe of the biological processes are critical for meaningful and insightful interpretation of large-scale genomic data. The Molecular Signatures Database (MSigDB) is one of the most widely used repositories of such sets.

Results: We report the availability of a new version of the database, MSigDB 3.0, with over 6700 gene sets, a complete revision of the collection of canonical pathways and experimental signatures from publications, enhanced annotations and upgrades to the web site.

Availability and Implementation: MSigDB is freely available for non-commercial use at http://www.broadinstitute.org/msigdb.

Contact: gsea@broadinstitute.org

1 INTRODUCTION

Microarrays and other high-throughput genomic technologies typically produce long lists of potentially interesting genes, which are not always easily interpreted. Recognizing the importance of coordinately expressed sets of genes, our seminal paper (Mootha et al., 2003) introduced Gene Set Enrichment Analysis (GSEA) to discover metabolic pathways altered in human type 2 diabetes mellitus. GSEA and other analytical enrichment tools summarize genomic data in prioritized lists of higher-level biological features. As underscored by a recent survey of 68 enrichment tools, they critically depend on ‘backend annotation databases’ (Huang et al., 2009). Typically, such databases focus on a particular domain of knowledge or annotation procedure. For example, Gene Ontology (GO) (Ashburner et al., 2000) represents a hierarchy of controlled terms to describe individual gene products, while TRANSFAC (Matys et al., 2006) stores information about transcription factor binding sites. A growing number of databases obtain sets from gene expression signatures reported in the literature. These include SignatureDB (Shaffer et al., 2006), GeneSigDB (Culhane et al., 2009), CCancer (Dietmann et al., 2010) and L2L and LOLA (Cahan et al., 2007).

Molecular Signatures Database (MSigDB) differs from these resources in several distinguishing aspects. (i) MSigDB is explicitly designed to provide gene sets for enrichment analysis methods. As such, it is natively and seamlessly integrated with our GSEA software (Subramanian et al., 2005). (ii) MSigDB covers a substantially more diverse and wider range of gene set sources and types. These include signatures extracted from original research publications, and entire collections of sets derived from specialized resources such as GO, KEGG (Kanehisa and Goto, 2000), TRANSFAC and L2L. (iii) MSigDB gene sets are acquired both through manual curation and by automatic computational means, whereas other databases emphasize only one of these approaches. (iv) Finally, MSigDB contains the largest number of gene sets overall.

The initial MSigDB database, released in 2005 with GSEA software, contained 1325 sets. In contrast, MSigDB 3.0, released in September 2010, includes 6769 sets and a richer set of annotations. Here, we describe the MSigDB 3.0 sets in more detail and the accompanying online resource.

2 RESULTS

Gene set collections: gene sets in MSigDB 3.0 are organized into five collections according to their derivation:

C1: Genes located in the same chromosome or cytogenetic band.
C2: Gene sets representing canonical pathways from pathway resources [including 430 new sets contributed by Reactome (Matthews et al., 2009)], and sets corresponding to chemical and genetic perturbations from 786 scientific publications.
C3: Sets of genes sharing cis-regulatory motifs in their promoter (transcription factor targets) or 3′ UTR (micro-RNA targets) sequences.
C4: Clusters of coexpressed modules defined by computational analysis of large gene expression compendia.
C5: Gene sets corresponding to GO terms.

Table 1 shows the growth of the MSigDB collections and database since the initial release (see also online Release Notes).

Table 1.

Open in new tab

MSigDB versions and changes in the number of gene sets

Gene set category	1.0 (2005)	2.5 (2008)	3.0 (2010)
C1: positional	319	386	326^a
C2: curated (total)	522	1892	3272
C2: chemical and genetic perturbations	50	1186	2392
C2: canonical pathways	472	639	880
C2: uncategorized	0	66	0
C3: motifs (total)	57	837	836^a
C3: transcription factor targets	57	500	615
C3: micro-RNA targets	0	222	221^a
C3: uncategorized	0	115	0
C4: computational	427	883	881^a
C5: GO terms	0	1454	1454
MSigDB total	1325	5452	6769

Gene set category	1.0 (2005)	2.5 (2008)	3.0 (2010)
C1: positional	319	386	326^a
C2: curated (total)	522	1892	3272
C2: chemical and genetic perturbations	50	1186	2392
C2: canonical pathways	472	639	880
C2: uncategorized	0	66	0
C3: motifs (total)	57	837	836^a
C3: transcription factor targets	57	500	615
C3: micro-RNA targets	0	222	221^a
C3: uncategorized	0	115	0
C4: computational	427	883	881^a
C5: GO terms	0	1454	1454
MSigDB total	1325	5452	6769

^aDecrease in number due to the removal of sets with too few genes to run GSEA.

Table 1.

Open in new tab

MSigDB versions and changes in the number of gene sets

Gene set category	1.0 (2005)	2.5 (2008)	3.0 (2010)
C1: positional	319	386	326^a
C2: curated (total)	522	1892	3272
C2: chemical and genetic perturbations	50	1186	2392
C2: canonical pathways	472	639	880
C2: uncategorized	0	66	0
C3: motifs (total)	57	837	836^a
C3: transcription factor targets	57	500	615
C3: micro-RNA targets	0	222	221^a
C3: uncategorized	0	115	0
C4: computational	427	883	881^a
C5: GO terms	0	1454	1454
MSigDB total	1325	5452	6769

Gene set category	1.0 (2005)	2.5 (2008)	3.0 (2010)
C1: positional	319	386	326^a
C2: curated (total)	522	1892	3272
C2: chemical and genetic perturbations	50	1186	2392
C2: canonical pathways	472	639	880
C2: uncategorized	0	66	0
C3: motifs (total)	57	837	836^a
C3: transcription factor targets	57	500	615
C3: micro-RNA targets	0	222	221^a
C3: uncategorized	0	115	0
C4: computational	427	883	881^a
C5: GO terms	0	1454	1454
MSigDB total	1325	5452	6769

^aDecrease in number due to the removal of sets with too few genes to run GSEA.

Gene set annotations: each MSigDB gene set is a list of genes with relevant annotations and links to external resources. MSigDB focuses on human gene sets. However, we do include sets from some model organisms and gene set annotations include organism identification. We use HUGO gene symbols and, as of version 3.0, human Entrez Gene IDs serve as universal identifiers. These Entrez IDs are guaranteed to be unique and stable, can easily be mapped into a variety of other identifiers and are natively integrated with the GenBank resources of primary nucleic and protein sequences. We also preserve whatever original identifiers were used in the gene set source. All sets have unique database identifiers and names, and include brief and full descriptions. Other annotations depend on the type of gene set. Annotations linking to external resources are especially important as they allow researchers to place the sets in the context of a specific study and facilitate decisions on follow-up experiments.

Gene sets from publications are the most richly annotated. Their annotations include the PubMed ID of the publication, pointers to other gene sets from the same publication, and now also details on the exact table or figure from which the gene set was extracted. For version 3.0, we updated the names of these gene sets to make them more descriptive and standardized and the accompanying brief descriptions to follow a more uniform and consistent format. Other annotation features introduced with version 3.0 include links to source datasets in Gene Expression Omnibus (GEO) (Barrett et al., 2009) and ArrayExpress (Parkinson et al., 2009). Canonical pathway sets include links to the pathway at the source web site.

File formats: MSigDB gene set files are available for download in plain text and XML formats. The plain text files contain simple listings of gene set membership, while the XML files also include the annotations. To ensure reproducibility of GSEA results, older versions of the MSigDB files are always available. Note that users of our GSEA software do not need to download the MSigDB files as the tool directly and automatically retrieves the gene sets.

3 MSigDB ONLINE RESOURCE

In version 3.0, we updated the MSigDB web site. First introduced in July 2007, the site allows users to view the annotated gene sets and perform simple search and analysis tasks. Each gene set and all of its annotations are presented on a separate web page (Fig. 1). Embedded hyperlinks connect annotations to corresponding external web resources, including PubMed, GEO and ArrayExpress, PubChem and Entrez Gene.

Fig. 1.

A typical gene set page on the MSigDB web site. The list of genes has been abbreviated from 41 to 2 for the purposes of this figure.

Open in new tab Download slide

The MSigDB web site allows users to find gene sets by searching for keywords in the annotations. The online analysis tools allow users to: (i) compute overlaps between gene sets; (ii) view a heat map of a gene set in one of the reference expression compendia; and (iii) categorize the genes in a set by gene families. Gene families offer a quick view of a gene set by grouping its members into a small number of informative categories. We have updated the gene families and they now include: oncogenes, tumor suppressors, translocated cancer genes, transcription factors, protein kinases, homeodomain proteins, cell differentiation markers and cytokines/growth factors.

ACKNOWLEDGEMENTS

We thank J. Roberston, L. Saunders and L. Kazmierski for gene set collection; H. Kuehn and J. McLaughlin for documentation; and M. Wrobel for web site development.

Funding: National Cancer Institute (5R01CA121941).

Conflict of Interest: none declared.

REFERENCES

Ashburner

, et al.

Gene ontology: tool for the unification of biology. The Gene Ontology Consortium

Nat. Genet.

2000

, vol.

(pg.

)

Barrett

, et al.

NCBI GEO: archive for high-throughput functional genomic data

Nucleic Acids Res.

2009

, vol.

(pg.

D15

)

Google Scholar

OpenURL Placeholder Text

WorldCat

Cahan

, et al.

Meta-analysis of microarray results: challenges, opportunities, and recommendations for standardization

Gene

2007

, vol.

401

(pg.

)

Culhane

, et al.

GeneSigDB – a curated database of gene expression signatures

Nucleic Acids Res.

2009

, vol.

(pg.

D716

D725

)

Dietmann

, et al.

CCancer: a bird's eye view on gene lists reported in cancer-related studies

Nucleic Acids Res.

2010

, vol.

Suppl

(pg.

W118

W123

)

da Huang

, et al. ,

Nucleic Acids Res.

2009

, vol.

(pg.

)

Crossref

PubMed

Kanehisa

Goto

KEGG: Kyoto encyclopedia of genes and genomes

Nucleic Acids Res.

2000

, vol.

(pg.

)

Matthews

, et al.

Reactome knowledgebase of human biological pathways and processes

Nucleic Acids Res.

2009

, vol.

(pg.

D619

D622

)

Matys

, et al.

TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes

Nucleic Acids Res.

2006

, vol.

(pg.

D108

D110

)

Mootha

, et al.

PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes

Nat. Genet.

2003

, vol.

(pg.

267

273

)

Parkinson

, et al.

ArrayExpress update – from an archive of functional genomics experiments to the atlas of gene expression

Nucleic Acids Res.

2009

, vol.

(pg.

D868

D872

)

Shaffer

, et al.

A library of gene expression signatures to illuminate normal and pathological lymphoid biology

Immunol Rev.

2006

, vol.

210

(pg.

)

Subramanian

, et al.

Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles

Proc. Natl Acad. Sci. USA

2005

, vol.

102

(pg.

15545

15550

)

Google Scholar

Crossref

WorldCat

Author notes

Associate Editor: Alex Bateman

Download all slides

Month:	Total Views:
November 2016	3
December 2016	18
January 2017	13
February 2017	52
March 2017	62
April 2017	23
May 2017	49
June 2017	40
July 2017	37
August 2017	42
September 2017	41
October 2017	79
November 2017	99
December 2017	615
January 2018	500
February 2018	446
March 2018	390
April 2018	402
May 2018	391
June 2018	406
July 2018	367
August 2018	292
September 2018	286
October 2018	293
November 2018	233
December 2018	218
January 2019	213
February 2019	243
March 2019	296
April 2019	328
May 2019	304
June 2019	255
July 2019	282
August 2019	288
September 2019	333
October 2019	315
November 2019	264
December 2019	232
January 2020	251
February 2020	306
March 2020	257
April 2020	315
May 2020	250
June 2020	333
July 2020	340
August 2020	319
September 2020	296
October 2020	305
November 2020	349
December 2020	336
January 2021	330
February 2021	338
March 2021	457
April 2021	400
May 2021	428
June 2021	388
July 2021	380
August 2021	413
September 2021	354
October 2021	368
November 2021	410
December 2021	407
January 2022	417
February 2022	516
March 2022	501
April 2022	497
May 2022	490
June 2022	450
July 2022	455
August 2022	466
September 2022	498
October 2022	503
November 2022	490
December 2022	464
January 2023	481
February 2023	523
March 2023	638
April 2023	602
May 2023	544
June 2023	493
July 2023	500
August 2023	489
September 2023	479
October 2023	512
November 2023	411
December 2023	503
January 2024	742
February 2024	751
March 2024	1,399
April 2024	687
May 2024	531
June 2024	439
July 2024	467
August 2024	511
September 2024	441

Article Contents

Molecular signatures database (MSigDB) 3.0

Abstract

1 INTRODUCTION

2 RESULTS

3 MSigDB ONLINE RESOURCE

ACKNOWLEDGEMENTS

REFERENCES

Author notes

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

Article Contents

Molecular signatures database (MSigDB) 3.0

Abstract

1 INTRODUCTION

2 RESULTS

3 MSigDB ONLINE RESOURCE

ACKNOWLEDGEMENTS

REFERENCES

Author notes

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

This Feature Is Available To Subscribers Only