Pathway analysis is a common task in genomics research and there are many available R-based software tools. Depending on the tool, it may be necessary to import the pathways, translate genes to the appropriate species, convert between symbols and IDs, and format the resulting object.
The msigdbr
R package provides Molecular Signatures Database (MSigDB) gene sets typically used with the Gene Set Enrichment Analysis (GSEA) software:
Please be aware that the homologs were computationally predicted for distinct genes. The full pathways may not be well conserved across species.
Load package.
All gene sets in the database can be retrieved without specifying a collection/category.
all_gene_sets = msigdbr(species = "Mus musculus")
head(all_gene_sets)
#> # A tibble: 6 x 17
#> gs_cat gs_subcat gs_name entrez_gene gene_symbol human_entr… human_gene… gs_id gs_pmid
#> <chr> <chr> <chr> <int> <chr> <int> <chr> <chr> <chr>
#> 1 C3 MIR:MIR_… AAACCA… 239273 Abcc4 10257 ABCC4 M126… ""
#> 2 C3 MIR:MIR_… AAACCA… 109359 Abraxas2 23172 ABRAXAS2 M126… ""
#> 3 C3 MIR:MIR_… AAACCA… 60595 Actn4 81 ACTN4 M126… ""
#> 4 C3 MIR:MIR_… AAACCA… 11477 Acvr1 90 ACVR1 M126… ""
#> 5 C3 MIR:MIR_… AAACCA… 11502 Adam9 8754 ADAM9 M126… ""
#> 6 C3 MIR:MIR_… AAACCA… 23794 Adamts5 11096 ADAMTS5 M126… ""
#> # … with 8 more variables: gs_geoid <chr>, gs_exact_source <chr>, gs_url <chr>,
#> # gs_description <chr>, species_name <chr>, species_common_name <chr>,
#> # ortholog_sources <chr>, num_ortholog_sources <dbl>
There is a helper function to show the available species.
msigdbr_species()
#> # A tibble: 11 x 2
#> species_name species_common_name
#> <chr> <chr>
#> 1 Bos taurus cattle
#> 2 Caenorhabditis elegans roundworm
#> 3 Canis lupus familiaris dog
#> 4 Danio rerio zebrafish
#> 5 Drosophila melanogaster fruit fly
#> 6 Gallus gallus chicken
#> 7 Homo sapiens human
#> 8 Mus musculus house mouse
#> 9 Rattus norvegicus Norway rat
#> 10 Saccharomyces cerevisiae baker's or brewer's yeast
#> 11 Sus scrofa pig
You can retrieve data for a specific collection, such as the hallmark gene sets.
h_gene_sets = msigdbr(species = "Mus musculus", category = "H")
head(h_gene_sets)
#> # A tibble: 6 x 17
#> gs_cat gs_subcat gs_name entrez_gene gene_symbol human_entr… human_gene… gs_id gs_pmid
#> <chr> <chr> <chr> <int> <chr> <int> <chr> <chr> <chr>
#> 1 H "" HALLMA… 11303 Abca1 19 ABCA1 M5905 ""
#> 2 H "" HALLMA… 74610 Abcb8 11194 ABCB8 M5905 ""
#> 3 H "" HALLMA… 52538 Acaa2 10449 ACAA2 M5905 ""
#> 4 H "" HALLMA… 11363 Acadl 33 ACADL M5905 ""
#> 5 H "" HALLMA… 11364 Acadm 34 ACADM M5905 ""
#> 6 H "" HALLMA… 11409 Acads 35 ACADS M5905 ""
#> # … with 8 more variables: gs_geoid <chr>, gs_exact_source <chr>, gs_url <chr>,
#> # gs_description <chr>, species_name <chr>, species_common_name <chr>,
#> # ortholog_sources <chr>, num_ortholog_sources <dbl>
Retrieve mouse C2 (curated) CGP (chemical and genetic perturbations) gene sets.
cgp_gene_sets = msigdbr(species = "Mus musculus", category = "C2", subcategory = "CGP")
head(cgp_gene_sets)
#> # A tibble: 6 x 17
#> gs_cat gs_subcat gs_name entrez_gene gene_symbol human_entr… human_gene… gs_id gs_pmid
#> <chr> <chr> <chr> <int> <chr> <int> <chr> <chr> <chr>
#> 1 C2 CGP ABBUD_… 66395 Ahnak 79026 AHNAK M1423 145761…
#> 2 C2 CGP ABBUD_… 11658 Alcam 214 ALCAM M1423 145761…
#> 3 C2 CGP ABBUD_… 71452 Ankrd40 91369 ANKRD40 M1423 145761…
#> 4 C2 CGP ABBUD_… 93760 Arid1a 8289 ARID1A M1423 145761…
#> 5 C2 CGP ABBUD_… 12040 Bckdhb 594 BCKDHB M1423 145761…
#> 6 C2 CGP ABBUD_… 239691 AU021092 146556 C16orf89 M1423 145761…
#> # … with 8 more variables: gs_geoid <chr>, gs_exact_source <chr>, gs_url <chr>,
#> # gs_description <chr>, species_name <chr>, species_common_name <chr>,
#> # ortholog_sources <chr>, num_ortholog_sources <dbl>
There is a helper function to show the available collections.
msigdbr_collections()
#> # A tibble: 22 x 3
#> gs_cat gs_subcat num_genesets
#> <chr> <chr> <int>
#> 1 C1 "" 299
#> 2 C2 "CGP" 3358
#> 3 C2 "CP" 56
#> 4 C2 "CP:BIOCARTA" 292
#> 5 C2 "CP:KEGG" 186
#> 6 C2 "CP:PID" 196
#> 7 C2 "CP:REACTOME" 1554
#> 8 C2 "CP:WIKIPATHWAYS" 587
#> 9 C3 "MIR:MIRDB" 2377
#> 10 C3 "MIR:MIR_Legacy" 221
#> 11 C3 "TFT:GTRD" 348
#> 12 C3 "TFT:TFT_Legacy" 610
#> 13 C4 "CGN" 427
#> 14 C4 "CM" 431
#> 15 C5 "GO:BP" 7573
#> 16 C5 "GO:CC" 1001
#> 17 C5 "GO:MF" 1697
#> 18 C5 "HPO" 4494
#> 19 C6 "" 189
#> 20 C7 "" 4872
#> 21 C8 "" 302
#> 22 H "" 50
The msigdbr()
function output is a data frame and can be manipulated using more standard methods.
all_gene_sets %>%
dplyr::filter(gs_cat == "H") %>%
head()
#> # A tibble: 6 x 17
#> gs_cat gs_subcat gs_name entrez_gene gene_symbol human_entr… human_gene… gs_id gs_pmid
#> <chr> <chr> <chr> <int> <chr> <int> <chr> <chr> <chr>
#> 1 H "" HALLMA… 11303 Abca1 19 ABCA1 M5905 ""
#> 2 H "" HALLMA… 74610 Abcb8 11194 ABCB8 M5905 ""
#> 3 H "" HALLMA… 52538 Acaa2 10449 ACAA2 M5905 ""
#> 4 H "" HALLMA… 11363 Acadl 33 ACADL M5905 ""
#> 5 H "" HALLMA… 11364 Acadm 34 ACADM M5905 ""
#> 6 H "" HALLMA… 11409 Acads 35 ACADS M5905 ""
#> # … with 8 more variables: gs_geoid <chr>, gs_exact_source <chr>, gs_url <chr>,
#> # gs_description <chr>, species_name <chr>, species_common_name <chr>,
#> # ortholog_sources <chr>, num_ortholog_sources <dbl>
The msigdbr
output can be used with various popular pathway analysis packages.
Use the gene sets data frame for clusterProfiler
with genes as Entrez Gene IDs.
msigdbr_t2g = msigdbr_df %>% dplyr::select(gs_name, entrez_gene) %>% as.data.frame()
enricher(gene = gene_ids_vector, TERM2GENE = msigdbr_t2g, ...)
Use the gene sets data frame for clusterProfiler
with genes as gene symbols.
msigdbr_t2g = msigdbr_df %>% dplyr::select(gs_name, gene_symbol) %>% as.data.frame()
enricher(gene = gene_symbols_vector, TERM2GENE = msigdbr_t2g, ...)
Use the gene sets data frame for fgsea
.
msigdbr_list = split(x = msigdbr_df$gene_symbol, f = msigdbr_df$gs_name)
fgsea(pathways = msigdbr_list, ...)
Use the gene sets data frame for GSVA
.
Which version of MSigDB was used?
This package was generated with MSigDB v7.2 (released September 2020). The MSigDB version is used as the base of the package version. You can check the installed version with packageVersion("msigdbr")
.
Can I download the gene sets directly from MSigDB instead of using this package?
Yes. You can then import the GMT files (with getGmt()
from the GSEABase
package, for example). The GMTs only include the human genes, even for gene sets generated from mouse experiments. If you are not working with non-human data, you then have to convert the MSigDB genes to your organism or your genes to human.
Can I convert between human and mouse genes just by adjusting gene capitalization?
That will work for most genes, but not all.
Can I convert human genes to any organism myself instead of using this package?
Yes. A popular method is using the biomaRt
package. You may still end up with dozens of homologs for some genes, so additional cleanup may be helpful.
Aren’t there already other similar tools?
There are a few other resources that and provide some of the functionality and served as an inspiration for this package. Ge Lab Gene Set Files has GMT files for many species. WEHI provides MSigDB gene sets in R format for human and mouse, but the genes are provided only as Entrez IDs and each collection is a separate file. MSigDF is based on the WEHI resource, but is converted to a more tidyverse-friendly data frame. When msigdbr
was initially released, these were multiple releases behind the latest version of MSigDB, so they may not be actively maintained.
The Molecular Signatures Database (MSigDB) is a collection of gene sets originally created for use with the Gene Set Enrichment Analysis (GSEA) software.
Gene homologs are provided by HUGO Gene Nomenclature Committee at the European Bioinformatics Institute which integrates the orthology assertions predicted for human genes by eggNOG, Ensembl Compara, HGNC, HomoloGene, Inparanoid, NCBI Gene Orthology, OMA, OrthoDB, OrthoMCL, Panther, PhylomeDB, TreeFam and ZFIN. For each human equivalent within each species, only the ortholog supported by the largest number of databases is used.