Introduction to msigdbr

Overview

Pathway analysis is a common task in genomics research and there are many available R-based software tools. Depending on the tool, it may be necessary to import the pathways, translate genes to the appropriate species, convert between symbols and IDs, and format the resulting object.

The msigdbr R package provides Molecular Signatures Database (MSigDB) gene sets typically used with the Gene Set Enrichment Analysis (GSEA) software:

Please be aware that the homologs were computationally predicted for distinct genes. The full pathways may not be well conserved across species.

Installation

The package can be installed from CRAN.

install.packages("msigdbr")

Usage

Load package.

library(msigdbr)

All gene sets in the database can be retrieved without specifying a collection/category.

all_gene_sets = msigdbr(species = "Mus musculus")
head(all_gene_sets)
#> # A tibble: 6 x 17
#>   gs_cat gs_subcat gs_name entrez_gene gene_symbol human_entr… human_gene… gs_id gs_pmid
#>   <chr>  <chr>     <chr>         <int> <chr>             <int> <chr>       <chr> <chr>  
#> 1 C3     MIR:MIR_… AAACCA…      239273 Abcc4             10257 ABCC4       M126… ""     
#> 2 C3     MIR:MIR_… AAACCA…      109359 Abraxas2          23172 ABRAXAS2    M126… ""     
#> 3 C3     MIR:MIR_… AAACCA…       60595 Actn4                81 ACTN4       M126… ""     
#> 4 C3     MIR:MIR_… AAACCA…       11477 Acvr1                90 ACVR1       M126… ""     
#> 5 C3     MIR:MIR_… AAACCA…       11502 Adam9              8754 ADAM9       M126… ""     
#> 6 C3     MIR:MIR_… AAACCA…       23794 Adamts5           11096 ADAMTS5     M126… ""     
#> # … with 8 more variables: gs_geoid <chr>, gs_exact_source <chr>, gs_url <chr>,
#> #   gs_description <chr>, species_name <chr>, species_common_name <chr>,
#> #   ortholog_sources <chr>, num_ortholog_sources <dbl>

There is a helper function to show the available species.

msigdbr_species()
#> # A tibble: 11 x 2
#>    species_name             species_common_name      
#>    <chr>                    <chr>                    
#>  1 Bos taurus               cattle                   
#>  2 Caenorhabditis elegans   roundworm                
#>  3 Canis lupus familiaris   dog                      
#>  4 Danio rerio              zebrafish                
#>  5 Drosophila melanogaster  fruit fly                
#>  6 Gallus gallus            chicken                  
#>  7 Homo sapiens             human                    
#>  8 Mus musculus             house mouse              
#>  9 Rattus norvegicus        Norway rat               
#> 10 Saccharomyces cerevisiae baker's or brewer's yeast
#> 11 Sus scrofa               pig

You can retrieve data for a specific collection, such as the hallmark gene sets.

h_gene_sets = msigdbr(species = "Mus musculus", category = "H")
head(h_gene_sets)
#> # A tibble: 6 x 17
#>   gs_cat gs_subcat gs_name entrez_gene gene_symbol human_entr… human_gene… gs_id gs_pmid
#>   <chr>  <chr>     <chr>         <int> <chr>             <int> <chr>       <chr> <chr>  
#> 1 H      ""        HALLMA…       11303 Abca1                19 ABCA1       M5905 ""     
#> 2 H      ""        HALLMA…       74610 Abcb8             11194 ABCB8       M5905 ""     
#> 3 H      ""        HALLMA…       52538 Acaa2             10449 ACAA2       M5905 ""     
#> 4 H      ""        HALLMA…       11363 Acadl                33 ACADL       M5905 ""     
#> 5 H      ""        HALLMA…       11364 Acadm                34 ACADM       M5905 ""     
#> 6 H      ""        HALLMA…       11409 Acads                35 ACADS       M5905 ""     
#> # … with 8 more variables: gs_geoid <chr>, gs_exact_source <chr>, gs_url <chr>,
#> #   gs_description <chr>, species_name <chr>, species_common_name <chr>,
#> #   ortholog_sources <chr>, num_ortholog_sources <dbl>

Retrieve mouse C2 (curated) CGP (chemical and genetic perturbations) gene sets.

cgp_gene_sets = msigdbr(species = "Mus musculus", category = "C2", subcategory = "CGP")
head(cgp_gene_sets)
#> # A tibble: 6 x 17
#>   gs_cat gs_subcat gs_name entrez_gene gene_symbol human_entr… human_gene… gs_id gs_pmid
#>   <chr>  <chr>     <chr>         <int> <chr>             <int> <chr>       <chr> <chr>  
#> 1 C2     CGP       ABBUD_…       66395 Ahnak             79026 AHNAK       M1423 145761…
#> 2 C2     CGP       ABBUD_…       11658 Alcam               214 ALCAM       M1423 145761…
#> 3 C2     CGP       ABBUD_…       71452 Ankrd40           91369 ANKRD40     M1423 145761…
#> 4 C2     CGP       ABBUD_…       93760 Arid1a             8289 ARID1A      M1423 145761…
#> 5 C2     CGP       ABBUD_…       12040 Bckdhb              594 BCKDHB      M1423 145761…
#> 6 C2     CGP       ABBUD_…      239691 AU021092         146556 C16orf89    M1423 145761…
#> # … with 8 more variables: gs_geoid <chr>, gs_exact_source <chr>, gs_url <chr>,
#> #   gs_description <chr>, species_name <chr>, species_common_name <chr>,
#> #   ortholog_sources <chr>, num_ortholog_sources <dbl>

There is a helper function to show the available collections.

msigdbr_collections()
#> # A tibble: 22 x 3
#>    gs_cat gs_subcat         num_genesets
#>    <chr>  <chr>                    <int>
#>  1 C1     ""                         299
#>  2 C2     "CGP"                     3358
#>  3 C2     "CP"                        56
#>  4 C2     "CP:BIOCARTA"              292
#>  5 C2     "CP:KEGG"                  186
#>  6 C2     "CP:PID"                   196
#>  7 C2     "CP:REACTOME"             1554
#>  8 C2     "CP:WIKIPATHWAYS"          587
#>  9 C3     "MIR:MIRDB"               2377
#> 10 C3     "MIR:MIR_Legacy"           221
#> 11 C3     "TFT:GTRD"                 348
#> 12 C3     "TFT:TFT_Legacy"           610
#> 13 C4     "CGN"                      427
#> 14 C4     "CM"                       431
#> 15 C5     "GO:BP"                   7573
#> 16 C5     "GO:CC"                   1001
#> 17 C5     "GO:MF"                   1697
#> 18 C5     "HPO"                     4494
#> 19 C6     ""                         189
#> 20 C7     ""                        4872
#> 21 C8     ""                         302
#> 22 H      ""                          50

The msigdbr() function output is a data frame and can be manipulated using more standard methods.

all_gene_sets %>%
  dplyr::filter(gs_cat == "H") %>%
  head()
#> # A tibble: 6 x 17
#>   gs_cat gs_subcat gs_name entrez_gene gene_symbol human_entr… human_gene… gs_id gs_pmid
#>   <chr>  <chr>     <chr>         <int> <chr>             <int> <chr>       <chr> <chr>  
#> 1 H      ""        HALLMA…       11303 Abca1                19 ABCA1       M5905 ""     
#> 2 H      ""        HALLMA…       74610 Abcb8             11194 ABCB8       M5905 ""     
#> 3 H      ""        HALLMA…       52538 Acaa2             10449 ACAA2       M5905 ""     
#> 4 H      ""        HALLMA…       11363 Acadl                33 ACADL       M5905 ""     
#> 5 H      ""        HALLMA…       11364 Acadm                34 ACADM       M5905 ""     
#> 6 H      ""        HALLMA…       11409 Acads                35 ACADS       M5905 ""     
#> # … with 8 more variables: gs_geoid <chr>, gs_exact_source <chr>, gs_url <chr>,
#> #   gs_description <chr>, species_name <chr>, species_common_name <chr>,
#> #   ortholog_sources <chr>, num_ortholog_sources <dbl>

Pathway enrichment analysis

The msigdbr output can be used with various popular pathway analysis packages.

Use the gene sets data frame for clusterProfiler with genes as Entrez Gene IDs.

msigdbr_t2g = msigdbr_df %>% dplyr::select(gs_name, entrez_gene) %>% as.data.frame()
enricher(gene = gene_ids_vector, TERM2GENE = msigdbr_t2g, ...)

Use the gene sets data frame for clusterProfiler with genes as gene symbols.

msigdbr_t2g = msigdbr_df %>% dplyr::select(gs_name, gene_symbol) %>% as.data.frame()
enricher(gene = gene_symbols_vector, TERM2GENE = msigdbr_t2g, ...)

Use the gene sets data frame for fgsea.

msigdbr_list = split(x = msigdbr_df$gene_symbol, f = msigdbr_df$gs_name)
fgsea(pathways = msigdbr_list, ...)

Use the gene sets data frame for GSVA.

msigdbr_list = split(x = msigdbr_df$gene_symbol, f = msigdbr_df$gs_name)
gsva(gset.idx.list = msigdbr_list, ...)

Questions and concerns

Which version of MSigDB was used?

This package was generated with MSigDB v7.2 (released September 2020). The MSigDB version is used as the base of the package version. You can check the installed version with packageVersion("msigdbr").

Can I download the gene sets directly from MSigDB instead of using this package?

Yes. You can then import the GMT files (with getGmt() from the GSEABase package, for example). The GMTs only include the human genes, even for gene sets generated from mouse experiments. If you are not working with non-human data, you then have to convert the MSigDB genes to your organism or your genes to human.

Can I convert between human and mouse genes just by adjusting gene capitalization?

That will work for most genes, but not all.

Can I convert human genes to any organism myself instead of using this package?

Yes. A popular method is using the biomaRt package. You may still end up with dozens of homologs for some genes, so additional cleanup may be helpful.

Aren’t there already other similar tools?

There are a few other resources that and provide some of the functionality and served as an inspiration for this package. Ge Lab Gene Set Files has GMT files for many species. WEHI provides MSigDB gene sets in R format for human and mouse, but the genes are provided only as Entrez IDs and each collection is a separate file. MSigDF is based on the WEHI resource, but is converted to a more tidyverse-friendly data frame. When msigdbr was initially released, these were multiple releases behind the latest version of MSigDB, so they may not be actively maintained.

Details

The Molecular Signatures Database (MSigDB) is a collection of gene sets originally created for use with the Gene Set Enrichment Analysis (GSEA) software.

Gene homologs are provided by HUGO Gene Nomenclature Committee at the European Bioinformatics Institute which integrates the orthology assertions predicted for human genes by eggNOG, Ensembl Compara, HGNC, HomoloGene, Inparanoid, NCBI Gene Orthology, OMA, OrthoDB, OrthoMCL, Panther, PhylomeDB, TreeFam and ZFIN. For each human equivalent within each species, only the ortholog supported by the largest number of databases is used.