Introducing mygene R package

As we mentioned in previous post, Bioconductor has accepted mygene, the official MyGene.info R client, and included it in the biannual release on October 14, 2014. MyGene.info provides simple-to-use REST web services to query/retrieve gene annotation data. mygene is a wrapper for accessing MyGene.info services in R.

Over the past several months, I have been working on the development of this software to allow users of R to incorporate MyGene.info's superb web services into their analyses. What actually began as a coding practice side project, became a full fledged effort to contribute to Bioconductor's repository.

Recently, I was able to present the work in progress at BioC2014, Bioconductor's annual conference that highlights developments and breakthroughs in genomic analyses in Boston. After a short talk and poster session, I received imperative feedback from staff as well as other scientists who use R/Bioconductor that I believe helped guarantee its acceptance.

Aside from the discovery how collaborative the community of R is, which allowed me to tailor the code and functions towards ease of integration within other Bioconductor packages, I learned how elaborate and standardized the documentation must be. So any ambiguity should be completely relieved by the manual pages and vignettes (R CMD check/build wouldn't cut me any slack!). mygene is completely free and open-source. Give it a try by downloading and installing from Bioconductor or install from your R console:

source("http://bioconductor.org/biocLite.R")
biocLite("mygene")

You must first upgrade to the latest Bioconductor version if you have not:

biocLite("BiocUpgrade")

Here are a few quick examples of code for using mygene.
The user is exposed to four primary functions for getting annotations: getGene, getGenes, query, and queryMany. Use getGene, the wrapper for GET query of ”/gene/” service, to return the gene object for the given geneid.

> gene <- getGene("1017", fields="all")
> length(gene)
[1] 36
> gene$name
[1] "cyclin-dependent kinase 2"
> gene$taxid
[1] 9606
> gene$uniprot
$`Swiss-Prot` [1] "P24941"
$TrEMBL
[1] "B4DDL9" "E7ESI2" "G3V317" "G3V5T9"

Just as easily you can retrieve annotations from several thousand genes without worrying about clogging up our servers. Use queryMany, a wrapper for POST query of ”/query”service, to return the batch query result.

> xli <-c('DDX26B','CCDC83','MAST3','FLOT1','RPL11','ZDHHC20','LUC7L3','SNORD49A','CTSH')
> queryMany(xli, scopes="symbol", fields="entrezgene", species="human")
Finished
DataFrame with 9 rows and 3 columns
query entrezgene         _id
<character>  <integer> <character>
1      DDX26B     203522      203522
2      CCDC83     220047      220047
3       MAST3      23031       23031
4       RPL11       6135        6135
5     ZDHHC20     253832      253832
6      LUC7L3      51747       51747
7    SNORD49A      26800       26800
8        CTSH       1512        1512

Two utility functions are also available: metadata, and makeTxDbFromMyGene. metadata allows the user to retrieve MyGene.info’s metadata and view available fields for querying.

> mg<-MyGene()
> metadata(mg)
> metadata(mg)$available_fields
[1] "accession"         "alias"             "biocarta"          "chr"               "end"              
[6] "ensemblgene"       "ensemblprotein"    "ensembltranscript" "entrezgene"        "exons"            
[11] "flybase"           "generif"           "go"                "hgnc"              "homologene"       
[16] "hprd"              "humancyc"          "interpro"          "ipi"               "kegg"             
[21] "mgi"               "mim"               "mirbase"           "mousecyc"          "name"             
[26] "netpath"           "pdb"               "pfam"              "pharmgkb"          "pid"              
[31] "pir"               "prosite"           "ratmap"            "reactome"          "reagent"          
[36] "refseq"            "reporter"          "retired"           "rgd"               "smpdb"            
[41] "start"             "strand"            "summary"           "symbol"            "tair"             
[46] "taxid"             "type_of_gene"      "unigene"           "uniprot"           "wikipathways"     
[51] "wormbase"          "xenbase"           "yeastcyc"          "zfin"

makeTxDbFromMyGene allows the user to store transcript annotations from a mygene 'exons' query in a sqlite database. TxDb is a really convenient Bioconductor container that allows for interoperability with other packages for utilization and accession of transcript annotations.

> txdb <- makeTxDbFromMyGene(xli, scopes="symbol", species="human")
> transcripts(txdb)
GRanges object with 22 ranges and 2 metadata columns:
seqnames                 ranges strand   |     tx_id      tx_name
<Rle>              <IRanges>  <Rle>   | <integer>  <character>
[1]           1   [24018268, 24022915]      +   |        13 NM_001199802
[2]           1   [24018268, 24022915]      +   |        14    NM_000975
[3]          11   [85566143, 85631063]      +   |         2 NM_001286159
[4]          11   [85566143, 85631063]      +   |         3    NM_173556
[5]          13   [21946709, 22033508]      -   |        18    NM_153251
...         ...                    ...    ... ...       ...          ...
[18] 6_mann_hap4 [  2043604,   2058547]      -   |         9    NM_005803
[19]  6_mcf_hap5 [  2077393,   2090996]      -   |        10    NM_005803
[20]  6_qbl_hap6 [  1988442,   2003382]      -   |         5    NM_005803
[21] 6_ssto_hap7 [  2027817,   2042770]      -   |         7    NM_005803
[22]           X [134654554, 134716460]      +   |         1    NM_182540
-------
seqinfo: 15 sequences from an unspecified genome; no seqlengths

Feedback is very welcome and appreciated. Thanks to Ryan Thompson, and the Su lab, especially Chunlei and Tobias for the guidance!