As we mentioned in previous post, Bioconductor has accepted mygene, the official MyGene.info R client, and included it in the biannual release on October 14, 2014. MyGene.info provides simple-to-use REST web services to query/retrieve gene annotation data. mygene is a wrapper for accessing MyGene.info services in R.
Over the past several months, I have been working on the development of this software to allow users of R to incorporate MyGene.info's superb web services into their analyses. What actually began as a coding practice side project, became a full fledged effort to contribute to Bioconductor's repository.
Recently, I was able to present the work in progress at BioC2014, Bioconductor's annual conference that highlights developments and breakthroughs in genomic analyses in Boston. After a short talk and poster session, I received imperative feedback from staff as well as other scientists who use R/Bioconductor that I believe helped guarantee its acceptance.
Aside from the discovery how collaborative the community of R is, which allowed me to tailor the code and functions towards ease of integration within other Bioconductor packages, I learned how elaborate and standardized the documentation must be. So any ambiguity should be completely relieved by the manual pages and vignettes (R CMD check/build wouldn't cut me any slack!). mygene is completely free and open-source. Give it a try by downloading and installing from Bioconductor or install from your R console:
source("http://bioconductor.org/biocLite.R")
biocLite("mygene")
You must first upgrade to the latest Bioconductor version if you have not:
biocLite("BiocUpgrade")
Here are a few quick examples of code for using mygene.
The user is exposed to four primary functions for getting annotations: getGene
, getGenes
, query
, and queryMany
. Use getGene
, the wrapper for GET query of ”/gene/
> gene <- getGene("1017", fields="all")
> length(gene)
[1] 36
> gene$name
[1] "cyclin-dependent kinase 2"
> gene$taxid
[1] 9606
> gene$uniprot
$`Swiss-Prot` [1] "P24941"
$TrEMBL
[1] "B4DDL9" "E7ESI2" "G3V317" "G3V5T9"
Just as easily you can retrieve annotations from several thousand genes without worrying about clogging up our servers. Use queryMany
, a wrapper for POST query of ”/query”service, to return the batch query result.
> xli <-c('DDX26B','CCDC83','MAST3','FLOT1','RPL11','ZDHHC20','LUC7L3','SNORD49A','CTSH')
> queryMany(xli, scopes="symbol", fields="entrezgene", species="human")
Finished
DataFrame with 9 rows and 3 columns
query entrezgene _id
<character> <integer> <character>
1 DDX26B 203522 203522
2 CCDC83 220047 220047
3 MAST3 23031 23031
4 RPL11 6135 6135
5 ZDHHC20 253832 253832
6 LUC7L3 51747 51747
7 SNORD49A 26800 26800
8 CTSH 1512 1512
Two utility functions are also available: metadata
, and makeTxDbFromMyGene
. metadata
allows the user to retrieve MyGene.info’s metadata and view available fields for querying.
> mg<-MyGene()
> metadata(mg)
> metadata(mg)$available_fields
[1] "accession" "alias" "biocarta" "chr" "end"
[6] "ensemblgene" "ensemblprotein" "ensembltranscript" "entrezgene" "exons"
[11] "flybase" "generif" "go" "hgnc" "homologene"
[16] "hprd" "humancyc" "interpro" "ipi" "kegg"
[21] "mgi" "mim" "mirbase" "mousecyc" "name"
[26] "netpath" "pdb" "pfam" "pharmgkb" "pid"
[31] "pir" "prosite" "ratmap" "reactome" "reagent"
[36] "refseq" "reporter" "retired" "rgd" "smpdb"
[41] "start" "strand" "summary" "symbol" "tair"
[46] "taxid" "type_of_gene" "unigene" "uniprot" "wikipathways"
[51] "wormbase" "xenbase" "yeastcyc" "zfin"
makeTxDbFromMyGene
allows the user to store transcript annotations from a mygene 'exons' query in a sqlite database. TxDb is a really convenient Bioconductor container that allows for interoperability with other packages for utilization and accession of transcript annotations.
> txdb <- makeTxDbFromMyGene(xli, scopes="symbol", species="human")
> transcripts(txdb)
GRanges object with 22 ranges and 2 metadata columns:
seqnames ranges strand | tx_id tx_name
<Rle> <IRanges> <Rle> | <integer> <character>
[1] 1 [24018268, 24022915] + | 13 NM_001199802
[2] 1 [24018268, 24022915] + | 14 NM_000975
[3] 11 [85566143, 85631063] + | 2 NM_001286159
[4] 11 [85566143, 85631063] + | 3 NM_173556
[5] 13 [21946709, 22033508] - | 18 NM_153251
... ... ... ... ... ... ...
[18] 6_mann_hap4 [ 2043604, 2058547] - | 9 NM_005803
[19] 6_mcf_hap5 [ 2077393, 2090996] - | 10 NM_005803
[20] 6_qbl_hap6 [ 1988442, 2003382] - | 5 NM_005803
[21] 6_ssto_hap7 [ 2027817, 2042770] - | 7 NM_005803
[22] X [134654554, 134716460] + | 1 NM_182540
-------
seqinfo: 15 sequences from an unspecified genome; no seqlengths
Feedback is very welcome and appreciated. Thanks to Ryan Thompson, and the Su lab, especially Chunlei and Tobias for the guidance!