MyGene.info v3 API is out !

by Sebastien Lelong

We're proud to announce the next version of MyGene.info! This 3rd version brings new features, fixes some issues, and can be reached using URL http://mygene.info/v3. MyGene.info v2 will remain up and active while transitioning to v3. Stay tuned, we'll also post a step-by-step guide to migrate from v2 to v3.

Here's a brief list of changes, we'll discuss some of them in depth in next posts:

Refseq accession number with version

As some of you requested this feature (see We need your opinion), we now store accession number with version. You can search this information with and without the version, so the following requests will give the same results:

...
  "refseq": {
    "genomic": [
      "NC_000012.12",
      "NC_018923.2",
      "NG_034014.1"
    ],
    "protein": [
      "NP_001277159.1",
      "NP_001789.2",
      "NP_439892.2",
      "XP_011536034.1"
    ],
    "rna": [
      "NM_001290230.1",
      "NM_001798.4",
      "NM_052827.3",
      "XM_011537732.1"
    ],
...

Note: v2 doesn't store version, see http://mygene.info/v2/query?q=refseq.rna:NM_001798&fields=refseq

RNA-protein mapping

"refseq", "accession" and "ensembl" now contains association between RNA and its protein product, within an added inner key "translation", as show in the following example for gene ID 1017.

Note: if a RNA or protein accession number isn't available in the association, then it's not added to this list

http://mygene.info/v3/gene/1017?fields=refseq

{
    "_id": "1017",
    "refseq": {
        ...
        "translation": [
          {   
            "protein": "XP_011536034.1",
            "rna": "XM_011537732.1"
          },  
          {   
            "protein": "NP_001789.2",
            "rna": "NM_001798.4"
          },  
          {   
            "protein": "NP_439892.2",
            "rna": "NM_052827.3"
          },  
          {   
            "protein": "NP_001277159.1",
            "rna": "NM_001290230.1"
          }   
        ]   
    }   
}

_Note: v2 does provide this information, see http://mygene.info/v2/gene/1017?fields=refseq

"exons" inner structure

Inner structure is now a list of dictionary. Each dictionary contains information about the exons with a "transcript" key containing the accession number. "position" inner key contains the different exons' positions.

http://mygene.info/v3/gene/1017?fields=exons

{                                                                                                                                                                                                                                                                                                                                                                           
    "_id": "1017",
    "_score": 21.731894,
    "exons": [
    {   
        "cdsend": 55971625,
        "cdsstart": 55967008,
        "chr": "12",
        "position": [
          [   
            55966768,
            55967124
          ],  
          [   
            55968048,
            55968169
          ],  
          [   
            55968777,
            55968948
          ],  
          [   
            55971043,
            55971247
          ],  
          [   
            55971520,
            55972789
          ]   
        ],  
        "strand": 1,
        "transcript": "NM_001290230",
        "txend": 55972789,
        "txstart": 55966768
    },  
    ...
}                      

Note: you can compare this structure with the actual v2, which uses a dictionary instead of a list of dictionary: http://mygene.info/v2/gene/1017?fields=exons

Better mapping between Ensembl and Entrez gene IDs

There are some annoying cases of one-to-many matches between Ensembl IDs and Entrez IDs, based on the mapping from Ensembl. For example, Ensembl gene ID ENSMUSG00000071350 associated to Entrez gene IDs 628705 and 239122. While these ambiguous mappings won't disappear completely, majority of them can be fixed by cross-checking the mappings from other sources. We worked hard to improve this mapping and remove discrepancy as much as we could. We'll post more about this soon.

Querying "reporter" data source

Because some "reporter" IDs are integers (e.g. Affymetrix HuGene_1-1 array), just like Entrez gene IDs, "reporter" field now needs to be explicit in the query to avoid any confusion:

http://mygene.info/v3/query?q=reporter:2845421&fields=reporter

Change in dot.field notation default

The "dot.field" notation is when nested keys are returned using dot, like ["refseq.rna"], instead of nested structure, such as ["refseq"]["rna"]. This behavior can be triggered using dotfield=1 in conjunction with fields parameters. Default is now results are returned using nested structure, unless dotfield=1 is explicitly specified.

{
  "_id": "1017",
  "refseq.rna": [
    "NM_001290230",
    "NM_001798",
    "NM_052827",
    "XM_011537732"
  ]
}
{
  "_id": "1017",
  "_score": 21.731894,
  "refseq": {
    "rna": [
      "NM_001290230.1",
      "NM_001798.4",
      "NM_052827.3",
      "XM_011537732.1"
    ]
  }
}

Note: this change is only for annotation endpoint /gene. Query endpoint /query already defaults to nested structure.

We focus on your needs so you're more than welcome to give feedback, comment any of these changes and request more. Again, stay tuned for more about this new version!