New default behavior for 'species' parameter

by Cyrus Afrasiabi

MyGene.info API supports both "/gene" and "/query" endpoints. On its /query endpoint, an optional species parameter allows users to pass one or multiple species (as common species names or taxonomy ids) to filter down the query results.

Previously, the default species were set to "human,mouse,rat". This meant that, unless you explicitly specified other values for the species parameter, your query results (e.g. "q=cdk2") might look like this:

http://mygene.info/v3/query?q=cdk2

{
  "max_score": 457.24393,
  "took": 11,
  "total": 32,
  "hits": [
    {
      "_id": "1017",
      "_score": 457.24393,
      "entrezgene": 1017,
      "name": "cyclin dependent kinase 2",
      "symbol": "CDK2",
      "taxid": 9606
    },
    {
      "_id": "12566",
      "_score": 329.98914,
      "entrezgene": 12566,
      "name": "cyclin-dependent kinase 2",
      "symbol": "Cdk2",
      "taxid": 10090
    },
    {
      "_id": "362817",
      "_score": 279.2216,
      "entrezgene": 362817,
      "name": "cyclin dependent kinase 2",
      "symbol": "Cdk2",
      "taxid": 10116
    },
    {
      "_id": "143384",
      "_score": 22.91444,
      "entrezgene": 143384,
      "name": "CDK2 associated cullin domain 1",
      "symbol": "CACUL1",
      "taxid": 9606
    },
    {
      "_id": "52004",
      "_score": 20.558783,
      "entrezgene": 52004,
      "name": "CDK2-associated protein 2",
      "symbol": "Cdk2ap2",
      "taxid": 10090
    },
    {
      "_id": "78832",
      "_score": 17.98903,
      "entrezgene": 78832,
      "name": "CDK2 associated, cullin domain 1",
      "symbol": "Cacul1",
      "taxid": 10090
    },
    {
      "_id": "365493",
      "_score": 14.489841,
      "entrezgene": 365493,
      "name": "CDK2-associated, cullin domain 1",
      "symbol": "Cacul1",
      "taxid": 10116
    },
    {
      "_id": "13445",
      "_score": 13.166027,
      "entrezgene": 13445,
      "name": "CDK2 (cyclin-dependent kinase 2)-associated protein 1",
      "symbol": "Cdk2ap1",
      "taxid": 10090
    },
    {
      "_id": "690181",
      "_score": 8.355364,
      "entrezgene": 690181,
      "name": "similar to S-phase kinase-associated protein 1A (Cyclin A/CDK2-associated protein p19) (p19A) (p19skp1)",
      "symbol": "LOC690181",
      "taxid": 10116
    },
    {
      "_id": "690646",
      "_score": 7.2449207,
      "entrezgene": 690646,
      "name": "similar to S-phase kinase-associated protein 2 (F-box protein Skp2) (Cyclin A/CDK2-associated protein p45) (F-box/WD-40 protein 1) (FWD1)",
      "symbol": "LOC690646",
      "taxid": 10116
    }
  ]
}

With no species parameter specified in the query, 32 hits were returned corresponding to all genes from species "human, mouse, rat" with a match to cdk2 in some fields (like symbol, name fields etc.). You could return the matched genes from all species by specifying species=all in the query.

While "human,mouse,rat" was a useful default for users who just need to query genes in these common species, it may cause some confusion for those query terms only relevant to non-"human/mouse/rat" species. For example, previously, a query like q=F1RW06 returns no hits instead of the matching pig CDK3 gene, unless you add "species=pig" or "species=all".

Now, based on many user feedbacks, the default "species" behavior has been set to "all". The same "q=cdk2" query will now return matched genes from all species:

http://mygene.info/v3/query?q=cdk2

{
  "max_score": 393.0346,
  "took": 115,
  "total": 611,
  "hits": [
    {
      "_id": "1017",
      "_score": 393.0346,
      "entrezgene": 1017,
      "name": "cyclin dependent kinase 2",
      "symbol": "CDK2",
      "taxid": 9606
    },
    {
      "_id": "12566",
      "_score": 327.42117,
      "entrezgene": 12566,
      "name": "cyclin-dependent kinase 2",
      "symbol": "Cdk2",
      "taxid": 10090
    },
    {
      "_id": "362817",
      "_score": 270.2593,
      "entrezgene": 362817,
      "name": "cyclin dependent kinase 2",
      "symbol": "Cdk2",
      "taxid": 10116
    },
    {
      "_id": "100925631",
      "_score": 268.31903,
      "entrezgene": 100925631,
      "name": "cyclin dependent kinase 2",
      "symbol": "CDK2",
      "taxid": 9305
    },
    {
      "_id": "100981695",
      "_score": 268.31903,
      "entrezgene": 100981695,
      "name": "cyclin dependent kinase 2",
      "symbol": "CDK2",
      "taxid": 9597
    },
    {
      "_id": "105864946",
      "_score": 268.31903,
      "entrezgene": 105864946,
      "name": "cyclin dependent kinase 2",
      "symbol": "CDK2",
      "taxid": 30608
    },
    {
      "_id": "ENSMEUG00000005552",
      "_score": 268.31903,
      "name": "cyclin dependent kinase 2",
      "symbol": "CDK2",
      "taxid": 9315
    },
    {
      "_id": "103465316",
      "_score": 268.31903,
      "entrezgene": 103465316,
      "name": "cyclin dependent kinase 2",
      "symbol": "cdk2",
      "taxid": 8081
    },
    {
      "_id": "100117828",
      "_score": 268.31903,
      "entrezgene": 100117828,
      "name": "cyclin dependent kinase 2",
      "symbol": "Cdk2",
      "taxid": 7425
    },
    {
      "_id": "101544122",
      "_score": 268.31903,
      "entrezgene": 101544122,
      "name": "cyclin dependent kinase 2",
      "symbol": "CDK2",
      "taxid": 42254
    }
  ]
}

We think this changed default behavior for "species" parameter will give more
intuitive results for most of users. And you can easily mimic the old behavior by explicitly specifying species=human,mouse,rat in the query. It's also worth mentioning that, as before, our customized weighting function makes sure that the human, mouse, and rat genes with the same matches (e.g. the same symbol match of "cdk2") are always appear first comparing to those from other species.

As always, let us know if you have any comments or concerns via help@mygene.info or @mygene.info.