MyVariant- Lesson's learned and what's in store

In case you missed it, a sneak peek at what's in store for MyGene.info was posted last week, so it's only FAIR to share our plans for MyVariant. Although the development of MyVariant.info naturally followed that of MyGene.info, the scale of variant annotation data presented a difficult challenge that required additional architectural and performance considerations. At the time MyVariant.info was first being developed, there were about 18 million genes in the 6+ data resources of interest compared to 340 million variants in the 12+ data resources of interest--a ~20x scale up in the number of items to index.

For this reason, the development of MyVariant.info required a bit of tailoring to make it work. In overcoming the architectural and performance challenges, the MyVariant.info team members learned a lot of valuable lessons on abstracting and standardizing the creation of APIs like MyGene.info and MyVariant.info for different types (and scales) of biological entity data. To learn more about the general architecture behind MyGene and MyVariant services, check out the 2016 paper in Genome Biology.

The lessons learned on wrangling data of the scale that is handled by MyVariant.info will be valuable as the BioThings team looks towards incorporating data from Ensemble Genomes which will drastically scale up the amount of data offered by MyGene.info. In abstracting the process of building MyGene and MyVariant, the BioThings team has laid the foundation for building additional APIs centered around biological entities like chemicals and diseases!

Furthermore, the BioThings team will take the lessons learned and incorporate them into their efforts to create a generic Software Development Kit (SDK) for generating APIs around biological entities like genes and variants. More on the BioThings SDK later.

Variant annotation data is much more valuable in the context of genes; hence, the MyVariant team has been exploring ways to increase interoperability of the MyGene and MyVariant services using JSON-LD.

Linking data with JSON-LD

Both MyVariant.info and MyGene.info store annotation data from different resources in JSON documents; however, differences in keys for the same data across the two services can make it challenging to obtain results for chained queries. JSON-LD provides a standard way to add semantic context to the existing JSON data structure enhancing the interpretability and therefore interoperability of the JSON data.

Basically, each API (like MyGene and MyVariant) specifies a JSON-LD context (ie- a JSON document that can provide a Universal Resource Identifier (URI) mapping for each key in the output JSON document). The use of URIs provides consistency when specifying subjects and objects, allowing the results for a multistep chained query to be obtained through a much more simplified query.

Learn about how it works here and imagine the possibilities.