Building a resource sharing site with BioThings SDK and the CD2H Data Discovery Engine (part 1)

by Ginger Tsueng

Although  there has been a proliferation of biological datasets made available in  recent years, often this information isn’t machine readable, making it  hard for things like Google Dataset Search to find and index them. In this series of blog posts, we’ll outline how  we are working to make datasets that our collaborators generate and  open data more findable, accessible, interoperable, and reusable, as  well as tools that we’ve developed to make it easier to share data.

The Su and Wu labs have always been a bit obsessed with making  biological information more findable, accessible, interoperable, and  reusable.  Our projects have opened accessibility and interpretability of publicly available genomic data (symatlas, BioGPS), enhanced interoperability of publicly available biological data sources (MyGene.info, MyVariant.info, MyChem.info, and other BioThing APIs, Wikidata), enabled community curation of biological information (Gene Wiki), and made it easy to share data through the creation of custom APIs (BioThings SDK).  Recently, we’ve turned our attention to a problem plaguing open data  efforts: the findability of information by search engines.

Towards that end, we have sought to make the use of schemas (more on schemas) more accessible via the CTSA NCATS CD2H-sponsored development of a Data Discovery Engine.  We’ll go into this a bit more in another post.  This post is really  about what can be done now that both of these tools have reached a level  of maturity in terms of developmental stage.

Try this:

google search ‘hla hemorrhagic dataset

You’ll notice that one of the first results takes you to the dataset made available by the Center for Viral Systems Biology (CViSB).

What you won’t see is that the metadata for this dataset actually follows a schema which was developed by the NIAID Systems Biology Data Dissemination Working Group.   Concurrent with the development of the schema which would culminate  into the CVISB data sharing site, we developed the Data Discovery Engine  in order to make it easier to standardize metadata for improving  findability.

Because the NIAID Systems Biology Data Dissemination Working Group paved the way on the development of a dataset schema,  and encouraged the development of a user-friendly schema builder, all  the tools needed to build a search engine-friendly, resource sharing API  were already in place when the SARS-CoV-2 outbreak began. This meant  that we could readily spin up a site for sharing resources related to the novel coronavirus outbreak, which is exactly what we did.

Try visiting https://outbreak.info/resources and having a look at any of the datasets listed under resources.

Although there are more types of resources available in outbreak.info than from the CVISB data sharing site,  the backends behind the information served up for both sites were built  using the same set of tools: namely the BioThings SDK and the Data  Discovery Engine.

This blog series will detail using the BioThings SDK and Data  Discovery Engine in order to spin up an API for resource distribution. In our next post, we’ll delve into schemas, the workhorse behind describing our data that improves consistency and interoperability between datasets.