Although there has been a proliferation of biological datasets made available in recent years, often this information isn’t machine readable, making it hard for things like Google Dataset Search to find and index them. In this series of blog posts, we’ll outline how we are working to make datasets that our collaborators generate and open data more findable, accessible, interoperable, and reusable, as well as tools that we’ve developed to make it easier to share data.
The Su and Wu labs have always been a bit obsessed with making biological information more findable, accessible, interoperable, and reusable. Our projects have opened accessibility and interpretability of publicly available genomic data (symatlas, BioGPS), enhanced interoperability of publicly available biological data sources (MyGene.info, MyVariant.info, MyChem.info, and other BioThing APIs, Wikidata), enabled community curation of biological information (Gene Wiki), and made it easy to share data through the creation of custom APIs (BioThings SDK). Recently, we’ve turned our attention to a problem plaguing open data efforts: the findability of information by search engines.
Towards that end, we have sought to make the use of schemas (more on schemas) more accessible via the CTSA NCATS CD2H-sponsored development of a Data Discovery Engine. We’ll go into this a bit more in another post. This post is really about what can be done now that both of these tools have reached a level of maturity in terms of developmental stage.
Try this:
google search ‘hla hemorrhagic dataset‘
You’ll notice that one of the first results takes you to the dataset made available by the Center for Viral Systems Biology (CViSB).
What you won’t see is that the metadata for this dataset actually follows a schema which was developed by the NIAID Systems Biology Data Dissemination Working Group. Concurrent with the development of the schema which would culminate into the CVISB data sharing site, we developed the Data Discovery Engine in order to make it easier to standardize metadata for improving findability.
Because the NIAID Systems Biology Data Dissemination Working Group paved the way on the development of a dataset schema, and encouraged the development of a user-friendly schema builder, all the tools needed to build a search engine-friendly, resource sharing API were already in place when the SARS-CoV-2 outbreak began. This meant that we could readily spin up a site for sharing resources related to the novel coronavirus outbreak, which is exactly what we did.
Try visiting https://outbreak.info/resources and having a look at any of the datasets listed under resources.
Although there are more types of resources available in outbreak.info than from the CVISB data sharing site, the backends behind the information served up for both sites were built using the same set of tools: namely the BioThings SDK and the Data Discovery Engine.
This blog series will detail using the BioThings SDK and Data Discovery Engine in order to spin up an API for resource distribution. In our next post, we’ll delve into schemas, the workhorse behind describing our data that improves consistency and interoperability between datasets.