New Data Release 201703

by Chunlei Wu

You might have noticed that MyVariant.info has not had a data update for quite for a while, until recently. No, we did not go out on vacation :-). In the past a few months, we have been busy refactoring our backend data aggregation and updating our code base. Check out the over a hundred commits that we've made in past months aimed at improving the efficiency and automation in our code if you're interested.

The first new data release with this refactored data backend was just rolled out. Specifically, it went live last Thursday (03/23/2017). Many data sources have been updated to their latest versions for both GRCh37/hg19 and GRCh38/hg38 variants. No new data source has been added in this data release. Of course, all changes in this data release are backwards-compatible.

Data Sources Updated

Three popular data sources, ClinVar, dbSNP and dbNSFP data were updated to their latest (same version for both hg19 and hg38 assembly):

Some numbers for GRCh37/hg19 variants:

last release new release # of variants
in last release
# of variants
in new release
ClinVar 2016-11 2017-03 166,681 262,061
dbSNP 147 149 153,037,251 153,968,878
dbNSFP 3.2a 3.3a 82,366,649 82,366,649

Similarly, some numbers for GRCh38/hg38 variants:

last release new release # of variants
in last release
# of variants
in new release
ClinVar 2016-11 2017-03 166,889 262,254
dbSNP 147 149 152,728,552 153,745,925
dbNSFP 3.2a 3.3a 82,443,934 82,443,934

ClinVar, dbSNP and dbNSFP annotations are available under "clinvar" and "dbsnp", and "dbnsfp" subfields, respectively, for each annotated variant. MyVariant.info aggregates annotations from ClinVar, dbSNP, dbNSFP and other 11 sources for each variant, so you can access them all in one request.

The total number of unique variants is now over 341M (341,289,677), compared to 340M (340,102,225) previously. More details about the variant data we provide from MyVariant.info are always available from our documentation. The programmatic access of this information is available from our metadata endpoint.

New field added for flagging observed variants:

  • observed

    To provide the maximum coverage of variants, MyVariant.info includes annotations for both observed and theoretical variants. Theoretical variants can come from data sources like dbNSFP and CADD, which calculate the possible impacts of all theoretical variants based on the human genome. While most of other resources like dbSNP, ClinVar and ExAC provide annotations for only "real" (observed) variants.

    We now added a new observed field to all "observed" variants, with a value of boolean "true" ({"observed": true}). No observed field is added for a theoretical variant. Thus, you can easily filter for only "observed" variants:

http://myvariant.info/v1/query?q=observed:true

http://myvariant.info/v1/query?q=_exists_:observed (equivelent to the first query)

or combine with other query terms:

http://myvariant.info/v1/query?q=cadd.polyphen.cat:possibly_damaging AND _exists_:observed

http://myvariant.info/v1/query?q=dbnsfp.sift.pred:d AND observed:true

That's all! And as always, feel free to reach us at help@myvariant.info or @myvariantinfo if you have any questions or feedback.