New data release out! ClinVar data structure overhauled

by Jiwen Xin

Update: On Jan. 18th, we updated ClinVar data to their latest 201501 release.

Another new data release was just rolled out. ClinVar and CADD data were updated to their latest:

last release new release # of variants
in new release
# of variants
in last release
ClinVar 201509 201511 127,745 114,627
CADD v1.2 v1.3 226,932,858 163,690,986

ClinVar and CADD annotations are available under "clinvar" and "cadd" subfields, respectively, for each annotated variant. MyVariant.info aggregates annotations from ClinVar, CADD and other 12 sources for each variant, so you can access them all in one request.

The total number of unique variants is now over 334M (334,443,525), compared to 316M previously. More details about the variant data we provide from MyVariant.info are always available from our documentation. The programmatic access of this information is available from our metadata endpoint.

Since the inclusion of ClinVar data in MyVariant.info, we have received a lot of feedback from our users, which resulted in the overhaul of ClinVar data structure in this new release. The changes are detailed below:

Data Structure Change:

  • RCV records

    This change reflects the fact that a single variant may correspond to one or more RCV records. In this release, RCV record specific fields, e.g. accession number, clinical significance, number of submitters, review status, last evaluated date, preferred name, origin and conditions are now moved under "clinvar.rcv" field (see an example here). If a variant includes multiple rcv records, each record will be represented as an element in a list under "clinvar.rcv" field (see examples here).

    This changes should resolve the data missing issue in previous release. The current release includes 127,745 unique variants (with ~150K RCV records) annotated by ClinVar. Roughly 9K RCV records were left out because their corresponding variants cannot be properly mapped to human reference genome.

New fields added:

  • ClinVar Variant ID

    ClinVar Variant ID is now included as "clinvar.variant_id" field. The definition of ClinVar Variant ID can be found here.

  • hg38 position

    When available, the genomic position on hg38 assembly is now included as "clinvar.hg38" field, along with the existing "clinvar.hg19" field for hg19 genomic position for each variant.

  • More xref IDs

    To facilitate cross-referencing with other variant databases, the available ids mapping each variant to OMIM, COSMIC, UniProt, dbVar are now included as separate fields under "clinvar". Previous data release only included rsid from dbSNP.

Fields deleted:

  • xref

    "clinvar.xref" field is now removed. All available xref ids were either moved out as separate fields (e.g. "clinvar.uniprot", "clinvar.omim") or included under "clinvar.rcv.conditions.identifers" field.

New data structure examples:

A few examples of the new ClinVar data structure can be found at this gist.

Query examples:

In this demo, we will show you how to use our Python client myvariant.py to query for ClinVar data with only a few lines of code. The complete tutorial is provided as a Jupyter notebook here (the raw ipynb file is here).

  • Install myvariant.py is easy with pip:
pip install myvariant
  • Now you just need to import it and instantiate MyVariantInfo class:
import myvariant
mv = myvariant.MyVariantInfo()
  • To get available ClinVar annotations of a variant in ClinVar, you can pass an hgvs id and "clinvar" as the fields parameter to the getvariant method:
mv.getvariant('chr6:g.26093141G>A', fields='clinvar')
  • Also, you can query for a single RCV accession number:
mv.query('clinvar.rcv.accession:RCV000198299')

Finally, we would like to thank all users who provided feedback and suggestions. As always, feel free to reach us at help@myvariant.info or @myvariantinfo if you have any questions or feedback.