New Data Release for MyVariant.info 201706

by Jiwen Xin

Another fresh data release for MyVariant.info is out! In this data release, we have updated the data from ClinVar to their latest versions, and also added two new fields under ClinVar and ExAC to handle specific cases, including genotype set and multi-allelic variants. Here are more details.

Data Sources Updated

ClinVar was updated to its latest (same version for both hg19 and hg38 assembly):

Some numbers for GRCh37/hg19 variants:

last release new release # of variants
in last release
# of variants
in new release
ClinVar 2017-04 2017-06 282,772 307,101

Similarly, some numbers for GRCh38/hg38 variants:

last release new release # of variants
in last release
# of variants
in new release
ClinVar 2017-04 2017-06 282,956 307,286

ClinVar annotations are available under "clinvar" subfields for each annotated variant. MyVariant.info aggregates annotations from ClinVar, dbSNP, dbNSFP and other 12 sources for each variant, so you can access them all in one request.

The total number of unique variants is now over 424M (424,519,520), slightly higher than our previous release on April 2017, which is 424,515,266. More details about the variant data we provide from MyVariant.info are always available from our documentation. The programmatic access of this information is available from our metadata endpoint (and hg38 metadata).


####New Field for genotype set under ClinVar

There are a few submissions in ClinVar that represent assertions about simple or complex genotypes. To include this information in MyVariant.info, we have included a new genotypeset field under clinvar. There are two subfields under genotypeset, which is genotype and type. The "genotype" field records all variants as hgvs ids sharing the same genotype with the target variant. And the "type" field specifies the genotype which these variants are sharing, e.g. "CompoundHeterozygote".

  • Query for variants having "genotypeset" information:
curl 'http://myvariant.info/v1/query?q=_exists_:clinvar.genotypeset'
  • Or query for the genotypeset information for a specific variant:
curl 'http://myvariant.info/v1/variant/chr5:g.151208511G>A?fields=clinvar.genotypeset'

{
  "_id": "chr5:g.151208511G>A",
  "_version": 2,
  "clinvar": {
    "_license": "https://goo.gl/OaHML9",
    "genotypeset": {
      "genotype": [
        "chr5:g.151239534C>A",
        "chr5:g.151208511G>A"
      ],
      "type": "CompoundHeterozygote"
    }
  }
}

####New Field for multi-allelic under ExAC and ExAC-nonTCGA subset. A [*multiallelic*](http://gatkforums.broadinstitute.org/gatk/discussion/6455/biallelic-vs-multiallelic-sites) site is a specific locus in a genome that contains three or more observed alleles, again counting the reference as one, and therefore allowing for two or more variant alleles. The [VCF source file](ftp://ftp.broadinstitute.org/pub/ExAC_release/release0.3.1/) from ExAC provides information about multi-allelic variants by organizing information about all multi-allelic variants at the same locus in one record. Hence, we decide to include a '*multi-allelic*' field under '*exac*'. The field will list all multi-allelic variants as hgvs ids related to a specific variant.

Thus, users could query for all multi-allelic variants for a target variant, e.g. chr10:g.103234255C>G using:

curl 'http://myvariant.info/v1/variant/chr12:g.103234255C>G?fields=exac.multi-allelic'

{
  "_id": "chr12:g.103234255C>G",
  "_version": 3,
  "exac": {
    "_license": "https://goo.gl/MH8b34",
    "multi-allelic": [
      "chr12:g.103234255C>T",
      "chr12:g.103234255C>G"
    ]
  }
}

Or query for all multi-allelic variants in ExAC using:

curl 'http://myvariant.info/v1/query?q=_exists_:exac.multi-allelic'

Please note that these two fields do not introduce any incompatible changes in the data structure, so your existing code should just work fine.

That's all! And as always, feel free to reach us at help@myvariant.info or @myvariantinfo if you have any questions or feedback.