New "fetch_all" feature for fast-retrieval of large queries using MyVariant.py client

by Cyrus Afrasiabi

The latest release of MyVariant.py includes a new parameter to the query method: fetch_all. Setting fetch_all = True, allows fast unordered retrieval of large query results using a generator. Consider the following query:

In [1]: import myvariant

In [2]: mv = myvariant.MyVariantInfo()

In [3]: query = 'dbnsfp.polyphen2.hdiv.score:>0.999 AND dbnsfp.chrom:Y'

In [4]: top_10_hits = mv.query(query)

In [5]: top_10_hits['total']
Out[5]: 21718

This query has ~22,000 total results. Currently, to page through all results, the size and skip parameters must be used, as in:

In [6]: %time all_sorted_hits = [mv.query(query, size=1000, skip=s) for s in [1000 * i for i in range(0,22)]]
CPU times: user 5.87 s, sys: 2.38 s, total: 8.25 s
Wall time: 2min 53s

Here we generate a new skip for each of the 22 requests (22 must be used as size is currently limited to 1000 or fewer hits). This can be simplified using the new fetch_all feature, which returns a generator to all unsorted query hits:

In [7]: %time all_unsorted_hits = list(mv.query(query, fetch_all=True))
Fetching 21718 variant(s)...
CPU times: user 8.44 s, sys: 2.72 s, total: 11.2 s
Wall time: 2min 42s

The fetch_all feature also allows fast retrieval of a large number of query results, as in:

In [8]: large_query = '_exists_:clinvar'

In [9]: top_10_lq_hits = mv.query(large_query)

In [10]: top_10_lq_hits['total']
Out[10]: 114627

In [11]: %time all_unsorted_lq_hits = list(mv.query(query, fetch_all=True))
Fetching 114627 variant(s)...
CPU times: user 39.8 s, sys: 4.34 s, total: 44.1 s
Wall time: 13min 53s

This larger query returns ~115,000 results in about 14 minutes. If you need to iterate through all results of a query this size or larger, it's less demanding on server resources to use the fetch_all option (for more information about how this works, check out our documentation).

If you are only interested in certain fields, you can specify them with the fields parameter, this will speed up the retrieval considerably, as in:

In [12]: %time all_unsorted_lq_ids = [hit['_id'] for hit in mv.query(large_query, fetch_all = True, fields = '_id')]
Fetching 114627 variant(s)...
CPU times: user 485 ms, sys: 85.4 ms, total: 570 ms
Wall time: 31 s

In [13]: len(all_unsorted_lq_ids)
Out[13]: 114627

In [14]: all_unsorted_lq_ids[:10]
Out[14]:
[u'chr6:g.161071470G>A',
 u'chr6:g.152708291G>A',
 u'chr6:g.152730851T>C',
 u'chr6:g.161771171C>T',
 u'chr7:g.295835G>C',
 u'chr6:g.157488340C>T',
 u'chr6:g.157488357C>T',
 u'chr6:g.160483653T>C',
 u'chr7:g.42005007G>A',
 u'chr7:g.42005836C>G']