As some of you may have noticed, MyGene.info went down intermittently on September 11th resulting in about 117 minutes of total downtime. In the interest of transparency, and in hopes of avoiding similar situations in the future, here's a behind the scene look at how MyGene.info was knocked down and what happens when a BioThings API like MyGene.info goes down.
Because of the way it is set up, MyGene.info can usually handle a lot of queries, and in the last 30 days, it has handled over 8 million requests. Not all requests are equal, and post requests batching thousands and thousands of queries are a heavier burden than simple get requests. The incident on September 11th, involved a single ip address sending multiple batched (big burden). We don't know exactly what these queries are, but they appear heavy ones. We recorded that the last success query right before the downtime took ~60s to finish. This proved to be too much, causing everyone else attempting to make queries to experience outages.
As soon as our uptimerobot detected the outage, Andrew, Jerry, and Sebastian immediately set out to determine the cause of the outage and how to get around it. Sometimes issues can be addressed with a quick fix like rebooting an overloaded node or adding a spot instance (when fiscally feasible). In this case the bombardment of heavy queries kicked all the nodes out from the elastic load balancer, and nothing could be addressed so long as the bombardment continued. As a result, the BioThings team reluctantly resorted to temporarily filtering out the ip/source of the heavy queries in order to get the API up and running again.
The fixes appear to have worked, but the debate on how to address the issues heavy query bombardment is ongoing (better throttling?). We do have some throttling set up to limit the total number of requests per second from a single IP, but it's always hard to judge if an incoming request is a heavy one or not. If you have suggestions feel free to share them with our team by posting in our BioThings google group. In the mean time, please help avoid down time by limiting the number of massive queries you make to one per second (or at least not parallelizing them to the point that you take down our nodes).