In the Times Interactive News Department, we pay careful attention to caching strategy to make sure it’s a good fit for the project at hand.
Publishing live election results requires a carefully tuned system: the setup must be able to withstand some of the most intense traffic levels seen all year at NYTimes.com, but at the same time, it needs to get information to our readers quickly as we receive updates from our editors and the Associated Press.
Most of our projects live as Rails applications deployed behind Varnish. They typically run on Amazon’s EC2, backed by MySQL on an Amazon RDS.
We knew we wanted to build out the election results as part of our usual stack, but we also knew the setup — given its scale and profile — needed to be resilient against the following:
- Varnish failure — varnishd, so far, has never crashed on us. But hardware failure, misconfiguration or a strange bug could also put varnish offline.
- Application errors — no matter how rigorous your testing, live data feeds for live events always invite unexpected quirks. We wanted our readers to be well-insulated from any errors that might crop up.
- Extreme traffic — historically, election nights bring some of the most intense traffic of the year. Not only did we want to avoid overloads, but we also wanted to make sure response times were consistent and fast.
We also wanted the system to have these characteristics:
- A simple cache-busting scheme — given the volume of pages we were publishing (184 in total) and the speed at which they were updating, we worried that a complicated scheme to clear the cache ran the risk of not busting the right data on the right machines at critical moments.
- Simple, rapid scalability — if required, a single team member should be able to rapidly scale up the infrastructure just by launching instances and editing minimal configuration files.
Trading Varnish for Flat Files
With these points in mind, we decided our customary Rails + Varnish setup left too much room for error. Although Varnish’s grace and saint modes make some allowances for struggling applications, we decided it was too risky to lean entirely on an ephemeral cache for an evening where seconds of downtime matter.
Instead, we decided to center our app on the simplest of all caching strategies: the flat file. As with many of our applications, responses on election night didn’t need to be dynamic — everyone can receive the same content.
The Setup
We wrote and deployed a dynamic application to four Rails applications servers, fronted by an EC2 micro running HAProxy.
Another central server ingested our results feed from the AP and handled post-processing. After each batch of new data was received from the AP, this server determined which pages needed to be re-rendered and, using the Typhoeus libcurl-multi bindings for Ruby, pulled new data for each of these pages from the render pool.
The newly rendered pages were then rsynced to a bank of Apache web servers that served them as flat files. Apache, in turn, was fronted by a Varnish instance that cached requests on a hard-coded 5-second TTL. This TTL proved long enough to improve response times to readers and buffer traffic to the Apache servers, while ensuring that new data appeared quickly.
HAProxy fronted the Varnish instance; it handled mobile user-agent detection and was capable of sending traffic directly to Apache or even to an alternate data center in case of failure. An Amazon ELB provided additional redundancy to divert traffic if an outage occurred.
With this load balancing and caching in place, we were able to handle thousands of requests per second on election night with minimal system load — and a final EC2 hosting bill of a few hundred dollars.
Comments are no longer being accepted.