Using Flat Files So Elections Don’t Break Your Server

In the Times Interactive News Department, we pay careful attention to caching strategy to make sure it’s a good fit for the project at hand.

Publishing live election results requires a carefully tuned system: the setup must be able to withstand some of the most intense traffic levels seen all year at NYTimes.com, but at the same time, it needs to get information to our readers quickly as we receive updates from our editors and the Associated Press.

Most of our projects live as Rails applications deployed behind Varnish. They typically run on Amazon’s EC2, backed by MySQL on an Amazon RDS.

We knew we wanted to build out the election results as part of our usual stack, but we also knew the setup — given its scale and profile — needed to be resilient against the following:

  1. Varnish failure — varnishd, so far, has never crashed on us. But hardware failure, misconfiguration or a strange bug could also put varnish offline.
  2. Application errors — no matter how rigorous your testing, live data feeds for live events always invite unexpected quirks. We wanted our readers to be well-insulated from any errors that might crop up.
  3. Extreme traffic — historically, election nights bring some of the most intense traffic of the year. Not only did we want to avoid overloads, but we also wanted to make sure response times were consistent and fast.

We also wanted the system to have these characteristics:

  • A simple cache-busting scheme — given the volume of pages we were publishing (184 in total) and the speed at which they were updating, we worried that a complicated scheme to clear the cache ran the risk of not busting the right data on the right machines at critical moments.
  • Simple, rapid scalability — if required, a single team member should be able to rapidly scale up the infrastructure just by launching instances and editing minimal configuration files.

Trading Varnish for Flat Files

With these points in mind, we decided our customary Rails + Varnish setup left too much room for error. Although Varnish’s grace and saint modes make some allowances for struggling applications, we decided it was too risky to lean entirely on an ephemeral cache for an evening where seconds of downtime matter.

Instead, we decided to center our app on the simplest of all caching strategies: the flat file. As with many of our applications, responses on election night didn’t need to be dynamic — everyone can receive the same content.

The Setup

Election night setupThe election night setup at NYTimes.com

We wrote and deployed a dynamic application to four Rails applications servers, fronted by an EC2 micro running HAProxy.

Another central server ingested our results feed from the AP and handled post-processing. After each batch of new data was received from the AP, this server determined which pages needed to be re-rendered and, using the Typhoeus libcurl-multi bindings for Ruby, pulled new data for each of these pages from the render pool.

The newly rendered pages were then rsynced to a bank of Apache web servers that served them as flat files. Apache, in turn, was fronted by a Varnish instance that cached requests on a hard-coded 5-second TTL. This TTL proved long enough to improve response times to readers and buffer traffic to the Apache servers, while ensuring that new data appeared quickly.

HAProxy fronted the Varnish instance; it handled mobile user-agent detection and was capable of sending traffic directly to Apache or even to an alternate data center in case of failure. An Amazon ELB provided additional redundancy to divert traffic if an outage occurred.

With this load balancing and caching in place, we were able to handle thousands of requests per second on election night with minimal system load — and a final EC2 hosting bill of a few hundred dollars.

Comments are no longer being accepted.

Great article! As one of the main committers to Typhoeus, I was wondering what your experience was with the parallel HTTP library, and if you had any feedback about it? I’d especially like to know how it held up in high-traffic situations!

My GitHub user is dbalatero if you’d like to contact me. Thanks!

What is all this nerd terminology? You’re journalists, not some start-up company from Y Combinator. Leave this to the nerds and get back to reporting.

@Rudiger

If you aren’t interested in “nerd terminology”, maybe you shouldn’t be hanging out in the NYTimes developer APIs blog…

You know, the one whose subtitle is “All the code thats fit to printf()”…

I quite enjoyed reading this.

@Rudiger: Not everyone at the NYT is a journalist and hearing about scaling issues is of interest to some people.

Please keep writing articles such as this in the future, thanks!

Rudiger – this is a blog, not a published article. Furthermore, it’s a technology blog. It may surprise you, but the works of high journalism you so clearly pine for are delivered to you through the labor of… nerds. For that matter, your privilege to laborously hunt-and-peck your whining for our perusal is all thanks to nerds.

Some of us nerds greatly appreciate it that the gents over at the Times care to share what they’ve learned. Keep up the good work!

@Lex @Scott @addicted Dry humor isn’t your strong suit is it :-)

It’s fantastic to hear NYT reporting back from the trenches. The coverage was excellent that night – too bad the results weren’t’ any better :-/

Lots of good info on how such a large site handles their obscene levels of traffic, thanks for sharing.

I’m really curious why are you using apache to serve flat files and not nginx as the second is some times faster less resource hungry

This is very interesting, good article.

Kudos! A good write-up of an elegant soution to a relevant problem.

Just out of curiousity, why HAproxy on the public facing side instead of an Amazon load balancer? I’m currently using ELB, but now I’m curious if I should switch to a micro/HAproxy solution instead.

I use the term static files instead of flat files in this context.

From what I understand, craigslist does the same as you; however they don’t need to update that frequently.

Stephan

Gabriel — we considered nginx but ultimately went with Apache simply because it’s more familiar. Right now, our Rails stack runs on top of Apache for historic reasons; we didn’t want to be fumbling through unfamiliar configs on election night. I suspect this might change in future years.

Meink — We actually used an Amazon ELB in front of our HAProxy instances to gracefully handle instance failure and redundancy across datacenters. We thought about sending traffic from the ELB directly to Varnish, but decided to front with HAProxy since it gave us more flexibility with traffic in case of difficulty — we could fail backend requests from Varnish to the alternate datacenter, redirect, push traffic at will. We also used HAProxy to run User-Agent detects for the iPad ahead of the Varnish cache.

Another key point for us was the haproxy stats page, which provides a really clear picture of system status and load. We certainly could have accomplished the same with CloudWatch on the ELB and more complex config for Varnish — but the flexiblity in the HAProxy arrangement made us more comfortable.

Interesting article. Thanks for sharing.

Why not use simply use Amazon S3 to serve static files (maybe with some AJAX/Javascript thrown in for added funkiness and user transparency)?

//aws.typepad.com/aws/2007/06/live_blogging_e.html

How many people worked on the election night apps, in total? And when did preparation first begin?

This is a great post, but I’m curious about the bigger picture stuff on the project, too.

@Rudiger If you aren’t interested in “nerd terminology”, maybe you shouldn’t be hanging out in the NYTimes developer APIs blog… You know, the one whose subtitle is “All the code thats fit to printf()”…