Test Everything: Notes on the A/B Revolution

How A/B testing, the practice of performing real-time experiments on a site’s live traffic, came to rule the web. And why it's seeping into ever-greater swaths of modern life.
Image may contain Human Person Finger and Thumbs Up
In an A/B world something either works or it doesn't:

Welcome, guinea pigs. Because if you’ve spent any time using the web today — and if you’re reading this, that’s a safe bet — you’ve most likely already been an unwitting subject in what’s called an A/B test. It’s the practice of performing real-time experiments on a site’s live traffic, showing different content and formatting to different users and observing which performs better.

Though it came into its own on the World Wide Web, the idea of A/B testing predates it, going back at least as far as catalog mailers and infomercials. In those metric-poor times, different phone numbers or discount codes could be shown onscreen or be printed on an insert as a way to track the allure of one pitch versus another. This data was a big step towards solving the age-old marketer’s bane (“half of my budget is wasted; I just don’t know which half”), but as a rule, any business insight ended at the point of sale.

If you were a blender company, you knew what made for sales conversions, but you couldn’t know how many people used the blender, at what time, how often, or whether it was for a milkshake or a margarita. On the web, and more recently in smartphone apps, companies are effectively able to monitor each press of the purée button. An app or site developer can know, for instance, exactly how many users are looking at a particular screen or clicking a certain button at a given moment—and often where in the world they’re doing so.

The rise of A/B testing online began around the turn of the millennium with internet titans like Google and Amazon, and in recent years it has been slowly seeping into ever-greater swaths of modern life, having become, now, more or less standard practice from the leanest startups to the biggest political campaigns. The touted “internet of things” concept may, in the next decade, catch the world of physical commerce up to speed with its software counterpart, finally making the purée button report back to corporate HQ.

More than this, though, A/B testing is not simply a best practice — it’s also a way of thinking, and for some, even a philosophy. Once initiated into the A/B ethos, it becomes a lens that starts to color just about everything — not just online — but in the offline world as well.

One Nation, Randomly Divisible for Statistical Significance

“It is one of the happy incidents of the federal system,” wrote Associate Supreme Court Justice Louis D. Brandeis in 1932, “that a single courageous State may, if its citizens choose, serve as a laboratory; and try novel social and economic experiments without risk to the rest of the country.”

In the realm of politics A/B testing makes an unexpected argument for things like block grants and state, as opposed to federal, power. As Silicon Valley’s A/B devotees can increasingly attest, not everything is best solved by discussion and debate. Differences in the way policy is implemented and issues are addressed at the state level make for a rough 50-way A/B test—yielding empirical data that can often go where partisan thought-experiments, and even debate at its most productive (but nonetheless theoretical) cannot.

Consider, for instance, the relationship between a society’s criminal justice system and its crime rates. A 2009 report from The Pew Center on the Statesshows that Idaho’s “correctional control” (jail, prison, probation and parole) population increased by 633% from 1982 to 2007, during which time neighbor Utah’s correctional control population increased by only 30%. In 2008, Alabama spent 2.5% of its state general fund on corrections; Michigan spent almost an order of magnitude more: 22.0%. What effect, if any, did such huge differences in policy have on the relative safety of those states? Such inter-state differences allow for a kind of side-by-side analysis that tracking federal data across different time periods doesn’t allow.

Of course, 2007 Idaho and 2007 Utah are different places, with other variables in play beside their correctional policies, and this blunts the impact of the data. A true political A/B test would look at completely co-extensive groups, truly randomly selected—say, by randomly divvying up Social Security Numbers into cohorts and providing different legal outcomes to each.

Here’s one way that could play out. Say (as has too often been the case) my car gets ticketed on street sweeping day: the ticketing officer runs my plates, which show whether I’m in the Restitutive Group or the Punitive Group. If the former, I’m fined the $10 it takes the city to hand-sweep that fifteen-foot section of curb. If the latter, I’m fined the $75 it will take to make me think twice every time I park. Lawmakers would determine the relevant metric (say, recidivism) and would quickly establish, to a scientific certainty, whether the stiffer penalty had the desired effects. Why debate when you can test?

Seemingly absurd notions like this, multiple codes of law operating simultaneously, start to make an uncanny amount of sense once one starts drinking Silicon Valley’s A/B Kool-Aid. Such a world—different permutations of the law in effect for different citizens in the same jurisdiction in the same time—starts to resemble strange speculative-fictional dystopian noirs like China Miéville’s The City & The City. It also starts to resemble the contemporary Web.

The Creative Process and the Slap of Data

A/B testing also casts an odd light on a practice close to home for me personally: writing. During my visit to the offices of all-things-gaming site IGN, I was allowed to try my hand at creating some alternative headline copy for the IGN homepage. I perused the day’s trending stories and found one whose headline seemed a little flat. I concocted an alternative that varied just by a word or two but was, I thought, snappier. Within seconds the test was live on IGN’s traffic, and within minutes the results were clear. My headline bombed.

I had officially been “slapped in the face by data,” as one developer put it: something of a rite of passage for A/B testers. The bigger slap, though, was the realization that my chosen profession was perhaps more quantitative and empirical than I’d imagined.

“It’s your favorite copyeditor,” says IGN co-founder Peer Schneider. “You can’t have an argument with an A/B testing tool like Optimizely, when it shows that more people are reading your content because of the change. There’s no arguing back. Whereas when your copyeditor says it, he’s wrong, right?” This comment stings retroactively, as forty-eight hours later I would cost his company umpteen clicks with my misguided “improvement.”

Conversations like this over the past months have prompted unexpected reflections on my own work. “So, like, how many A/B tests did you guys do when you were deciding the subtitle for your book?” a developer at one startup asked me. All of a sudden I felt the flush of shame. “Uh—none. We just all got together and discussed and picked one.”

“Huh,” said the developer, a look of curiosity and concern on his eyebrows.

Of course, what works for headlines and subtitles doesn’t work for novels, with their 90,000 moving parts. Indeed, developers seemed to treat me with sympathy and pity: As an author, I am expected to periodically disappear for 12 to 18 months and emerge with a massive and nearly finished product, virtually unseen before publication and unalterable afterwards. Its ultimate success or failure won’t be clearly measurable until years after its release, if even within my lifetime. For anyone in a data-driven culture, this is a nightmare scenario. And I confess there are days when I long for the tester’s certainty: the headline or ad-copy writer who takes three cracks at a sentence before 9:30 am, and by quarter of 10 knows once and for all which was best.

Ultimately, though, there are reasons to be grateful that life on the whole remains unamenable to the A/B test. The unholy thing about A/B testing is that it tends to treat users as fungible. Testing ad copy works because man-on-the-street X’s reaction is presumed to be a useful guide to man-on-the-street Y’s reaction. And when you do the test and the statistics are right, it is. But, in the political example, learning that a particular sentencing is excessive comes only after you’ve administered it to real people living real lives.

And as for finding the right words: Many of our most important letters, remarks, decisions, and questions are meant for an audience of one—a population size that admits no sampling. Where it counts the most—in family, in friendship, in love—we are operating by instinct, no A’s, no B’s, flying blind.