Oct 12, 2015 7:00 AM

The Pitfalls of Using Google Ngram to Study Language

Garbage in, garbage out when it comes to big data analysis of language and culture.

Five years ago, Google unveiled a shiny new toy for nerds. The Google Ngram Viewer is seductively simple: Type in a word or phrase and out pops a chart tracking its popularity in books. Millions of books, 450 million words—suddenly accessible with just a few keystrokes. It's a fun and clever offshoot of the Google Books program, which scanned books from over a dozen university libraries.

With Google Ngram, you could easily track the fame of Mickey Mouse versus Marilyn Monroe, the evolution of irregular verbs, censorship in Nazi Germany, and the decline of God. And so, so, so much more. At least, that was the promise from researchers who published a splashy paper in the prestigious journal Science. They even went ahead and gave their new field a name: “culturomics.”

Since then, Google Ngram has been popping up in the scientific literature and all over the internet in pop social science articles. Even if you haven’t heard the word Ngram, you’ve seen the charts, in the familiar red, blue, and green of Google’s logo.

But—and you can probably sense a “but” coming—relying on Google Ngram to study the rise and fall of words and ideas has plenty of pitfalls. A new paper published in PLoS ONE outlines some of the major problems with the corpus of scanned books that powers Google Ngram. “It’s so beguiling, so powerful,” says Peter Sheridan Dodds, an applied mathematician at the University of Vermont who co-authored the paper. “But I think there’s a misrepresentation of what people should expect from this corpus right now.” Here are some of the problems.

OCR Errors

OCR, or optical character recognition, is how computers take the pixels of a scanned book and convert it into text. It’s never a perfect process, and it only gets harder when computers are trying to decipher squiggles on a 200-year-old page. Let’s look at a particularly amusing and profane example:

Google Ngram

From the data alone, you might wonder why “fuck” almost completely disappears in books only to be revived in 1960. But, well, it didn't. The lowercase long s in old books looks a lot like a f, a fact that has long fooled computers and confused kids trying to read the Constitution. As Mark Liberman, a computational linguist at the University of Pennsylvania, points out, the confusion of over s and f turns up time and again: case versus cafe, funk versus sunk, fame versus same. Plenty of OCR errors probably exist, but systematic ones like confusing s and f are where you have to start being careful.

Overabundance of Scientific Literature

Still, one wrong letter is pretty trivial. The corpus gets skewed in less visible ways, and these are more insidious. Google Book’s English language corpus is a mishmash of fiction, nonfiction, reports, proceedings, and, as Dodds’ paper seems to show, a whole lot of scientific literature. “It’s just too globbed together,” he says. His study tracks the frequency of words common in academia, such as the capitalized “Figure,” likely to appear in the caption of a paper, versus the lowercase “figure,” which has many more common uses.

Google Ngram

The changing composition of the corpus over time isn’t a new criticism. Many have noted that the pre-20th century corpus has way more sermons. Jean Twenge, a psychologist at San Diego State University, who has used Google Ngram to study narcissism, cautions against “throwing the baby out with the bathwater.” For example, she notes, the fact that scientific literature grew so much is indicative of a shift in society, too.

But the tricky part here is more subtle. If scientific publications are taking up more and more of the the corpus, certain non-scientific terms may appear to fall in relative popularity. For example, are writers less interested in writing about “autumn” or are there just simply more scientific papers totally unrelated to “autumn” crowding the corpus?

Google Ngram

Messy Metadata

When Google scans books, it also populates the metadata: date published, author, length, genre, and so on. Like OCR, this is a largely automated process, and like OCR, it’s prone to error. Over at the blog Language Log, University of California linguist Geoff Nunberg has documented the books whose dates are very wrong. He notes that a search for Barack Obama restricted to years before his birth turns up 29 results. Some of these errors have since been fixed, as Google is pretty vigilant when it notices errors in Google Books.

But the fixes don’t make it into the indexed corpus that powers Google Ngram right away. That has been updated only once, in 2012. “Our paper is bit of an appeal to Google to release a third edition which would be more nuanced,” says Dodds. “We need a recleaning of the data.”

Popularity Contests

One of the traps in using ngrams to divine the popularity of people, ideas, or concepts is that a book only appears once—whether it’s been read once or millions of times. The Lord of the Rings is in there once, notes Dodds, and so is some random paper on mechanics. The two texts are weighted equally. It doesn’t reflect what is people are talking about so much as what people are publishing about—and until very recently, most people didn’t have access to publishing. Like, what does this really tell you about language?

Google Ngram

Erez Lieberman Aiden, a computational geneticist at Baylor who published the original culturomics paper, agrees that these problems exist in the Ngram corpus, though he stresses it’s true of any measurement tool in science. In his mind, this doesn’t indicate a fatal flaw in the field. “Any healthy field is going to include people who are sort of being overly enthusiastic, using data in ways that can’t possibly be justified. And other people try to slam the brakes on it,” he says.

Google Ngram is a powerful tool that researchers a decade ago could have only dreamed of. But in a way, it's so easy to use that it lends itself to overuse—and misuse. The field has arrived at a backlash. Now, they just have to wait for the backlash to the backlash.