Six Degrees of Kevin Bacon in six languages

20 years of efficiently computing Bacon numbers

apollo1

The Oracle at Delphi spoke just one language, a cryptic one that priests “compiled” into ancient Greek. The Oracle of Bacon—the website that plays the Six Degrees of Kevin Bacon game for you—has, in its 20-year existence, been written in six languages. Read on for the history and the reasons why.

1995-1999: C

The original version of the Oracle of Bacon, written by Brett Tjaden in 1995, was all C. The current version, my stewardship of it, and my revision control history only go back to 1999, so that’s where I’ll start. In 1999, I rewrote the Oracle… still entirely in C. Expensive shortest-path and spell-check algorithms? Definitely in C. String processing to build the database? Also C! Presentation layer to parse CGI parameters and generate HTML? C here, too!

Why C? The rationale for the algorithmic component was straightforward: the Oracle of Bacon ran on a slow, shared Unix machine that other people were using to get actual work done. Minimizing CPU and memory resource requirements was the polite thing to do. I needed a compiled language that let me optimize time and space extensively. The loops all counted down, not up, because comparing against zero was fractionally faster on SPARC. It had to be C.

But why were the offline string processing and the CGIs in C? Mostly, I think, to reuse code from the other parts of the code base and from previous projects I’d written when C was the only language I knew.

2004-2005: Perl

As the site added features, I got tired of writing code to generate HTML in C. I wrote new CGIs, then rewrote existing CGIs, in Perl. Simply put, writing the CGIs in an interpreted language made me more productive. I had hash tables and vectors built into the language and CGI support a simple “use” statement away. I didn’t have to compile on one server and then deploy to another—I could edit the CGIs right there on the web server. Good deployment practices it wasn’t, but it made me more productive as a programmer, and the performance of the CGIs didn’t matter all that much.

2008-2012: PHP

For years, the header and footer of each web page were static files that the CGIs simply wrapped around content to produce pages. Initially, these header and footer files were generated using M4, but that’s another story, and not a choice I ever made. As I redecorated the site, I wanted the headers and footers to be parameterized: different menus, search boxes, and social-media links on different pages. PHP did that more cleanly than static files, so I gradually rewrote the presentation code in PHP. Doing that had an added performance bonus: since PHP ran in the context of Apache, it saved a measurable amount of fork, exec, and parsing time on each request. No, my hosting provider didn’t have mod_perl.

2010: Java?

One common theme to this point is that only the presentation layer ever changed languages. The database, which must search millions of actors, actresses, and movies in about a half second of CPU time, was always C. In 2010, a friend who teaches CS to undergraduates asked me to give him some Oracle of Bacon sample code for an assignment. And, oh, could it be Java? That seemed reasonable. The data structures in the Oracle of Bacon read like a textbook example of object-oriented programming. Each actor has a name and a list of movies. Each movie has a name, a year, a genre, and a list of actors. Java seemed like a great fit. I was even working on some Java server code at my day job at this point.

So did it work? As the kids say, lolno. Every object ballooned by 8-16 bytes. Every string ballooned by 24. Actually, it was worse than that, because every object had to be on the garbage-collected heap, and the C version of the Oracle uses a custom heap with less allocation overhead. In short, the memory footprint jumped by a factor of about ten. That would be fine for undergraduates’ laptops, sure, but not for a production server. I sent the professor my Java sample code and haven’t touched it since.

2013: Go?

Approximately 70% of the programmers I’ve ever met now work at Google.(Maybe not quite, but it seems that way.) And some of them like Go. I like the idea of Go, so I wrote a prototype of the Oracle of Bacon in Go. I wanted to add a scheduler and worker threads to the Oracle, both of which would be straightforward in Go. Unfortunately, Go, like Java, has a managed heap. So my Go prototype, like my Java prototype, was far slower and far larger than the existing C code.

In both Go and Java, I could have sliced all my data structures into arrays of simple types and eliminated most of the object and heap overhead. Or I could have left any code that touched the movie database in C, using Java and Go native bindings. But either of these approaches would have fractured the code, leaving it less maintainable and arguably less type-safe than the (existing, optimized, and debugged) C version. Since maintainability and type safety were the purported advantages of a high-level-language rewrite, I rejected these ideas. In C it stays.

2014: Ruby

Why Ruby? Why now? Ruby was the easiest language, for me, this year, in which to add the Oracle of Bacon’s first unit tests. (Really. Partly, I chose Ruby because it’s something I use now at my day job. I also chose it because I never mastered Perl’s object-oriented syntax, and so my Ruby code is always better organized than my Perl code.

Onward and upward

I don’t have a favorite programming language, and I never have. I made the language choices I made based mostly on the trade-off between run-time performance and my productivity in the languages I knew at the time. The Oracle of Bacon has always been tightly resource constrained—originally because it ran on shared login machines, and more recently because I pay for the hosting costs out of pocket. To this day, the Oracle of Bacon runs on a server with just 640MB of memory and a correspondingly tiny CPU allowance. Keeping resource requirements at those levels has saved me thousands of dollars in hosting costs.

But it’s silly, as I learned after a scant five years maintaining the site, to write everything in C. Code that handles 10K of HTML doesn’t have the same performance requirements as code that traverses a six-million-node movie database. I can write presentation code faster and maintain it more easily in an interpreted language—generally, whichever one I’m using for the other projects in my life.

The pragmatic approach here doesn’t leave much room to experiment with new languages. For that, there are always quines, polyglots, and bottles of beer on the wall. For a production system, give me the languages I know.

tags: , , , , , ,