Friday, September 18, 2015

Indexing Google's Source Code

I couldn't talk about this before, but now that Wired magazine has disclosed the size of Google's code base (2 billion lines), I can discuss my authorship of gtags and what it did for Google, as well as some funny stories arising from that.

I wrote the first version of gtags in 1991 (yes, gtags is older than Google!), when I was at Geoworks. GEOS was several million lines of assembly, including every freaking app written for that OS. Since every object could potentially call any other object, the entire code base was relevant. Needing to get my head around that code base, I tried to build a TAGS database and that immediately caused my Emacs to start swapping. The performance was unacceptable.

The core insight was this: there's no reason to use a disk based search on a TAGS database. Stick the entire database into RAM, and use a hash-table to lookup your keywords, and search performance would go from multiple seconds (half a minute in some cases) to sub-second response time. So one weekend I coded up the hash-table, wrote code to load up a TAGS database into memory, and implemented a simple UI that let me browse code in Emacs. Soon, I enjoyed sub-second search times and could grok code that would have been impossible to do any other way.

If I ever needed validation that the tool-building approach to coping with large-scale software was the right approach, this was it. Once the senior engineers (remember, I was an intern at Geoworks then) got hold of the tool, I saw even loyal-vi users switch over to Emacs just to get their hands on the code browsing functionality (going from half a minute per search to sub-seconds was critical).

After I left Geoworks, most of my code was in C, C++, or other high level languages. Computers got so fast, and IDEs so sophisticated that I never dealt with a code base that couldn't be loaded into the IDE. It seemed to me that the need for such functionality had been obviated by ever more powerful machines.

That was, until I joined Google in 2003. By then, Google's code based was already approaching 1 billion lines, but in multiple languages. I needed to wrap my head around that code base in a hurry. Various teams were using random tricks to subset Google's code base into their IDEs, which I thought was a kludgy and unsatisfactory way to work. So in my 20% time, I rewrote my old tool using Google infrastructure (thanks to Craig Silverstein, who was the only person who believed in my code enough to waste precious time code reviewing it --- even then he was skeptical that my tool would be widely used or even substantially useful, given the huge amount of effort people had put into subsetting the codebase). I coded up the UI again in Emacs Lisp. I actually had to put some effort into the UI this time, given that C++ (and Java) overloading meant you had multiple search results for any given search term. Thankfully, Arthur Gleckler came in to lend a hand. Reading Arthur's Lisp code was like reading poetry: you can't believe the succinctness and elegance that can be expressed in so little space. It's worth your time to learn Emacs Lisp just so you can read Arthur's code.

Just as I expected, gtags took off in a huge way inside Google's engineering team. (By the time I left, 2500 daily active users was the metric, or about 25% of Google's engineering workforce. The internal tools team did a survey once and discovered that nearly every engineering workstation had a copy of Stephen Chen's gtags-mixer running on it) There wasn't a whole scale conversion from vi to Emacs though: Laurence Gonsalves stepped in and wrote a vim script that emulated the Emacs code. I don't even remember how I managed to do the code review for that checkin, but anything to help gtags, so I must have just gritted my teeth and done the code review or asked Laurence to find someone competent to review it.

But I wasn't nearly even close to done. Because of the huge amount of ambiguity and overloading involved in C++ and Java, I wanted gtags to do a better job of ranking the results of any given search. Phil Sung took a first crack at it, introducing Sung-ranking and later on, an include-rank that mirrored page-rank, except for code. Stephen Chen solved the problem of how to intermix protocol buffer files into the search results. Matei Zaharia (now a professor at MIT) spent a summer integrating a parser into the indexer for gtags, so that it was no longer a dumb lexical scanner but a full-on type-comprehension system for both C++ and Java. He also designed and implemented incremental indexing on Google's code base, no mean feat. Leandro Groisman and Nigel D'Souza both also made major contributions to gtags.

For several years, I had the entire Google source repository downloaded and checked out on a dedicated gtags indexing machine sitting under my desk. It was a standard underpowered workstation of that era: dual core, 2GB of RAM, and 500GB of disk: it had a special p4 client that eliminated the need to download any binary assets, since it was only interested in code! It was probably a major security hole, but I figured since Bill Coughran knew about it, I wasn't violating any corporate policies.

This illustrates a very important point: 2 billion lines of code sounds like a lot of code, but if you do the math (assuming 50 characters per line) you'll get only about 100GB of data (uncompressed). After dropping comments, white space, and lines that don't perform any declarations, your index is going to be pretty tiny, and you need to split that code base into several corpora (C++, Java, protocol buffer declarations, python), so each individual server could easily handle its entire corpus in RAM without any fancy sharding. Too many people get caught up in trying to apply fancy Google techniques required to manage terabytes of data when they're dealing with tiny amounts of data that fit into RAM and can be managed by traditional programming techniques.

In any case, gtags was a very hardware light project: it never took more than one machine to index all of Google's code base (and we never had to apply any fancy MapReduce techniques), nor did the serving cluster ever exceed more than about 10 machines. We came close to maxing out the RAM available on 32-bit machines for a while, but between Phil's string table optimization reducing memory use by 75% and the switch to a 64-bit architecture we never ever had to split indexes for any given language (there was a server for each language) across multiple servers. Those servers were under-utilized of course (they could probably have served 25,000 or 250,000 users at once), but on the flip side, you always got sub 10ms response times out of gtags. We switched from dedicated gtags server desktops sitting under people's desks to Google's cloud internally fairly early on, with Ken Ashcraft doing much of the work of converting gtags into a borg-ready service.

This came to a head when Google added the China office sometime in 2005 or so. After that, the powers that be decided that high intellectual property (HIP) code needed special permissions to access. Since I wasn't HIP enough, I simply stopped indexing that code. This burdened the HIP people so much that eventually some of them (including Sandor Dornbush) contributed to gtags. A HIP guy would take on the burden of downloading HIP code and indexing it using our indexer, and then put up the gtags server with that code behind a HIP firewall. The gtags-mixer would then be configured to talk to the HIP server and mix-in the result if you were HIP enough.

One of my prouder moments at Google was when Rob "Commander" Pike came to me and asked me how gtags worked. It turned out that he didn't want to talk to the gtags mixer or the gtags server, but just wanted his programming environment/editor to directly grok the output of the indexer. I was happy to give him access to the index for him to do whatever he wanted with it. I forget the mechanism by which this happened: he might have simply scp'd the index over to his machine, or I might have had the indexer push the index to his machine whenever it was done. This was great, because Rob became one of the folks who would notice whenever the indexer was broken because the file wouldn't get updated!

In any case, as with many things at Google, after I left gtags got replaced by some cloud solution that took way more resources than me, Arthur, and a bunch of interns, and I'm sure everything I wrote has been long retired by now, with the possible exception of the Emacs Lisp front-end.

Even after I left Google, gtags paid me back. Soon after I met my wife, she talked to some of her friends at Google about who she was dating. One of them did a p4 lookup on my changes, and said to her, "Hey wow, this guy has code commited everywhere, even the protocol-compiler." So I guess that worked out as far as a back-door reference check was concerned. (That change in the protocol-compiler was necessitated because I wanted to inject a clue in its output: that clue enabled the gtags indexer to map a generated C++ source file back to its original .proto form --- it was far easier to do that by having the protocol compile emit the clue than to try to guess --- it was a trivial change and Sanjay approved it in seconds)

If it seemed unbelievable to you that during that period of time I had such an illustrious group of people on a tiny 20% project, it should be. But I maintain that the test of a high quality engineering organization is whether or not that organization is able and willing to invest time, money, and effort into building tools that enable that organization to move faster and produce higher quality code. Google met that test and passed with flying colors.

Post a Comment