Saturday, December 18, 2010

The Size of Shakespeare, or, A Comedy of Errors


It's been a little while, and I haven't neglected you, three blog readers. I've been working on a project in order to blow your collective mind, or at least give it a little something to chew on.

Specifically, I have delved deep into the realms of my Gmail chat logs and have begun to discover: data. Oh man, the trip this has been. And it's not over. There will be charts, there will be graphs, AND! there may be PODCASTING.

I intend to give you the tidbits I have learned in chewable form, piece by piece. Today's episode is: why you should examine your data thoroughly before you make any conclusions.

I wrote a Perl script to turn my wad of uncooked data into a delicious patty; it returned the size of each individual chat file along with other important stats. In the statistical scripting language R, I discovered that the sum total of chat content produced was, in a word, ridiculous. I did some calculations and made a graph that looks a little something like this:

Yes, it appeared that even just my most chatty friend had produced with me a larger corpus of work, bytewise, than Bill Shakespeare himself (the Bard wrote about 5 Mb worth). Sweet mercy. Note: I have been using Gmail's Chat client, and occasionally Google Talk, since the former launched in the middle of 2006.

The problem with this graph is that it's not actually accurate. When I examined the files a little further, I realized most of them looked like this:

Uh oh. So, of course, I had to write another Perl script. I found that code accounted for roughly 77 percent of the content created by Gmail chat. Here's the revised graph:

Tada! I think the most interesting development from this graph is quite simply that, even with the code stripped from the chats, a few of my friends and I have produced an entire corpus of text.

In the next few months, I'll look at some of the sociological implications of the date and size data, and then all the way into the the textual aspect of the transcripts. Analyzing the text itself should be tremendously interesting.

[Note: I changed some things as other pursuits have prevented me from diving into the actual text. Someday...someday.]


  1. I think the implications are clear: The Bard is probably really, really happy that the minutiae of his life weren't dutifully logged in Google's servers.

  2. I'm really sure that our rants about girls were worthy to be kept within the bowels of gmail chat