Monday, December 27, 2010

As the Sands of the Hourglass...

This is a nice graph showing where people came into and went out of my Gmail chat life. That gray line is the first day I used the service, which as you'll not is NOT in the middle of 2006 (tee hee, previous post. tee hee.)

Personally, I think this is an incredibly telling graph. It's taking, again, my top nine gchat buddies and showing how the volume of our interaction changes. For example, KRS shows a very solid line at the beginning, showing frequent interaction at the beginning of this period, but tapers off. RPM starts off as a very casual gchat friend, but gains in intensity near the end. CRS is an interesting latecomer; she's my wife. She was given a Gmail account by ASU when she was accepted to the doctoral program, and I switched her personal email over to Gmail as well. We didn't start chatting until we were essentially engaged.

You'll note that she still makes it into my top nine. This is because of the size of those early chats. Unfortunately, this graph is not weighted; you get one dot every day we converse, no matter how long or short. I'm still learning R, and am trying to find out exactly how one goes about changing plotting colors according to the values of a variable. (Rather, I know how to do this for larger symbols, but not for dots).

A note on the three-letter codes. In the interest of privacy, they are generally not the current complete initials of the person listed. However, in order to make it readable for me, they are pretty close. If you're listed here, and you want me to change the initials to protect your privacy on the internet, please let me know. I'm not Mark Zuckerberg, you know.

Saturday, December 18, 2010

The Size of Shakespeare, or, A Comedy of Errors


It's been a little while, and I haven't neglected you, three blog readers. I've been working on a project in order to blow your collective mind, or at least give it a little something to chew on.

Specifically, I have delved deep into the realms of my Gmail chat logs and have begun to discover: data. Oh man, the trip this has been. And it's not over. There will be charts, there will be graphs, AND! there may be PODCASTING.

I intend to give you the tidbits I have learned in chewable form, piece by piece. Today's episode is: why you should examine your data thoroughly before you make any conclusions.

I wrote a Perl script to turn my wad of uncooked data into a delicious patty; it returned the size of each individual chat file along with other important stats. In the statistical scripting language R, I discovered that the sum total of chat content produced was, in a word, ridiculous. I did some calculations and made a graph that looks a little something like this:

Yes, it appeared that even just my most chatty friend had produced with me a larger corpus of work, bytewise, than Bill Shakespeare himself (the Bard wrote about 5 Mb worth). Sweet mercy. Note: I have been using Gmail's Chat client, and occasionally Google Talk, since the former launched in the middle of 2006.

The problem with this graph is that it's not actually accurate. When I examined the files a little further, I realized most of them looked like this:

Uh oh. So, of course, I had to write another Perl script. I found that code accounted for roughly 77 percent of the content created by Gmail chat. Here's the revised graph:

Tada! I think the most interesting development from this graph is quite simply that, even with the code stripped from the chats, a few of my friends and I have produced an entire corpus of text.

In the next few months, I'll look at some of the sociological implications of the date and size data, and then all the way into the the textual aspect of the transcripts. Analyzing the text itself should be tremendously interesting.

[Note: I changed some things as other pursuits have prevented me from diving into the actual text. Someday...someday.]