Monday, April 25, 2011

Finding the Weird Words (Vocab and Gmail Corpus Part I)

Today, I can finally begin to analyze the text of my Gmail Chat corpus.

My first thought is that IM text is probably somehow different from speech and pre-planned text: books, articles, letters, etc. After the rather painful process of piecing together Perl scripts to divide my corpus between the text I wrote and the text other people wrote, I ran those two files through Laurence Anthony's great AntWordProfiler, which compares a text against the General Service List and Academic Word List.

The GSL and the AWL are general vocabulary lists designed to help ESL teachers give learners of English the most useful vocabulary first. While English has one of the largest lexicons (if not the largest) in the history of language, about 80% of most texts are comprised of a few thousand very common words. The 2000 "most useful" of these words are the GSL. The AWL adds words that appear in basic academic texts and newspapers.

AntWordProfiler divided the words from my texts into four categories: those that were found in the first thousand words of the GSL, those that were found in the second thousand, those that were found in the AWL, and those that were not found in either the GSL or the AWL. While not completely accurate, one might consider the results as being divided between "very frequent", "frequent", "somewhat frequent", and "not found".

This is the distribution of the four groups, divided in two between words typed by others and words typed by myself. I ended up using a lower percentage of K1 (first thousand) GSL words, and a higher percentage of off-list words. This is probably because my corpus of interlocutors' words includes more than 80 different people, of varying word-choice preference. As the average should be 80% K1+ K2 words, neither I nor my associates are more loquacious than the average bear. (Not true: Average bear K1 is probably 0; most bear speech is "Not in List".)

What I find most interesting, really, is how similar the two breakdowns are. The difference between the percentages of GSL K2 and AWL were a matter of fractions of a percent each, and both corpora show distributions predicted for normal English text, which indicates that instant messaging text is no different from speech or pre-planned text in its word choice (as opposed to, say SMS text, which almost certainly has a different distribution).

Next up: What was in that "Not in List" group? A look at my weird words.


  1. Interesting. My only question is whether there's an inherent bias in the people you most frequently chat with. I'm guessing they skew pretty heavily toward more literate text, no?

  2. Oh yes, and purposeful misspellings. But that should yield a smaller percentage of GSL words than the 80+ we're seeing.

    Also, I see what you did there.

  3. I am want see how me did on thing. Please.