Sunday, December 19, 2010

Ngram and Inference

Google's new ngram word mapping has been making the blogging rounds. Basically, it charts the frequency of words that occur in Google Books scans. I think Kevin Drum aptly suggests "the potential here for timewasting disguised as scholarly research." But let me take seriously as scholarly research for a moment, because the ngram simply puts a quantitative face on a key practice that humanities scholars adopt regularly: the historicization of ideas. It's Raymond Williams' Keywords with numbers. See for instance, Aaron Bady's discussion of concepts of race.

My book-in-progress is historicizing both the concept of the "social problem" and consequently the "social problem film." This has involved an intellectual history of the former and a reception study of the latter. The word-mapping is a good, if very partial, check to see how representative either pursuit is.

The rise in "social problem" usage does at least correspond, roughly to first progressive discourse and second to functionalist sociology. Correlation does not equal causation, but at least it's not contradicting my main argument. The map of the "social problem film" is interesting because it suggests the genre term has become more solidified in the last few decades.

One caveat though: the ngram viewer maps only books, meaning that popular periodical usage is not present. I can attest the term is more prevalent in the 1940s than in the 2000s.

There are some other problems to consider. The search is case sensitive; to use an example from Bady's post, chart "negro problem" and "Negro problem" and you will get two very different timeframes. The chart also does not track tone or context. For that I would recommend the Corpus of Historical American English, which does more or less the same thing Google's service does, but with the context preserved. The COHA gives some further limitations of the Google charts and critique of their accuracy.

1 comment:

Chuck said...

I've been playing with, primarily out of entertainment, just to trace the popularity of various authors, filmmakers, and public figures at various points. I think you're right to speculate about the problems with case sensitivity, etc. I put as many as five terms in at a time, so I would guess that you could put in multiple terms and spellings and get pretty solid visual data.