Blog Post

N-gram, Corpus, Field

A quick reaction to "Quantitative Analysis of Culture Using Millions of Digitized Books."

Thanks to coverage in the mass media (NY Times) and, more importantly, the scholarly blogosphere (Language Log and again Language Log, Arcade's own Natalia Cecire, David Crystal ...indeed as I am writing this more and more responses are popping up, more than I can read today [later: including right here on Arcade, so go read Paula Moya's post, too!]), many of you will have already heard about an interesting development coming out of Google Books: an article, to be published in Science, on the Quantitative Analysis of Culture Using Millions of Digitized Books by Jean-Baptiste Michel, Erez Lieberman Aiden, et al. (Don't miss the important supplement describing their methods in more detail.) Michel and Lieberman Aiden and their collaborators at Harvard and Google, who have backgrounds not in the humanities but in mathematics, computer science, and biology, have compiled tables of "n-grams," n=1...5, for a large number of books in Google's Google Books database. An n-gram is just a sequence of n words. Looking at changes in the frequencies of these n-grams over time, the authors seek to watch cultural change happening---not just qualitatively, but, they claim, quantitatively. Google has also released an interface to the n-gram tables, allowing you to graph the changing frequency of an n-gram of your choice over time.

There are many things, good and bad, to be said about this enterprise, and lots of people with more expertise than I to say them. Already we have Geoff Nunberg's reaction in the Chronicle, which makes some key points. No doubt much foolish ink will also be spilled on how this will revolutionize/destroy/discredit/do nothing at all to the traditional humanities with their "Theory" and their "libraries," or how Google is a force of enlightenment/darkness, or how computers are making us brilliant/zombies. Also I'm looking forward to commentary by the specialists in corpus linguistics, digital humanities, etc.

Meanwhile, a thought on what it means for this study to claim that its object is "culture" (the authors have a website, we are supposed to see an analogy to computational genomics). Everyone should be quite skeptical of the quality and scope of the data used by these researchers to represent culture over the last 500 years. As anyone who's used Google Books knows, the metadata is usually of poor quality, and dates in particular are extremely unreliable. Furthermore, despite years of protests from people like Nunberg and Robert Darnton, Google has been programmatically nonchalant and evasive about the problem, insisting that the flaws are in the catalogues provided to them by libraries. The truth is that Google is not concerned with the niceties of bibliography and book history; they are concerned to sell books in the present.

In any case, even Michel et al. acknowledge the problem in the supplement. Their solution was to find ways to prune the dataset in an attempt to isolate books with reliable metadata. Clever idea, but chief among these choices was what they call a "Serial Killer" algorithm, in which they attempted to remove all periodicals. Periodicals are one of Google Books's most conspicuous weaknesses--and one would think that the idea of removing every trace of periodicals would then have given them pause before they claimed to be inaugurating the quantitative study of "culture." Indeed, given that one of the "results" discussed in the Science article has to do with the timecourse of individual fame and celebrity, it is pretty much insane to rule newspapers and magazines out of discussion in favor of books. It is, or should be, a genuine research question whether "book fame" and "periodical fame" follow the same trajectories.

There's a more general point here. "Culture" becomes, in this study, a bizarrely undifferentiated mass--a bag of words, as the machine-language people say. It's the move to the bag that makes it possible to think ignoring periodicals in order to produce higher-quality data might be okay. Now the problem is not that computational techniques neglect the putative singularity of the individual text or the need for "close reading" interpretation. The challenge of producing meaningful knowledge about cultural history from large corpora can, I believe, be met by sufficiently careful researchers. But not by imagining culture as a uniform mass. Instead, a computational approach needs to see the cultural field--it needs to see that books come in multiple genres and forms, that they have multiplicitous readerships, that they organize themselves into debates, series, rivals, subcultures, that they make different claims to permanence, veracity, accessibility... Good metadata would allow digital humanists to begin to take note of this essential fact about culture. [And no, Google Books's, and Michel et al.'s, use of the Book Industry Standards and Communication subject categories, devised for contemporary booksellers, does not even come close to being okay. What account of "culture" would try to apply the categories of contemporary book commerce over four centuries? Time series and graphs notwithstanding, no account with any pretense to a historical sensibility.]

But then this "world is flat" ideology of culture imbues the whole Google enterprise. It is, of course, the ideology of Silicon Valley neoliberalism, projected into culture: every n-gram gets an equal opportunity at riches, or anyway high frequency! (Michel et al.'s choice of examples is, by the way, a very tempting target for armchair ideology critique.) The freedom of the market rules! The great sea of "the world's information" speaks for itself! But of course "information" has been very selectively defined here, paring away the information contained, for example, in the "bibliographic code" of a book's format, publisher, etc.--the information used in the symbolic struggles that constitute the cultural field--in favor of a dubious ensemble of "plain" texts ("98% accurate," they proclaim, as if that were a good number).

It's possible that these kinds of information really are, in the great sea of n-grams, the merest noise in comparison to the signals from the big trends. But surely that is a question for investigation, not one that should be decided before the research begins...

Andrew Goldstone's picture
Andrew Goldstone is an Assistant Professor of English at Rutgers University, New Brunswick. His book, Fictions of Autonomy: Modernism from Wilde to de Man, is published by Oxford University Press. He specializes in twentieth-century literature in English, with interests in modernist and non-modernist writing, literary theory, the sociology of literature, and the digital humanities.