The announcement today in the New York Times about the forthcoming article in the journal Science by psychologist Jean-Baptiste Michel and mathematician Erez Lieberman Aiden about the creation of a digital storehouse got me to thinking. The two Harvard researchers have created the database by culling information from nearly 5.2 million books digitized by the internet giant Google. Truly, the creation of such a database is a momentous and exciting development for our field. Depending on what humanities researchers and others do with it, it may even offer what the NYT trumpets as a “new window on culture.” But much depends on what we do with the opportunity with which we have been presented. Without turning our backs on the possibilities opened up by the ability to quantify what has previously not been counted, humanities scholars must continue to exercise our ability to look beyond, behind, and around, as well as inside, the database.
The ability to quantify literary texts does not, as some humanities scholars may fear, eviscerate our traditional contributions to knowledge, but rather makes them more urgent. What we do best—examining the discursive framings of the questions being asked, and contextualizing historically and ideologically the texts being produced—remains crucially important. The effect of quantitative data, when it is presented in neat tables and visually attractive graphs, can indeed be stunning. The problem with the "stunning" quality of data—especially as it operates on those not accustomed to working with it—is that it can shut down inquiry. After all, the meaning of a graph appears so obvious: “Oh! Jimmy Carter was more important than Marilyn Monroe after all!” But whose idea was it to compare Marilyn Monroe to Jimmy Carter in the first place? And what does it mean? More specifically, what does it mean when, in what context, and to whom?
I wish to make a second point about this database. As comprehensive as it may appear to be, surely the words that appear in 5.2 million books are not all that can be counted as “culture.” What about that which has not captured by that treasure trove of data? What books, or newsletters, or chapbooks were judged too ephemeral, or just too unworthy, to be scanned? What kinds of sayings, oral stories, or urban legends barely or never make it into print, and so are by necessity excluded or undercounted? What about habits of interaction and ways of being in the world that are gestural, or visual?
So much has been written about some canonical authors that scholars now argue about whether an author or his or her editor should be credited with the elegance of the resulting prose. There will come a time, I have no doubt, where scholars will compare the number of semi-colons a canonical author uses in one text with the number he or she uses in another. Overkill? Maybe, or maybe not. But it is the kind of project that could be facilitated by the easy availability of quantitative data. Meanwhile, the kinds of scholarly projects that examine underground knowledges, insurgent cultural formations, or non-text-based literatures are at risk of becoming even more marginalized than they already are—or even being reduced to the status of “not-knowledge.”
I look forward to meeting the challenges presented to us by “culturomics.” I hope we meet it well.


Paula--Thank you for this post, which says so perfectly both what is appealing and what is distressing about this latest piece of news from the text-mines. A great deal would be lost if we came to accept the unreflective idea of "culture" lodged in this study and indeed in the database of Google Books as it currently exists.
On the other hand, digital media and huge databases have enormous potential for supporting, preserving, and making available for study the kinds of underground knowledges and cultural productions outside the sphere of mainstream print that you're concerned about. This is the insurgent potential of the Internet and digital media--they can bypass established methods of fixation and legitimation of cultural products. But in academia these are subjects of interest to humanists--and sociologists and anthropologists. By contrast, when true disciplinary outsiders like Jean-Baptiste Michel and his team enter the arena of cultural history and cultural studies from the side of science and engineering, they must be looking to legitimate themselves by proving that their approach "works" for subjects that they imagine will be widely recognized as significant. Hence the focus on American celebrities, on figures deemed famous by Wikipedia, American exports in new vocabulary, and (implicitly) American superiority in the domain of freedom of speech vis-a-vis Nazi Germany and post-Tiananmen China in the Science article. Hence too the prioritizing of highly canonical authors in the development of big digital archives like Whitman Archive or the Jane Austen manuscript site you allude to. [I'm cribbing here from wonderful talks I heard at the ALA in May by Amanda Gailey and Matt Cohen.] These new ventures, operating outside normal channels for consecration in humanistic scholarship, need to appear mainstream in a way "insider" humanists taking on the same technical means would not.
Are some of the big challenges, then, those of collaboration and the organization of academic work? Why did the team publishing in Science feel okay not asking any humanists or social scientists to participate in their work? Did the journal consider this question before approving it for publication?
Why did the team publishing in Science feel okay not asking any humanists or social scientists to participate in their work? Did the journal consider this question before approving it for publication?
Who knows, maybe they asked, but didn't get any takers. And just how should Science have taken this into consideration? Should they have refused an exploratory piece of work because the team didn't include a historian or a sociologist?
Steve Pinker doesn't count? Sure he's just the 12th author and all, but still, he has been known to do linguistics research and has likely trawled through a corpus or two in his day.
Looks like the research group behind the paper also has some other psychologists on board, at least in an advisory capacity.
http://www.fas.harvard.edu/~ped/people/affiliates/
But, as you say, no "humanists" if by that we mean non-empirical scholars of the humanities who would be primarily affiliated with a department like History or English or the like.
Quite right. Pinker is there, and he certainly counts as a social scientist. Although having a psycholinguist as your team social scientist is also a meaningful choice. It's not a question of "empirical" disciplines or not so much as a question of involving anyone whose specialty might be said to have a central cultural or historical focus--since cultural history is the focus of the paper. [Most historians would call their work empirical!] That kind of specialization is not conspicuous, anyway, in this group. I guess we shouldn't presume anything about how the team was formed or what happened in the review process--but it seems reasonable to assume that the questionable assumptions Paula and others have been pointing out made it into the published paper because those cultural/historical specialists were never consulted. All of this talk isn't so much wagon-circling as it is trying to probe what seem like fairly thick barriers to interdisciplinary communication.
It's ironic that within a couple of weeks of anthropologists (people who tend to study culture) debating whether or not to drop "science" from their field's aspirations (some would say pretensions--see http://chronicle.com/article/Anthropologists-Debate-Whether/125571/), we have this attempt to squeeze into existence "culturomics" through a simple and simply awful neologism. No doubt with this christening it will blossom, but I think the last line of the NYT's piece is wonderfully coy and astute at the same time--what will that word mean twenty years hence? How many hits will it have when the program is run on it then? The beauty of this is of course even if it has disappeared--hey, the system works (remember, that is what they said during Watergate). That is to say, it will still (always already) "mean" something. What? God knows. That is to say, nothing is not meaningful once you have set up the metrics.
In this dance around "science," anthropologists have zigged whist some humanist types have zagged it seems. Yet I cannot help but find some strain behind Menand's brave attempt to make this interesting. It's as if it is not politically correct to say that enormous masses of data processed in a millisecond doesn't always portend something awesome. The garbled English translation of Derrida comes back to haunt us: there is nothing outside the text. I agree completely with all the fine points Paula Moya makes, but once the machine gets set up, nothing can escape it, for better or worse. The Weberian iron cage. A mentality has set in at this particular juncture wherein the humanities have to speak another, quantifiable language to be "relevant" and "innovative" and, in a word, substantial. For some rebuttals, I point the interested reader to the first issue of Occasion, found here on Arcade, which treats rational choice theory and the humanities.
Thanks for pointing out that nice chiasmus, David. I think Alex Golub's post on "#AAAfail" at Savage Minds wonderfully gets at some of the issues you're pointing up here, in particular, the tendency to cede the category of knowledge to quantitative research:
Someone on Twitter, noting the awfulness of the word "culturomics," proposed the alternative "freakumanities." That's just right--Freakonomics is Exhibit A of the public's appetite for dubious number-crunching and for basing facetious arguments thereon.
That doesn't have to be what text-mining is about, but that's mostly what the web-based Google Books N-gram tool is good for, prompting large numbers of normally sober-minded, subtle thinkers to confess, with equal giddiness and guilt, in the last few days, that they've been "playing around with" the n-gram tool, which is extraordinarily "fun." By consistently describing the tool in the language of play, we admit to both its relatively low stakes and its pleasures. It's a kind of epistemic candy: no nutrition value, but nonetheless rather (as Patricia Cohen writes in the NYT) "addictive." It delivers a knowledge-effect with none of the hard work that knowledge normally demands. "With a click," as Cohen writes.
I played around with it for about half an hour. Now I'm bored.
Here's what's interesting and supports Josh's comment on another post. NGram is case sensitive. So you get really telling differences between searching for "Mother" and searching for "mother"; "Father" and "father"; and "Vengeance is mine" and "vengeance is mine". By "telling" I mean interesting to note that differences of spelling and capitalization completely skew the results (though maybe we get interesting data here on the history of capitalization as well).
Almost nothing on lower case or even upper case "son" but plenty on "Sonne" (17th c.) and "sonne" (16th c.)
And what's up with the spike in "Lyrical Ballads" in the early eighteenth century? Whereas "lyrical ballads" spikes only in the middle of the 19th.
Well, I'll tell you what's up. A lot of the old-style metadata -- maybe the front-matter in rebound or reissued versions of pre-1798 books -- contain advertisements for Lyrical Ballads, at least if you search "lyrical ballads" in Google Books and restrict your search to pre 1750. So I think there's a lot of interesting data to mine, but if medical data is dubious, as Cecile points out, I think even specialer care is needed for using this toy.
http://thebinderblog.com/2010/12/17/googles-word-engine-isnt-ready-for-p...
But does this just forestall the inevitable "perfection"? And then what?
This one contains recommendations for moving on from here:
http://thebinderblog.com/2010/12/21/how-to-fix-googles-word-engine/
Great post, Paula! Following up on David's point and Natalie Binder's blog pieces, here's a funny thing that Tom Scocca discovered: since Google's OCR routinely reads the long S as an F, some unfortunate results occur with old books talking about, say, bees sucking nectar. Who knew the f-word was such a big thing back in the 18th century?
It may be worth adding that we've had engines like this for a while now, albeit not on this scale: the ARTFL project, that wonderful resource, has been around for almost thirty years. They're immensely helpful when it comes to certain questions, and I'm deeply grateful for the existence of all of them. But at the end of the day they're just tools, and like most tools, they need skilled operators. The founders of the ARTFL project had the good grace not to adopt the messianic rhetoric, or the pose of making everything else redundant, that we have witnessed in more recent years. Might be nice to see a bit more of that modesty over at Google (and closer to home).