I love words. And I love books. And I have been known to fall in love with a couple databases (specifically those I don’t have to compile myself). So I should have fallen head over heels at the news (wonderfully commented on Arcade already) that a team of computer scientists, mathematicians, social scientists and biologists from Harvard and MIT had taken to aggregate the texts of all the books printed and digitized from 1800 to 2000 and compute all the words printed in a given year (8 billions in 2000!) in a single, free, graph-producing database (what old-school humanities scholars call an archive). That they would offer this mass of digitized documents as the database to mine for a quantitative approach of “culture” is quite an homage to the persisting prestige, in spite of its real decline, of the printed word and of this old technology called the printed book. And that, for a literary scholars made to feel on an everyday basis like one of the last (generation of) dinosaurs, feels nice. (So, first: thank you).
Yet, if there is one thing that literary scholars and historians agree can arguably not be lulled out from books only is precisely what we call culture. Even someone like myself with such a vested (indeed vital) interest in promoting the equation culture=books=words cannot feign to believe that culture is made of words found in printed books. As much as I wished it were the case, as much as my very elitist and bookish French education tells me it ought to be so, the quickest mental survey of what “culture” can mean (and its meaning is anything but monolithic) and how words signify (even more complexity on that side) seems to disqualify the whole enterprise as a nice, but skewed, illusion. (Sigh: the dream of a new academic and cultural order where scientists would work to produce ingenious tools for literary scholars because they (and not scientists) hold the key to the knowledge of who we are as a culture will probably not come true, even though the culturomics project seemed to be heralding that dream).
The premises of the project being thus questionable, the real work lies ahead: to do the actual work of interpreting what kinds of conclusions can be drawn from such a humongous (and fascinating) database, and figuring out in the first place about what exactly we are talking about when we draw from this tool. It is still unclear what this database is giving us access to. Strictly speaking, it can tell us a lot about books printed between 1800-2000, and certainly by inference about what they say or fantasize about what laid outside of them (the world, history, etc.). But to claim abruptly that from the frequency of this or that word we can deduce something about “the culture” of the time, that is a leap of faith that is surprisingly un-scientific. As with medical data, there is this weird ellipsis (in the name of quantifying the humanities—body or soul) on the need for, shall we say the word?, interpretation.
A short sentence at the beginning of the article should make us pause: “Periodicals were excluded.” And so, we might add, were manuscripts, images, videos, radio archives, cloths, restaurant and cafeteria menus, boarding schools policies, bylaws, billboards, flyers, posters, graffiti, songs, speeches, discussions and oral debates. Not to mention that the frequency of a word, be that the name of a historical figure, can mean any numbers of things: “I have never read Proust,” “who would want to read Proust,” “Proust’s book is pretentious and boring,” and “Proust is the singly most important French writer of the 20th century” are widely different clues to Proust’ place in French culture at a given time. So yes, it took less time for Britney Spears to make it into Celebrity World (in print) than Marilyn Monroe, and she will vanish (from printed books) faster, according to the graphs under the part entitled “In the Future, Everyone Will Be World Famous for 7.5 Minutes” –Whatshisname,” but what exactly does it say about what? Maybe something about cultural memory and how closely tied to the news cycle what gets into books has become. But what is the intellectual gain of knowing by how many years one gets closer to fame now than in the 50s? In the same way, we did not have to wait for any mathematical equation to infer from observation that new technology enters popular culture (and books) at an accelerated rate: it took my father to get married to buy his first TV; it took me 3 months to resists buying the ipad. Quantifying “culture” can certainly yield valuable results, but so far, the ratio between financial/time/intellectual/manpower investment and tangible, new results is disappointing.
Another statistical flaw that might seriously limit the import of any result drawn from the culturomics database: Google.Books digitizes only one copy per edition (if that). The corpus is therefore not statistically relevant if not adjusted to the volume of distribution of each edition: apparently, the frequency of “n-grams” found a given year has not been weighted by the number of volumes published or sold for that given edition were it is spotted. As far as I know Google.Books is not interested in digitizing re-editions, not does it make any specific attempt to digitize all known editions from a work (not to mention that a number of national library have refused to give their troves to Google.Books: this is a big problem for the 19th century, where rare and important books belong to special collections of municipal, national, or university libraries. So the question remains: if you are going to build a database on “culture” based solely on books, why not try to bring in a specialist of the history of the book?
In the same way, the small piece on censorship sounded very interesting, and the supplement showed an effort to compare what results a human would have come up with (a human asked to work on the same google/Wikipedia-based original database that is—which is not, I think, what most historians of censorship would use spontaneously). So we learn that censorship in Nazi Germany was very effective in suppressing the names of forbidden artists form the cultural production of the time. I am not a specialist of censorship in dictatorships, but having worked on censorship in the early modern world and read a decent amount of scholarship on censorship across history, I found the conclusion at odds with other cultural trends, and thus deserving more contextualization. One rather wide-spread if apparently paradoxical conclusion scholars (starting with Foucault) point out is that censorship generates discourses as much as it suppresses them, and that there is a correlation between the rise of the circulation of a forbidden word/image/kind of books and the repression that attempts to control or eradicate it (in other words, if you ban Madame Bovary, it becomes a best-seller—this even Wikipedia knows). Maybe things do work differently in a totalitarian regime (but are forbidden editions likely to find their way into Google.Books if they had to be hidden? what about manuscripts and oral culture as counter-cultures?). It would have been worth contextualizing this finding into the broader context of the history of censorship to point out how original it was, if it was. As for the methodological precautions, or rather validation, described in the Supplement: what about comparing the results yielded by culturomics with that of “old-fashioned” scholars specialized in Nazi censorship instead of replicating the data-mining operation done via computers by one done by a hand? It seems that from a methodological point of view, what matters is to compare the difference in the way one builds a relevant database, that is, first interprets the “data” as an indicator of the cultural production of the time.
The part I found most interesting was the one on linguistics: not surprisingly, when the data the study is based on (words) coincide with the object of study (words in languages), the results are more convincing. They amount to something closer to the “objective” knowledge the authors seem to be looking for. I was thrilled to know that the English vocabulary has steadily been on the rise, by thousands of words a year at that. Even as we have been said to enter a “culture of the image” (culture de l’image), words have been forged, texts written and books printed exponentially. We, the human species, are forever drifting away from the Ur-tongue, forever further away from the original coincidence between word and deed, word and thing. Maybe the world contained more worlds that our ancestors had not yet tried to name; maybe our minds came up with more fictions than ever recorded. We have added to our universe not only an infinite trash of objects, but also a trail of words, obsolete and new. (And, more pessimistically: while there are more words printed out there every year, other trends suggest that each individual knows less of them than a few generations ago.) Yet, even this fascinating part of the study, which shows more clearly than others the potential of the database, seemed awkwardly blind to its more obvious methodological bias: that languages are accurately represented by the written word, and, even worst from a methodological perspective, by books. Maybe paying lip service to the now old, but essential, distinction between langue et parole, would have been appropriate?
Not to mention of course the question of polysemy, ambiguity, irony, semantic shifts across history (does “nation” means the same in 1800 and 2000? “subject”? “class”?), periphrasis (if you look for De Gaulle, you might as well look for “Le Général,” “l’homme du 18 juin,” le retraité de Colombey-les-Deux-Églises,” but also in other publications “le traître,” “notre sauveur,” l’homme providentiel,” etc. Trickier with Mitterand, who was called “the sphinx”), puns, double-entendre, and all the various ways that texts say things in circuitous, ambiguous ways: in ways that need to be interpreted in context, and for which meaning cannot be extracted from the definition found in dictionaries. Naming is the poorest, if the most effective, way of designating someone or something. Analyzing words as if they meant the same thing across time and across sentences is at best an approximation, at worst a sign of naïveté. As for the fact that the database reveals that there are way more words in books than in dictionaries, well, I am glad this is officially quantified, but that’s not exactly news to lexicographers.
Self-Reflective Cultural Relevance
And yet, this enterprise remains fascinating, and culturally relevant, but because it points back to our own era: to 2010 rather than 1800-2000. That this young and promising team of undergraduates and post-doctoral fellows rely on Wikipedia and Google Books for their data, that, in itself, is culturally, if sadly, relevant. That the n-gram “interpretation” appears only twice, at the very end of the article, to allude at the last minute to the core of the scientific issue underlying the whole enterprise and to dismiss the possibility of offering much more than sketches for lack of space (and interest?) is another interesting cultural fact. That Google funded apparently in large ways the study, and not the institutions that do have a reliable database of all books published from 1800 to 2000 because publishers have an obligation to submit a copy of each publication there (the Library of congres, The Bibliothèque nationale de France), is also suggestive.
That scholars in the humanities, and notably lexicographers, historians of the books, literary scholars, cultural historians, and others specializing in the complexity of words, books and culture are decidedly absent from the research team is another perplexing, but probably telling one. (Sadly, I suspect that the responsibility for this absence is in part a by-product of the widening gap between “text-based” humanities and the social and hard sciences, rather than a deliberate omission).
Now what? The intellectual gist of the project is priceless: who would not dream of a database of all printed words and books? Why not bring more people in, more fields of expertise to weigh in at the conception of the database, more manpower (rather than computer-power) to check that the words “tokenized” are actually words (rather than “con-” in French for instance), and most importantly to analyze and interpret what are these sources telling us about history, languages, and, maybe, culture?