In the last post, I spoke about the difference between King Lear as an abstract idea (what readers want) and the many material copies of King Lear that Google has gathered into its database (what they get). Certainly, the preservationist aspect of the project is impressive.
By scanning volumes housed in college libraries and collecting these images into a database, Google created a permanent record of the books as they were at some point in the mid-2000s. (Once or twice, I’ve even borrowed books from the Stanford library that have deteriorated since the images were taken.) At the same time, the aim was not to present flat facsimiles: it was to render these books searchable. To do that, scanned images needed to be converted into texts. But what exactly does it mean to read the texts rather than the images of these books, or to search "inside" them? For someone who wanted to read a Shakespeare play, wouldn’t a single error-corrected version (the Project Gutenberg model) be better than hundreds or even thousands of erroneous texts spirited from the pages of books?
Containment is an old metaphor for describing the relationship between the book or the manuscript and the true text inside it. We might also speak of the text as animating soul and the book as its fragile decaying body; the text as the book’s antecedent or cause; and of one true text and many, possibly corrupt books. When Coleridge described being interrupted by the notorious person from Porlock, he imagined that "Kubla Khan" itself, the text of his poem, had already come into existence in a visionary instant. The "scattered lines" he managed to get down were merely a poor copy of what could have been.
Google Books reverses the traditional ontology by a) producing texts automatically from the pages of books (by OCR) and b) positing not a one-to-many, but a one-to-one relationship between the text and its material instantiation. Each book generates its own text, and those texts sit, each alone, siloed off from each other like books on a library shelf. Instead of imagination giving rise to texts (Coleridge’s "scattered lines"), which in turn give rise to material copies in a degenerative process, the already interlined and beaten-up books produce the texts. To toggle between them, Google provides buttons on the top right of the screen: "page images" and "plain text." That last phrase itself smacks vaguely of early Protestantism. "This comfort shalt thou evermore find in the plain text and literal sense," wrote William Tyndale in 1530. But, as we will see, these texts are far from plain and they are always "worse" than the books from which they come.
Textual scholars over the ages have tended to hypothesize a moment—before copying, printing, dissemination, or editing at which a text was most fully itself, "ideal," uncorrupted. The work of textual criticism has been to get back to that moment by dealing with the corruption that transmission inevitably entails. As Fredson Bowers grimly put it: "only a practising textual critic and bibliographer knows the remorseless corrupting influence that eats away at a text over the course of its transmission." Medieval scribes made unconscious errors and perverse alterations as they copied, and type-setting too produces errors— a line of type is set twice, a printer’s clerk swaps out a word he doesn’t recognize, or an author deals carelessly with proofs. The editor’s work was to understand this process of error-formation so that it might be subsequently reversed.
The processes that Google uses to render texts to us also involve corruption, but no one has yet articulated a clear rationale for dealing with it. The most important and counter-intuitive point is this: instead of leading to, or existing prior to, the process of transmission, "plain texts" are the final items produced in a long process of mediation–they are the final, not the original, item on the stemma. This means that the distinction scholars sometimes make between "linguistic" and "bibliographic code" isn’t very meaningful. Material features of the books–colored or decorative capitals, unusual fonts–inevitably bleed into or "infect" the plain texts.
For example, Google’s database has as many different texts of Henry James’s novel The American as it does scanned copies of the book. It doesn’t, and can’t, show the fact that there are three historically different versions: the Atlantic Monthly serial, the 1877 first book edition, and the 1908 New York Edition. Potentially, we can imagine an infinite number of "bibliographic codes" for the novel—it could be reprinted in many ways — but there are only three significant textual versions. Moreover, even when duplicate copies have been scanned in from different libraries (the same impression of the same edition), no provision is made for linking up the texts or, more interestingly, for using them to correct each other. Both the Bodleian and Oxford English faculty library have near-identical copies of the Macmillan 1879 edition of the novel, but each scanned copy generates a different and differently mistaken plain text.
This is the reading of the Bodleian edition, page 20:
“And now,” began Mr. Tristram, when they had tasted the decoction which he had caused to be served to them, “now just give an account of yourself. lVhat are your ideas, what are your plans, where have you come from and where are you going? In the Hrst place, where are you sta ing2”
Would it be possible to correct that text against this one, from the English Faculty Library?
"And now," began Mr. Tristram, when they had tasted the deeoctiorT which he had caused to be served to them, "now just give an account of yourself. What are your ideas, what are your plans, where have you come from and where are you going? In the first place, where are you staying?"
What would a textual criticism for OCR-based texts look like? Reversing the process of corruption involves understanding how it comes to be in the first place so, to answer this question, we’d want to trace the whole multi-stage process by which a book on a college library shelf becomes an electronic, searchable plain text. How many steps are there? What kind of error is introduced at each stage? The process includes:
a) Decay or damage to the book after printing. In some cases, the book may remain legible to us but be imperfectly so to a computer. In this copy of The Voyage Out, for example, underlining has made the text hard to read.
Here is the plain text:
As he did not leave her, however, she had to wipe her eyes, and to raise them to the level of the factory chimneys on the other bank. She saw also the arches of Waterloo,Bridge and the carts moving across IhemTlike the~Iine of animals in a shooting-gallery. They were seen blankly, but to see anything waToT course: |ojmdJierjive€ping and_bjgjnJo jvalkT
b) Problems during the scanning process, including a spectral hand over the page, and difficulty dealing with spinal curvature. Here we have an image from Dickens' novel Our Mutual Friend.
T. S. Eliot thought about using Betty's description as the title for part of The Waste Land, but anyone searching for "He do the Police in different voices" would be out of luck with this edition. Rats' alley is more like it. The plain text reads "He do the Mice," as the letters "Pol" become deformed into "M."
c) Error occurring during the OCR conversion process. This is the most common reason for textual mistakes in the plain texts, and it's particularly likely in the case of unusual typefaces or non-English language words. The medial-s is a recurrent problem in earlier texts. What do you think the Latin name Persephone is rendered as in this book from 1730, The Scripture Chronology?
The answer, with strange anachronism, is "telephone."
d) Errors of correction. This problem is by far the rarest, and I would be interested to see more examples. In traditional textual criticism, the analogy would be a false, foolish emendation. When Google acquired the software reCaptcha in 2009, the aim was to "improve the process of digitizing books by sending words that cannot be read by computers to the Web in the form of CAPTCHAs for humans to decipher." This is a brilliant idea — when we sign up for a new account, or download an internet file, we're in theory emending the errors in OCR texts. One of the words is clear and nonsense (this is the control) and the other, which usually makes sense, comes from a book. I solved the Captcha below by typing "caufe ompostat," where "ompostat" is presumably the control and "caufe" the word that comes from a book. But is this really the solution?
Humans are much better than computers at figuring out problems due to the physical state of a volume (a and b), but without knowing when a book was printed, or where, or something about the history of fonts, we aren't necessarily better than the OCR at interpreting unfamiliar typefaces (c).
This final human-introduced kind of error is rare; most of the mistakes in the plain texts aren't semantically motivated. "Persephone" becomes a non-word more often than "telephone." In the next post, I want to think about how this leads to problems with counting the number of occurrences of a word or phrases, and more generally about quantitative research in the database. In the meantime, I'd be glad to collect some more examples of plain text oddities. Please send them my way if you come across any in your own research!