When I began thinking about this, I had to ask “What isn’t data in literary studies?” Everything is data, in some sense, and it depends on the position of the analyst and the nature of the project. So I want to narrow the question by situating it: what is data to whom? and for what?
In this talk, “data” is that which can serve as input for computer analysis, by someone working with texts using the type of Natural Language Machine Learning I’ve worked with to isolate significant word clusters, topic modeling.
That’s pretty specific, but even then there are two distinctions to draw. Lisa Gitelmen points out, in her collection, Raw Data is an Oxymoron, that data is abstract but it requires material expression. So I’ll start with the material form, to argue:
Data is not literature.
I don’t mean this in the sense of cultural value, or even in the sense of what are our primary sources of analysis. But in purely practical terms. Texts have to be prepared as a corpus for computer programs to use them, and the result doesn’t look anything like literature. This is chapter 3 of Great Expectations:
To identify sets of word clusters, or “topics,” you want useful groups of words that can tell you something about the texts, other than that the English-language ones use “the” and “and” a lot. You want to identify groups of significant words, generally nouns and verbs. So you get rid of, “stop list,” all the articles, conjunctions, pronouns. And the honorifics: Mr, Miss, Sir. And the interjections, “Ah!” “Hem!” You eliminate proper names—they don’t tell you anything substantive about thematic concerns, and a topic that tells you who are all the major characters in a book is not very useful. If you are really ambitious, you might want to transform dialect into standard English, so that “guv’ner” and “governor” count as the same word, rather than two separate ones. What’s left is basically unreadable and doesn’t look anything like literature, and that’s without going to the next level and stemming the text, so that all forms of a verb, past or present tense, count as the same word and not different ones.
So postulate #1: The material expression of data is a representation of literature, not literature.
If we turn to the abstract form of data, in this context, the first thing that becomes apparent is that data is a product of analysis rather than its object:
Data is not the ground of analysis but its product
The scrubbing process I’ve just describe is premised on how the algorithm in topic modeling works. It’s premised on a model of what the documents contain, that is, a theoretical version of the data it will analyze. The abstraction in topic modeling works like this, according to its creator, David Blei, at Princeton:
Documents arise from a particular generative process, i.e., a story about how texts are written. It assumes that writers begin with a group of topics, say 100, and each topic contains an equal number of different words in it. This is the imaginary writer’s imaginary data.
The writer then randomly distributes these topics among all the documents, using different proportions for each. So document a has a lot of topic 1, and document b is mostly topic 2 and a little of topic 1, and so on. Every document has all 100 topics within it, in different proportions; 90% of a document might consist of two topics alone, and the others are barely there at all.
Finally, the writer distributes the words from each topic randomly within the documents, according to the given proportions. Thus: “It was a dark and stormy night.”
It bears no relation to real life, but that’s not the point. It’s a model—an abstraction—of computer-generated documents that another computer program could realistically analyze, using probabilistic modeling. Based on the observable words alone, it can infer the hidden topics and their proportions in the documents and the words per topic that it all started with. Reverse genesis.
So postulate #2: Analysis creates an illusion of meaningful data as prior to itself.
This isn’t a new concept for literature scholars; we know the argument in literary theory that interpretation creates an illusion of textual significance as prior to interpretation. And I don’t see any reason to expect that the function of data’s two forms, abstract and concrete, should be any more or less complex a construct than literature itself.