Comments

Comments on Christof Schöch: Big? Smart? Clean? Messy? Data in the Humanities

While the data is not always fully representative of the original source, it is usually representative enough that we can draw very clear and accurate assumptions. For example, when studying the text of an 18th century novel, the fact that our data takes the form of a digital document hardly interrupts our ability to analyze the text. Similarly, I understand Drucker’s pretense of capta, but I don’t quite see this as distinct from data except in name—data are always gathered and, for the most part, oriented by goals (we choose to collect certain data and not certain other data, for example). It’s all subjective. However, I respect the notion that data, as we often consider it, is thought to be objective, while it is really a subjective collection of data “of interest” to us for particular reasons. In this sense, I am more in the camp of Trevor Owens, mentioned here, who sees data as manufactured and “mobilized as evidence in support of an argument.”

I very much agree with and see the promise of the author’s distinction between big data and smart data. In sense-making, there is also the distinction between think data and thin data, which similarly distinguishes, essentially, between “how deep” the data goes and “how voluminous” the data is. I take small issue with the author’s label, here, of “smart data,” since the data itself isn’t intelligent, although I’m not sure what a more appropriate term might be for this (and I do understand her point).

I am curious about the author’s “tagging” system, described about half-way through the article, and how this differs from well-known approaches of annotating large bodies of textual data. For example, NVivo is a tool that is used to annotate and draw connections between and among many documents within a single corpus (or within a single document, video, audio clip, survey, etc.). In cases like this, I have performed a lot of annotation that has never needed to fit within the confines of a TEI encoding structure, and so it allows data of widely varying structures to be read, even though the data might take drastically different forms. In this case, I wonder whether the author would call this smart or big data.

In reading the author’s comments on the current state of digitization of human cultural records, I am astonished, and dismayed, at just how little of our historical cultural records are digitized, and to consider that those records that have been digitized were done so for very particular reasons—and we therefore have little in the ways of an objective record. Of note: I am particularly fond of the idea, mentioned near the end of this paper, of (automated) crowdsourcing to create richer, larger, and smarter datasets, especially when potentially combined with engagement and motivation approaches (e.g., gamification).

Comments on Miriam Posner: Humanities Data: A Necessary Contradiction

The key point here, for me, is that humanists have different ways of knowing when compared against social scientists or “hard” scientists. I deeply appreciate the author’s example of early silent film and conventions of melodrama, which makes clear this difference; I’ve also struggled over the years to explain to colleagues from other disciplines why the “data” I’m presenting is valid or trustworthy, and this effectively encapsulates that notion of an alternate way of knowing. This sentence stands out: “With a source, like a film or a work of literature, you’re not extracting features in order to analyze them; you’re trying to dive into it, like a pool, and understand it from within.” This notion of reproducible results is also interesting, and I believe true. I’ve thought recently about what it means to “validate” our results in the humanities, as social scientists must do. I’m still uncertain of what this would mean.

I chuckled at mention of the “historical dataset” of images all named by their dates with a seemingly random number at the end. During our last class, our group was speaking about this very situation: that computers organize information in ways that are easy for computers to locate and arrange, but not so for humans. When we think of a photographed memory, and try going through out phone’s memory to locate that picture, we are scrolling for shapes—colors—anything other than a numerical sequence to identify the original. In any case, this was a curious association with reference to the author’s nearby statement: “So humanists — even those who aren’t digital humanists — desperately need some help managing their stuff.”

The notion of nontraditional datasets is also touched on here, and of particular interest to me (i.e., the example of a student who studied the frames that were made to frame paintings during the 17th and 18th centuries). As noted here: “So it’s quantitative evidence that seems to show something, but it’s the scholar’s knowledge of the surrounding debates and historiography that give this data any meaning.”

The author goes on to list numerous limitations of the digital humanities that need to be overcome. Among them, I am drawn to this issue of data modeling, which requires that our data is clean and in a consistent format. To me, filling in the blanks between different data formats may require more subjective interpretation in data preparation. Otherwise, I am most interested here in constructing and using nontraditional datasets in new ways (a non-humanistic example, from elsewhere, would be the NSF study showing that nighttime light (as viewed from space) can prove to be an effective measure of economic activity). In summary, the article brings up several notable issues in the digital humanities relating to the data we use, where it comes from, how it’s formatted, and how we can then utilize it creatively.