Reading… into Text Mining

Starting with Marti Hearst’s ‘What is Text Mining?’, one can gain a fundamental understanding of the concept behind computer mining of text and its various applications for text extraction. When the words machine learning enter a conversation, my gut instinct is to cringe. Not because I believe it is wrong, unethical, or useless; but rather I fear the removal of the human in the process and care that comes with human oversight. While human error is often times quite large, human nature plays a major role in social science and understanding the person behind the text or data is always important. Hearst speaks to this when discussing the difference between data mining and text mining. In text mining for instance, she states that it is based on natural language text instead of structured databases and argues this is because text is meant for people to read rather than automatically process. This statement quickly put my apprehensions to rest and allowed me to objectively read and understand the entirety of the article. Furthermore, her analogy referencing crime fighting and the difference between dsicovering new knowlege vs. showing trends allowed me to apply her viewpoints to my personal work which I appreciated. She strikes a balance between providing information and insightful critique that allows for a reader like me who is new to the topic to feel as though I have a grasp on the general concept, an effective writing style to say the least.

For the second reading I decided to read Marti Hearst’s ‘Untangling Text Data Mining’ in order to further my knowlege on the topic and explore the effectiveness of her writing style. This article in terms of chronology was written four years prior to the one mentioned above. She begins with the metaphor of finding information that meets the needs of a researcher to ‘looking for needles in a haystack’ and honestly I couldn’t agree more. The ability to apply a tool such as text data mining to initial research procedures could greatly reduce the amount of time spent jumping down rabbit holes or so-to-speak. On the other more critical hand, the inherent bias that exist with computer text data mind is hard to overlook. As the article suggests, there are different ways for computers to extract text and one of which might be more successful in my opinion is categorization. Similar to applying tags or subfolders in archives, by starting with categories the researcher would be able to navigate their personal bias rather than implied from the beginning with a computer. Again, this brings up a range of issues however the ability for human guidance is important. One area of the article that I found difficult to overlook in terms of proving the applicability of the technology is in the section addressing how to use text to uncover social impact. The case study references the mix of operations leading to the results and shows how this process can both simplify and complexify an argument. On one hand, it was able to process thousands of documents but on the other, as noted by the author, much of the work was done by hand due to the data not all being available online. This speaks to my apprehension with applying this tool to history or non-contemporary research as much of the documentation is handwritten or exists only in print. Lastly, she speaks to the untapped resource of large text collections and the need to not simply rely on artificial intelligence which allowed for a rounded argument.

The thoroughly enjoyed the third reading, ‘“Raw Data” Is an Oxymoron’. The author Lisa Gitelman writes in an almost satirical voice to try and reiterate the complexity of defining and categorziing various types of data. Using words like ‘sexy’ and ‘silly’ and metaphors of raw data to ‘jumbo shrimp’ in an effort to pull the reader into a critical discourse into contemporary applications of data. She takes us on a historic journey of data categorization from the 1960s until present data to discuss various professions viewpoints on how data is used and spoken about. In particular, I enjoyed her view on the reduction of literary objects to ‘graphs, maps, and other data visualizations’ was an important point to speak to the resistance in some fields vs. others. The need to create universal definitions for terms is important and one which I constantly grapple with in a cross-disciplinary field. Her ability to represent various applications and points of view in a concise way allows the reader to see the different vantage points on the topic and come to their own conclusion. This is all said prior to her deep dive into objectivity which creates a new layer of complexity that was more difficult for me to untangle. All in all I think the paper provides a wonderful introduction to raw data and the complex topics of discussion/arguments that arise when digging further into the topic and questioning the intent of its very definition.