Data Exploration and Cleanup with OpenRefine – US Deaths in Jails Data: Qingyu Cai

For this project assignment, I chose the data on US deaths in jails and focused on New York cases. I spent some time reviewing the data in OpenRefine and came up with the following questions for this data set. First, I don’t know where we can find the information for each shorthand word in this data set. For instance, what does DA, H, M, O, and S stand for in terms of cause_short information? I think there should be a note for this digitized data set so that people can read the information correctly.

Secondly, regarding the cause details of the deaths, blanks create two different name facets because of OpenRefine’s feature of being unable to identify the blank at the end of a word or phrase as useless input information. Besides, one should be careful with capital and small letters while inputting information. Moreover, there are some obvious occasions where two phrases indicate the same meaning but are written in different words. For instance, natural and natural causes are the same.

Finally, I have a question about how to control the detailed level of information for digitizing data. For instance, for death reasons, hanging by a bed sheet, hanging by cloth around his neck attached to a pipe on the upper wall, and hanging are three terms mentioned in the data set in three different detailed levels. It would be a waste just to omit the details. So, I would suggest adding one more category for death reasons and combining the three mentioned above as one category of hanging.