09-19-OpenRefine Blog-Elva Si

Open Refine is an easy-to-use software for data clean-up and exploration. With the imported csv.documents, the software could quickly categorize data using “genre-facet-text facet.” It is much easier than the Excel sheet to sort and track the data. The data duplicates were usually caused by minor typo like a extra space, which could be easily fixed.

Through the plays.csv exploration, I can tell that comedy was the most popular form of play at Comedie-francaise while the Bourgeois tragedy was the least popular. The registers.csv is a little bit more challenging to explore because we need to sort numerial facets (e.g. weekdays/register number) using the text facet. Therefore, if I want to investigate days that have register number below or equal to 10, I need to scroll the list and click data sets 1-10 to filter the exact data. It would be better if we could cluster the data by a particular pattern (e.g. register number below 10, 10-100, 100-200, above 200) in an earlier stage. I also want to cross-examine the data to see if weekday/weekend or seasons have an impact on the register number.

Some of my interesting findings are:

Register number was above 200 in Jan and Feb.
91 days have a register number below or equal to 10 people. The register number may not be affected by weekday/weekend because the data was evenly distributed through Monday-Sunday.
Yet, the register number was affected by seasons. April was off-season because it had low register number in general. The first day of May, June, July, August, September, and October were also dull days because they all had the lowest register number-1.