OpenRefine Comment

During this clean-up data exercise, I noticed duplications are caused by many formatting minor errors in the data set, such as having space after the text or having different spelling for the same information. Users need to spot similarities and differences and then edit each facet to organize the same data into one group. It will be more efficient if there is a way to categorize similar facets and compare the minor differences is, so we can standardize the data. Furthermore, I hope there is a tool to mass-standardize all the similar data into one category so the user doesn’t need to edit them one by one.

Furthermore, there might be redundancies in the data information. In the Plays.csv file, the data records are accurate to the second. This too-detailed data is unnecessary and makes it hard to sort the opera data by day. I wish there could be a way to filter data (ignoring the seconds information) and sort opera data according to the first few characters instead of to the end.

Moreover, in the register.csv data file, there is a bug when sorting Register Num in its default mode (Sort by name); it looks at the first character of the number list first and makes ‘9’ listed after ‘89’. It would be helpful if there are more ways of sorting, E.g. sort by value.