Data & Tools

  • Blog Post About Cleaning Up Data

    Insights into Data: It is no wonder that the period is called Comedie-Francaise, since genres that had plays written during this period were mostly comedies, or a combination of comedy and another genre according to the play.csv. There was no primary author who took the period by storm, since a variety of different authors contributed to the comedy genre in France. They were all of variable lengths, but a majority of the plays had 5 acts. The plays were equally spread out through the weekdays according to the register.csv.

    Insights into Platform: Easy enough platform to utilize in deciphering information already expressed in database format.

  • Data exploration in OpenRefine of Marvel Comics & Characters Data

    Marvel Characters and Comic Books

    While I was unable to download this dataset from the Canvas link, I was able to find it online. This data is arranged by the name of the character with the character’s real name (e.g., Bruce Wayne) in parentheses. The “id” describes whether the name is a real name (public identity or secret identity). The database also describes the characters’ physical characteristics, living status (e.g., “deceased character”), number of comic appearances, and when they first appeared. This dataset was fairly clean already, but I was able to remove unnecessary spacing at the end of strings as well as split the “first appearance” category into two columns (the year and month). The “year” cell was then redundant (e.g., two columns with the same year information), so I removed one of these columns.

  • OpenRefine Comment_Xiaofan Ye

    During this clean-up data exercise, I noticed duplications are caused by many formatting minor errors in the data set, such as having space after the text or having different spelling for the same information. Users need to spot similarities and differences and then edit each facet to organize the same data into one group. It will be more efficient if there is a way to categorize similar facets and compare the minor differences is, so we can standardize the data. Furthermore, I hope there is a tool to mass-standardize all the similar data into one category so the user doesn’t need to edit them one by one.

    Furthermore, there might be redundancies in the data information. In the Plays.csv file, the data records are accurate to the second. This too-detailed data is unnecessary and makes it hard to sort the opera data by day. I wish there could be a way to filter data (ignoring the seconds information) and sort opera data according to the first few characters instead of to the end.

    Moreover, in the register.csv data file, there is a bug when sorting Register Num in its default mode (Sort by name); it looks at the first character of the number list first and makes ‘9’ listed after ‘89’. It would be helpful if there are more ways of sorting, E.g. sort by value.

  • mbox-converter

    Here’s a GitHub repository with a simple script you can run to turn an .mbox file, the format your emails are exported to, into a .csv file, which you can view as a spreadsheet. Instructions for how to install and use it are included in the repository’s home page.

  • Open Data Catalogs

    dataportals.org hosts a huge collection of (collections of) data, organized by place. This is a great source for municipal data sets: for example, zooming down onto Boston will lead you to the Metro Boston DataCommon, where you can explore visualizations relating to culture and the arts, among many other categories, and export the raw data from those visualizations or from new ones you create.

  • NYPL Maps

    The New York Public Library has a lot of excellent digitized collections, among them a huge set of maps of New York and other areas. In addition to browsing and search interfaces, they provide an awesome tool called Map Warper which lets you overlay historical maps onto a dynamic present-day map. You can check out some hightlights here, and view the collection itself at maps.nypl.org.