6 Provocations of Big Data

The two provocations I read were “Bigger Data are Not Always Better Data’ and “Just Because it is Accessible Doesn’t Make It Ethical”. In “Bigger Data are Not Always Better Data”, the authors discuss taking into account and acknowledging the group which researcher’s data sets represent. Specifically, there was a large focus on researchers using twitter as a part of research. The authors stated that “Twitter does not represent all people although…researchers refer to “people” and Twitter users as synonymous”. Following this, there is a large discussion as to why Twitter is not representative (due to the fact that what people actually see from the API is a very limited subset), and as to why it is important that the acknowledgement of the lack of a complete data set is important (meaning such that, in the example of Twitter, when researchers use Twitter as a part of their research, they need to acknowledge and keep in mind that the metadata that is found is not necessarily complete and correct). In “Just Because it is Accessible Doesn’t Make It Ethical,” the discussion of use of public content is brought up. The authors discuss questions researchers are faced when considering using online content in their work and some of the regulation of the use of humans as research subjects . In addition, there is a paragraph on the accountability of researchers when in these situations, however the authors never make a clear definition of what the researchers must be accountable.

As someone who has tried to access the Twitter API, I could only fathom what it is like to actually use it. When I tried to use it, it was actually a huge hassle to gain permission, so in the end I forwent using the Twitter data, and relied on websites built on providing Twitter Meta-data. When I was scrolling through and trying to get the information I was seeking, I was unable to get a comprehensive view, and I recognized that. I considered using a web scraper to try and get information off twitter, however my lack of exposure to building web scrapers, as well as the fact that my focus of the data scraping was emojis, I forwent that also. However, as a normal person attempting to use the Twitter information, the numbers are huge. When thinking about the scale of the data, others may think that the huge numbers are good enough to make a general consensus. But I agree with the paper; just because you have all this information doesn’t make it comprehensive. It’s representative of a small subsection of users, and that should be kept in mind when using the data in research. Really in depth and comprehensive numbers would require pulling data from another source.

I really enjoyed reading the “Just Because it is Accessible Doesn’t Make It Ethical” section, as it discusses just that. Just because you can see it doesn’t mean you can just use it. The ethical dilemma of whether to use public information or not is, what I can only assume, is highly debated. However, something I was thinking of while I was reading was the Terms and Conditions of using websites like twitter. I feel like the paper didn’t take into account what may be contained in the terms and conditions. When a user signs up for something, there is a disclaimer. But who actually reads the disclaimers? For all I know, a researcher somewhere may be using the information and content I post online, legally and rightfully, because I signed my ability to make that choice (of yes or no) away when I agreed on the Terms and Conditions.