Project Updates

  • Virtually Here - Report and Presentation

    Final Report: https://docs.google.com/document/d/1NSjnjdF9Jb7GOKXmX8exuU387puKr9a9GJqQuqPBrZA/edit?usp=sharing

    Final Presentation Slides: https://docs.google.com/presentation/d/1QJSjn6oEeJtpEkpHRIXPJp__9xi_c3a3MiI2nVUs4Fc/edit?usp=sharing

  • Final Project: Eating Around the World, In NYC

    Presentation: here.

    Source code: here.

    Website (hosted on Github Pages): here.

  • CurateMe final project draft

  • Final project updates 4/21-4/27

    Final project updates from last week (April 20) available here.

    Project updates for the week of 4/21 - 4/27:

    We are underway with the technical aspects of our project (tentatively titled CurateMe). Alan has worked on authenticating users through Twitter using a node.js library. David has been working on the VR aspect on the Google Cardboard and made a demo using a few images around campus that change like a slideshow (viewable here), which in the future can be modified to include the view at the scene along with the curated posts. David is also working on adding more features to the VR such as getting it to turn based on the user’s head position. Julia and Keeley have been working on the equivalent of Alan’s work regarding photos. One potential change that might have to be made is that we might pull from Flickr photos instead of Instagram due to privacy issues related to users’ posts, but we are looking into if we can get around it. Regardless, we are in good shape for the final project.

    Work that needs to be done includes the sentiment analysis (most likely a group effort) as well as more work with the Twitter and Flickr/Instagram APIs to pull posts based on location, which will be delegated to respective members of our team based on our current roles.

  • Karrie & Ali - HATHITrust

    We changed the focus of our project – we’re exploring the limitations of HATHITrust, a well-known content repository, as a basis for DH projects (rather than our earlier goal of using a HATHITrust collection to create a dataset that others could explore).

    Problems encountered: There are unexpected restrictions on the data (licensing, that trumps copyright and takes large swaths of data out of reach for public scholarship), which means we will have even less than we originally imagined. The documentation for creating worksets is so poor, we had a hard time recreating work we already did. The interface is not helpful for knowing what you are putting into your workset, which makes for tedious and brittle workflows.

    New Concept: Document the barriers for doing simple things with a repository of content that is promoted to scholars as a useful resource. Document the barriers by trying to accomplish a series of relatively simple tasks.

    Next Steps: We need to see if we can actually get a dataset that has all the pices we need, possibly by Friday. Karrie will document all the barriers already encountered. Ali will work on creating a serch tool and some visualizations and document how working with repository content was difficult. Look for recommendations for massaging the data (text markup?) that might allow for more user-friendly ways to explore and visualize the content.

    Project ppt

  • Food Maps: Project Update

    This week, we explored the food tour app concept in more detail and decided to rework our concept so that it focuses more on the cultural aspect of food in New York City.

    Old Concept

    Our initial project concept had a few aspects to it: present-day food maps, historical food maps, and food tours. Based on the class feedback, we zeroed in on the food tour aspect and were planning on generating customizable food tours for NYC.

    Problems Encountered

    Long story short, there are a lot of food tour apps out there, and there is no clear way for us to add anything new to the food tour app sphere. Some notable examples include:

    1. Rama: Offers pre-written, “curated” food tours and crawls by food fanatics for a low price of $1-3 each tour.
    2. Peek: Offers scheduled tours (e.g. Pizza and Little Italy Walks, Neighborhood Wine Tours, etc.) for a price, as well as booking services for the destinations in your itinerary.
    3. Localeur: Offers crowd-sourced food collections (e.g. Best Donuts in Boston, Best Places to Brunch, etc.) subject to community up-votes and down-votes.
    4. Many, many more…

    New Concept

    We’ve decided to pivot and focus more on the cultural/academic aspect of our original idea. We are planning on building a data visualization tool that allows viewers to visualize different ethnic cuisines in NYC with overlaid auxiliary information. The auxiliary information we include is subject to the data we can find, but so far we plan on displaying information regarding:

    1. Restaurant accolades
    2. Ethnic neighborhood break down
    3. Fusion level
      This tool will be geared more towards academics such as food anthropologists, so it will be optimized for desktop browsers.

    Next Steps

    Borrowing from the process of the VRMFA team, we decided to split the work into front-end (Nicole) and back-end (Zygi) and focus on building bare minimum functionality: heat maps that show the distribution of different ethnic cuisines in NYC. After this, we will tackle additional dimensions to overlay on our heat maps.

  • Project Pitch: Eating Around the World in NYC

    Presentation

    Here

    Problem Statement

    How have different ethnic cuisines mixed in NYC, in terms of neighborhood, price, and reputation?

    Pitch

    New York City has always been known for its wide variety of ethnic ciusines. There are about 24000 restaurants in NYC, each of them providing unique insights into the area it operates in. Restaurants, cafes and bistros can tell us a lot about the culture, lifestyle and community of the area. Our idea is to build a web application that would let users explore the restaurants in NYC and see the data map overlaid with other interesting statistics about the city’s various areas. In addition to that, the application would allow the user to generate a food tour in NYC. It would be customizable by length, popularity, cost and other factors, and generate a journey that would let users experience the various cultures of NYC through the food they have to offer.

  • Final project pitch

    Final project idea: Crowd-sourced location-aware digital curation to augment space

    View our final project pitch presentation here.

    We’re interested in answering the question: How do different people view and feel about the same space? Our idea for our final project involves creating an application for the Google Cardboard. Based on GPS location, the user can put on the Cardboard and view posts curated from others at that same location, such as Instagram photos and Twitter tweets. As an additional layer, we would also include the emotion expressed in each tweet, using sentiment analysis. Our hope is that this application could have a multitude of uses, ranging from more scholarly uses such as at historical sites and museums, to more casual uses. This would allow us to gain insight into the experiences and perspectives of the people around us.

  • Karrie & Ali HATHITrust Project

    About the Time Machine project

    Using text data from a monthly student publication, we will explore ways to reveal “what it was like” to be a Chinese student in the U.S. in the early part of the 20th century. The use of HATHITrust digitized materials is partly to help understand how libraries can be the bridge between large digitized bodies of information and researchers who do not necessarily have the skills of time to make them usable as datasets rather than online books for reading. Libraries want to present information in its most usable form to researchers, and not be locked into only offering discovery and use of books, articles, etc. Our powerpoint: https://docs.google.com/presentation/d/1aIe-eKhUFYHn6wIFVUAMcghpxj6zm1_kTv9041a_Gv8/edit?usp=sharing

  • Updates for Final Project

    After talking to everyone in class, David and I thought that we had similar topics and themes because we were both interested in visualizing changes in data over time from popular sites like Reddit and Github. Keeley’s was similar as well in that we all wanted to see how certain trends come and go, whether they are on the internet or in print. Specifically regarding my original idea of visualizing the internet over time through Reddit, I realize that I should narrow down the scope of the project so that I don’t try to bite off too much at once. Besides looking at the top subreddits for every month and their corresponding metadata, we could also focus on just one topic or subreddit such as science or AMAs. It would also be interesting to explore the differences between media sites such as Reddit, NYTimes, Facebook, etc. to see how information flows through different sources or if there is an inherent bias in certain sources.

  • Project Updates

    Project Update

    After meeting with Ziggy about his idea to create an image curation website, we didn’t see any similarities between his project and mine, since his website would be for curation of general images, and my project idea would be interested in images of landscapes, nature, and buildings.

    New Project Idea: Timeline of History

    A new project idea would be to create a website that contains a timeline that the user can navigate through which graphically illustrates the history of the world. This timeline would mine data and images from Wikipedia, which should be feasible since Wikipedia has pages dedicated to events in each year in human history. Additionally, these pages have links to articles dedicated to each event, and many articles have images that are published through a Creative Commons license which allows for reuse (as long as the author is attributed).

    In short, this would be a graphical timeline of Earth’s history mined from data in Wikipedia. It would be unique because of its ease of navigation and visual appeal. One significant use for the website would be that users could better understand the magnitude of time that has passed between certain events by having a chance to visualize the flow of time.

  • Final Project updates

    I saw a number of similarities between my project idea about visualizing Github project development and Julia’s idea about visualizing Reddit data. Both ideas involve visualizing information that develops over time and being able to easily see the state of the data at any given time. I think the main difference between our two ideas is that Github data would appeal primarily to software developers and Redit data would appeal to a more broad audience. Through the discussions about my idea last lecture I came up with a number of questions that I would like a Github visualization to be able to answer. These questions are: Does this project look like it will be a success (has it reached a critical point)? How does the development of one open source project compare to the development of another open source project? What are the stages of a projects development?

  • Final Project: Modifications

    Over the break, I thought more about making the cross-cultural component of my food project more prominent. I think that food fads are hilarious and interesting, and these fads are very often cultural–Korean barbecue, sushi, Spanish tapas, etc.

    Reading through the other final project ideas in the class, I had an idea that builds off my original idea, boosts the cultural component, and incorporates elements of Keeley’s project: a tool that allows users to track and visualize the history of cultural food fads by looking at restaurant menus and reviews over time. It could also allow you to track classic dishes as they became popular in restaurant cuisine. For example, this tool could allow you to see the prevalence of Korean ingredients in restaurants rise over the past decade and visualize the origin and evolution of the Sazerac.

    This modified project has largely the same audience, but it has a more prominent cultural perspective that may draw in an anthropological crowd as well.

  • CFRP Data Visualization

    Draft 2 Update

    a. Title

    CFRP in Time (viewable here)

    b. Goal

    Our CFRP timeline aims to contextualize the perfomance data available via the CFRP API, and to present visual cues for the associated metadata.

    c. Approach

    We present different dimensions of the data through a variety of means. Firstly, performances are sorted chronologically by premiere date. This enables a viewer to see what time span is covered by the CFRP data, and the frequency of when new plays were being performed. The performances are further keyed by an image, representing the genre of the performance, and some metadata. The metadata includes, author, date, number of acts, and types of tickets sold. We also use color to represent the popularity of a play, if ticket sale data is available for the performance. A red background means 15,000+ tickets were sold for the performance over its lifetime, pink is 5,000 to 10,000 tickets, and a light pink is less than 5,000 tickets sold. Lastly, we integrated historical events in the timeline, to provide hints of how performance genres, ticket sale success, and frequency may have been affected by the current events of the time. This adds an additional layer to the raw metadata from the CFRP API.

    d. Audience

    This project is intended for anyone interested in the events in France in the 17th and 18th century, and in particular, the interplay between French performances of the time and historical events.

    e. Obstacles

    We struggled with some missing data, and were unable to get ticket sale numbers for a good number of the performances. Furthermore, we would have liked to include even more details about specific actors, and to have had more information about the content of the performances, to better tie them to specific events in French history. In a future iteration, we would love to separate out the French history timeline and the CFRP performance data into two side by side timelines, and allow a user to click an event in the French history timeline, or vice versa a specific performance, and to have the related performance/historical event highlighted in the corresponding timeline. This could help French history scholars better understand the art of the time, and art historians better appreciate the relevant historical events.

  • Project

    #Project Title Rise of YouTube

    ##Problem Statement Analyzing interactions of YouTubers and how they growed to gain their first 1 million followers.
    ##Approach This class has been focused heavily on timelines and visulaizing things over time, so I thought it’d be cool to analyize the top 10 or so YouTube videos before their first million followers and see who they interacted with before hand. ##Audience YouTubers and people interested in becoming YouTubers ##Skills Meeded Minor coding skills needed to make timeline

  • Potential Scam Message Ideas

    I have always been curious about how many people actually respond to scam messages, and from there how many people actually end up getting scammed into transferring money. If there is data available about scam message responses (such as the response rate, average amount of money made, etc) it would be really interesting to create a visualization to see which kinds of scam messages garner the most responses and money from people. Are there certain topics and stories (e.g. cancer/disease, fugitive seeking help) that are more appealing and persuasive to readers? The messages could be analyzed based on topic, length, formality to see if there are any trends between those variables and their effectiveness.

    Another aspect of scam messages that I’ve found interesting is the way they are written. I actually find it off-putting how the emails are constructed in a way that looks like it was clumsily generated by a computer, very formal sounding, with spotty punctuation and grammar. To me, the appearance of the message is enough to tip me off that it is a scam because it looks suspiciously constructed. I wonder how much more effective scam messages would be if they were rewritten in a way that sounded more conscientious, human-written, and presentable, rather than the way they are now. Similar to the way the videos at the List exhibit showed real people reading the messages, these messages could be displayed in a proper written form and we could observe the difference it makes on readers.

  • Keeley - Potential Scam Message Idea

    I was considering a dynamic visualization with a bipartite screen. On the left side is a bare-bones, anonymous representation of a scammer, and on the right side is some equally indistinguishable representation of a mark. The visualization could then be launched and some representation for a message would be sent from the Nigerian scammer to the mark. As this message sends, it illuminates some aspect of the initially fully anonymous scammer representation (i.e. if the message alludes to the scammer being a prince, the representation gains a crown, etc). The message is recevied by the mark representation on the right hand side, and that representation also becomes more defined. Messages ping pong between the two sides of the screen, at each iteration further illuminating more and more about each of the interacting parties.

    The evolution would represent two distinct viewpoints: the scammer as imagined by the mark, and the mark as imagined by the scammer. At the end of the string of interactions, both representations revert to their true state, from what has been imagined. The visualization shows the false perceptions in a scam interaction, and how an interaction develops over time.

  • Keeley - Final Project Pitch

    Problem Statement

    Words are introduced into the English language at an extremely rapid pace. As new words begin to be used, their meaning is often ambiguous and undefined. New words then become more integrated into our society, and their meanings begin oxymorphically both solidify and morph. A word’s full history, including when, and in what context, it was introduced, how its meaning has evolved over time, and in what contexts it is currently being used, provides a unique view of English language. Further, it provides context for a given word.

    Approach

    1. Choose an interesting English word with a divergent history
    2. Identify which corpuses to parse for word instances/occurences
    3. Design a visualization to demonstrate the word’s evolution

    Audience

    This project appeals to the general public, and could also be used by researchers and scholars. Linguists in particular might be very interested in the project. The analysis would reveal how events impact a word’s use, popularity and connotation, providing both research and general interest insights.

    Skills Needed

    Text parsing is cruical for the success of this problem, especially a powerful process that can ingest a large amount of text. Also, the project requires front-end skills (HTML, CSS, JS, and any other data visualization tools)

  • Final project

    Project Title:

    Open Source Github Visualization

    Problem Statement

    It is difficult to see who collaborated on different sections of a project when all of the commits are viewed in list form. It is also difficult to see at what stages of a projects development different contributors were active.

    Approach

    My final project idea is to visualize data on opensource github repositories such as when authors commit and which authors edit the same files or parts of files. I would like to be able to graphically display this information in a way that allows a viewer to easily see which users collaborated with eachother and see how the development of a project has progressed over time. I would like there to be a play back feature where a viewer can watch the development of a repository unfold one commit at a time.

    Audience

    Both programmers and users interested in an open source projects development.

    Skills Needed

    D3 knowlege to graphically visualize commits. JQuery knowledge to make API calls to Github.

  • Project Pitch

    Photography Exhibition Composer

    Problem Statement

    There are a lot of different tools that enable users to display their photo collections online. However, most of them are limited in some way or another. For example, while Omeka has a lot of metadata and organization features, its “Exhibition” feature isn’t flexible or user friendly. More approachable tools like Google Photos are also more focused on organizing one’s photos than displaying them.

    Approach

    We would design and implement a tool that would be focused on organizing photos into an “exhibition” that can then be displayed publicly on the Internet. The main design goals would be:

    • User friendliness: it should be usable without any technical knowledge.
    • Floating layout: instead of being forced into a grid, the photos should be freely arrangeable on the screen.
    • Dynamic: the users should be able to actually ‘design’ the exhibition by iterating inside the tool, not just use it to create an exhibition from an existing design.
    • Style minimalism: other than displaying borders and shadows, the window itself would not be excessively styled - photos would be the content.

    Audience

    Future CMS633 students; photography hobbyists and photo collectors who want to display their photos online in a more curated way

    Skills Needed

    Frontend programming (css, js) would probably take the most time; Graphics design would also be important but would probably take less time because of the focus on minimalism. Backend programming work would be limited to saving the exhibitions and making them visible publicly.

  • Karrie - Final Project

    Project Title:

    Chinese Students Studying Abroad in the U.S. 1900-1920

    Problem Statement:

    How did Chinese students in U.S. universities use their key organizations and bulletins to stay connected with each other, stay informed about developments in China, shape student opinion and behavior, and reflect on their experience?

    Approach:

    • Develop a corpus of fulltext primary and secondary source materials within Hathi Trust collection area relevant to the topic.
    • Develop some geographical data visualizations, such as locations of national or regional conferences, home states or cities of the officers of organizations and/or editorial boards of student bulletins.
    • Design a visualization to show the “home topics” of the key monthly bulletin over time, as a way to highlight the key topics about Chinese politics or events of interest to Chinese students studying abroad in the U.S.
    • Design an interface through which users would be able to explore the reflections and exhortations of students as written up in student publications, biographies or other primary sources.

    Audience:

    Me. Potentially some of my friends who are from China and have been students in the U.S.

    Skills Needed:

    Creation of spreadsheets. Data visualization. Hathi Trust allows API and data downloading of OCR text, and also it’s possible to request the creation of datasets for pre-1923 materials, and I would like to experiment with that but wouldn’t have the programming skills to make anything out of the data once I got it.

  • Final Project Idea: The Evolution of the Internet, As Seen Through Reddit

    Problem Statement

    In the past decade, the internet has exploded in size and relevance in our culture. The internet is a major (and perhaps the main) source of information and media for people today. How can we see what topics and perspectives people have cared about over time? How has the internet shaped what we think about, and vice versa? One option is to explore this idea through reddit.com, “the front page of the internet.”

    Approach

    We would build a data visualization tool that displays the top, most popular subreddits of a certain time period, such as February 2016 (right now I envision a bubble/web sort of layout). The size of each subreddit on the visualization is directly proportional to its popularity. For each subreddit, the user can click on it to show more information, such as the top posts and comments, number of visitors, etc. Since Reddit has only been around since June 2005, it is feasible to provide a visualization for every month since its founding. Many more features are possibilities as well, such as colors for different categories of topics (politics, science, etc), a list on the side with the top posts for that month, filters for more advanced queries, and superimposing major economical, social, and political events with the timeline.

    Audience

    This project appeals to the general public and can be of use to scholars as well. Non-scholarly users can use the data visualization to see how the internet has evolved over time, in terms of both size and content. Scholars can use the filters to see how the general public/users of the internet reacted to certain events and analyze the popularity trends of posts and topics.

    Skills Needed

    This would definitely require front-end skills (HTML, CSS, especially JS and other data viz tools) to make a clean and responsive design. There is a Reddit API that can ideally provide the data for the backend.

  • CFRP Academy Awards

    CFRP Awards

    Project Members

    Zygimantas Straznickas, Nicole Seo, Karrie Peterson

    Goal of Prototype/Questions Answered:

    Our group wanted to explore how we can make the CFRP more accessible to the general public. Specifically, how can we enable non-experts to expore general interest trends within this data – e.g., popularity of plays, of authors, of actors?

    Today’s audiences often think in terms of awards, box office profits, top-billed celebrity performers. We wanted to create a website that would put the CFRP data into these kinds of terms.

    Our Approach

    What we did to achieve this goal was to create an award generator–in the style of the Academy Awards. We really processed the data, so that users could easily peruse the world of old French theatre. Other CFRP tools we have seen keep the data very much intact, allowing the user to do what they want with it. We took a different approach, by using the data to tell a story in modern day terms.

    Intended Audience

    Our intended audience is the general public–people who don’t know a lot about the Comédie-Française but are perhaps Francophiles or theatre enthusiasts. The site will give these people an easy entry point into the world of 17th-18th century French theatre. They will learn some of the big names of the time and get a general sense of who’s who.

    Obstacles, Regrets, and Lessons

    • Obstacles: It was a challege to decide exactly how we wanted to fit the Academy Awards metaphor to this data that was not made for the Oscards. It was also challenging thinking about how to included awards about different actors/actresses due to limitations of the CFRP API (a challenge we weren’t able to overcome in two iterations).
    • Regrets: We regret that we weren’t able to explore other API’s in order to include more information that would bring these awards to life (e.g. critic reviews, plot summaries, date context, etc.). The point of the site was to interest the general public, and a little more information is probably necessary to do this.
    • Lessons: Your vision is often limited by your data.
  • Money Can't Buy Happiness, But It Can Buy Food

    Problem Statement

    In cuisine, there are some dishes that are fit for kings and some fit for peasants (historically, at least). For example, stews were originally peasant food because poor quality meats could be masked in the heavy sauce. On the other hand, macarons used to be upper class food because the almond flour involved, the high failure rate, and the precision in temperature needed to bake the shells make them more expensive to produce.
    What is interesting to consider is the progressive blurring of the socioeconomic divides in the culinary world. Haute cuisine is embracing ingredients and techniques that used to be frowned upon (e.g. Spam). Low brow cuisine is embracing organic, fresh, and consequently more expensive ingredients. We still have the old-school accolades such as the much revered Michelin star, but we also have new forms of recognition such as Diners, Drive-In’s, and Dives and a 4.5-5 star Yelp rating.
    What I would like to investigate is the extent of the blurring between the high and low culinary worlds. Are fancy and casual dining actually finding common ground, or are there still fancy dishes and casual ones? I propose we build a tool that will help us answer this question.

    Approach

    I would like to build a visualization tool that will allow people to see the relative frequency of a certain main ingredient/dish in the dining scene, so they can see how common certain dishes are given a certain price range. The tool would allow the user to search for an ingredient or dish and see the relationship between cost and frequency over the years. It would also allow the user to see a breakdown of the most common dishes in high and low brow cuisine over the years.
    The information will be gleaned from online menus and Yelp/Locu data.

    Audience

    The audience for this project would consist of food critics/enthusiasts and food historians.

    Skills Needed

    Web scraping, UI design, data visualization

    Project Pitch

  • Final Project Idea: Photos of Time

    Problem Statement

    The flow of time can cause us to forget what used to exist but no longer does. One way we deal with this is by keeping photos and pictures, a large quantity of which have been amassed over the course of human history. The problem that this project would address is that we sometimes neglect how landscapes have changed over time. Residents of modern-day Manhattan may not realize how radically different their zone is today from how it was in the days of the Native Americans. To tackle this problem, this project aims to give users an easy-to-use interface to explore how different regions have changed in appearence over time.

    Approach

    In this project, we would create a website where the user sees a map of the Earth and has a slider to set the date to a different point in time. When the user clicks on a point in the map, it will bring up a photo from a database that most closely matches the point in time and space that the user has selected.

    Audience

    Our audience would be people interested in understanding how landscapes change over time. This could include historians, environmentalists, and students. Research questions that this project can help answer are: What effects did the colonial expansion of the 17th Century have on the Americas? How is deforestation affecting the appearance of woodlands? How has industrialization impacted the appearance of different regions?

    Skills Needed

    This approach would require gathering thousands of images and labeling them with spatial and temporal information. It would thus require back-end programming skills and database technology knowledge. It would also require creating a responsive and attractive front-end and thus require knowledge of HTML5 and JQuery.

  • Scam/Spam Recipes

    Before coming to MIT, I spent a year earning a degree in French Patisserie in Paris, during which I gained an immense appreciation for the art of recipe creation. Examining the scam/spam messages in the Google Sheet, I noticed that there are a handful of rhetorical components that are often mixed and matched to create many variants of scam messages–much like different ingredients and techniques are mixed and matched to create an endless variety of culinary dishes. Some example components (or ingredients) are: medical-related stories, finance-related stories, personal details, antivirus promises, flattering remarks, etc. These components are roughly analogous to the key ingredients of a dish: a protein, starch, vegetables, seasoning, sauce, etc. On the other hand, we have common techniques: evoking guilt, instilling fear, stimulating greed, confusing the reader, etc. These techniques are analogous to the procedures you learn about in culinary school: folding, beating, laminating dough, roasting, searing, etc.
    The analogy between scam messages and recipes made me think about creating some sort of scam recipe visualization. This would allow you to see the composition of these scam messages–drawing to attention the similarities and differences between different strategies. The exact details of the visualization need to be fleshed out still, but it would be reminiscent of cook books, online recipes, and cooking channel type visuals–all in the context of these scam messages.

  • Scam/Spam Idea

    While looking through the spam messages, it struck me how despite the fictional narratives taking place in different places, fictional people being different nationalities, there were only a few types of structures: either it’s a person who has a lot of money and needs help moving it, or it’s someone offering free money that by some accident now belongs to you. These firm structures make it easy to gerenate content automatically, so I think it would be interesting to build a spam email generator - something that, when given a theme or a country, would generate a (more or less) unique heart-wrenching sob story. I’m really curious about whether it’s possible to replicate the feeling of reading a spam email, or if there’s some necessary human touch that the scammers are able to inject into their fake stories.

  • Scam message visualization

    I noticed a number of patterns in the spam messages, particularly the subject lines. Most of the spam messages either fill into the category of being very formal and attempting to impersonate someone or they are very friendly and try to act like they have met you before. A few examples of these two categories can be seen in these subject lines: “FROM ALINCO SECURITY COMPANY”, “DO REPLY: FEDEX PAYMENT LODGEMENT OF FUNDS”, “YOUR BROTHER”, and “Hello Dear Friend.” I think it would be interesting to try to visualize the messages as grouped by there level of formality. You count the number of occurrences of formal and informal words in the subject of each message and then plot each email based on how formal it is. Other meta data about the message could also be incorporated to added additional axes to the graph such as geography location of the sender or length of the emails.

  • Microsoft Research on Nigerian Prince Scam

  • CFRP Project Updates

    Live demo:

    http://tinyurl.com/CFRPtimeline

    Project Team

    Keeley Erhardt, Julia Guo, Ali Finkelstein

    Research Question

    How can we visualize the trends of theater (specifically at Comédie-Française) over time? That is, did certain plays, writers, or actors become more popular over time? Do these trends correlate with any world events or national events in France at the time?

    How would you analyze and present the data?

    We chose to create an interactive timeline that displays each play that premiered at Comédie-Française as events on the timeline. The user can scroll through the timeline and view specific plays in more detail, including the genre of the play, the writer, and the actors.

    What kind of scholarly tools would you be building?

    We hope to build more scholarly tools for analysis that can align the play premiere events with historical events at the same time. Furthermore, we hope to add a tool to zoom in on a specific time period so that scholars can focus on a specific time period, as well as a tool to examine trends in plays, actors, and authors.

    How would you draw a public audience into the project?

    We hope to draw a public audience into the project with this interactive timeline because it is much more visually appealing and digestable than a table of dates. We hope to engage the audience by first presenting them with a grand image of the theater and then give them the freedom to explore and navigate through the timeline.

    What might be some of the technical challenges for your approach?

    We can improve technically by adding more information to each play (maybe by pulling a summary of each play, or ticket sale information from the API), as well as figuring out a way to make the events on the timeline appear less crowded and busy. The reason it is this way is because there are several plays that premiered around the same time, so they clump together on the timeline.

  • Animals and the People of France

    After looking at the more than 500 images of French society, we decided we wanted to tell a story about how animals were treated and admired during the time of coverage.

    We began by downloading all the images that contained animals and thought about why each animals was featured in the image. We realized that in general, horses and dogs are found in the background of images, while more exotic animals such as elephants and birds tend to be the main focus of the images they are in.

    Horses especially appear in many of the images. While they are usually being used for labor and carriage pulling, they sometimes appear prominently because they are being ridden by an army general or being used in a jousting competition. Horses also sometimes appear in statues and scultpures associated with army generals.

    We hope that this collection of images come together to tell a story of how the people of France used animals in their daily lives, but also admired them and used them to symbolize their amibitions and dreams.

    The collection is found at http://animalsoffrance.omeka.net/

  • Additional Email Scam Representation Ideas

    After reviewing the scam emails and using the Stanford Name Recognizer, I noticed that a lot of the emails attempt to associate themselves with a location, whether it is a nation such as Russia or an organization, such as the United Nations.

    Therefore, I think a good representation would be to show a map of the world with dots representing countries, such as Nigeria, and organizations, such as the United Nations. When a user clicks on this dot, it can take them to a floor with animated characters that tell their story when clicked on.

    I also noticed that a repeating theme was that of a victim with cancer.

    This gave me the idea of exploring the relationship between storyline and promised reward. In this presentation, the user can read through scam messages where the promised sum of money has been blanked out. The user can then guess what amount of money was being promised, and they are scored according to how close they were able to get to the true sum. This would be an interesting way to test whether larger sums correlate with more dramatic stories.

  • Karrie - scams


    layout: post published: true author: Karrie category: updates tags: null title: “Karrie - Scams” —

    I’m only able to use the online demo version of the Stanford Named Entity Recognizer, which only allows me to input a pretty small amount of text - not enough to look for patterns. Still, after reviewing the scams a bit, it occurs to me that many of the messages mention family members, and it might be possible to analyze the women writers and which fmily members they mention versus the male message writers.

    Enter text in Markdown. Use the toolbar above, or click the ? button for formatting help.

    Enter text in Markdown. Use the toolbar above, or click the ? button for formatting help.

  • Curatorship as Social Practice

    I’ve been to a lot of museums before. The MoMA, the National Gallery, the LACMA, the MET, major museums in London, in Paris, in India, in Israel, in Toronto, and more. My mother is a strong believe that in order to have a good conversation you need to be educated in all regards. But the favorite museum that I’ve ever visited is the Art Basel Expo, that happens every year a little over two hours from my house. Technically, Art Basel isn’t a museum. It’s a big convention that happens in only two locations in the entire world, Miami, Florida, and Basel, Switzerland for about a week. However, every major and upcoming artist is there showing off pieces, whether they’re dead or alive. I’ve been twice, as my high school was fortunate enough to be able to send some students to go visit each year. I think that Art Basel does art presentation right though. Technically the presentation of art isn’t done or curated by one group of people. There are booths of sorts that people come and set up to display art, and each individual booth has its own unique style and unique pieces. They have figured out how to curate their art to everyone who visits. They’ve adapted extremely well. The focus of the paper this week was exactly this: how are museums adapting the art of showing art to the changing society. The one thing that Art Basel doesn’t do right, and that I’ve found many other museums don’t do right is give the right amount of detail. There is only so much that the little placards next to a painting or sculpture can say. But what is missing is the true story of why is this painting important, what is its story, how did it get from the artist to the museum, what makes it worthy of being hung up next to other artists? These are just my opinions, however everyone is entitled to their own.

  • CFRP Prototype Presentation

    Below is the link to slides presenting our CFRP prototype:

    https://docs.google.com/presentation/d/1UEutsx3-RnI3O8xRCvIKav1AtXRs4lpTM3kdP1HzjTA/edit?usp=sharing

  • 6 Provocations of Big Data

    The two provocations I read were “Bigger Data are Not Always Better Data’ and “Just Because it is Accessible Doesn’t Make It Ethical”. In “Bigger Data are Not Always Better Data”, the authors discuss taking into account and acknowledging the group which researcher’s data sets represent. Specifically, there was a large focus on researchers using twitter as a part of research. The authors stated that “Twitter does not represent all people although…researchers refer to “people” and Twitter users as synonymous”. Following this, there is a large discussion as to why Twitter is not representative (due to the fact that what people actually see from the API is a very limited subset), and as to why it is important that the acknowledgement of the lack of a complete data set is important (meaning such that, in the example of Twitter, when researchers use Twitter as a part of their research, they need to acknowledge and keep in mind that the metadata that is found is not necessarily complete and correct). In “Just Because it is Accessible Doesn’t Make It Ethical,” the discussion of use of public content is brought up. The authors discuss questions researchers are faced when considering using online content in their work and some of the regulation of the use of humans as research subjects . In addition, there is a paragraph on the accountability of researchers when in these situations, however the authors never make a clear definition of what the researchers must be accountable.

    As someone who has tried to access the Twitter API, I could only fathom what it is like to actually use it. When I tried to use it, it was actually a huge hassle to gain permission, so in the end I forwent using the Twitter data, and relied on websites built on providing Twitter Meta-data. When I was scrolling through and trying to get the information I was seeking, I was unable to get a comprehensive view, and I recognized that. I considered using a web scraper to try and get information off twitter, however my lack of exposure to building web scrapers, as well as the fact that my focus of the data scraping was emojis, I forwent that also. However, as a normal person attempting to use the Twitter information, the numbers are huge. When thinking about the scale of the data, others may think that the huge numbers are good enough to make a general consensus. But I agree with the paper; just because you have all this information doesn’t make it comprehensive. It’s representative of a small subsection of users, and that should be kept in mind when using the data in research. Really in depth and comprehensive numbers would require pulling data from another source.

    I really enjoyed reading the “Just Because it is Accessible Doesn’t Make It Ethical” section, as it discusses just that. Just because you can see it doesn’t mean you can just use it. The ethical dilemma of whether to use public information or not is, what I can only assume, is highly debated. However, something I was thinking of while I was reading was the Terms and Conditions of using websites like twitter. I feel like the paper didn’t take into account what may be contained in the terms and conditions. When a user signs up for something, there is a disclaimer. But who actually reads the disclaimers? For all I know, a researcher somewhere may be using the information and content I post online, legally and rightfully, because I signed my ability to make that choice (of yes or no) away when I agreed on the Terms and Conditions.

  • CFRP Visualization Project

    Live demo: http://zygi.gitlab.io/cfrp-viz/

    What was in vogue at the Comedie-Francaise?

    Project Team

    Zygimantas Straznickas, Nicole Seo, Karrie Peterson

    Research Question

    How can we enable non-experts to expore general interest trends within this data – e.g., popularity of plays, of authors, of actors?

    Today’s audiences often think in terms of awards, box office profits, top-billed celebrity performers. To provide entree to this historical data, we are attempting to enable exploration in those kinds of categories.

    Functionality of the web-based interface

    We aimed at giving users the ability to discover year-by-year information using an Accolade Generator:

    • Best play – Highest grossing play
    • Nominees – Top ten highest grossing plays
    • Best in class – highest grossing in each genre (comedy, tragedy)
    • Most popular playwright – authors of the top grossing play
    • Nominees for playwrights - authors of the runners up
    • Top box office attraction - actors/actresses whose appearances total up to top grossing numbers
    • Hardest working actor in show biz - actor participating in the most plays in a given year
    • Turkey – least grossing play
    • Favorite of the cheapskates – play for which the most number of the cheapest seats sold for that year

    More complex data extractions could permit:

    • Mood of the season - showing the proportion of comedies vs. tragedies for each season of a given year
    • Longitudinal summaries of individual plays, actors, playwrights over a period of time, showing their frequencies.

    Longitudinal playwright popularity: Allow users to view the relative popularity of playwrights in a designated time period. Users can choose any time period, and get a visualization that shows each playwright whose plays were performed in that time period as a circle, with the size of each circle increasing based on total gross. If the time periods are very small, it might not be so important to correct for the change in the value of currency over time, but if the time periods were large, that correction would be needed. Design considerations for the user interface

    Accolade Generator - The uncomplicated accolade generator allows users to explore the best and the worst by year. Obstacles for doing this easily involve linking to additional data for identifying actors by gender (female/male), and singling out the leading roles for each play, so that minor characters are not tied with actors playing major roles. Additionally, there needs to be some documentation that explains to users that the top grossing plays will be affected by play-going in many of the years, where for certain performances there were restrictions on who could attend. Trendline Generator - Data visualizations for longitudinal trends could be bar charts where we’re just showing absolute frequencies (number of times a play was performed, for exmple), or to show relative popularity could be unordered bubble charts or stream graphs.

    Use of the API to capture data

    For best play and nominees: The “top grossing play” needs to be defined and documented. Since there were two plays on each night, attached to one gross sales amount, and we cannot determine if one was the “opening act” for a more renowned play, we chose to divide the take in half and give each play for the night 50% of the take. Since plays were often repeated in the same year, our assumption was that a more popular play would be repeated more often, and thereby get a higher gross. The Turkey (least grossing) would just be the same process but lowest number.

    Best in class: Similar process, first identifying the highest grossing, and then retrieving its genre. There would be one winner for tragedies, and one winner for comedies, in any given year.

    Top box office attraction and hardest working actor in show biz: it would be difficult to implement this using just the API provided - even though there are API endpoints for actors and their performances, in order to correlate them with performance profits it would require getting information about all the different combinations of performance and actor data. This can only be achieved by first using the API to download all the actor data in the project and all the performance data for each year (assuming that’s not against the terms of use) and then processing the whole dataset.

    Structure of data and technologies for query and display

    For display, we would use a simple jQuery based script to display the top 10 lists and control the website. Graphs would be drawn using the d3 library. The data retrieval question is more complicated. In our demo, we only implemented the worst/best plays functionality, the data for which could be retrieved using a single API call to the CFRP server. For this, we just used an AJAX call from the user’s browser - the website does not need any functionality from a separate backend server as everything is done clientside, by the browser. When implementing other views, specifically the “hardest working actor” category, a different approach would be needed. As described above, this data cannot be acquired by a single API call. The project would need a backend server that would collect all the data needed by doing many API calls to CFRP, process it, derive the statistical data and then serve it to the website visitors.

  • Revised CFRP Project

    How would you analyze and present the data? Do you have different ideas now that you have full access to the data?

    After learning more about the CFRP records, I believe the method of analyzing data that Nicole and I came up with is still a good idea. According to the researcher, his team determined that the irregular handwriting of the records makes it difficult to parse the data through machine vision. Thus, I believe that organizing the records into piles of relatively similar formats as well as having humans enter the numbers is the most feasible approach.

    I would still present the data as an IMDB-styled website with a page dedicated to each play and theatre. However, I would expand the analytical abilities of the website to make it more useful for scholars to quantitatively analyze the data.

    What kind of scholarly tools would you be building?

    To make the presentation of the data useful for scholars, I would include several data analysis tools. One would be an application that can generate bar graphs of data by allowing the user to choose an independent variable, such as ‘Time’, and a dependent variable, such as ‘Revenue’. I would also include an expandable tree graph where researchers can browse through all the plays for which we have data. The first layer of this tree diagram could allow the user to select a year, the second layer could display available theatres, and the third layer can display a list of matching performances. Lastly, I would include a tool that can calculate statistics specified by the user’s search criteria. For example, a researcher could use this tool to find the total revenue generated in a set of theatres during the years 1700-1730.

    How would you draw a public audience into the project?

    I believe the accessibility and design of the project would help draw a public audience. Because the website will be inspired by the structure of modern movie websites, audiences will find it intuitive to navigate. Furthermore, I would design the website with a pleasing aesthetic appearance typical of commercial websites that aim to attract the casual audience, rather than the plain design that a more academic website might have. To increase my audience I would use a reasonable portion of my budget to take advantage of advertising programs such as Google Ads and Facebook Ads.

    What might be some of the technical challenges for your approach?

    The technical challenges to my approach would be data entry and algorithmic correctness. Because the records number in the thousands, it would take months to enter the data into our system, much like it did for the real life researchers. Furthermore, because of the advanced capabilities of the research tools I am proposing, it would be important to make sure the data analysis algorithms involved are correct. To avoid creating buggy code that could produce false data, we would make use of existing code libraries and implement unit and integration software testing.

  • thoughts about scam messages

    I looked through my spam folder, and quickly appreciated how much better email-filters have improved in the past decade. Emails in my spam folder were nearly universally proper spam (instead of correct emails), though I did find some legitimate correspondence lost there.

    Recurring themes in my spam included the ubiquitous Nigerian Prince narrative – a rich “damsel in distress” persona who lost access to their funds and needs some capital to get it back, Viagra-esque and hooker low-effort commercials, and other emails designed to convince me to give away my personal information. A particularly interesting deviation from my expectations occurred when I discovered emails designed to sneak in with my daily-life emails: an invitation to a bogus conference with a registration form, an internship interview request to a software-looking company, a notice masquerading as a real MIT account asking for confirmation of financial aid/tuition/email quota. Spam and spam filters have gone through some kind of arms race – both becoming increasingly tailored to user/target habits in attempts to outwit their opponents.

    The articles make mention of the social/political context surrounding the recent explosion in sheer quantity of sent scam messages. Scam emails started in Nigeria (turns out “Nigerian Prince” is contextually accurate) as a result of economic and political unrest, when fairly-educated-and-now-unemployed people turned their efforts and internet access to online exploitation. But this is the same context given for the rise of gangs, terrorist groups, what have you – the world, both corporeal and online, is full of unsavory people. I still have little empathy for people who choose to spend their time taking advantage of others.

    Also– there’s this guy, James Veitch. He gave a TED-talk recently (December) about scam emails, and I thought it was cool: Video

  • cfrp ideas Ava Chen

  • CFRP Link To Slides - Nicole and Alan

    Below is the link to our CFRP Slides presentation.

    https://docs.google.com/presentation/d/1utGO-aydCf4ssE6UGsl6ddaQI3YlGA7vASOZR5bqrGk/edit?usp=sharing

  • CFRP Ideas - Zygi and Karrie

    Link to the presentation here

    Data digitization

    While the facsimiles in CFRP all list the prices and numbers of tickets sold, there isn’t a single uniform representation of this data. Because the records span more than a hundred years, there are a lot of differences in how the documents are structured and what information is included. In addition, most of the text is handwritten, which makes it impossible to fully automate the transcription process. However, optical character recognition (OCR) can still be used to speed it up by extracting all printed text from the documents. While most of the text would still have to be transcribed manually by a human, they could reference the auto-transcribed printed text and just copy and paste the words instead of having to type them.

    The structure of CFRP data can be naturally represented by a relational database scheme. Each block of entries in the CFRP records represents a PlayPair. Because a single ticket was sometimes sold for two plays that would happed at the same evening, a PlayPair would contain a title of the first play, a title of the second play, a date, and a list of Tickets. Each Ticket object represents a group of same-type tickets sold for the same PlayPair and would contain a number of tickets in this group, their cost, and the name of their seat type. Because of inconsistent ticket type descriptions used throughout the records, it is not enough to just enter them as text since then it would be very difficult to compare them - is a Parterre seat better than Place de Parquet? In order to make comparisons easier, it would be useful to divide all the different ticket types into a few categories and then, for each Ticket group, store both the text transcription of its type and the discrete category it belongs to. This could be done by knowledgeable transcribers, although a better solution would probably be to categorize the tickets in a post-processing step.

    Data access

    The main advantage of storing the data in a relational database format is that this allows the researchers to easily query the database and get answers to many complicated questions, e.g. how many tickets have been sold to a particular play or how the total sum of ticket prices depends on the month. Because the database would probably be quite small, there would be many ways to distribute it, from simply sharing the file and letting computer-savvy researchers process it themselves, to connecting the database to a web-based graphical user interface that would make issuing interesting data queries accessible to the public. In addition to simply providing access to data, a simple web view could be designed to show each PlayPair and its ticket records for people who just want to browse.

    Data use

    By itself, the CFRP data would probably only be useful to other researchers, as a data source to use in other projects. While the records could be used to explore some economics questions - how did the price of tickets or the difference between quantities and prices of different seat types change - it lacks a lot of context for many interesting insights. The primary use of this project would be combining it with other datasets. Specifically, having more data about the plays, their playwrights, lengths and genres would let us answer questions about which genres were popular or if there were any extraordinarily popular playwrights and how much their popularity helped sell their plays. Similarly, combining the CFRP records with data about the actors in each performance could be used in research about the “superstar” status throughout history by supporting written sources with statistical data. All in all, this project would digitize and make accessible a lot of interesting data that, when combined with other sources, could then be used for exciting research.

  • CFRP Description - Nicole and Alan

    Our approach to go from facsimile to data

    We plan to use human data entry to extract the data from the theatre records and store it electronically. Because the information is presented in a mostly handwritten, highly irregular form, we ruled out the possibility of using machine vision to extract the data. However, the sheer quantity of the register pages made us wary of taking a purely human-driven approach (i.e. hand-coding the data). Our strategy to make the process easier for human encoders is to group together records with similar formats and to create an electronic form with an intuitive user interface to facilitate the data entry process (see Slide #1).

    The process by which we will store and present the data

    We will store the data using a SQL server such as MySQL, which scales well with large amounts of data. We will store play information using database fields such as ‘Play ID’, ‘Play Name’, ‘Revenue’, ‘Expenses’, and ‘Total Attendance’. Other data, such as seats sold by section area, will contain fields such as ‘Play ID’, ‘Seat Type’, ‘Number of Seats Sold’, and ‘Revenue’. In presenting our data, we wanted to process it in a meaningful way rather than just presenting it in a table or graph. Thus, we will create a website reminiscent of modern movie sites such as IMDB or Rotten Tomatoes that will allow the user to explore the plays as if they were modern-day productions.

    The tools we will be building

    Our website will consist of pages dedicated to each play found in the records. Each page will display information regarding the play’s ticket sales, seat distribution, revenue, and profit. By cross-referencing other sources of data, we may even present critic reviews or a story synopsis. To facilitate navigation of the site, we will include a search bar where the user can find plays by searching for their title, venue, and more. Our website homepage will show lists of ‘Top Plays’ ranked by attendance, profit, and other criteria. We also plan on including a graphical tool with our website that will allow the user to generate custom bar graphs by selecting an independent variable, such as ‘Time’, and a dependent variable, such as ‘Total Attendance’. The generated graph would then allow the user to obtain a visual impression of the information that our data has to offer. The front-end portions of our project would be created with the Bootstrap framework complemented by HTML. Our back-end would use a server framework such as Node.js.

    The audience of our project

    Our target audience will consist of scholars interested in studying the culture of French theatre during the 16th and 17th Centuries. This may include french historians, cultural scholars, and drama scholars. However, the graphically appealing nature of our website will also make it possible for non-scholars to peruse it.

    Potential Research Questions our Project Addresses

    Our data can be used to examine how French theatre during this era interacted with the contextual environment of the time. For example, one could study how historical events of 16th Century France, which included various wars with Great Britain and expansive colonization efforts, affected the themes and popularity of theatrical plays. Another area of study could be how theatres in France decided to allocate their expenses and how these priorities changed with time. One could also study how actor pay changed with time, or how changes in actor pay correlated to changes in ticket revenue.