This infographic was created through a painstaking process that utilized almost 10 different applications to generate the final result. The main application used to create the word cluster graphic was Gephi, an open source platform that lets you visualize complex networked data elements in a visually compelling and interactive environment. However, coming up with this particular end result was complicated by various factors, one of which was the complexity that arose from using Japanese characters in its analysis.
The first step in this Japan Twitter project was to actually collect and archive the twitter data coming out of Japan after the earthquake. For this, a cron job was written as a PhP script by David Shepard, a member of the UCLA Digital Humanities Collaborative. The script used the Twitter search API to find and filter tweets based on relevant hashtags, and dumping them into our own MySQL database. The cron job ran every 3 minutes for 30 days, collecting over 650,000 tweets during this time period.
Once the Twitter data was safely in our MySQL database, I queried out and generated 30 separate text files, one for each day following the earthquake. Each “day” file consisted of just the tweet text from the thousands of tweets that belonged to that day (on average there were about 20,000 tweets per day).
Here, you can see the number of tweets collected on an hourly basis:
In order to capture the range of emotions through the different phases of recovery following the disaster, I followed a methodology employed by Eiji Aramaki from Tokyo University, who took the words from an Emotion Dictionary to extract emotion patterns in a set of text files. Dr. Aramaki provided me with about 2000 of the most commonly used “emotion” words in the Japanese language, sub-divided into 10 different categories. A separate CSV file for each emotion was generated.
I then used WordSmith, an application that allows you to extract word patterns, to find concurrences of every emotion word against each “day” file. Through WordSmith’s concordance tool, I was able to run a batch process that matched each of my 10 “emotion” files against each of my 30 “day” files.
Here is a screenshot of WordSmith’s concordance function:
The data generated from WordSmith was exported as a series of spreadsheets. These spreadsheets were combined, merged, analyzed, and recalculated to produce a single matrix of emotion words by day. While I was able to do most of the work in Excel, because of varying language character problems, I was forced use Google Spreadsheets, mostly to generate the CSV file format that Gephi requires as an input source file (Excel lost the Japanese text on csv export, while Google did not).
In order to create an emotion “measure” for each day, the spreadsheet generated columns that counted the number of times each keyword was found in each of the 30 days. For example, for word 悲しみ (sadness) was found 0.5 times for every 10,000 tweets on March 11th, 3.1 times on March 12th, 325 times on March 13th, and so on.
The heart of the word cluster analysis was conducted in Gephi. Gephi requires you to define your data in two basic elements: Nodes and Edges. For this analysis, I chose to define these as follows:
Nodes: Every emotion word, and every day was used and defined as a Gephi node
Edges: Every connection between a “word” and a “day” was defined as an edge, and weighted by how many times that word was found for every 10,000 tweets, for each day.
Here is a screen shot of Gephi’s data view:
Once the data elements were defined, Gephi is ready to visualize (ie, the fun part!). Gephi comes with many layout templates that you can choose from. Each layout has its own built in algorithms that take the nodes and edges from your database to generate a network diagram. I chose to use a layout called “Parallel Force Atlas” (it sure sounds good). You can choose to size and/or color each node by different data attributes, and do the same for the edges, which serve as the connectors between the nodes. You then press a button, configure a few parameters (such as “gravity”), and voila! you are introduced to a beautiful infographic.