CMES AM Graduate Explores the Power of Twitter as Big Data
Mostak graduated from UNC-Chapel Hill in December 2006 with Bachelor’s degrees in anthropology and economics and a minor in math. After graduation he moved to Syria to teach English for Berlitz, where he began taking Arabic classes, fell in love with the language, and developed a deep interest in the region. Before starting the CMES AM program in September 2010, he also worked for a year at the Institute of Palestine Studies in Washington, DC and spent another year studying Arabic at the American University in Cairo on a CASA fellowship. While in Egypt he also worked as a translator for the Egyptian newspaper Al-Masry Al-Youm, occasionally writing for the paper as well.
Mostak chose the CMES Master’s program on the strength of its Middle East studies faculty and for its flexibility, which served him well as his interests evolved during the two-year course of study. For his Master’s thesis, he decided to examine Twitter in the context of the Egyptian revolution, having become interested in social media while working for Al-Masry Al-Youm. As he began collecting and analyzing Egyptian Twitter data, Mostak was struck by its potential as a data source. “I became more and more fascinated by Twitter not as the phenomenon you’re trying to study but as the tool to give you a massive amount of data,” he explains. “You can look at it both ways: you can look at how did social media affect the revolution or propel it forward—and that’s a very, very interesting question—but revolution or no revolution, [Twitter is] a great way to figure out what people are thinking.”
The value of Twitter as a data source is its huge volume—four to five hundred million new tweets are generated each day—and the precise time and location data associated with those tweets. Though less than two percent of tweets—those sent from phones by users who have opted to have their location shared—are geo-coded, even that small percentage represents seven to eight million tweets per day for which a precise latitude and longitude are known. That’s seven or eight million data points reflecting not only what people are thinking, but also exactly when and where.
An image from Mostak's AM thesis mapping Islamist sentiment by neighborhood in Cairo.For his Master’s thesis, Mostak first aggregated tweets in Egypt by district, then attempted to determine whether the degree of Islamism expressed in those tweets correlated with the degree of poverty in the district. He found a correlation with rurality, but no significant correlation with poverty, instead finding that Islamism as measured by tweets seemed to cut across wealth. Along the way, the project taught him a lot about the kinds of issues this type of analysis encounters, as well as its potential, and inspired him to continue working on ways to overcome those obstacles.
A significant challenge Mostak faced for his thesis was how to measure the degree of Islamism expressed in a tweet. One method he tested was to score tweets against user posts from Ikwhan.net, an online forum associated with the Muslim Brotherhood, using an algorithm that compares similarity of discourse. He discovered that the algorithm, a relatively simple measure of word and phrase usage, did not work well on a diglossic language like Arabic1. Because the formal register of Arabic is associated with religion, it is commonly used by the Islamist users on Ikwhan.net. The algorithm Mostak tested correctly identified some tweets as Islamist by detecting their similarity with Ikhwan.net posts, but found other false positives in tweets that were simply using formal Arabic, regardless of their content. Solving this problem by designing an algorithm specifically for Arabic that accounts for register is an area Mostak hopes to work on in the future. For his Master’s thesis, he chose instead to use the number of Muslim Brotherhood politicians each user followed on Twitter as a proxy for their own degree of Islamism.
After graduating from CMES in May 2012, Mostak began a fellowship at the Ash Center for Democratic Governance and Innovation at the John F. Kennedy School of Government, working with his thesis adviser Tarek Masoud, an associate professor of public policy at the Kennedy School. Together Masoud and Mostak are working on several papers that analyze discourse mined from social media and other online sources to shed light on recent Egyptian political changes. One project examines posts from Ikwhan.net, while another looks at current and past party platforms from various Egyptian political parties, comparing them against each other and over time. They are using Twitter as a data source as well, to examine how vote share per presidential candidate in the recent Egyptian presidential elections compares with sentiment towards those candidates and towards other issues.
The other major challenge that Mostak encountered while working on his Master's thesis had to do with the sheer volume of data he was trying to analyze and map. He remembers one analysis that was going to take forty days to run using the off-the-shelf tools he started out with. Hoping to speed things up, he began exploring other solutions. In his last semester at CMES, Mostak cross-registered for a database systems class at MIT, and for his final project built a prototype database that ran on graphics cards rather than on conventional processors. In the past nine months he has developed that project into a fully functional and staggeringly fast SQL database that has spawned collaborations with Harvard’s Center for Geographical Analysis, Reischauer Institute of Japanese Studies, and South Asia Institute, and MIT’s Computer Science and Artificial Intelligence Laboratory.
Graphics cards, also known as graphical processing units (GPUs), are designed to render computer graphics for video games and other consumer applications. GPUs have also found significant use in scientific applications that require massive amounts of computational horsepower, such as protein folding and physics research. These tasks require huge numbers of repetitive calculations, which GPUs can perform very quickly. This ability also makes GPUs good at running large databases, an application which has become possible in the last decade as programmable graphics cards have become available. The Titan supercomputer at Oak Ridge National Laboratory in Tennessee, currently the fastest computer in the world, uses 18,688 GPUs to achieve its record-breaking speed. Even on a smaller scale the speed-ups are dramatic: Mostak’s system, with only four GPUs, can query and visualize hundreds of millions of tweets in milliseconds, a task that would take a conventional processor at least one hundred times longer.
Manhattan's Central Park stands out in red on a TweetMap-generated heat map of tweets including the term "central park."Mostak’s GPU database, which he calls MapD for “Massively Parallel Database,” forms the back end of TweetMap, a platform he developed in collaboration with the Center for Geographical Analysis (CGA) to query and render geo-coded Twitter data on an interactive map. CGA handles the front end of the application, which uses a customized version of their WorldMap platform. Individual tweets are displayed as points on the map, which when clicked display that tweet’s text and other details. Users can filter by time to show only tweets posted in that time period, and by term to show only tweets that contain a specific word or phrase. Filtering by term also brings up a heat map overlay, in which “hotter” red areas denote regions where the percentage of total tweets matching the specified term is above a certain adjustable threshold. Searching for “Pats,” for example, turns New England bright red, while “ain’t” is concentrated strongly in the southeastern United States. The feature is good at picking out places—“Central Park” forms a neat rectangle in Manhattan—as well as language—“vous” lights up France, North Africa, and Quebec. Mostak notes that many visitors spend hours on the site searching for trends and patterns in the data. He can relate to the fascination. “I love this stuff because you have to think of it like a detective,” he says. He recalls for instance being excited while working on his thesis at the noticeable drop in Twitter volume at prayer times. “You don’t have what you’re trying to study right in front of you, but you can infer it—you can see traces of it.”
In its current alpha stage of development TweetMap houses 125 million tweets collected between December 10 and December 31, 2012. Without the processing speed of the MapD GPU database, filtering and mapping that much data in real time (as soon as the user enters the filter parameters) would be prohibitively resource intensive. Heat maps, in which each pixel is a weighted average of thousands of its nearest neighbors, are particularly computationally intensive to generate, requiring many billions of calculations to render a single frame on a high-resolution monitor. The system currently powering TweetMap—a computer housed at the CGA—uses four graphics cards with approximately 1,500 processors each, allowing it to perform six thousand calculations in parallel. (A conventional processor, by comparison, can typically perform one, two, or four calculations at a time depending on how many cores it has.)
Ben Lewis, the project lead on CGA’s WorldMap and Mostak’s partner on the TweetMap project, notes the value for researchers in being able to get instant results when analyzing big data sets. Lewis hopes that their collaboration may eventually enable the MapD technology powering TweetMap to add extra processing power to WorldMap too. The GPU database system’s speed would make it easier for researchers looking at big data in WorldMap to test multiple hypotheses looking for patterns, “to flexibly slice and dice the big data, in order to figure out what part of it they want to study in more depth,” Lewis explains.
One downside to using GPUs to run a database is their relatively limited memory. A typical server might have 32 to 128GB of CPU RAM. Each of Mostak’s graphics cards has only 4GB (for a total of 16GB across all four cards), limiting the amount of data that can be processed at once. To work around this limitation, Mostak developed a storage schema that is startlingly elegant and creative. Rather than storing the full text of each tweet to be processed by the GPUs, he converts each word in a tweet to a prime number, using the smallest primes for the most common words (“I,” “and,” “we,” etc.). The full content of each tweet is then represented as the product of its constituent prime numbers. To test whether a tweet contains a certain word, the system divides the tweet's total product by the target word’s prime number to see if it divides evenly. This method, which Mostak and his colleagues believe is a novel approach, condenses a corpus of 125 million tweets to one quarter of its full-text size.
Further increasing the storage capacity of TweetMap is one of the features Mostak is actively working on. He’s had up to two hundred million tweets displayed at once, and his goal is one billion. Other features that he and Lewis hope to add include choropleth mapping (aggregation by region, such as country, state, or census district), the ability to regress by known regional attribute information such as census data, and the ability to generate time-based animations on the fly. Mostak also plans to add a sentiment filter, which would indicate not only when people were tweeting a certain word but whether they were doing so with a negative or positive emphasis. (This type of sentiment analysis, which Mostak has already implemented successfully in his work with Masoud, uses a naive Bayes classifier trained on emoticons—happy and sad faces—to determine the positive or negative tone of a tweet. “It sounds ridiculous,” Mostak admits, but he has tested the measure and found it to predict sentiment, as scored by other measures, with an accuracy much greater than chance.)
Though his attention is focused on a wide variety of projects recently, Mostak remains deeply interested in the Middle East and hopes to continue working on projects related to the region, particularly where he can develop computational tools to further social science research. He has plans, for instance, for using GPUs to accelerate the traditionally slow process of machine language learning in order to develop better automated voice recognition for Arabic. In February, Mostak began working as a researcher at MIT’s Computer Science and Artificial Intelligence Laboratory, where he will continue to work on MapD and TweetMap in general and for specific projects. One such application will be a partnership with researchers from Massachusetts General Hospital who hope to use the platform to examine health trends such as flu outbreaks and immunization rates. Though his Ash Center fellowship ended in December 2012, Mostak also hopes to continue collaborating with Masoud and sees great potential in their future research. “There’s a million different things that are interesting that you could measure,” Mostak says. Judging by what he’s accomplished in eighteen months, it’s hard not to believe him.
—Article by Johanna Bodnyk
1Arabic is diglossic, meaning most of its speakers speak two distinct dialects or registers: a formal register used in political and religious contexts, among others, and a colloquial register used in everyday life. ^