It used to be easy for researchers to study digital social systems. Not anymore. A few unethical scientists, political operatives, and capitalists—plus irresponsible privacy policies like Facebook’s during the Cambridge Analytica scandal—have rightly put Facebook and Twitter on the defensive. The days of tapping into their application programming interfaces (API) and drinking in gigabytes of data are over. And while ethical researchers can still get some data, new limitations make answering some of society’s most pressing questions more difficult—in many cases, impossible.
Our research group at the School of Information Studies at Syracuse University hosts a website where we’ve collected and analyzed social media data around recent US elections. Those of us doing election research on social media platforms have seen our work become increasingly difficult since 2017. Both Twitter and Facebook are imposing tighter restrictions on their data. This limits how well academic researchers can act as sources for journalists, and thus how well the public is informed.
Insights from this data are the substance of much of our work. In one of our recently published papers we analyzed a collection of retweets of gubernatorial candidates’ tweets in the 2018 elections. We found that users with a middle range of followers, between 1,800 and 26,000, tend to have the most influence over the flow of political information. We call this group the information middle class. These accounts outnumber top information producers like the New York Times, and their followers, on average, tend to be more active in propagating political information. Our research looking back at the activity of automated accounts (bots) during the 2016 US Presidential election found that while bots amplify messages, meaning they retweet, they tend not to amplify messages as well as people do because they rarely have many followers. This means that, unlike humans, bots don’t push candidates’ messages out to new audiences. Looking back further, to the 2014 election, we used computational methods to determine when candidates were more likely to use attack messages and when the public tended to spread those messages. We found that candidates tended to attack more on Twitter than Facebook, and that challengers were more likely to attack than incumbents. We also found that incumbent-generated messages were more likely to be shared and retweeted, and that, regardless of incumbency status, attack messages were more likely to be spread than other types of messages.
We collect our data in real time throughout an election cycle, and store it on servers for analysis. In total, we have archived around 1 billion tweets and hundreds of thousands of Facebook posts since 2014. We do not do analytics on individuals other than public candidates running for election. Our servers are behind firewalls and password protected, and only those of us on the research team have access to the data. We generally employ traditional statistical methods used in social sciences along with a lot of data visualization.
Most of our research addresses messaging, such as negativity in campaign messages, whether messages are related to a candidate’s image or the issues, what kinds of messages are spread, and who influences the diffusion of political information. Without this research, the public would be less informed about how politicians and the public use social media during the political process. We would lack an understanding of the ways platforms like Facebook and Twitter are affecting how we access political information.
Historically, Facebook and Twitter provide very little information on these topics. It seems reasonable to assume that the kinds of in-house research they do are limited to topics that affect their bottom lines. This is obviously not always in-line with the public interest.
In 2014, collecting most of our data was fairly easy. We wrote simple python applications that connected to the Facebook or Twitter API and collected messages posted by candidates. On Twitter we collected retweets of candidate’s tweets and mentions of their handle. Every so often a platform would change their API; in response, we updated our code, or our collection systems stopped. Often there would be minor changes in the data such that we needed to be nimble and forgiving of unexpected alterations. Many times, the platforms didn’t update their documentation as fast as they made modifications to the API. As a result, we were always vigilant and ready to adapt on the fly—Twitter doesn’t wait for the election to conclude before implementing changes to its API.
The 2016 presidential election changed all this, though it wasn’t until March 2018, with news of the Cambridge Analytica scandal, that the public became aware researcher had misused Facebook data, we learned, in an effort to influence election results. The subsequent media storm and congressional hearings caused social media companies to lock down their data. Some of the changes were small: During the 2016 election, for a given comment on a candidate’s Facebook page, we were able to collect the comment, the time of the post, the screen name of the poster, and how many times it had been liked and shared. Since Facebook’s API changes, we cannot see who the poster was. Other changes have been larger and more destructive. For the 2018 election, we collected data for every gubernatorial, Senate, and House candidate – more than 900 candidates. To do this, we needed multiple “apps,” that connected to the API and collected a subset of all the candidates. Now, however, Twitter allows just one application. Our previous methods have become impracticable. Facebook has restricted some kinds of work entirely, for example automated coding, which automatically categorizes messages. These restrictions appear to be a moving target for Facebook and Twitter, with new ones occasionally being introduced as new policies are announced.
During the 2020 election, we will still be collecting social media data, but we will get less of it. It will also be harder to maintain access for our apps. But we are updating our illuminating site to include access to the 2014, 2016, 2018, and 2020 elections. The tools will support analysis within states, within elections, or offices (senate only, for example). We are also introducing new data visualizations that we hope will make it easier to see how candidates are using social media in their election campaigns.
We believe the research we do supports journalists and, ultimately, democracy, but we are concerned that in a justified effort to improve privacy, social media companies are making it harder for researchers to do work that is valuable for the health of democracy. Currently, researchers submit requests for access to Facebook and Twitter through the same channels as for-profit businesses. Other models deserve consideration: fair use laws in the US allow for the free use of portions of copyrighted material for educational purposes. We think informing the public is worthy of such special consideration.