Anti-Immigrant Sentiment and BREXIT

In this blog post, we are going to discuss about the process of engineering metrics to capture the two major concepts in our research question using the British Election Study panel data:

How do attitudes toward immigrants & immigration influence the volatility of opinions on BREXIT?

Anti-Immigration Sentiment (AIS)

To quantify individual’s immigration sentiment, we identified 4 survey questions that explicitly asked about respondents’ attitudes towards immigrants and immigration. Given that the 4 survey questions were asked across multiple waves, to compute one immigration index on a person level, we first need to confirm that one’s immigration attitude does not vary drastically over time. In particular, we want to look at the range of attitude change. If respondants’ immigration sentiment vary within a small range over time, we can assume that by taking the average of an individual’s responses across multiple waves we have an accurate representation of their attitude at any point of time.

In practice, we computed the range of change by taking an individual’s maximum and minimum responses over the period in which they participated in the survey. We defined ‘tolerable change’ as an individual has changed in their reponses at most by 1 level for a 5-level response questions (e.g. from “Disagree” to “Strongly Disagree”), or at most 2 levels for a 7-level response questions (e.g. from 2 to 4 on a scale of 1 to 7) in the entire course of their survey. Any changes more than described above are defined as ‘significant change’.

Fig.1 below shows that more than 75% of the respondants exhibit no change or tolerable change in their immigration attitude. This is true for all 4 survey questions we identified. Therefore, we decide to quantify an individual’s immigration attitude using their mean response.

The four questions are:

  • immigEcon: “Do you think immigration is good or bad for Britain’s economy?”
  • immigCultural: “Do you think that immigration undermines or enriches Britain’s cultural life?”
  • immigrantsWelfareState: “How much do you agree or disagree with the following statements? ‘Immigrants are a burden on the welfare state?’”
  • changeImmig: “Do you think that immigration are getting higher, getting lower or staying about the same?”

Fig.2 Correlation matrix between responses to the 4 survery questions on immigration sentiments.

Fig.2 shows us that the responses to the first 3 questions are highly correlated (correlation coefficient > 0.75). This helps us make the decision to compute an overall index to quantify Anti-immigrant Sentiment (AIS) using responses from the first 3 questions.

Volatility of BREXIT Voting Preference

Next, we want to find a way to quantify opinion volatility in respondants’ attitude towards Brexit. Except for wave 5, the survey asks respondents “If there was a referendum on Britain’s membership of the European Union$turnoutText, how do you think you would vote?”. We flag the responses if it is of a polarized switch from “stay” to “leave” or from “leave” to “stay” as compared to their previous responses in last wave.

We computed two person-level metrics: switch_ratio and if_switch. switch_ratio is calculated by using the number of switches an individual had divided by the number of waves they participated – 1. This is because for n waves an indiviudal participated in the survey, there are only n – 1 waves they can possibly switch their voting preference. It measures how frequent an individual switches their opinion. if_switch is a binary variable that indicates if an individual has switche their voting preference at all.

In practice, we found that if_switch is much more helpful in differentiating respondants in terms of opinion volatility. Firstly, approximately only 1 in every 7 respondants switched their voting preference from one side to the other side. (1 indicates switch and 0 otherwise).

##     0     1 
## 36892  7932

Therefore, in effect, the distribution of switch_ratio looks like:

Such a highly skewed distribution makes it difficult for switch_ratio to effectively represent opinion volatility in modeling tools.

In addition, we looked at the relationship between switch_ratio and immigration sentiment (AIS).

As the graphs above suggest, AIS remains relatively constant regardless of switch_ratio. We observe volatility of AIS when switch_ratio exceeds 0.5, but the extremely small sample size in that range undermines the significance of such observation. Given that we do not observe a difference in immigration sentiment in respondents no matter how frequent they switch their voting preference, we hypothesize that the frequency of switch is not as critical as whether individuals switch their opinion at all. Therfore, we deicde to use if_switch instead of switch_ratio to advance our investigation.

Later on, we use combine if_swtich and respondents’ final vote to categorize respondants into 4 different voter types:

  • Voters who always voted for “stay”
  • Voters who always voted for “leave”
  • Voters who have switched their opinion and voted for “stay”
  • Voters who have switched their opinion and voted for “leave”

With this categorization, we developed an interactive Shiny application to directly visualize the difference in immigration sentiments amongst 4 different voter types.

Source: British Election Panel Study Data

New DASIL Space in the Humanities and Social Studies Center (HSSC)

Data — from a single digit to terabytes of information — increasingly shapes decisions on public policy to the way we individually go about our daily lives. To some this is exciting; to others, intimidating.

The Data Analysis and Social Inquiry Lab (DASIL) helps students and faculty members integrate data analysis into both classroom work and research by facilitating workshops, helping students collect and analyze data, and offering software training and data-set preparation.

“Since the fall of 2017, DASIL has been active in providing services to over 170 clients, including students and faculty,” says Xavier Escandell, associate professor of anthropology and faculty director for DASIL. “We have also successfully continued our support of the institution at large.” (Click here to continue reading).

Sentiment Analysis of a Podcast


There has been an increase in the exploration of text as a rich data source. Quantifying textual data can reveal trends, patterns, and characteristics that may go unnoticed initially by human interpretation. Combining quantitative analyses with computing capabilities of modern technology allows for quick processing of substantially large amounts of text.

Here we present a sentiment analysis of an intriguing form of text – podcast transcripts – to provide a discussion on the process of text analysis.

Podcast transcripts are a unique form of text because their initial intent was to be listened to, not read, creating a more intimate form of communication. The text used in this example are the transcripts from the NPR “Serial” podcast hosted by Sarah Koenig. “Serial” explores the investigation and trial of Adnan Syed who was accused of the murder of his girlfriend in 1999. The podcast consists of 12 episodes averaging 43 minutes and 7,480 words ­­each. Here we examine the 12 episodes together as a whole text.

Sentiment analysis involves processing text data to identify, quantify, and extract subjective information from the text. Using the available tools from the Tidyr package in R, we can examine the polarity (positive or negative tone) and the emotional association (joy, anger, fear, trust, anticipation, surprise, disgust, and sadness) of the text. We present one method of sentiment analysis which involves referencing a sentiment dictionary or list of words coded based on the objective. For examining the polarity, each word is given a positive, negative, or neutral value and for examining the emotions, each word is tagged with any associated emotions. As an example, the word “murder” is coded as negative and tagged with fear and sadness. We chose to use the NRC sentiment dictionary for this analysis as it is the only one that includes emotions, and it was created as a general purpose lexicon that works well for different types of text.

Starting with an overall visualization of the emotions and polarity of the podcast, a bar graph (Figure 1) displays the percentage of the text characterized by each emotion. In examining the text in this way, a particularly intriguing discovery is that the most common emotion is trust, which may be surprising for a podcast about a murder investigation and trial. The next most common emotion is anticipation. This confirms what one may expect in the context of podcasts: hosts would want to keep their listeners interested in the story so anticipation would play a key role in getting people to listen regularly.

Figure 2 shows that overall this text is positive as a larger percent of the words are coded as positive. Looking closer at which words occur most often within a specific sentiment or emotion, a sorted word cloud allows one to visually identify the most commonly used words coded as positive or negative.

The most frequently used negative words are crime, murder, kill, and calls. The most frequently used positive words are friend, talk, police, and pretty. It is important to examine the context of the most common words. Consider the word “pretty”, in the text “pretty” was used as an adverb not as an adjective (e.g. “I’m pretty sure I was in school. I think– no?”) All 53 instances of “pretty” in the text were used to show uncertainty.  However, the NRC dictionary defines and codes the word “pretty” as an adjective describing something as being attractive. This mismatch between usage within the text and the dictionary impacts the sentiment analysis. One should carefully consider how to handle such words appropriately.

Similarly we can examine each emotion in more detail. These graphs allow one to see which words were most represented in each emotion.


This graph again illustrates the importance of critically examining the results. The word “don” is coded as a top positive word, however, in this text “don” is the name of a person and like the other names it should be coded as neutral. However, the NRC lexicon codes the word “don” defined as a gentlemen or mentor. Similar concerns may be present for other words that may have multiple meanings. These words should be appropriately considered, particularly if among the most frequently used words in the text.

These graphs show a few of the many ways to quantify and visualize text data through a sentiment analysis to understand a text more objectively. As text analyses become more prevalent, it is imperative to actively engage in the process and critically examine results paying attention to not only the numbers and graphs but also the subject matter of the text.


Logo for Tableau Software

Software Review: Tableau as a Teaching Tool

Tableau is unique and a valuable teaching tool because it provides an easy interface for the creation of charts, graphs and even maps.  Students can explore data in sophisticated ways with only a short training session.  Even better, as students they can get free licenses for the software, allowing faculty to use it for classes without ensuing large financial commitments.

A map showing fatality and even types of different violent events in Africa

What sets Tableau apart from other data visualization or business intelligence software is its intuitive, user-friendly drag-and-drop interface. For more sophisticated applications this is supplemented by a variety of easy to understand menus. By using contextual menus and panels instead of typing in code, Tableau lowers the learning curve needed to create visualizations. For example, creating a line graph or a map is as easy as selecting the variables in question and selecting the appropriate type of visualization.

Classic tables like the one below are easy to construct and can also be augmented with color-coded hotspot analyses.

A highlight table showing the number of violent events happening in Egypt, Libya, South Sudan, and Sudan broken down by Country and Event Type

Tableau provides the opportunity to construct data visualizations that are more complex than those generated by most traditional statistical packages.  For example, the graphic below compares the number of conflicts over time for four North African countries in a fairly normal plot, but add an additional variable, the number of fatalities by varying line thickness.

A line graph showing the trend of the number of violent events in 4 African countries (Egypt, Libya, South Sudan, and Sudan) between 1997 and 2015. The thickness of the lines represent number of fatalities.

For classes working with data, Tableau presents a significant opportunity for instructors to integrate more data into the classroom, especially with students who might not have experience with more advanced statistical software. Making it easier for students to explore and understand data, as well as to ask their own questions through investigative learning, encourages them to gain a deeper appreciation for data as it relates to their discipline. In fact, as of the time of writing, Tableau is currently being successfully used in several of our classes at Grinnell College.

However, Tableau does have its drawbacks. In particular, visualizations created with Tableau are not as customizable as more powerful languages such as R or Javascript. In addition, Tableau is not created for data analysis.  It is a data visualization tool, not a statistical package. Another small downside is that data entered into Tableau must be formatted in a specific way.  While Tableau is able to do some data manipulation, spreadsheet programs like Excel are much easier for this. So, Tableau’s role in classrooms or in research might only be restricted to surface-level explorations of the data in question. Despite this limitation, Tableau remains a tool with great potential, especially in the possibilities it presents to the user in creating quick and easy visualizations.

Student Spotlight: Racial Bias in the NYPD Stop-and-frisk Policy

Donald Trump recently came out in favor of an old New York Police Department’s (NYPD) “stop-and-frisk” policy that allowed police officers to stop, question and frisk individuals for weapons or illegal items. This policy was under harsh criticism for racial profiling and was declared in violation of the constitution by a federal judge in 2013.

An earlier post by Krit Petrachainan showed a potential racial discrimination against African-Americans within different precincts. Expanding on this topic, we decided to look at data in 2014, one year after the policy had been reformed, but when major official policy changes had not yet taken place.

More specifically, this study examined whether race (Black or White) influenced the chance of being frisked after being stopped in NYC in 2014 after taking physical appearance, population distribution among different race suspect, and suspected crime types into account.

2014 Data From NYPD and Study Results

For this study, we used the 2014 Stop, Question and Frisked dataset retrieved from the city of New York Police Department. After data cleaning, the final dataset has 22054 observations. To address our research question, we built a logistic regression model and ran a drop-in-deviance test to determine the importance of Race variable in our model.

Our results suggest that after the suspect is stopped, race does not significantly influence the chance of being frisked in NYC in 2014. A drop-in-deviance tests after creating a logistic regression model predicting the likelihood of being frisked gave a G-statistic of 8.99, and corresponding p-value of 0.061. This marginally significant p-value shows we do not have enough evidence to conclude that adding terms associated with Race improves the predictive power of the model.

Logistic regression plot predicting probability of being frisked from precinct population Black, compared across race

Figure 1. Logistic regression plot predicting probability of being frisked from precinct population Black, compared across race

To better visualize the relationship in interactions between race and other variables, we created logistic regression plots predicting the probability of being frisked from either Black Pop or Age, and bar charts comparing proportion of suspects frisked across sex and race.

Interestingly, given that the suspects are stopped, as the precinct proportion of Blacks increases, both Black and White suspects are more likely to be frisked. Furthermore, this trend is more profound for Black than White suspects (Figure 1).

Additionally, young Black suspects are much more likely than their White counterparts to be frisked, given that they are stopped. This difference diminishes as suspect age increases (Figure 2).

Logistic regression plot predicting probability of being frisked from age, compared across age

Figure 2. Logistic regression plot predicting probability of being frisked from age, compared across age

Finally, male suspects are much more likely to be frisked than females, given they are stopped (Figure 3). However, the bar charts indicate that the effect of race on the probability of being frisked does not depend on gender.

Proportion frisked by race, compared across sex

Figure 3. Proportion frisked by race, compared across sex

Is stop-and-frisk prone to racial bias?

Our results suggest that given that the suspect is stopped, after taking other external factors into account, race does not significantly influence the chance of being frisked in NYC in 2014. However, after looking at relationships between race and precinct population Black, age, and sex, there is a possibility that the NYPD “stop-and-frisk” practices are prone to racism, posing threat to minority citizens in NYC. It is crucial that the NYPD continue to evaluate its “stop-and-frisk” policy and make appropriate changes to the policy and/or police officer training in order to prevent racial profiling at any level of investigation.

*** This study by Linh Pham, Takahiro Omura, and Han Trinh won 2nd place in the 2016 Undergraduate Class Project Competition (USCLAP).

Check out the 2016 USCLAP Winners here.