Measuring Interviewer Effects in the November 2018 Grinnell Poll

Overview of Findings

In this post, we investigate whether or not responses to the Grinnell College National Poll (GCNP) are affected by the interviewer conducting the poll. The poll is conducted over the phone, through live interviews by professional staff working from a call center. Past research has demonstrated that the preceived air of authority, race and gender of the interviewer can influence responses from interview subjects (Cotter 1982, Kane 1993).

The GCNP denotes each interviewer with an anonymous code and records the gender of each interviewer. We use this evidence to assess whether interviewers appear to be influencing responses in the poll. We found the following:

  • Evidence of an effect of the interviewer’s gender on the answers of some questions (approval of Trump for example, but not necessarily on specificly gendered questions like the level of discrimination against women in the United States)
  • A consistent pattern of female interviewers getting more ‘Not Sure’ answers across many questions compared to their male counterparts.
  • Some interviewers recieved noticably more ‘Not Sure’ answers compared to their peers, and some interviewers had more unusual responses than would be expected.

We cannot definitively determine whether these abnormalities are due to interviewers explicitly influencing responses or whether these interviewers got unusual respondents by chance. In addition, an interviewer interviews on average 7 respondents, which means any influence that interviewer has on responses is limited to a small subset of the poll. Overall, we did not find substantial evidence that effects from interviewers are biasing the results in the GCNP.

Characteristic of Interviewers in the GCNP

With just the gender of the interviewer, we only know two pieces of information about the interviewers: Their gender, and the number of respondents that they interviewed. In the left graph below, we can see a density plot of the number of respondents that each interviewer interviewed. We see that the vast majority of interviewers interviewed under 15 respondents (131 out of 141 interviewers, 93%), and nearly half of the interviewers interviewed less than or equal to 5 respondents (64 out of 141 interviewers, 45%). We can also see at least two outliers on the high-end, indicated by the two small humps on the right side of the density plot, with one at 27 and the other at 32 respondents interviewed.

In the graph on the right, we can see that roughly 2/3 of the interviewer’s are female. However, 530 of the 1000 respondents were interviewed by women, which shows that there was some effort to make more balanced interviewer gender classes.

Prevalence of Not Sures

In any polling, the presence of a ‘not sure’ response is necessary so that individuals do not feel forced into specific answers, as respondents could either feel uncomfortable giving a response to a particular question or the respondent actually doesn’t know the subject matter enough to give a proper answer. In either case, a ‘not sure’ response poses problems in further analysis. With this poll, it is difficult to encode in the ordinal scale that the rest of the answers fall into, and more importantly, it is missing information when it comes to a respondents true feelings about a question.

In relation to interviewer effects, there are two key aspects to consider: how to measure an interviewer’s effect on the prevalence of ‘not sure’ answers, and what could be the cause of an abundance of ‘not sure’ responses. We will first consider the potential causes of an abundance of ‘not sure’ responses from the respondents of a particular interviewer, and how the interviewer could be affecting that abundance:

  • Random Chance: It could be that the interviewer randomly got an ambivalent group of respondents, and it had nothing to do with the tone, quality, or ability of the interviewer.
  • Interviewer Influence: Some interviewers might be more open to offering the ‘not sure’ response than other interviewers. It seems likely that you could have three groups of interviewers; one where they explicitly read out ‘not sure’ as an option when reading the question, one where they offer the ‘not sure’ response after a long pause without a response from the respondent, and one where they never explicitly offer ‘not sure’ as a response, but will of course record that if the respondent does say ‘not sure’ unprompted. It seems straightforward that these three groups would get a different prevalence of ‘not sure’ responses.
  • Human Error: Polling is a human enterprise. There have been incidents of researchers fabricating data (Anil Potti as an example), and interviewers can make errors when gathering data, such as by entering a response incorrectly. Such errors have the potential to be systematic. For example, an error in the way questions were displayed on an interviewer’s computer screen reportedly led to the cancelling of a Des Moines Register Iowa Poll just prior to its release.

Unfortunately, it is impossible to decipher the actual reason from the data, but you can flag potential interviewers to do a more in-depth investigation. With that said, there are several ways that you could measure the interviewer’s potential impact on the prevalence of ‘not sure’ responses:

  • Occurence of strings of ‘not sure’ responses. This usually means that there was either a sequence of questions that the respondent didn’t know enough about, or that they were getting tired of doing the survey, and were thus saying ‘not sure’ to many questions in a row. An interviewer could certainly be an influence on the latter reason, and the reoccurence of these strings of questions, especially towards the ends of the interviews could be an indication of interviewer impact.
  • Logistic Regression with the response variable as whether or not the respondent said ‘not sure’, and the key explanatory variable being the interviewer code (or interviewer gender), with control variables for the respondent’s demographics. This would likely lend less light on actual influential interviewers, and more on questions that seemed to be swayed by interviewers. The key issue with this test is that the sample sizes within each interviewer are small, so there would likely be noticeable differences in mean responses across interviewers just due to chance.
  • Distribution of ‘not sure’ responses per respondent grouped by the interviewer. This is the simpler technique, and what we primarily looked at when investigating the interviewer’s effect on the prevalence of ‘not sure’ responses.

Below, you can see boxplots that represent this third option detailed above. Each of the little blue dots is an indivudual respondent, and they are grouped along the x-axis by their interviewer, and the y-axis is the number of ‘not sure’ responses that they gave out of the 57 questions that they were asked.

Caution was taken to make the interviewers anonymous, so the identifiers along the x-axis are randomly generated, and thus in no case can the person behind the code be identified. One of the biggest challenges with investigating interviewer effects in the context of these polls is the limited sample size for some of the interviewers. This plot only includes information on interviewers that had 10 or more respondents. The biggest ‘outliers’ that we found have their label in red: q198, q456, q466, q964. q466 and q964 both have higher quartiles values compared to their peers, especially looking at their lowest quartile (at 25%). q964 is especially peculiar as its the only interviewer that didn’t have any respondents that gave answers to every question (0 on the y scale). q198 is odd as it has 4 respondents that gave not sure answers to more than 15 questions. q456 also has a higher inner quartile range compared to most other interviewers, and has two respondents with over 35 not sures.

While we identified four potential interesting interviewers when it comes to their respondents prevalence for giving a ‘not sure’ answer, we don’t recommend any remedial action based on these differences. The point of this was to make sure that there weren’t any interviewers with consistently abnormal amounts of not sure responses, which we don’t see here.

Deviance from Expectation

To measure the second component of our analysis (how do the answers given to certain interviewers differ from what would be expected on the respondent given their demographics), we developed a 4 step process of detecting potential interviewers that consistently got answers that would be unexpected based on their respondents demographics. One major thing that we want to note here is that it is impossible to know if any of the detected ‘outlier’ interviewers are actually influencing the responses given, or if they just happened to interview individuals who have beliefs that don’t line up with what would be expected based on their demographics. This is especially true with small sample sizes (which we have with these polls) where we can’t reasonably assume that these non-aligning individuals are watered down by more aligning respondents for each interviewer.

The four steps to our process are outlined below, with further detail provided for each step in their own sections:

  1. Build Ordinal or Logistic Model for each question based on respondent demographics
  2. Calculate Loss for each question for each respondent
  3. Calculate average loss for each question for each interviewer
  4. Outlier detection at the interviewer level with features as average loss for each question

1. Building Predictive Models for each Question based on Respondent Demographics.

All of the questions in these polls follow an ordinal scale (ordered responses, think favorability scale), or have a binary response (do you support Donald Trump), so they can be modeled with either an ordinal logistic regression or with a logistic regression model (ordinal logistic models reduce to logistic regression in the case of just two categories, but the functions in R (MASS::polr) does not reduce to simple logistic regression). For each question for each interviewer, we first split the poll results into the respondents that were interviewed by that interviewer (test dataset), and those that weren’t (train dataset). We should note here that we removed the ‘not sure’ responses, as we didn’t want to subjectively place this answer in an ordinal scale with the other responses. We can build either an ordinal or logistic regression to predict the responses based on the demographics in the train dataset we described previously. Once we have this model, we can then run the model on the respondents that were interviewed by our particular interviewer of interest (test dataset), and compare the predicted results to the actual responses. With both the ordinal and logistic regression, we can get probabilities for each response based on each respondents demographics.

2. Calculate Loss for each question for each respondent

Once we have predicted probabilities for each category of the tested question for each respondent, we can calculate a loss function. In this case, the loss function is just 1 minus the predicted probability for the actual response that each respondent gave. With this set-up, we have a value that is bounded from below by 0, and bounded from above by 1, with larger values meaning that the actual response was more unexpected.

3. Calculate average loss for each question for each interviewer

Once we have loss values for each respondent for each question, we can calculate average losses for each interviewer for each question. density plots of average losses for different questions in the November 2018 version of the Grinnell Poll are shown below. As we can see, distributions vary based on the spread of answers and the number of potential answers each question had.

  • Q1: Approve/Disapprove of the job Donald Trump has done as President? This is a binary response, and we can see that the model for this question is fairly accurate, with peak average loss around 0.2.
  • Q14E: Number of Immigrants from Middle East: Increase, Decrease, or Stay the Same? This has three options, so a naive model would have average losses of 2/3. Peak average loss is just below this figure, but the distribution extends beyond 2/3, showing that maybe demographics aren’t a great predictor of these answers.
  • Q20A: Importance of being born in America to being a ‘Real American’? We see here average losses above 0.5, but mainly below 0.75. Looking at the distribution of responses to this question, we see a decent spread across each of the responses (24, 6, 18, 49), which is likely why we see average losses just below 0.75 (what would be expected from a naive model).
  • Q20I: Importance of Believing in Treating People Equally to being a ‘Real American’? Like Q20A above, this question also had 4 response levels, but we see a density plot that has a much lower peak. The biggest reason for this is that most people responded with ‘Very Important’, so the model can typically be pretty confident in assigning this response to each respondent.

4. Outlier detection at the interviewer level

Looking back at the average loss density plots above, we notice a couple of density humps outside of the main density. For example, in Q1, we see a small hump around 0.55, in Q20A, we see a small hump at 0.5, and in Q20I, we see a small hump around 0.8. Potential problem interviewers are those that repeatedly find themselves on the edges of these distributions. However, one side of these distributions is more problematic then the other; it is more problematic for an interviewer to find themselves on the right side of these distributions, as that means their respondents are consistently giving unexpected answers.

This process of detecting outliers is an entire field of analytics more generally known as anomaly detection, and we pulled out a specific process to apply to our dataset of average losses. The process we chose is a form of ‘Local Outlier Factor’. Local outlier factor will consider a point an outlier if the density around this point is considerably sparser than the density around each of its neighbors. We can apply this idea to our dataset of average loss for each question for each interviewer to identify interviewers that are in sparse regions.

To do this in R, we used a package called OutlierDetection that contains many functions related to anomaly detection, choosing the ‘dens’ function. To deal with different sample sizes, we restricted our dataset to only include interviewers that had at least 5 respondents. Running this function on our dataset of average losses, we detected three outlier interviewers: q234, q456, and q495.

To determine if these are truly abnormal responses, we compared our results from the Grinnell Poll to randomly generated data. In this way, we can understand if the trends we notice in these outliers could be purely random, and not systematic. Below are violin plots, which can show density in the y variable, which in this case is percentiles of the average loss. These percentiles are for each question (so for q456 and their percentile value for Q1, it is the percentile of q456 average loss for Q1 compared to all other average losses for Q1). A troublesome pattern would be a fatter density at high percentiles and a skinnier density at low percentiles, as this would mean that they are more likely to be on the right side of the distributions above in the Average Loss section. We see that with q456. However, when looking at the randomly generated data on the right, we see a similar trend, albeit reversed, for random id 40.

Impact of Interviewer Gender

To investigate the impact of the Interviewer’s Gender in the Grinnell National Poll, we looked at two things:

  • Is one gender more likely to get ‘not sure’ responses compared to the other gender? Does this relationship change based on the context of the question (gender-related vs not gender-related)?
  • Does the interviewer’s gender have a directional impact on the answers given by respondents? Is this influenced again by the context of the question?

To answer the first set of questions, we can look at the prevalence of not sure responses broken down by question and the gender of the interviewer. We have 50+ questions in the November 2018 Poll, so instead of showing the comparison for all questions, here we will just show the questions that had the largest difference between the prevalence of not sures for male interviewers compared to female interviewers as well as questions that are specifically related to gender. As we can see below, all of these have a higher proportion of female interviewers getting ‘not sure’ responses compared to their male counterparts, and this trend is largely seen throughout all questions. The same plot with all questions is shown in the appendix.

The first ten questions (Q13H through Q20K) are the questions that had the largest differences between the prevalence of not sure responses between the genders of the interviewer. Q13H references discrimination towards African Americans, and the Q20 questions reference what it means to be a ‘Real American’ (none of them seemingly relate to gender). The last two questions on the right, Q13E (Discrimination against Women) and Q18C (birth control pills related to health insurance) are the two questions from the November 2018 poll with clear connections to gender, and neither of them see significant differences in the prevalence of not sure responses.

One thing to keep in mind when viewing this is that the prevalence of ‘not sure’ responses is not independent within the same respondent. If an individual responds with ‘not sure’ to one question, they are more likely to respond ‘not sure’ to other questions. This means that the consistent trend across questions is likely the same trend in respondents, and not necessarily a consistent effect that the gender of the interviewer has on the prevalence of not sure responses. However, even with this in mind, several questions have a statistical significant difference between the number of not sures recorded by female interviewers compared to male interviwers. Without domain knowledge, we do not want to comment on potential specific causes of these differences, although it could certainly be an important thing to investigate in a more thorough study.

In addition to the above work on the prevalence of ‘not sure’ responses, we also want to see if the interviewer’s gender influences the responses to specific questions, and when possible, investigate the direction of this association. To do this, we set up an ordinal logistic regression for each question, with the interviewer’s gender as our main explanatory variable of interest, and controlled for respondent demographics like race, gender, income level, religion, political party, and level of schooling.

Through this modeling, we found a couple of questions where the gender of the interviewer had a significant impact on the response even after controlling for demographics of the respondent. Some of these significant ones have a tangential connection to gender Q3: would you vote to re-elect Trump in 2020?), but others have seemingly no relation to gender (Q15: Does the U.S. have a moral responsibility to take in refugees?). In the graph below, in addition to including the questions with a significant coefficient for the interviewer’s gender, we have also included the two questions in the poll with clear ties to gender: Q13E and Q18C. Neither of them show significant coefficients; however the direction of both are in line with intuition (Q13E: smaller response value means that the respondent thinks women face a lot of discrimination; Q18C: larger response value means that the respondent believes removing birth control pills from health plans is discrimination). These findings mirror the findings above; the gender of the interviewer does impact the reponses, but not necessarily in predictable or intuitive ways.

Conclusions and Final Thoughts

In this post, we outlined several approaches to investigating the effect that the interviewer can have on a poll. Verifying the absence (or at least a limited presence) of interviewer effects can go a long way to validate findings and analysis done with the poll, as you no longer have to continuously worry about the confounding impact of the interviewer. In this analysis of the November 2018 GCNP, we see that there is some impact that the interviewer could have on responses; however, the actual scale of the impact is relatively small (when present), but seemingly no more than would be expected on a poll (especially with 141 interviewers). Further, the small sample size for each interviewer also reduces any impact one particular interviewer could have on full-scale analysis, as well as the fact that we can’t determine causal impact of the interviewers, instead of just random chance from the sampling. With all of this in mind, we recommend no specific remedial action in the context of the 2018 November GCNP.

The methods outlined in this post (while done on the 2018 November GCNP) are generalizable to any poll with identified interviewers, and the exact methods and statistical techniques can certianly be tweaked and improvement to fit more specific circumstances of a survey or poll.

Acknowledgement

We would like to thank Professor Hanson for his assistance with domain specific knowledge surrounding the GCNP and polls/surveys in general and his help preparing this for publication. We would also like to thank Professor Miller for his help identifying techniques and best practices for the analysis outlined in this document.

Matthew Palmeri ’20 is a mathematics major at Grinnell College with a concentration in statistics.

Jasper Yang ’21 is a biology major at Grinnell College with a concentration in statistics.

Ethan Pannell ’21 is a political science major at Grinnell College with a concentration in statistics.

Appendix

These are the same density plots as was seen above in the Average Loss section, with red vertical lines indicating the location of interviewer q456. Again, troublesome locations of these red lines is on the extremes of the right sides of each of these distributions.

This is the same graph from the investigation into the effect of the gender of the interviewer on the prevalence of not sure responses without any restrictions (so every question).

This is the same graph from above with interviewer gender model coefficient, but with all questions.

Leave a Reply

Your email address will not be published. Required fields are marked *