Full disclosure: I approach this topic simultaneously from the perspective of a social scientist and as the instructor of a traditional introductory statistics class for over twenty years. I am, thus, myself part of the problem. While I am mainly following the dictates of some of the most popular text books, it is fully within my power to diverge from the book. When I do not do so, it is really my own fault—a sheep following the sheep dogs.
Our worst failure as statistics teachers is to teach as if all or most of the data that our students will engage with in their future careers are from simple random samples.
Not so! Most social science research, including that conducted by political pollsters, newspapers and market research companies, over-samples some groups in order to ensure that they are sufficiently represented in the sample to allow inferences to be made about them. This technique, called stratified random sampling, is discussed in most, but not all, introductory statistics textbooks.
Even if the initial intent is to use a simple random sampling technique, the systematic propensity of some sub-groups to be easier to find or more likely to cooperate with the researcher often means that, in the end, some groups are over-sampled relative to their proportions in the population. This problem may be exacerbated by recent declines in willingness to participate in surveys. According to the Pew Research Center for the People and Press, in 1997 the response rate was about 36%, but by 2012 it was down to 9%, at least for some surveys. In order to make generalizations about the entire population, these over-samplings whatever their cause must be compensated for using by giving less weight to individuals from the over-sampled groups.
Most (but not all) statistical software will do this automatically using a weighting variable that the researcher must, of course, understand and set up appropriately. Despite discussions of the stratified random sample as a research method, only a few introductory statistics texts, provide information about how to calculate a weighting variable or even how to analyze the data from stratified random samples when a weighting variable is already provided with data they download from elsewhere.
The effect of omitting to use a weighting variable when the sampling structure demands it can be great or small, critical to interpretation or not, but it always leads to statistical inaccuracies. The two graphs above compare the use of weighted and unweighted data gathered by the Pew Research Center poll on U.S. Religious Knowledge. The 5 point difference in the results shown above not seem critical in this case, but a similar difference in polling data for an election certainly might.
The bright spot is that some (but not all) instructors in Political Science, Economics, Sociology, Anthropology or other disciplines DO tell students how to weight data and require them to do so in data analyses, when appropriate. Unfortunately, these classes usually do not have the time to also investigate more fully the rationales behind weighting and ways to calculate your own weighting variables.
This post discusses what I consider introductory statistics courses biggest disservice to social studies students. The other disservice is an overemphasis on parametric statistics often to the extent of even explaining the existence of non-parametric alternatives. More about that later…
In the near future, look for a post with a suggested lesson for teaching undergraduate introductory statistics students how to deal with weighting variables.