A Tool for Visualizing Regression Models

Will sales of a good increase when its price goes down? Does the life expectancy of a country have anything to do with its GDP? To help answer these questions concerning different measures, researchers and analysts often employ the use of regression techniques.

Linear regression is a widely-used tool for quantifying the relationship between two or more quantitative variables. The underlying premise is simple: no more complicated than drawing a straight line through a scatterplot! This simple tool is nevertheless used for everything from market forecasting to economic models. Due to its pervasiveness in analytical fields, it is important to develop an intuition behind regression models and what they actually do. For this, I have developed a visualization tool that allows you to explore the way regressions work.

You can import your own dataset or choose from a selection of others, but the default one is information on a selection of movies. Suppose you want to know the strategy for making the most money from a film. In regression terminology, you ask what variables (factors) might be good predictors of a film’s box office gross?

The response variable is the measure you want to predict, which in this situation will be the box office gross (BoxOfficeGross). The attribute that you think might be a good predictor is the explanatory variable. The budget of the film might be a good explanatory variable to predict the revenue a film might earn, for example. Let’s change the explanatory variable of interest to Budget to explore this relationship. Do you see a clear pattern emerge from the scatterplot? Can you find a better predictor of BoxOfficeGross?

If you want to control for the effects of other pesky variables without having to worry about them directly, you can include them in your model as control variables.

Below the scatterplot are two important measures that are used in evaluating regression models: the p-value and the R2 value. What the p-value tells us is the probability of getting our result just by chance. In the context of a regression model, it suggests whether the specific combination of explanatory and control variables really do seem to affect the response variable in some way: a lower p-value means that there seems to be something actually going on with the data, as opposed to the points being just scattered randomly.  The R2 value, on the other hand, tells us how what proportion of the variability in the response (predicted) variable is explained by the explanatory (predictor) variable, in other words, how good the model is. If a model has a low R2 value and is incredibly bad at predicting our response, it might not be such a good model after all.

score vs runtime plot

If you want to predict a movie’s RottenTomatoesScore from its RunTime, for example, the incredibly small p-value might tempt you to conclude that, yes, longer movies do get better reviews! However, if you look at the scatterplot, you might get the feeling that something’s not right. The R2 value tells us this other side of the story: though RunTime does appear to be correlated to RottenTomatoesScore, the strength of that relationship is just too weak for us to do anything with!

Play around with the default dataset provided, or use your own dataset by going to the Change Dataset tab on top of the page. This visualization tool can be used to develop an intuition for regression analysis, to get a feel of a new dataset, or even in classrooms for a visual introduction to linear regression techniques.

Modeling Population Growth in Excel

The Malthus and Condorcet Equations, simple formulas that model relatively complex trends in population growth, are now accessible with an Excel calculator that allows the user full control over every component of the equations. Students can use the Excel file to model human population growth under the assumption that a human carrying capacity exists.

The Malthus Equation expresses the growth rate of a population as a function of the current population size and current carrying capacity. Specifically, the growth rate of a population is equal to a Malthusian parameter multiplied by the current population size multiplied by the difference between the current carrying capacity and the current population size. This relationship creates a high growth rate once a population is large enough to reproduce at its full potential, but remains a low growth rate when the population is very small or when a population is nearing its carrying capacity and feeling the effect of constrained resources. The Malthusian parameter is almost invariably between zero and one because a negative Malthusian parameter would lead to a population’s gradual extinction while a Malthusian parameter greater than one would lead to explosive population growth that would greatly exceed the carrying capacity. In the latter situation, unrealistically rapid and extreme periods of growth and contraction would ensue.

The Condorcet Equation expresses the growth rate of the carrying capacity of a population as equal to the growth rate of the population multiplied by a constant termed the Condorcet parameter. The logic behind this mathematical relationship is that the carrying capacity of a population increases or decreases proportionally with the growth rate of a population because an additional person in a population can have a positive or negative effect on the carrying capacity. This implies that a Condorcet parameter greater than one results from a society where an additional individual somehow increases the number of people that can be supported even when taking into account the resources that additional individual consumes; this could result from a situation where there are increasing returns to labor. If doctors cure diseases better when more of them work together, this is reflected by a Condorcet parameter greater than one. A Condorcet parameter between zero and one is most realistic for human populations because the contribution of another person will probably grow the carrying capacity but not by more than one. A negative parameter implies that an additional person would actually lower the carrying capacity; perhaps every additional person would consume natural resources at a rate greater than the previous individual’s rate.

As Cohen (1995 Science 269: 341-346) points out, the equations are not necessarily realistic models of human population growth. There is no consensus about whether or not a human carrying capacity exists. In theory, we as a species might be able to continually develop technology at such a rate that we are unable to approach a carrying capacity. A slowdown in overall human population growth is more likely due to a global increase in income per capita that leads to altered reproductive strategies.

With r=0.1 and c=0.1 as parameters, the population experiences a positive but steadily decreasing growth rate because the carrying capacity increases at 1/10th the rate of population growth, and since population growth slows as the population size approaches the carrying capacity, we observe almost asymptotic behavior. This is a realistic pattern for human population growth if a carrying capacity exists.

Figure 1: with r=0.1 and c=0.1 as parameters, the population experiences a positive but steadily decreasing growth rate because the carrying capacity increases at 1/10th the rate of population growth, and since population growth slows as the population size approaches the carrying capacity, we observe almost asymptotic behavior. This is a realistic pattern for human population growth if a carrying capacity exists.

The calculator defines the Malthus Equation as dP(t)/dt=rP(t)[K(t)-P(t)] and the Condorcet Equation as dK(t)/dt=c dP(t)/dt (See Cohen 1995: 343). The user may enter values for the initial states of r (the “Malthusian parameter”), P(t), (population size), K(t) (carrying capacity), c (“Condorcet parameter”), t_0 (the starting time for the model) and dt (the length of one interval in time) that determine all of the future changes in population size. The rates of change of population and carrying capacity at time t, dP(t)/dt and dK(t)/dt respectively, are determined by the equations. The Malthusian and Condorcet parameters are constant in a growth model provided that there are no exogenous shocks that affect the nature of population or carrying capacity growth. Because of this, they do not vary as a function of t.

To explore the Malthus-Condorcet calculator, please follow this link to an automatic download of the Excel spreadsheet containing the calculator.

Data Across the Curriculum: Using Geospatial Data to Illustrate Historical Change

History is a discipline that is founded on looking at changes over time, and for Sarah Purcell, Professor of History, data is an essential tool in measuring that change. More specifically, Purcell employs geospatial data to investigate historical change in both time and space for her Civil War & Reconstruction class, which focuses on the causes, progress, and consequences of the Civil War and Reconstruction with an emphasis on race, politics, economics, gender, and military conflict.

Purcell uses a stair-step approach in getting students exposed to geospatial data, first by using Google Maps to compare Civil War battleground locations to the locations of students’ hometowns, then investigating how other historians have used data, especially economic and demographic data, in tandem with historical narrative. Finally, Purcell has her students work with ArcGIS, an analytical map-making software, to visualize geographic trends in various historical data. For example, students in the class explore on black soldiers who enlisted in the U.S. Army during the Civil War in an in-class exercise (Figure 1) that encourages them to think critically about military data.




To Sarah Purcell, data is important due to its wide applicability: using data in the context of history teaches a valuable lesson about how data can enhance just about any discipline. Moreover, in the history field, there exists a broad array of different types of data to be utilized, both qualitative and quantitative. While Purcell admits that some students have easier facility with working with data than others, she stresses that the struggle is important in internalizing quantitative literacy and getting accustomed to confronting data, an essential skill. The amount of involvement with data students get in her courses has impacted her students in a variety of ways: some students have gone on to get further training in ArcGIS via formal coursework, and others have been able to secure jobs, citing that employers are largely attracted to data skills in historical work.

Data Across the Curriculum: Teaching Data Skills in Sociology

Casey Oberlin, Assistant Professor of Sociology, understands the importance of using data in the classroom, especially in such a discipline as Sociology, which is commonly viewed by others outside the discipline as a field with less real-life application of hard skills (e.g. data analysis). This conception is far from the truth, and Oberlin’s approach with data in the classroom gives her students a very holistic and interactive view of data analysis in the field that shows how data is part and parcel to the discipline.
Oberlin uses both her introductory Sociology courses and Research Methods courses as opportunities for students to get deeply entrenched with the data-rich, multi-tiered research process of the field. Data in Sociology is very diverse, as it involves both quantitative and qualitative measures, so Oberlin’s approach focuses on getting students exposed to the vast array of data types, as well as the techniques, technologies, and methods used to interpreting each type.


At the introductory level, Oberlin focuses on data consumption as a first step to data concepts. Students study infographics (see Figure 1) and other data visualizations to learn how to present data and interpret the data being presented. Oberlin’s Research Methods courses are reserved for her experiential-based approach with data that teaches students two data software programs throughout the semester, one quantitative (SPSS) and the other qualitative (Nvivo), shows students the wide range of data utilized by Sociology, and has students grapple with the entire research process for themselves. In Research Methods, students create research questions, hypotheses/expectations, clean or assess the dataset, analyze their results, and present their work in a professional manner. Her heavy guidance through the research process helps to mitigate understandable anxiety about trying new techniques and presenting their ongoing work, setting her students up to then develop their own sustained research project throughout the semester. Oberlin states this immersive method is beneficial to and enthusiastically received by students, as the practice in research opens doors to internships, jobs, and grad schools.

All in all, Casey Oberlin’s utilization of data in the class gives students exposure to the intensive research process that is integral to Sociology and teaches important data skills and concepts that are applicable both in the real-world and in a classroom setting.