Throwback Thursday: Big Data in the Early 20th Century

Last week, we talked about the 1888 invention of one of the first tools that could be used to process “big data,” the Hollerith Machine. A fascinating book published in 1935, Practical Applications of the Punched Card Method in Colleges and Universities, records some of the “big data” research that academics undertook using this new technology, including an effort by an anthropology professor at Harvard University to determine precise anatomical profiles for various classes of criminals. He and his research team recorded information about 125 biometric variables for 17,000 criminals, then used Hollerith machines to look for correlations in the data. I’ll let Prof. E. A. Hooton tell you more about this in his own words:

In the course of elaborating our criminal data, one process was performed by the Hollerith sorter which in its complexity is probably unique in anthropometric research. In our series of native white criminals of native parentage there is included a group of 414 robbers. These robbers display as a group a number of statistically significant excesses and deficiencies of certain categories of morphological features…. It was desired to ascertain how many individual robbers manifested each one of every mathematically possible combination of these nine morphological peculiarities. Since there are 512 possible combinations of the presence and absence of these characters, the sorting task involved was stupendous and consumed several weeks of the entire working time of the sorter…. The outcome of the research was a conclusive demonstration that, by taking a sufficient number of peculiarities of the robber group in combination and selecting all of the individuals who possessed that combination, it was possible to pick out a type which was 100 per cent robber. At the same time it was demonstrated that only one robber out of 414 showed this complete and exclusive type combination. It was therefore apparent that morphological type combinations were of no practical use in determining the offenses of criminals, so far as our particular data were concerned.(1)

While this project is notable as much for how ridiculous its premise sounds to us 80 years later as for the scale of its undertaking, other chapters in the book record efforts that would not be out of place today, such as attempts to code information about large numbers of hospital patients in an effort to learn more about the causes of mortality, or a survey of over 30,000 businesses in three states to gauge the impact of newly-imposed sales taxes. I’ll let Edwin H. Spengler, author of the latter chapter, conclude with a statement that, with only minor changes, could easily appear in any modern work on “big data”:

Much as the compilation of certain statistical data may be desired, however, the expense and the time involved in sorting and tabulating the information, have frequently deterred individuals from going ahead with a given project…. To a large extent, the introduction of mechanical methods of counting, sorting and tabulating numerical facts has eliminated these difficulties. Electric machinery, capable of performing routine operations at the rate of several hundred per minute, has increased the speed and lowered the expense of preparing statistical tabulations. This has resulted in a broadening of the field of statistical research and analysis and has stimulated the projection of studies which, without the use of such equipment, would no doubt have been considered impossible or impractical of accomplishment. (2)

What would Spengler and his colleagues have thought about today’s supercomputers, which can perform more than 1012 operations per second? And what will researchers 80 years from now view as quaint when looking back at our “big data” research?

(1) E. A. Hooton, “Anthropology,” in G. W. Baehne, editor, Practical Applications of the Punched Card Method in Colleges and Universities. New York: Columbia University Press, 1935, p. 387.
(2) Edwin H. Spengler, “Economics,” in G. W. Baehne, editor, Practical Applications of the Punched Card Method in Colleges and Universities. New York: Columbia University Press, 1935, pp. 397-8.

Throwback Thursday: Big Data in the Late 19th Century

“Big data” is one of the buzzwords of 21st century research. In the sciences, it has been the subject of a special issue of Nature; in the social sciences and humanities, the National Endowment for the Humanities has sponsored a “Digging into Data Challenge” to encourage “big data” research in these fields. Reports on the impact of big data on research have been written by everyone from the Council on Library and Information Resources to Microsoft Research. Many of these pieces of writing emphasize the unprecedented ability that ever-more-powerful computers have given us to collect and analyze massive quantities of data.

But tools for working with “big data” long precede the invention of the modern, integrated-circuit-based computer*. The Hollerith Machine, a “computer” that could rapidly tabulate information recorded on punched cards, was invented in 1888 to solve a pressing big data problem of the day: how to tabulate the Decennial Census data gathered from the U.S.’s rapidly-growing population. This sort of punched card technology was used to process Census data for fifty years.

Hollerith Census Machine pantograph
You can read more about how the Hollerith Machine worked in the “History” section of the U.S. Census Bureau’s website.

c.1900 Hollerith Census Tabulator

Not long after the Hollerith Machine was invented, academic researchers were considering how to apply this new tool and its successors to their own “big data” research problems. In our next Throwback Thursday post, we’ll look at some research from 1935 that used punchcards to analyze “big data.”

*Invented by Grinnell alumnus Robert Noyce.

Images from flickr users Marcin Wichery and Erik Pitti respectively, with no changes made, under the Creative Commons License 2.0.

Visualizing Marriage and Social Inequality

With Valentine’s Day just around the corner, it’s a good time to take a look at data on marriage in the United States. It’s been a hot topic lately not just among demographers and sociologists, but also among economists and others who are worried about economic inequality. Although it’s now old news that marriage rates in the United States are declining, with people waiting until later to marry and an increasing number not marrying at all, the class differences that have appeared in marriage rates have not been as widely discussed. DASIL has created two visualizations that let you explore aspects of these changes from the 1970s to the present.


Less-educated Americans are now less likely to be married than more-educated Americans. The visualization above shows marital status by education and gender for Americans 1976 to present, based on data from the General Social Survey.

Americans who are not married tend to have lower incomes than those who are.  This visualization shows the median income of Americans age 18 and over by marital status, race, and gender, 1974 to present, based on data from the Current Population Survey.

Visualizing Disease Outbreaks: A Question of Scale?

Vaccinations are a hot-button issue right now as measles outbreaks crop up throughout the United States. Measles, mumps, rubella, whooping cough, and polio are all deadly diseases that can be easily prevented with vaccines. Outbreaks of these diseases have been occurring worldwide for a long time, but outbreaks have been increasing in the U.S. while going down in other countries, according to the video below:

Continue reading →

Visualizing the Budget of the United States

Each year, the President of the United States follows the State of the Union Address with the budget proposal. But what does the US budget look like?

The New York Times created a visualization of the 2012 budget that breaks down the spending by size and color. The larger the rectangle, the more money spent. Green symbolizes an increase in that budget area, and red shows a budget cut.

A similar visualization can be found for the 2014 budget at The Washington Post. Again, size is used to represent the amount spent, but their visualization also includes revenue information. The Washington Post also includes a breakdown of mandatory vs. discretionary spending.

Here at DASIL, we have created a visualization tracking budget spending over time, including estimates until 2019, using data from The White House. You can use our tool to compare outlays for various government agencies.
For example, let’s examine spending on the Department of Education and the Department of Homeland Security:Outlays by Agency: Department of Homeland Security vs. Department of Education

Continue reading →