Adventures With Collaborative OCR Projects: The Scarlet and Black Plain Text Files

The Digital Liberal Arts Collaborative (DLAC) along with the College Libraries, DASIL, and other partners are excited to launch a public archive of plain-text files from Grinnell College’s student newspaper, the Scarlet and Black (1894-2010).

The project is available via a public GitHub repo.

History:

As described in Grinnell Magazine: “Since its first publication on Sept. 12, 1894, the Scarlet & Black has served as a vital source of up-to-date news on campus, an important record of Grinnell College and a rich historical resource.”

In 2014, former Samuel R. and Marie-Louise Rosenthal Librarian of the College Richard Fyffe and Special Collections Librarian and Archivist Chris Jones contracted ArcaSearch to digitize past issues of the S&B and place them in a publicly-accessible database. This effort was described in articles published in Grinnell Magazine and the S&B.

Screenshot of Compass Research Systems interface for Scarlet and Black digital archive.
Screenshot of Compass Research Systems interface for Scarlet and Black digital archive.

That database launched in October 2014 and remains the primary access point for individuals interested in searching and browsing the S&B.

Methods:

As part of the Digital Studies Innovation Fund grant, Sarah J. Purcell (L.F. Parker Professor of History & Grinnell Class of 1992) and Erik Simpson (Samuel R. and Marie-Louise Rosenthal Professor of Humanities) began initial efforts to procure plain-text data of the digitized S&B. This effort was supported by Julia Bauder (Social Studies and Data Services Librarian) and Liz Rodrigues (Humanities and Digital Scholarship Librarian).

In Spring 2018, Sarah J. Purcell worked with student research assistant Papa Kojo Ampim-Darko (Grinnell College Class of 2019) to explore possibilities for converting the digitized S&B issues into plain-text files that could be used for computational text analysis and other digital research methods. This work was supported by Julia Bauder and Katherine Walden (Digital Liberal Arts Specialist). Ampim-Darko’s work was supported by the L. F. Parker Chair in History student research fund.

OCR conversion of the tiff image scans was accomplished using Tesseract 4, on a virtual high-performance computing cluster. HPC support was provided by Mike Conner (Linux Administrator).

Two Python libraries, aspell and hunspell, were used to clean the OCR output. The results of both cleaning processes are included in the project’s GitHub repo. The Python scripts used to clean the OCR output are also included in the repo.

Repo Structure:

Screenshot of Scarlet and Black plain-text GitHub repository.
Screenshot of Scarlet and Black plain-text GitHub repository.

aspell_cleaner.py and hunspell_cleaner.py include the Python scripts used to clean the Tesseract output. The original_ocr_results folders include the direct output of the Tesseract OCR process. The ocr_results_aspell folders include the output from the aspell cleaning process. The ocr_results_hunspell folders include the output from the hunspell cleaning process. The unstructured folders include the entire batch of txt files, with no sub-folder structure. The structured folders organize the txt files by decade, year, and month.

The file naming convention (maintained from ArcaSearch digitized files) adheres to the following structure:

  • Sample file name: usiagrc_scb_1894_09_12_50_000_00001-00000_000.txt
    • Issue year: 1894
    • Issue month: 09 (September)
    • Issue date: 12 (Twelth)
    • Issue page: 00001

Acknowledgements:

This project would not have been possible without the indefatigable efforts of Grinnell alum Papa Kojo Ampim-Darko, supported by Sarah Purcell. Julia Bauder provided invaluable technical expertise getting the project off the ground and providing access to the digitized S&B files. Katie Walden served as de-facto project manager, coordinating the various units and individuals involved. The support of Mike Conner and Sam Rebelsky (Professor of Computer Science), provided necessary access to HPC resources. Insight provided by Jerod Weinman (Associate Professor of Computer Science) helped us select Tessarct as an OCR resource. Additional thanks to Jarren Santos (Data Scientist and Grinnell College Class of 2017) of the Data Analysis and Social Inquiry Lab for his support of this project.

Leave a Reply

Your email address will not be published. Required fields are marked *