In the spring of 2016, the German Department and the Bancroft Library partnered to pursue a collaborative research grant through the Digital Humanities at Berkeley program. Jeroen Dewulf, Associate Professor in the UC Berkeley Department of German and the Director of Dutch Studies, and David De Lorenzo, Associate Director at The Bancroft Library at UC Berkeley, formed a peer-to-peer collaboration with Dewulf as the authority on the Dutch colonial documents and De Lorenzo as the architect of the information and technological aspect of the project.
The project plan was to prepare a digital research collection from selected primary source materials in the Engel Sluiter Historical Documents Collection. Working with students from UC Berkeley, the project sought to design a web presentation for the “Colonial New Netherland” collection using a selection of documents in the Sluiter collection that are focused on the seventeenth-century Dutch colony of New Netherland, which later became New York. These unique and valuable materials include many typed transcriptions of archival materials which were previously inaccessible to most researchers because of the difficulties in reading seventeenth century paleography. The project sought to digitize, extract, and clean the historic text, in order to present “research ready” text to enable natural language, machine processing capabilities over these archival documents. The project was awarded funding through the Digital Humanities at Berkeley program, with additional support from Bancroft Library, for the academic year 2016-2017. The project hired Julie van den Hout as the Project Consultant and Lauren Dooley as the WordPress designer. Prior to the start of the project, David de Lorenzo left UC Berkeley and was replaced by Mary Elings, Assistant Director of Bancroft Library, as the project lead at Bancroft.
The Engel Sluiter Historical Documents Collection at The Bancroft Library is a research collection consisting predominantly of copies and transcriptions of Spanish, Portuguese, Dutch, French and English primary source materials from archives in Europe, the United States, the Caribbean, and Latin America on the seventeenth-century Atlantic by former UC Berkeley historian Engel Sluiter (1906-2001). The processing of the collection, which consists of 98 boxes, including some 160,000 pages of documents, was generously funded by the New Netherland Institute. The documents related to the seventeenth-century Dutch colony of New Netherland, were represented in seven folders (carton 13, folders 12-18).
The 823 documents in these seven folders were digitized by Library personnel as TIFF files and then the digitized versions of the documents were run through Optical Character Recognition (OCR) to generate text files. The Project Consultant, Julie van den Hout, who is fluent in the Dutch language, manually reconciled and corrected the OCR text files using the original digitized material. She was assisted by Quinn Dombrowski, of Berkeley Research Computing (BRC), in using the OCR Virtual Desktop supported by BRC’s Analytic Environments on Demand (AEoD) service. Julie used this service to extract the corrected text and recombined them into new PDF files. While the project intended to use a text analysis tool developed by a previous digital humanities fellow called “ArchExtract” to analyze the text, the texts were instead run against an online, a web-based text analysis environment called “Voyant Tools,” to explore the texts and determine if they were research ready.
The results of the project were put into a project website, created by Lauren Dooley, which outlines the project and presents the final research products. The products comprise the corrected texts which are presented as PDF files on the project website for use by researchers interested in doing text analysis over these archival documents. A few examples of the Voyant text analysis are presented to demonstrate the potential in these texts for additional text-based research. The text files can be used with other natural language processing tools, such as topic modeling, entity extraction, and keyword extraction, to explore and expand access to the documents. In addition to the project website presentation, the corrected texts with the archival page images are fully text searchable and published through Calisphere, a digital collection platform hosted by the California Digital Library.
Mary W. Elings 2017-09-22