Die Grenzboten on its Way to Virtual Research Environments and Infrastructures

  • Manfred Nölte State and University Library Bremen
  • Martin Blenkle State and University Library Bremen

Abstract

The State and University Library Bremen (SuUB) is dedicated to the digitization of its historical collections. Digitization is an important instrument for improving the accessibility of valuable information contained in fragile historical documents. It facilitates academic research and teaching and is indispensable to the digital humanities. Especially the research of digital serial publications benefits from ‘recent systematic digitization efforts, often initiated by libraries […]. More and more historical periodicals and other serial publications are now digitally available in full, i.e., all of their issues’ [Piotrowski, this volume]. The historical journal presented in this article is one of these and the final section will discuss why it can be considered a complete corpus. Usually, digitization projects produce digital images, metadata for cataloguing and web-navigation purposes and OCR full text for searching. This information is made available through the library's web portal for digital collections. However, digital humanists need high-quality full texts enriched with metadata in the appropriate format to analyse them with powerful software tools.


The historical journal Die Grenzboten serves as an exemplary model to bridge the gap between digitization projects in libraries and research infrastructures. Die Grenzboten is a long running serial publication (1841 – 1922). It can be classified as a literary journal that also covered politics and arts. We demonstrate that OCR post correction and a page-wise structuring are prerequisites for the creation of a high-quality TEI version of a full text. The TEI version was created in cooperation with the Deutsches Textarchiv (DTA) at the Berlin-Brandenburg Academy of Sciences and Humanities (BBAW). A fully automated OCR post correction developed at the SuUB Bremen is freely available on GitHub.


To enable scientists to work with powerful software tools the transfer of high-quality full texts to research infrastructures is a necessary step. We describe transfers of full text and the experience we have made, but still some general questions persist: What has to be done to prepare raw OCR output for this purpose in a reasonable and cost-effective manner? What quality is needed or expected? Which metadata and file formats are needed? Should there not be a closer cooperation between research infrastructures and libraries handling the digitization? OCR full texts, even post corrected, are not perfect but character recognition rates around 99% certainly provide more options than just being used as a search index. There is a vast amount of textual resources available ready to be made fully accessible for scientific research! Finally, some suggestions for scholars and the researchers working on digital serial publications are given.

Published
2019-06-30