Die Grenzboten on its Way to Virtual Research Environments and Infrastructures
The State and University Library Bremen (SuUB) is dedicated to the digitization of its historical collections. Digitization is an important instrument for improving the accessibility of valuable information contained in fragile historical documents. It facilitates academic research and teaching and is indispensable to the digital humanities. Especially the research of digital serial publications benefits from ‘recent systematic digitization efforts, often initiated by libraries […]. More and more historical periodicals and other serial publications are now digitally available in full, i.e., all of their issues’ [Piotrowski, this volume]. The historical journal presented in this article is one of these and the final section will discuss why it can be considered a complete corpus. Usually, digitization projects produce digital images, metadata for cataloguing and web-navigation purposes and OCR full text for searching. This information is made available through the library's web portal for digital collections. However, digital humanists need high-quality full texts enriched with metadata in the appropriate format to analyse them with powerful software tools.
The historical journal Die Grenzboten serves as an exemplary model to bridge the gap between digitization projects in libraries and research infrastructures. Die Grenzboten is a long running serial publication (1841 – 1922). It can be classified as a literary journal that also covered politics and arts. We demonstrate that OCR post correction and a page-wise structuring are prerequisites for the creation of a high-quality TEI version of a full text. The TEI version was created in cooperation with the Deutsches Textarchiv (DTA) at the Berlin-Brandenburg Academy of Sciences and Humanities (BBAW). A fully automated OCR post correction developed at the SuUB Bremen is freely available on GitHub.
To enable scientists to work with powerful software tools the transfer of high-quality full texts to research infrastructures is a necessary step. We describe transfers of full text and the experience we have made, but still some general questions persist: What has to be done to prepare raw OCR output for this purpose in a reasonable and cost-effective manner? What quality is needed or expected? Which metadata and file formats are needed? Should there not be a closer cooperation between research infrastructures and libraries handling the digitization? OCR full texts, even post corrected, are not perfect but character recognition rates around 99% certainly provide more options than just being used as a search index. There is a vast amount of textual resources available ready to be made fully accessible for scientific research! Finally, some suggestions for scholars and the researchers working on digital serial publications are given.
Copyright (c) 2019 Manfred Nölte, Martin Blenkle
This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms
Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).