This webpage contains information about the datasets used and code developed as part of the KB research in residency project: "Is your OCR good enough? A comprehensive assessment of the impact of OCR quality on downstream tasks." This project focuses on a comprehensive assessment of the impact of OCR quality in Dutch newspaper, journal and book collections, comparing it with published results for English and French. This is be done via extrinsic evaluation: assessing results from a set of representative downstream tasks, such as text classification or clustering. The ultimate goal of the project is to contribute guidelines detailing when OCR quality is to be considered good enough, in order to inform the development and use of textual collections.
We considered the following downstream tasks which are relevant to the digital humanities and the library and information communities. In particular, this research focused on:
- Document classification: the goal is to assign one or more labels to each document. The model learns from a manually annotated dataset; this is therefore a supervised learning task.
- Document clustering: the goal is to group documents into clusters, by similarity. This task also includes topic modelling, a widely used technique in the digital humanities. These are therefore unsupervised learning tasks because they are performed in the absence of annotated data.
- Post-OCR correction: the goal is to post hoc improve the quality of the OCR. This can be done using either supervised, unsupervised and rule-based techniques. This task is somewhat different from the previous three, as its purpose is to improve on the quality of OCRed texts, independently of their subsequent use.
Results for this research are based on the following datasets:
- Historical newspapers from the KB OCR research project.
- A selection of books from DBNL.
- ICDAR 2019 post-OCR correction challenge.
A brief description of the datasets can be found on this Wiki page.
On March the 18th Dr. Giovanni Colavizza gave a webinar about the results of his Researcher in Residence project.
The recording of his presentation is available online.
Mirjam Cuper (KB) and Konstantin Todorov (UvA) have also contributed to this work as follows: Mirjam has contributed data, including the evaluation of OCR quality, and provided invaluable support throughout the project; Konstantin is fully responsible for the post-OCR correction task.
When using this dataset we request you to cite it as follows:
Giovanni Colavizza, & Mirjam Cuper. (2021). Is your OCR good enough? A comprehensive assessment of the impact of OCR quality on downstream tasks [Data set]. Zenodo. http://doi.org/10.5281/zenodo.4498186
Data & code for this project
The code used for this project can be found on the github page.
The dataset that was generated for this project and used to evaluate the OCR quality can be found on the zenodo page.
For more information about the post-OCR correction see https://zenodo.org/record/4033104
Or read more about the post-OCR correction dataset in 'Transfer learning for historical corpora: an assessment on Post_OCR Correction and Name Entity Recognition' by Konstantin Todorov en Giovanni Colavizza. http://ceur-ws.org/Vol-2723/long32.pdf