The IMPACT KB dataset was created for the purpose of evaluation and training of OCR software. The original OCR and layout recognition of a selection of KB material has been manually corrected to 99,95% accuracy to provide a 'perfect' result, otherwise also known as ground truth. The set consists of:
- 2055 book pages, ranging from 1630 until 1796 from Early Dutch Books Online and Digitale Topstukken
- 1024 newspaper pages, ranging from 1618 until 1885 from Delpher
- 1179 parliamentary proceedings, ranging from 1814 until 1945 from Staten Generaal Digitaal
- 205 typewritten radio bulletins from 1937, from Delpher
The set was made as part of the IMPACT project, a European funded research project led by the KB. From 2008-2012, 26 partners worked together to make OCR for historical text better, faster and cheaper. The project is concluded, but the resources and tools are transferred to the IMPACT Centre of Competence.
When using this dataset we request you to cite it as follows:
IMPACT Project, IMPACT KB Ground-truth. KB Lab: The Hague. http://lab.kb.nl/dataset/ground-truth-impact-project
Each page has a master image in TIF format and a corresponding PAGE XML file that contains the ground truth for both the text and the layout.
The IMPACT KB dataset is a representation of the KB collections at the time of production (2008-2012). The full dataset is being made available in the Public Domain. You can download the individual sets as zipped archives from here:
This spreadsheet gives all metadata about the files per category. Please note that some files were made from test batches and are not directly linkeable to live versions of the files.