Ground-truth IMPACT project


    The IMPACT KB dataset was created for the purpose of evaluation and training of OCR software. The original OCR and layout recognition of a selection of KB material has been manually corrected to 99,95% accuracy to provide a 'perfect' result, otherwise also known as ground truth. The set consists of:

    The set was made as part of the IMPACT project, a European funded research project led by the KB. From 2008-2012, 26 partners worked together to make OCR for historical text better, faster and cheaper. The project is concluded, but the resources and tools are transferred to the IMPACT Centre of Competence.


    When using this dataset we request you to cite it as follows:

    IMPACT Project, IMPACT KB Ground-truth. KB Lab: The Hague.


    Each page has a master image in TIF format and a corresponding PAGE XML file that contains the ground truth for both the text and the layout. 

    The IMPACT KB dataset is a representation of the KB collections at the time of production (2008-2012). The full dataset is being made available in the Public Domain. You can download the individual sets as zipped archives from here:


    • TIF (15.1 GB), 
    • XML (9.0 MB)


    • TIF (16.7 GB), 
    • XML (12.8 MB)

    Parliamentary Proceedings:

    • TIF (5.3 GB), 
    • XML (8.5 MB)

    Radio Bulletins:

    • TIF (0.9 GB), 
    • XML (0.5 MB)

    This spreadsheet gives all metadata about the files per category. Please note that some files were made from test batches and are not directly linkeable to live versions of the files.