Historical newspapers OCR ground-truth

    Introduction
    Body

    For a research project about the OCR of our historical newspapers, we produced 2000 pages of ground-truth. This dataset is available for research purposes and structured as follows;

    Years of publication/software

    ABBYY 8.1

    ABBYY 9.0

    ABBYY 10.0

    Total

    1700-1882

    237

    0

    37

    275

    1883-1947 (minus 1940-1945)

    652

    21

    494

    1166

    1948-1995

    436

    11

    112

    559

    Total

    1325

    32

    643

    2000

     

    Selection and correction

    For each category a randomised selection was made using newspaper issue identifiers, after which either the first or the second page of the issue was selected for the dataset. The pages were manually corrected to either 99,95% correct characters (digitised from paper) or 99,5% correct characters (digitised from microfilm). After each batch delivery a random paragraph of each page was checked by the KB.

    All pages were sent to a service provider and were rekeyed using the software package Aletheia, developed by the PRImA Research Lab during the IMPACT project. The instructions they received were the following:

    • The text should be 99,95% correct, leaving the segmentation in the pages as they are.
    • Change ligatures to separate characters: æ à ae - ij à ij, etc.
    • Change fractures to separate characters: ¼ à 1/4, etc.
    • Type what you see including capitals and printing mistakes if there are any.
    • Any layout information, such as bullet points or indentations, or style, such as bold or italics, can be ignored.
    • Leave the long s (ʃ) and guilder (ƒ) as a special character.
    • Only type logos and other textual lines in more visual parts of the newspapers such as advertisements if they are clearly legible.
    • Initials are part of the first line.
    • Ignore spaces in words that are there for layout reasons.
    • Normalise special characters to ASCII (except long s and guilder).

    Data

    This set consists of;

    • 2000 Images in JP2 format
    • 2000 Original OCR files in ALTO format
    • 2000 Manually corrected OCR ground-truth files in ALTO format

    For more information, please see the available metadata file:

    Citaat

    When using this data set we ask you to cite it as follows;

    Wilms, L., Nijssen, R., Koster, T. (2020), Historical newspaper OCR ground-truth data set. KB Lab: The Hague. 

    Toegang

    To obtain this dataset, please send an email with your request to dataservices@kb.nl, including the following information:

    • Your name
    • Affiliation/institution
    • Why you would like access to the dataset
    • How long you would like access

    A representative of the KB will contact you and can provide access to the dataset for scientific or scholarly purposes after a contract has been signed. Please note this process can take a couple of working days before access can formally be granted.