Named entity recognition (NER) on historical DBNL

    Introduction
    Body

    The dataset contains the public domain part of DBNL (3542 publications) annotated with 3 NER models. The models were fine-tuned on Dutch historical data and described in the paper. The dataset consists of 4 parts: the predictions of each of the 3 NER models, and the predictions obtained by combining the models in majority voting. It can be used as silver data for training entity linking models, or in digital humanities applications. The annotations are available as tsv files in the standard IOB format (each token is labelled as beginning, inside or outside of a named entity). 

    Acknowledgements: support with the data and research was provided by Marieke Moolenaar

    Afbeelding
    Image
    Annotation NER example
    Citaat

    When using this dataset we ask you to cite it as follows;

    V. Provatorova, Named entity recognition (NER) on historical DBNL (version 1, 12-08-2024) KB Lab, the Hague.

    Toegang

    The raw data can be found on Zenodo.

    Examples

    The dataset can be used as silver data for training entity linking models, or in digital humanities applications.

    Inhoudsblokken
    Afbeelding
    Image
    An example of an annotation: “Gezicht op Leiden, Aelbert Cuyp, 1630 - voor 1649”. Leiden and Aelbert Cuyp are marked as named entities: location and person, respectively. Aelbert Cuyp is an old form of the name Albert Cuyp, which demonstrates the ability of the model to correctly annotate text with historical spelling variations.
    Bijschrift

    An example of an annotation: “Gezicht op Leiden, Aelbert Cuyp, 1630 - voor 1649”. Leiden and Aelbert Cuyp are marked as named entities: location and person, respectively. Aelbert Cuyp is an old form of the name Albert Cuyp, which demonstrates the ability of the model to correctly annotate text with historical spelling variations.