Introduction

The dataset contains the public domain part of DBNL (3542 publications) annotated with 3 NER models. The models were fine-tuned on Dutch historical data and described in the paper. The dataset consists of 4 parts: the predictions of each of the 3 NER models, and the predictions obtained by combining the models in majority voting. It can be used as silver data for training entity linking models, or in digital humanities applications. The annotations are available as tsv files in the standard IOB format (each token is labelled as beginning, inside or outside of a named entity).

Acknowledgements: support with the data and research was provided by Marieke Moolenaar

When using this dataset we ask you to cite it as follows;

V. Provatorova, Named entity recognition (NER) on historical DBNL (version 1, 12-08-2024) KB Lab, the Hague.

Access

The raw data can be found on Zenodo.

Examples

The dataset can be used as silver data for training entity linking models, or in digital humanities applications.

An example of an annotation: “Gezicht op Leiden, Aelbert Cuyp, 1630 - voor 1649”. Leiden and Aelbert Cuyp are marked as named entities: location and person, respectively. Aelbert Cuyp is an old form of the name Albert Cuyp, which demonstrates the ability of the model to correctly annotate text with historical spelling variations.