Europeana Newspapers NER

    Introduction
    Body

    The KB Europeana Newspapers NER dataset was created for the purpose of evaluation and training of NER (named entities recognition) software. The original OCR of a selection of European newspapers has been manually annotated with named entities information to provide a 'perfect' result, otherwise also known as ground truth. The collections of four project partners has been manually tagged with all instances of Person, Location and Organisation and this data is avilable in CC0 via this Lab.

    There are four sets available for download, namely;

     

    Each set consist of a number of ALTO files, a BIO file and a trained classifier for Stanford NER

     

    Citaat

    When using this dataset we request you to cite it as follows:

    Europeana Newspapers project, (2014), KB Europeana Newspapers NER Dataset. KB Lab: The Hague. http://lab.kb.nl/dataset/europeana-newspapers-ner 

     

    Toegang

    Each set can be downloaded in Zip-files and consist of ALTO files, a BIO file and a trained classifier.

    Named Entities in Dutch Newspapers Set:

    • ALTO (md5sum: d481d7e40d84bb479a479aa5266f8d0d)
    • BIO (md5sum: 4724261eca17dc7eb7daad6175627488)
    • Trained classifier for Stanford NER: eunews.nl.crf.gz (153 MB)

    Named Entities in German Newspapers Set:

    • ALTO (md5sum: 217f955f9c7e643c99c4657611bb3570)
    • BIO (md5sum: 85872b4841fbaec49268839cdd0875a4)
    • Trained classifier for Stanford NER: eunews.de.crf.gz (19 MB)

    Named Entities in Austria Newspapers Set:

    • ALTO (md5sum: 3b4e76e268f64ed26956b4a677e8043e)
    • BIO (md5sum: 3d0e909baf9aedfc27673170ed1fcdff)
    • Trained classifier for Stanford NER: eunews.at.crf.gz (34 MB)

    Named Entities in French Newspapers Set:

    • BIO (md5sum: cf8d59c45bca837fc076eae403fcd166)
    • Trained classifier for Stanford NER: eunews.fr.crf.gz (47 MB)

     

    Examples

    All methods have been evaluated by retaining a number of pages per language for evaluation. Precision and Recall are calculated using the amount of true positives, true negatives, false positives and false negatives.

    Inhoudsblokken
    Afbeelding
    Image
    Europeana Newspapers NER 1
    Body

    These figures have been derived from a k-fold cross-evaluation  of 25 out of 100 manually tagged pages of Dutch newspapers from the KB. The results confirm the fact that the Stanford NER tagger tends to be a bit “conservative”, i.e. it has a somewhat lower recall for the benefit of higher precision, which is also what was aimed for, as this is the most valuable for our users.

    The French material was evaluated by LIP6, which resulted in the following figures:

    Afbeelding
    Image
    Europeana Newspapers NER 2
    Body

    There were less pages available for evaluation for German and Austrian, due to the problems with the export function of the training tool that resulted in several pages being not useable. The decision was made to use  as many as possible for training, which resulted in a smaller evaluation set. Therefore, the outcomes are not split up per category, as this would provide too little entities for a good evaluation. Five pages from LFT and six pages from the ONB were used for the following evaluation.

    Afbeelding
    Image
    Europeana Newspapers NER 3