Datasets

Is your OCR good enough?

Description This webpage contains information about the datasets used and code developed as part of the

Word embedding playground

Word embedding playground provides tools for training and fine-tuning word embedding models.

DIGGER

DIGGER contains geocoded place names for 102 million news items from the digitised newspapers from Delpher.

Historical newspapers OCR ground-truth

A dataset consisting of 2000 pages historical newspaper groundtruth, OCR and images.

Newspaper ngram collection

This dataset was generated by PoliticalMashup and contains yearly counts for word ngrams for n ranging

Frame generator

Tool for extracting topics, keywords and their co-occurence patterns from a Dutch corpus.

Genre classifier

The Genre classifier predicts the genre of a Dutch newspaper article, using plain text as input.

Dictionary viewer

The Dictionary viewer visualises the appearance of a word list in the newspaper corpus over time.

Europeana Newspapers NER

Data set for evaluation and training of NER software for historical newspapers in Dutch, French, Austrian

Ground-truth IMPACT project

Collection of 99,95% correct OCR of books, newspapers, parliamentary papers and radio bulletins meant for training

Example set

This collection consists of a small selection of our digitised publications from the years 1870-1871.

Keyword generator

A command-line tool to extract significant keywords from a collection of sample texts.

ALTO Edit

ALTO Edit is a simple browser-based post correction tool for ALTO XML files.

PoliMedia

PoliMedia allows cross-media analysis of coverage of parliamentary debates in a uniform search interface.

Newspaper ngram viewer

The PoliticalMashup ngram viewer visualises the frequency of a certain phrase in the Delpher newspaper collection.

You are here