Dataset

Automatically extract XML content with Python

A quick-start into working with XML files using Python. The course covers various XML formats.

Clockwork picture of an itinerant dentist performing an extraction in French rural scene, wood frame, metal workings, first half 19th century. Science Museum, London. Attribution 4.0 International (CC BY 4.0) (cropped from original).

Extracting text from EPUB files in Python

Johan van der Knijff published a brief introduction to extracting unformatted text from EPUB files.

Louis Couperus

Dutch Novels 1800-2000

Dataset that contains a corpus of 1346 novels from DBNL.

Thorium Alice in Wonderland

Accessible e-books and audiobooks

We asked specialists to convert several public domain publications into accessible versions.

Total number of Target Records

Historical growth of the KB web archive

Description of the KB web archive and it's growth since the KB started archiving.

Courante_uyt_Italien

Is your OCR good enough?

Comprehensive assessment of the impact of OCR quality in Dutch newspaper, journal and book collections.

NL-Blogoshere

Web collection NL-blogosfeer

Metadata-datasets and collection description regarding the NL-blogosfeer: collection of Dutch weblogs.

Example Dataset Entangled - French

Entangled Histories: Ordinances of the Low Countries

This special collection is made up of 108 books of ordinances published in the Early Modern Era.

DBNL OCR Data set

This data set consists of 220 texts digitised by the DBNL in TEI and txt (OCR).

Europeana Newspapers NER 1

Europeana Newspapers NER

Data set for evaluation and training of NER software in Dutch, French, Austrian and German.

Ground-truth IMPACT project

Collection of 99,95% correct OCR of books, newspapers, parliamentary papers and radio bulletins.

icon for an Parliamentary paper

Example set

This collection consists of a small selection of our digitised publications from the years 1870-1871.