The KBK-1M Dataset (‘Koninklijke Bibliotheek Kranten – 1 Miljoen’) is a collection of 1,603,396 images and accompanying captions of the period 1922 – 1994. We extracted the images from digitised newspapers that are stored in the National Library (KB) Newspaper Archive and that are publicly accessible via www.delpher.nl . Via Delpher visitors can search and browse through several collections including Dutch newspapers. One way to narrow down retrieved results is by clicking on facets. One of these is ‘illustraties met onderschrift’ (illustrations with caption) that contain photographs (black & white and colour), comic strips, political cartoons and weather-forecasts. This KBK-1M dataset contains these illustrations with captions of all newspapers in the period 1922-1994 which were on Delpher when we crawled the illustrations, in August 2015.
Creation of the datase
In the newspaper archive of the KB, each issue is stored as a set of scanned pages with one JPEG per newspaper page. Each page is associated with a set of metadata files which describe the locations of each image, caption and article on that page. During the digitisation process of the newspapers, these locations were manually annotated by trained workers. The article and caption texts are available through automatic OCR-processed output. We took these data as starting point when we built the harvester to create the KBK-1M dataset. The data harvester was built using the Python programming language which prepared and extracted the images and captions using KB-internal RESTful APIs. Figure 1 below, shows how we transformed the raw source material into the dataset that contains JPEG files for the images and JSON files for the metadata.
All relevant metadata for each image is stored in a JSON file. Listing 1 below shows the JSON file of the image as shown in figure 1. In order to create this file, we serialised the caption (“caption”), the title of the newspaper issue (“paper_title”), the page (“page”), the date of publication (“date”), and the identifiers of the content and text blocks (“content_block & text_block” and “content_block”) as stored in the original repository metadata document. Each newspaper issue is stored with a unique identifier linking an image caption pair (“content_block_url” & “jp2_url”) directly back to the newspaper issue ID (“alto_url”) from the Newspaper Archive. Finally, we created a unique filename for each JPEG/JSON file (“image_name”).
This dataset was created while Martijn Kleppe & Desmond Elliott worked as Researcher-in-Residence at the National Library of the Netherlands (KB) on the Photos in and out of Context (PhoCon) project (Kleppe 2015). During the creation process of the KBK-1M dataset they were assisted by Willem Jan Faber of the Research Department of the KB. Desmond Elliott was also supported by an ERCIM ABCDE Fellowship 2014-23.
The authors wish to thank Lotte Wilms, Steven Claeyssens and Annemarie Beunen of the KB for their assistance in the creation of this dataset and making it available to the research community.
This dataset was created by Martijn Kleppe, Desmond Elliott & Willem Jan Faber. When using this dataset we request you to cite it as follows:
Access to this dataset can be requetsed via the EASY system of DANS: https://easy.dans.knaw.nl/ui/datasets/id/easy-dataset:73749/rd/1
Please log in the EASY system, click on the tab 'Data files' and request for access by include the following information;
- Your name
- Why you would like access to the dataset
- How long you would like access
A representative of the KB will contact you and can provide you access to the dataset for scientific or scholarly purposes only after a contract has been signed. Please note this process can take a couple of working days before access can formally be granted.
Set-up of the dataset
This KBK-1M dataset consists of a collection of zipfiles that correspond with each year. In each zipfile three types of files are available:
- JPEG files that contain the actual image. Each image has a unique filename. In the case of the example image above it is ‘1951/DDD:ddd:010474896:mpeg21/p001-P1_CB00003.jpg. The OCR is provided on article level, that all have classifications. The articles that we harvested for the KBK-1M dataset all have classification 'Illustratie met onderschrift’ (‘Image with caption'). Since the images are stored on page-level, each image in this set is a subset of a newspaper page. Using the example above the URL for the page on which this image is published is the following: http://resolver.kb.nl/resolve?urn=ddd:010474896:mpeg21:p001
- JSON files that contain all information about the image as listed above in listing 1.
- 1 Excelfile containing all information of each individual JSON file. This file can be used as en index-file of all photos of that particular year.
The creators of this dataset foresee possible use in several domains that are also described in Elliott and Kleppe (2016). Humanities scholars can e.g. use it to analyse photographic styles changes, the representation of people and societal issues and or the creation of new tools for exploring photograph reuse via image-similarity-based search (Kleppe & Elliott 2016).
Computer scientists can e.g. use the dataset for experiments in automatic image captioning, image-article matching, object recognition, and data-to-text generation for weather forecasting.
Articles & blogposts about the KBK-1M Dataset
- Elliott, D. & Kleppe, M. (2016), ‘1 Million Captioned Dutch Newspaper Images’ in Calzolari, N. et al, Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Portoroz: European Language Resources Assiciation (ELRA) http://www.lrec-conf.org/proceedings/lrec2016/summaries/448.html
- Kleppe, M., Desmond, E. (2016). 1 Million Dutch Newspaper Images available for researchers: The KBK-1M Dataset. In Digital Humanities 2016: Conference Abstracts. Jagiellonian University & Pedagogical University, Kraków, pp. 824-825. http://dh2016.adho.org/abstracts/193
- Kleppe, M. (2015). Foto’s in en uit context (FoCon) http://www.martijnkleppe.nl/onderzoek/focon/