The KBK-1M Dataset (‘Koninklijke Bibliotheek Kranten – 1 Miljoen’) is a collection of 1,603,396 images and accompanying captions of the period 1922 – 1994. We extracted the images from digitised newspapers that are stored in the National Library (KB) Newspaper Archive and that are publicly accessible via www.delpher.nl . Via Delpher visitors can search and browse through several collections including Dutch newspapers. One way to narrow down retrieved results is by clicking on facets. One of these is ‘illustraties met onderschrift’ (illustrations with caption) that contain photographs (black & white and colour), comic strips, political cartoons and weather-forecasts. This KBK-1M dataset contains these illustrations with captions of all newspapers in the period 1922-1994 which were on Delpher when we crawled the illustrations, in August 2015.
Creation of the datase
In the newspaper archive of the KB, each issue is stored as a set of scanned pages with one JPEG per newspaper page. Each page is associated with a set of metadata files which describe the locations of each image, caption and article on that page. During the digitisation process of the newspapers, these locations were manually annotated by trained workers. The article and caption texts are available through automatic OCR-processed output. We took these data as starting point when we built the harvester to create the KBK-1M dataset. The data harvester was built using the Python programming language which prepared and extracted the images and captions using KB-internal RESTful APIs. Figure 1 below, shows how we transformed the raw source material into the dataset that contains JPEG files for the images and JSON files for the metadata.
Figure 1: An example of an image and caption extracted from the front page of the January 27th 1951 issue of the De Nieuwsgier. The image (red) and caption (black) are gold-standard annotated in the KB Newspaper Archive. We automatically extracted the image and text from inside these annotations and save the resulting content as JPEG and JSON data (Elliott & Kleppe, 2016).
All relevant metadata for each image is stored in a JSON file. Listing 1 below shows the JSON file of the image as shown in figure 1. In order to create this file, we serialised the caption (“caption”), the title of the newspaper issue (“paper_title”), the page (“page”), the date of publication (“date”), and the identifiers of the content and text blocks (“content_block & text_block” and “content_block”) as stored in the original repository metadata document. Each newspaper issue is stored with a unique identifier linking an image caption pair (“content_block_url” & “jp2_url”) directly back to the newspaper issue ID (“alto_url”) from the Newspaper Archive. Finally, we created a unique filename for each JPEG/JSON file (“image_name”).
This dataset was created while Martijn Kleppe & Desmond Elliott worked as Researcher-in-Residence at the National Library of the Netherlands (KB) on the Photos in and out of Context (PhoCon) project (Kleppe 2015). During the creation process of the KBK-1M dataset they were assisted by Willem Jan Faber of the Research Department of the KB. Desmond Elliott was also supported by an ERCIM ABCDE Fellowship 2014-23.
The authors wish to thank Lotte Wilms, Steven Claeyssens and Annemarie Beunen of the KB for their assistance in the creation of this dataset and making it available to the research community.