Introduction
The SIAMESET dataset was created by Melvin Wevers and Juliette Lonij for the tool SIAMESE. The dataset consists of images and metadata of advertisements from two Dutch national newspapers: Algemeen Handelsblad (1945-1969) and NRC Handelsblad (1970-1994). A further selection was made based on two criteria. First, we removed images with a width or height smaller than 500px and advertisements with dimensions that resembled classifieds. Second, we removed images with a character proportion higher than 0.0005, which are advertisements that contained mostly text. This resulted in a dataset of 426,777 advertisements for the period 1945 - 1994.
Data format
The dataset consists of the high-resolution advertisement images in JPEG format grouped into folders, one for each year. Each image has a unique filename corresponding to its KB identifier. The dataset includes a CSV file that provides metadata for each of the images, such as identifier, date, size, position, page number, total number of pages, and character proportion. This file also contains any textual content of the advertisement that was identified by our OCR software.
When using this dataset we request you to cite it as follows:
Wevers, M., Lonij, J. (2017) SIAMESET. KB Lab: The Hague. http://lab.kb.nl/dataset/siameset
Access
To obtain this dataset, please send an email with your request to dataservices@kb.nl, including the following information:
- Your name
- Affiliation/institution
- Why you would like access to the dataset
- How long you would like access
A representative of the KB will contact you and can provide access to the dataset for scientific or scholarly purposes after a contract has been signed. Please note this process can take a couple of working days before access can formally be granted.