The SIAMESET dataset was created by Melvin Wevers and Juliette Lonij for the tool SIAMESE. The dataset consists of images and metadata of advertisements from two Dutch national newspapers: Algemeen Handelsblad (1945-1969) and NRC Handelsblad (1970-1994). A further selection was made based on two criteria. First, we removed images with a width or height smaller than 500px and advertisements with dimensions that resembled classifieds. Second, we removed images with a character proportion higher than 0.0005, which are advertisements that contained mostly text. This resulted in a dataset of 426,777 advertisements for the period 1945 - 1994.
The dataset consists of the high-resolution advertisement images in JPEG format grouped into folders, one for each year. Each image has a unique filename corresponding to its KB identifier. The dataset includes a CSV file that provides metadata for each of the images, such as identifier, date, size, position, page number, total number of pages, and character proportion. This file also contains any textual content of the advertisement that was identified by our OCR software.