CHRONIC

    Introduction
    Body

    The CHRONIC (Classified Historical Newspaper Images) dataset consists of metadata for 313K classified images harvested from Delpher’s Dutch digitised newspapers for the period 1860-1922. Thomas Smits and Willem-Jan Faber created the CHRONIC database by applying several Computer Vision techniques to classify the images. CHRONIC was originally created to test if state of the art computer vision techniques could be applied to historical images and research when Dutch newspapers started to use photographs instead of drawings to visually represent the news [1].

    Pipeline

    We used a pipeline, consisting of four different steps, to classify the images: a harvester, a face recognition classifier, a classification into nine different categories using Tensorflow’s Inception-V3 convolutional neural network (CNN), and a classification of all the images into photographs and drawings by a convolutional neural network made by Leonardo Impett.

    Harvester

    In the first step of the pipeline, we harvested images from digitised Dutch newspapers. In the XML files (ALTO) of the pages of the digitised newspaper, the code-line ‘imageblock’ denotes images. Around 1900, Dutch newspapers contained many small images, like the often-recurring illustrations used at the beginning of a specific section, or small images that accompanied advertisements in newspapers. Because we were mainly interested in images of the news, we decided to only include images that could be related to newspaper articles (via the XML file), exclude images of advertisements, and discard all the images with a file size smaller than 30KB.

    Faces and categories

    In the second step, we used Adam Geitgey’s facial recognition API, built using the Dlib’s facial recognition library, to recognize faces on the images [2].  In the third step, we applied Tensorflow’s Inception-V3 convolutional neural network to recognize nine different categories:

    • buildings,
    • cartoons,
    • chess,
    • crowds,
    • logos,
    • maps,
    • schematics,
    • sheet music, and
    • weather reports

    In the last few years, machine learning has made tremendous progress in object detection and classification. Deep convolutional neural networks can especially achieve high performance in these kinds of tasks. Inception-V3 is a deep convolutional neural network trained for the ImageNet Large Visual Recognition Challenge. In order to recognize our nine categories, we retrained Inception’s final layers [3].  Although the creators of this method recognize that it will be outperformed by a full training run, it is surprisingly effective (see below for performance) and does not require GPU hardware [4].  We used training sets of around forty images for every category.

    Photographs/Drawings

    We asked Leonardo Impett to build a CNN that could recognize if images were either drawings or photographs. Although this task sounds relatively simple, the heterogeneity of the material makes it quite hard. Building on the work of Paul Fyfe and Qian Ge, we decided to focus on reproduction techniques: engraving for illustrations and the half-tone process for photographs. Using MATLAB, Ge devised a method to analyse two low-level features of images: the pixel ratio, the number of low-intensity pixels divided by the total number of pixels, and the entropy level: the amount of information contained in the image. By juxtaposing these two features, they were able to sort the images of the illustrated newspapers according to the technique used for their reproduction. Half-tones, used to reproduce photographs, exhibit both a high pixel ratio and a high entropy level, while engravings, used to reproduce illustrations, display lower pixel ratios and entropy level. Applying the technique of Fyfe and Ge, we found that it was relatively good in recognizing both high-quality engravings and photographs. However, Dutch newspapers mainly printed low-quality halftones and engravings, which were not recognized by their model. Furthermore, newspapers frequently used the half-tone technique to reproduce illustrations.



    Impett built a CNN that focuses on the lower-layers of the network and trained a support vector machine (SVM) to divide the images into photographs and illustrations. This method is based on the same idea as Fyfe and Ge’s method, but uses the lower layers of a CNN instead of pixel ratio’s and entropy levels.

    Performance

    In order to calculate the F1-scores of the applied computer vision techniques we manually tagged 500 random images from the entire dataset (1860-1922), 500 random images of the years before 1900 (1860-1900), and 500 images for the years after 1900 (1900-1922). With an F1-score of around 0,85 Impett’s CNN can be confidently used to recognize photographs in digitized visual source material of this period. The high scores of the chess and weather categories show that Inception is very good in recognizing images with a high degree of visual similarity. Although it has more trouble with conceptual similarity, the F1scores for ‘maps,’ ‘buildings,’ and ‘crowds’ show that this method can also be used for these kinds of tasks. The same thing goes, albeit to a lesser extent, for the category ‘cartoon’ which records if images are stylistically similar.

    Category F1-score (1860 - 1900) F1-score (1900-1922) F1-score (1860-1922)
    Photo - 0,81 0,90
    Faces 0,79 0,58 0,57
    Buildings 0,89 0,65 0,45
    Cartoon 0,67 0,70 0,67
    Chess 0,99 0,95 -
    Crowds 0,74 0,68 0,72
    Logos 0,78 0,51 0,72
    Maps 0,67 0,81 0,80
    Schematics 0,82 0,81 0,85
    Sheet music - - -
    Weather 0,67 0,95 0,94
    Weather2 - 0,72 0,75

    References

    [1]  https://lab.kb.nl/about-us/blog/can-computer-vision-find-illustrations-nineteenth-century-railway-crashes

    [2] Adam Geitgey, Face_recognition: The World’s Simplest Facial Recognition Api for Python and the Command Line, Python, 2017, https://github.com/ageitgey/face_recognition.

    [3] Jeff Donahue et al., “DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition,” ArXiv:1310.1531 [Cs], October 5, 2013, http://arxiv.org/abs/1310.1531; “How to Retrain Inception’s Final Layer for New Categories,” TensorFlow, accessed November 23, 2017, https://www.tensorflow.org/tutorials/image_retraining.

    [4] “How to Retrain Inception’s Final Layer for New Categories.”

    Citaat

    When using this dataset we request you to cite it as follows:

    Smits, T., Faber, W.J. (2018) CHRONIC (Classified Historical Newspaper Images). KB Lab: The Hague. http://lab.kb.nl/dataset/chronic-classified-historical-newspaper-images

    Toegang

    Data

    The dataset consists of separate CSV files for each year (1860-1922) that provide metadata for every image. The first row records the XML category. Images are either linked to articles (‘artikel’) or captioned illustrations (‘illustratie met onderschrift’). The second row records the newspaper title, the third the date, the fourth the link to the article in Delpher, the fifth notes if an image contains a face and one of the nine categories, and the sixth if an image is classified as a drawing or a photograph. A merged CSV file of all the separate files is also available.

    Access

    This dataset is available on Github.

    Examples

    Tools for analysis

    The CSV files can be analysed in many ways. In order to give users an idea of the possibilities, we created a Jupyter Notebook, which analyses and visualizes several large trends in the use of images by Dutch newspapers. We also created CHRONReader: a tool, which allows users to search the database in a more explorative manner.