It is known that “machine-readable” or digital humanities has changed the course of research methodologies for scholars. In the next turn, “machine-learnable” data from humanities domain is yet to be properly introduced as the basic ingredient for the data-driven intelligence. It is worth reminding that the impressive performance of modern computer vision comes at the price of large number of images that are annotated by humans. It takes millions of images that are labeled with hundred thousand of object classes (machine-learnable data) to teach a computer to “see” the world .
This is perhaps the most intriguing question in one’s mind. Primary answer is that computer vision scientists care the most. Novel data, in this case archival visual data, brings new challenges and research opportunities to the discipline of computer vision. In turn, benchmarking datasets of archival imageries provide the basic foundation for computer vision advances in the automatic recognition of objects in this context. However, the result not only fulfills the curious case of computer scientists in interaction with brand-new data. In fact, librarians and humanity scholars are the ones to benefit the most from the advent of tailor-made data analytic tools for their material.
Objects of Past Imagination
Illustrations are the visualization of its creator’s mind as much as photo imagery reflects our world. Pixel representation of illustrations enables large-scale analysis of this type of data with respect to the concepts, styles and objects created by illustrator artists. The question is whether we have competent tools to analyze illustrations, where depicted objects are as diverse as one’s imagination ranging from real to conceptual and even fictional figures?
Challenges of Open-Set Object Detection
The challenges of visual understanding of illustration imagery from the perspective of a computer vision scientist are presented as follows. The task of object detection is chosen here due to its general-purpose use for the understanding of an image content. For that, the computer model identifies the different instances of objects and categories together with their locations, using bounding boxes, within the image boundary.
- The state-of-the-art object detection models including YOLO model  are trained on high-quality photo datasets such as COCO . Illustrations are commonly sparse in the pixel representation space, compared to the natural photos. In fact, texture information, learned from the high-quality digital photos, may not generalize well to the illustrations.
- COCO dataset includes 80 categories including Baseball bat, Computer mouse, Donut, etc., that are not relevant to the context of our illustrations. To adjust the class labels to the custom dataset of historical illustrations, one needs to finetune the classification layer of the pre-trained neural network to the new set. What if there is no prior information on the categories appearing in the illustration catalogue to be explored? This is not far from reality as we would like to use the automatic tools to discover the content of our large-scale repositories. Recent advances in computer vision and machine learning includes zero-shot learning  for recognition of novel objects during the inference time that have not been presented to the neural network model in the training phase.
- Subjectivity in bounding box labeling can cause confusion in the learning process. People have different notion of an object boundary, particularly once grouped with many instances of the same object or in case of obstruction. While the exact object boundary might not be a matter of importance for many applications, the common loss functions used in state-of-the-art object detection models can be easily dominated by noisy ground truth of the object location. This leads to the obsessive learning of the object location and neglecting the classification objective.
Ot & Sien Dataset
To provide a research foundation on object recognition for illustration images, we have collected a dataset of illustrations from children’s book published as Ot & Sien. The images were manually labeled by people at KB. The collection includes 1512 images with 7210 objects (an average of 4.8 per image) and 264 classes that are annotated with object class and bounding boxes. The images are resized to 416x416 with black fitting edges to adapt well for to the training procedure. The illustrations were harvested from KB repository from books published before 1880. The choice of historical books is mainly to prevent copy right issues that are restrictive for creating the open-source datasets. We are currently working on the extension of the dataset for contemporary images published from 1960 until now from DBNL collection.
The author provides a baseline model using the YOLO 5 ultralytics architecture as a backbone, where the classification layer is replaced by the number of classes in our dataset. The weights are initiated at the beginning of the training with the pretrained YOLO weights. There are three losses to optimize during the training process; object, box and classification losses. Object loss is guiding the network to detect objects wherever; they are present, classification loss concerns with the recognition of the correct category of the object and box loss is a regression loss aligning the object boundry with the ground truth. The evaluation is performed by mean average precision (MAP) metric that is the mean over all class objects and samples in the validation or test set, during the training or inference time, respectively.
The challenging task of object detection in an open-set data is yet to be solved.
We would like to invite researchers and computer vision practitioners to crack this task on our illustration dataset. The contributions can be made in different levels and expertise from purely computer vision and technical perspective to digital humanities and crowdsourcing. You can request the access to the data by sending an email to firstname.lastname@example.org. Students and scholars are most welcome to include this dataset as part of their graduation/thesis project.
 Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.
 J. Redmon, S. Divvala, R. Girshick and A. Farhadi, "You Only Look Once: Unified, Real-Time Object Detection," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 779-788, doi: 10.1109/CVPR.2016.91.
 Lin TY. et al. (2014) Microsoft COCO: Common Objects in Context. In: Fleet D., Pajdla T., Schiele B., Tuytelaars T. (eds) Computer Vision – ECCV 2014. ECCV 2014. Lecture Notes in Computer Science, vol 8693. Springer, Cham. https://doi.org/10.1007/978-3-319-10602-1_48
 Y. Xian, C. H. Lampert, B. Schiele and Z. Akata, "Zero-Shot Learning—A Comprehensive Evaluation of the Good, the Bad and the Ugly," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 9, pp. 2251-2265, 1 Sept. 2019, doi: 10.1109/TPAMI.2018.2857768.