12 Dec 2023

Investigating automatic metadata applied to Dutch Cultural Heritage

Inhoudsblokken
Body

1. Introduction

The transformative force of Artificial Intelligence (AI) in various areas of daily life has led cultural institutions to wonder about the benefits of applying AI to cultural heritage. How and with what perspectives we deal with the binomial AI-cultural heritage are questions arising from the new field of Digital Humanities and that generate new approaches related to each other: "Culture for AI" (culture having an important role in the development of AI) and "AI for Culture" (application of AI to cultural heritage)[1].

One of the research fields participating in the wider study of AI applied to the domain of cultural heritage is the automatic metadata of cultural digital objects. Automatic metadata is a branch that arose in conjunction with the advent of digital technology that, in the last decade, has come back to light thanks to the potential of data when managed with the help of intelligent machines. Indeed, the development of AI and machine learning has shown that automatic metadating, when produced with intelligent systems, generates robust and scalable metadata for all cultural institutions and their collections (Liddy et al., 2002; Han et al., 2003).

Some methodologies implemented in recent years focus on the automatic classification of objects (categorization of the object in terms of form and content), on image recognition (recognition of visual concepts), on automatic entity recognition (people, places, events) and on automatic authorship attribution (recognition of the author).

2. A quick reference to some projects 

The Netherlands is actively involved in innovative research in this field by launching several projects. Among these, the National Archive worked between 2016 and 2019 with the Huygens ING and War Sources Network within the TRIADO [2] research project, where experiments have been carried out on the automatic classification of documents from the Central Archive of Special Jurisdiction (CABR), using deep learning techniques.

Active in this field is the Netherlands Institute for Sound and Vision (NISV), which has been working since 2018 on the implementation of the automatic description of TV and radio programs through speech recognition and text processing, automatic speaker labeling and facial recognition, as well as keyword assignment (Kleppe et al., 2019).

Recommendation systems such as Bookarang [3] also fall under the theme of automatic metadata, because they use automatic keyword assignment algorithms.

The KB, National Library of the Netherlands, is another institution attentive to the application of AI to its digital collections. Focusing on quality metadata, topic dissemination and automatic metadata techniques (Kleppe et al., 2019), the KB shows that it supports, on the one hand, the importance of access to cultural resources and, on the other hand, the indispensable "healthy" relationship between man and machine, as emerges from the seven principles defined by the KB (Van Wessel, 2020).

In 2020 the KB, in collaboration with the Rotterdam University, developed the project Entangled Histories (Romein, de Gruijter & Veldhoen, 2020), aimed at the datification of the early modern ordinances, using machine learning techniques for text recognition, segmentation and categorization of norms.

Between 2020 and 2021 the KB carried out the Demosaurus project (Haagsma, 2021), experimenting with automatic keyword and authors indexing techniques on the Dutch e-books.

In collaboration with the Utrecht University and the KNAW Humanities Cluster, the KB has experimented, between 2017 and 2019, with image recognition, dividing historical newspapers into photos, drawings and cartoons (Wevers & Smits, 2020). 

In 2021, KB produced the Ot & Sien dataset in order to develop methods of automatic image recognition by detecting the objects contained in them. 

Between 2021 and 2022, within the Krant & foto’s verbonden project, the KB explored the use of AI to link heritage collections, identifying press photographs and their corresponding newspaper publications (de Gruijter et al., 2022).

In addition, the KB has deepened the study of entity recognition systems, as evidenced by Puck Wildschut’s research of the Radboud University on character recognition in novels and mapping their relationships (Wildschut & Faber, 2017).

Finally, another project launched in 2020 by KB in partnership with other institutions, and still under development, is called AI:CULT [4]. This project, which is part of the field of research that addresses the gap between artificial intelligence and digital cultural heritage, is closely linked to the new automatic metadata models as it will offer institutions methods for bias detection and filtering in classifications and descriptions of automatically generated collection data in two case studies.

3. Considerations

Analyzing and comparing these projects results in interesting aspects that reflect the state of the art of experimental activities.

Firstly, there is a clear focus on data pre-processing. On the one hand, awareness of the quantity and quality of data drives researchers to improve data through corrections, additions of missing data, and enrichment of descriptions. On the other hand, there is a need for data modeling so that the representation of data is as suitable as possible to be understood by the machine. 

In the aforementioned project Entangled Histories on the datification of the early modern ordinances, for example, given the language of digitized books, which varies from Dutch to French and Latin, and above all the presence of the different Roman and Dutch Gothic type font printing, it has been promising to experiment with techniques of Handwritten Text Recognition (HTR), through the use of Transkribus, who applies HTR methods based on combining pattern recognition with AI and neural networks.

Another example is proposed by a thesis written in synergy with the KB and focused on the automatic authorship attribution to publications through supervised machine learning (Hirzalla, 2020). This project focuses on how the scarcely available text of a publication can be best represented. Considering as input data text snippets (titles and short descriptions of content), and testing various models of text representation with different machine learning techniques, the project shows that the SVM (Support VectorMachine) achieves excellent results for predictions thanks to the combination of text and metadata functionality, while the GBM (Gradient BoostingMachine) is useful for the implementation of ensemble learning and similarity learning. In fact, BERT (Bidirectional Encoder Representations from Transformers) has also been tested, a technique that uses contextualized embeddings for data representation. It turned out to be very promising but not yet fully mature. 

Secondly, the difference in data perception between humans and machines is evident. An illuminating example is the aforementioned thesis. To answer the research question on the potential of additional metadata and to understand the thought process that domain experts use to link ambiguous authors to publications, a survey was conducted among metadata experts and cataloguers at the KB to understand the methodologies used and personal perceptions of different types of information. From the comparison of the results of the survey with the results of machine learning models, methodological and conceptual differences in the authorship attribution are distinguished: experts attach greater importance to author information such as age, autobiographical notes and role, while machine learning models give more meaning to publication metadata. 

Actually, this aspect also materializes in those projects that aim to make machines capable of detecting the prejudices inherent in cultural data that are by nature subjective and partial. One example is the De-BIAS [5] project. By proposing the development of an AI tool to automatically detect biases in cultural heritage metadata and provide information on their problematic context, the project attempts, on the one hand, to improve the description of digital collections and to propose a more appropriate narration of the stories of minority communities, on the other, the question arises of how the machine can understand and manage the prejudices present in the data, in order to overcome the outdated and sometimes offensive or harmful view of some terms. 

4. Conclusions

It is clear that the risks and challenges faced, and still to be addressed, are many: starting from the type of data representation for automatic recognition, to the linguistic aspect and the related thesauri and ontologies of a sector, to the copyright and personal data that involve management problems for intelligent systems, as well as the identification of homonyms. But experimenting with methods and tools on culture data sets, as well as combining text data with images, involves recognizing the possibilities and limitations of the various self-learning algorithms and pursuing research on improving data understanding through context metadata that helps AI learn data from different perspectives. 

In conclusion, assuming machines perceive and understand data through reasoning techniques other than those applied by humans, it is critical to provide intelligent systems with an appropriate framework for AI to approach data in a conscious way. Recent research shows that the automatic generation of quality metadata and metadata in addition to raw data, when carried out in synergy with expert knowledge, produces a better and balanced context of heritage data.

The assignment of additional contextual metadata, which help the system to detect context, and the applied methodologies for the implementation of learning and automatic classification techniques, not only help AI to achieve a high degree of reliability (Floridi, 2019), in terms of accuracy and recall, but also to improve the representation of knowledge, offering information in a more inclusive and balanced way and participating in the great research on co-creation of the graphic knowledge.

 

Bibliography

  • de Gruijter, M., Balmashnova, E., Bertrams, R., Brons, M., Franken, J., Groothuis, M., de Groot, R., Kampen, B., Kleppe, M., Kruidhof, J., Manyuhina, O., Vriend, N. & van der Wal, D. (2022). Krant & foto's verbonden. Een verkenning om kunstmatige intelligentie in te zetten om erfgoedcollecties te verbinden (1.0). Zenodo. https://doi.org/10.5281/zenodo.6183002.
  • Floridi, L. (2019). Stabilire le regole per costruire un’IA affidabile. Natura Intelligenza artificiale 1, 6 (2019), 261-262.
  • Haagsma, D. (2021). Onderzoekassistentie van de Demosaurus. https://doi.org/10.5281/zenodo.5705945.
  • Han, H., Giles, C. L., Manavoglu, E., Zha, H., Zhang, Z., & Fox, E. A. (2003). Automatic document metadata extraction using support vector machines. In L. Delcambre, G. Henry, & C. C. Marshall (Eds.), Proceedings - 2003 Joint Conference on Digital Libraries, JCDL 2003 (pp. 37-48). Article 1204842 (Proceedings of the ACM/IEEE Joint Conference on Digital Libraries; Vol. 2003-January). Institute of Electrical and Electronics Engineers Inc. doi:10.1109/JCDL.2003.1204842.
  • Hirzalla, N. (2020). Automating Authorship Attribution in Heterogeneous and Sparse Publication Data through Supervised Machine Learning (Master thesis, Vrije Universiteit Amsterdam, Amsterdam, Netherlands). Retrieved from http://www.victordeboer.com/wp-content/uploads/2020/11/Masterthesis_NizarHirzalla_Final.pdf.
  • Kleppe, M., Veldhoen, S., van der Waal-Gentenaar, M., den Oudsten, B. & Haagsma, D. (2019). Verkenning mogelijkheden automatisch metadateren. Zenodo. https://doi.org/10.5281/zenodo.3373316.
  • Liddy, E. D., Allen, E., Harwell, S., Corieri, S., Yilmazel, O., Ozgencil, N. E., Diekema, A., McCracken, N., Silverstein, J., & Sutton, S. (2002). Automatic metadata generation & evaluation. SIGIR Forum (ACM Special Interest Group on Information Retrieval), 401-402. doi:10.1145/564376.564464.
  • Romein, C. A., de Gruijter, M., & Veldhoen, S. F. (2020). The Datafication of Early Modern Ordinances. DH Benelux Journal, 2. http://journal.dhbenelux.org/journal/issues/002/article-23-romein/article-23-romein.pdf.
  • Van Wessel, J. W. (2020). AI in Libraries: Seven Principles. Zenodo. https://doi.org/10.5281/zenodo.3865344.
  • Wevers, M., & Smits, T. (2020). The visual digital turn: Using neural networks to study historical images. Digital Scholarship in the Humanities, 35, 194-207. Retrieved from https://doi.org/10.1093/llc/fqy085.
  • Wildschut, P., Faber, W.J. (2017) Narralyzer. Retrieved from http://lab.kb.nl/tool/narralyzer.