20 Jan 2025

Identifying Austronesian Entities in Dutch Texts: Translingual Named Entity processing of East Indies Newspapers from the 1930s and 1950s

Inhoudsblokken
Body

Simon Kemper was a researcher-in-residence at the KB in 2021.

His project ‘Identifying Austronesian Entities in Dutch Texts: Translingual Named Entity processing of East Indies Newspapers from the 1930s and 1950s’ explored the question: Is it possible to improve entity recognition by linking historically related languages?

Body

Project

Simon Kemper explored how multilingual models can be used to detect Asian entities in historical Indonesian and Dutch texts.

His first blogposts explores entity-recognition in Indonesian and Dutch newspapers, including the challenges and differences in digitization and HTR quality. It goes into the collection and labelling of a corpus of Malay, Dutch, Sundanese and Javanese texts across Latin and Javanese scripts.

His second blogposts goes into linking languages through Named Entity Recognition (NER). It gauges the effect of multilingual training data to improve NER for low-resource historical languages, describing the process of preprocessing, training and evaluating.

Auteur
Simon Kemper
Simon C. Kemper
PhD candidate
BIO
Simon C. Kemper specialises in colonial Southeast Asian history and the intricacies of combining Asian and European sources within the digital humanities.