Simon Kemper was a researcher-in-residence at the KB in 2021.
His project ‘Identifying Austronesian Entities in Dutch Texts: Translingual Named Entity processing of East Indies Newspapers from the 1930s and 1950s’ explored the question: Is it possible to improve entity recognition by linking historically related languages?
Project
Simon Kemper explored how multilingual models can be used to detect Asian entities in historical Indonesian and Dutch texts.
His first blogposts explores entity-recognition in Indonesian and Dutch newspapers, including the challenges and differences in digitization and HTR quality. It goes into the collection and labelling of a corpus of Malay, Dutch, Sundanese and Javanese texts across Latin and Javanese scripts.
His second blogposts goes into linking languages through Named Entity Recognition (NER). It gauges the effect of multilingual training data to improve NER for low-resource historical languages, describing the process of preprocessing, training and evaluating.