DH 2023: Collaboration as Opportunity

Inhoudsblokken
Body

This year’s DH2023 annual event takes place in person in Graz, Austria from July 10th to July 14th. From the KB Lab, Mirjam Cuper will present a poster about ‘Interchangeability of ngrams models between heterogeneous dataset’. 

What:  DH 2023: Collaboration as Opportunity

The annual ADHO (Alliance of Digital Humanities Organizations) Digital Humanities Conference is the central and largest event of the international DH community and unites scholars from across the globe, presenting them with a unique opportunity for the exchange of their work and ideas and the fostering of future collaborations.

The conference theme “Collaboration as Opportunity” showcases transdisciplinary and transnational collaboration, with a special focus on the South-Eastern European DH community. It will explore how mutual empowerment and collaboration of neighboring countries – regardless of continent and geopolitical placement – can transform regional hubs of expertise to international networks of excellent research, for the benefit of the global DH community.

When: July 10th to July 14th.

 

Interchangeability of ngrams models between heterogeneous dataset

Wednesday July 12th, 6:00pm to 8:00 pm, Poster Reception at the MCG Gallery

By Mirjam Cuper 

A lot of heritage material is digitally available with more being added continuously. Regrettably, OCR software used for digitizing texts is not flawless. This leads to a variety of quality in digitized texts. In order to solve these problems, OCR errors must be detected and resolved. 

A method that appears to be an accurate way for the detection and correction of errors in (digitized) texts is the use of ngram models. For example, ngrams models are used to detect spelling errors (El Atawy / Abd ElGhany 2018; Wu et al. 2013) and are implemented in the quality pipeline of both the National Library of the Netherlands and the National Library of Luxembourg (Cuper 2022; Schneider / Maurer 2022). Furthermore, various studies use ngram models for the post-processing of OCRed texts (Nguyen et al. 2021; Chiron et al. 2017).  

However, there is a huge variety of types of heritage texts that are digitally available. Not only are there different types of publications, such as fiction and non-fiction books, periodicals, and newspapers, but time periods also need to be taken into consideration. Due to this large variation in material, and because of the effort to create manually corrected versions of texts, representative ngrams models are only available for a small number of corpora.  

It is not evident that ngrams models built upon one corpus can be interchanged to measure the quality of another corpus with different characteristics. Previous research has shown that ngrams can be used to measure similarity between texts (Islam et al. 2012). Furthermore, ngrams have been successfully applied for author-profiling (Basile 2017). These results imply that there are significant differences between ngram models of various corpora. Following this line of thought, each type of corpus may very well need their own ngram model in order to use ngram models for error detection and correction. This would mean that using ngram models is only feasible when there is an ngram model of a representative corpus available.  

This leads to our research question: How, and to what extent, is an ngram model built upon corpus A interchangeable to measure the quality of corpus B.