Examining a multi-layered approach for classification of OCR quality without Ground Truth

With more and more heritage collections digitally available, the use of computer driven research tasks is increasing. However, there is a pitfall: the Optical Character Recognition (OCR) quality of digitised text is not always of high quality. Quantifying these errors is difficult without the availability of so-called Ground Truth (digitised texts that are manually corrected by humans). Some methods have been developed to measure the quality without the Ground Truth, such as a dictionary comparison, but the reliability of these measures depends heavily on factors like time-period and type of text. In this presentation, Mirjam Cuper will demonstrate her concept for a multi-layered approach that combines various different measurements to acquire a higher accuracy of the OCR quality.

The NewsEye International Conference

This lecture is part of the conference What’s Past is Prologue: The NewsEye International Conference which takes place over two days: 16 and 17th of March. The News Eye project’s international conference seeks to both present and examine the wide range of DH methods and tools which impact the digital research landscape of today.

For more information about the programm click here. Mirjam Cuper her presentation is part of session 3 on day 2.

Registration

Registration is free and now open at: https://www.newseye.eu/wpip21/save-the-date/

Date and time

17 March : 13:30 - 15.00

Programm Session 3 Digitised Historical Material: Improving Data Quality

Moderated by Juha Rautiainen (The National Library of Finland)

Examining a multi-layered approach for classification of OCR quality without Ground Truth (Mirjam Cuper)
Discovering Spatial Relations in Literature: What is the influence of OCR noise? (Gaël Lejeune and Caroline Parfait)
Evaluating the multilingual capabilities of PERO-OCR with digitised historical newspapers: A Belgian case study (Julie M. Birkholz, Sally Chambers, Michal Hradis and Pavel Smrz)
Two Examples of Analysis of Textual Document in Oriental and Under-Resourced Languages (Chahan Vidal-Gorène)