With more and more heritage collections digitally available, the use of computer driven research tasks is increasing. However, there is a pitfall: the Optical Character Recognition (OCR) quality of digitised text is not always of high quality. Quantifying these errors is difficult without the availability of so-called Ground Truth (digitised texts that are manually corrected by humans). Some methods have been developed to measure the quality without the Ground Truth, such as a dictionary comparison, but the reliability of these measures depends heavily on factors like time-period and type of text. In this presentation, Mirjam Cuper will demonstrate her concept for a multi-layered approach that combines various different measurements to acquire a higher accuracy of the OCR quality.
This lecture is part of the conference What’s Past is Prologue: The NewsEye International Conference which takes place over two days: 16 and 17th of March. The News Eye project’s international conference seeks to both present and examine the wide range of DH methods and tools which impact the digital research landscape of today.
Registration is free and now open at: https://www.newseye.eu/wpip21/save-the-date/
17 March : 13:30 - 15.00
Programm Session 3 Digitised Historical Material: Improving Data Quality
Moderated by Juha Rautiainen (The National Library of Finland)