20 Jan 2025

Is your OCR good enough? A comprehensive assessment of the impact of OCR quality on downstream tasks

Inhoudsblokken
Body

Dr Giovanni Colavizza was a researcher-in-residence at the KB in 2020.

His project ‘Is your OCR good enough? A comprehensive assessment of the impact of OCR quality on downstream tasks’ explored the question: How does the quality of optical character recognition affect further analyses?

Body

Project

Giovanni Colavizza explored the effect of OCR quality on downstream tasks, aiming to contribute to answering the question of when OCR quality is good enough, and when improvement is necessary.

His first blogpost introduces the project and discusses the challenges of OCR and determining OCR quality. The second blogpost goes into the extrinsic evaluation of OCR quality impact on three downstream tasks: topic modelling, document classification and post-OCR correction. 

The collected dataset, as well as the code used, are available through the dataset page below. 

Auteur
Dr Giovanni Colavizza
Giovanni Colavizza
Assistant professor of Digital Humanities
BIO
Giovanni Colavizza is an assistant professor of digital humanities at the University of Amsterdam, a visiting researcher at The Alan Turing Institute (UK) and at the Center for Science and Technology Studies (CWTS, Leiden University).