Dr Giovanni Colavizza was a researcher-in-residence at the KB in 2020.
His project ‘Is your OCR good enough? A comprehensive assessment of the impact of OCR quality on downstream tasks’ explored the question: How does the quality of optical character recognition affect further analyses?
Project
Giovanni Colavizza explored the effect of OCR quality on downstream tasks, aiming to contribute to answering the question of when OCR quality is good enough, and when improvement is necessary.
His first blogpost introduces the project and discusses the challenges of OCR and determining OCR quality. The second blogpost goes into the extrinsic evaluation of OCR quality impact on three downstream tasks: topic modelling, document classification and post-OCR correction.
The collected dataset, as well as the code used, are available through the dataset page below.