ICDAR 2023: The 17th International Conference on Document Analysis and Recognition

Inhoudsblokken
Body

This year’s ICDAR takes place in person San José, California from August 21th to 26th. From the KB Lab, Mirjam Cuper will be present with her paper ‘Unraveling confidence: examining confidence scores as proxy for OCR quality’. 

What:  ICDAR 2023: The 17th International Conference on Document Analysis and Recognition

The International Conference on Document Analysis and Recognition (ICDAR) is the premier international event for scientists and practitioners involved in document analysis and recognition, a field of growing importance in the current age of digital transition.

When: August 21th to August 26th

 

Unraveling confidence: examining confidence scores as proxy for OCR quality

By Mirjam Cuper

While performing Optical Character Recognition (OCR), most engines provide confidence scores. These scores give an indication on how certain an engine is that a word or character is correctly determined. The practical application of this score is not yet clear and various studies have discussed the (un)usability of these confidence score as an estimation of OCR quality. 

Using a dataset of 2000 historical Dutch newspapers we investigated different aspects of the confidence score as provided by ABBYY Finereader, while also looking for a way to use the confidence score as an indication of quality. Such an indication could be used by institutions to determine which part of their collection would benefit from re-OCRing or post-processing. 

We found that the reliability of the confidence score as a measure of quality is largely dependent on the way the engine has been configured. In addition we show that when there is a high enough correlation between the word confidence and the Word Character Error (order independent) the word confidence can be used to calculate a proxy measure for categorizing digitized texts. However, such a measure must be recalculated for individual OCR engine set ups and producers. For our dataset this proxy measure performs well for the separation of digitized texts into categories of those with a very good and those with a very bad quality with total accuracy of 83%. 

The full paper is now available:

Cuper, M., van Dongen, C., Koster, T. (2023). Unraveling Confidence: Examining Confidence Scores as Proxy for OCR Quality. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds) Document Analysis and Recognition - ICDAR 2023. ICDAR 2023. Lecture Notes in Computer Science, vol 14191. Springer, Cham. https://doi.org/10.1007/978-3-031-41734-4_7