In a past blog we wrote about research on the quality of the OCR of our digitised newspapers. Since then we’ve developed 2000 pages of ground-truth (GT) and have used those to examine if and how we can improve the recognised text of one of our biggest digital collections. This post describes the development of the GT and the research we have done to better understand the OCR quality of our digitised newspapers.
Once we established that we could in fact use the access images for our research (instead of the hard to access master images) we selected 2000 pages from the collection. These 2000 pages only represent a mere 0,17% of our entire historical newspaper collection (at the time), but budget restraints meant this was what we could produce. We subdivided the to be selected files into the following categories:
Years of publication/software
1883-1947 (minus 1940-1945)
A random selection of digitisation identifiers was made for each of the categories after which alternatingly the first or second page of that issue was selected. Given that we were looking for pages with as much text as possible we opted for the first two pages of each issue where we would mostly see text and could hopefully avoid full pages of advertisements. Due to the varying typefaces and print quality during the Second World War we chose to exclude those years from the selection, as it is not representative for the whole collection.
All pages were sent to a service provider and were rekeyed using the software package Aletheia, developed by the PRImA Research Lab during the IMPACT project. The instructions they received were the following:
For each batch a random paragraph was manually checked for each image and the batch was then either approved (with a 99,95% accuracy) or returned to the service provider for further correction.
As is often the case in the development of GT, it took some weeks for the process to run smoothly. As we didn’t want to touch the segmentation of the articles the people rekeying had to work from existing OCR and we quickly found that the scans that were made from microfilms were of such a bad quality that it was very difficult to reach the 99,95% accuracy. The decision was made to extract those images and process them separately with a lower accuracy of 99,5%. In the end we spent an extra 4 months on the development of the GT than originally planned.
We worked together with research software engineer dr. Janneke van der Zwaan who developed a machine learning application that can correct existing OCR. Due to the delays in the GT production Janneke had less time than we originally planned and could therefore only use a portion of the files to train her engine. We chose to select the 1325 pages digitised with ABBYY 8 as we expected the best results in post-correction there. Out of these, 80% was used for training, 10% for validation and 10% for evaluation. A more elaborate description of the application, its development process and the results, please see this more elaborate post about Ochre by Janneke.
The application is available on Github: https://github.com/KBNLresearch/ochre.
All pages were re-OCRed using ABBYY11, the latest version of the software we have been using since starting our digitisation processes. As we have manually corrected headings for all our articles our original plan was to not process these headings. However, due to the way our metadata is designed, the ALTO did not contain information about what a heading is conclusively, as this is stored in either a MPEG21 file or a METS file. Creating a pipeline in which both these files could be taken into account when re-OCRing was unfortunately too costly for this research project. We also tried to extract all headings in the evaluation, but the results were not certain enough to work with. We therefore chose to evaluate the existing version with corrected headings and the re-OCRed version which could have introduce more mistakes than we currently offer. This risk is of course always present when presenting new OCR results that have been automatically processed, but should be taken into account when examining the results.
Once we had all results from the service provider we asked the IMPACT Centre of Competence to evaluate all data. For this they used the ocrevalUAtion tool which presents three measures; the character error rate, word error rate and the word error rate (order independent). Since we were evaluating the data for our mass digitisation platform Delpher we chose to work with the word error rate (order independent) in our final results. The word error rate is calculated as follows, where the order of the words is not taken into account for the order independent option:
Fig 1. Description of calculation of word error rate by Carrasco, Text Digitisation
Since we subdivided the selection into time periods and OCR software the results can be displayed for various subsets. In figure 1 the results are displayed for the entire set, subdivided into all scans (Total), microfilm scans and scans from paper. All three columns are also subdivided into the three software packages used in digitisation (ABBYY 8.1, ABBYY 9 and ABBYY 10). The blue bars (left) represent the average word error rate for the original OCR files which we currently offer on Delpher. The orange bars (middle) represent the average word error rate for the reprocessed scans with ABBYY FineReader 11. And finally, where possible, the grey bars (right) represent the average word error rate for post-corrected OCR using Ochre.
Fig. 2: Complete set compared on word error rate (order independent)
Fig. 3: Pages published before 1883 compared on word error rate (order independent)
Fig. 4: Pages published between 1883 and 1947 compared on word error rate (order independent)
Fig. 5: Pages published after 1947 compared on word error rate (order independent)
After examining the results above, we concluded that for us at this moment it is not cost-effective to re-OCR legacy digitisation projects. The changes needed in our infrastructure are too disruptive to merit the little improvement reprocessing the pages brings. For ABBYY 9 pages (which were all from a single newspaper title) even more errors are introduced when reOCRing. For us, it is better to use the available resources to digitise more items. However, we do see potential in using post-correction tools such as Ochre for the material which is digitised from microfilms and processed with ABBYY 8.1. The recognition rate for those pages is so low that any improvement can be helpful to our users and further developing the tool in-house is worthwhile, also in understanding more about the quality of our digitisation outputs and about machine learning applications in a library.
One of the major lessons we learned during this project is that ground-truth production is always more complicated and takes more time than what you plan for. Next to that, working with a digital collection that was not designed to publish new OCR versions presents issues which can be difficult to implement in a mass digitisation workflow. We saw this in for example the corrected headings, but also the lack of a version management process in our digital library system. This is something that we don’t have an easy solution for currently, but should and will be taking into account in future projects.
After this project we will continue to work on the topic of OCR quality within the KB and KB Lab. We currently have Kimberly Stoutjesdijk working with us as an intern on evaluation of an open-source OCR engine, and have an internal working group consisting of colleagues from the Research, Collections, Digitisation, Information policy and DBNL departments tasked with researching the quality and possible correction of our digital collections. We hope to further develop Ochre for post-correction on microfilm scans and plan to use the ground-truth developed in this project for more evaluation projects.
If you are interested in an internship using our digital collection, ground-truth datasets or OCR quality, please send us a message to discuss possible topics and options.