The KB has been digitising its newspaper collection since 2006 and has now published more than 12 million pages on the national platform Delpher.nl. Over these past 12 years optical character recognition (OCR) has improved a great deal – partly due to the efforts made in the by the KB coordinated FP7 project IMPACT.
However, we have never really known what the quality is of the OCR we provide to our users and more importantly perhaps, we would like to improve the older OCR. We have therefore started a project where we intend to find out what the quality is of our newspaper OCR and if there is a (semi-)automated process we can use to improve it. We will be looking at reOCRing old files, a machine learning application (Ochre by dr. Janneke van der Zwaan) and a text induced corpus cleaning application (TICCL by dr. Martin Reynaert). We also did a small test to see what the difference was between OCRing from master or access images and found that the difference wasn't actually significant!
To evaluate the OCR we need ground-truth (99,95% corrected text) and this is a rather costly process. This is why we have chosen to work with a sample of our newspapers of 2000 pages – a mere 0,17% of the total. We know this is very little, but we do hope this gives us some insights into what we have and which options we have to improve it without it costing us an arm and a leg (which could be used for more digitisation).
To select a representative sample we first needed an overview of the total amount of pages and with which software package there were processed. This is all information that is available in the ALTO files of the digitised newspapers, in the tag <OCRProcessing>:
Image 1: OCRProcessing tag in KB's ALTO files
In February 2017 we extracted all information from the ALTO files we had online at that time and after some cleaning using OpenRefine inputted them into a database (available on request). That resulted in the following overview. All our newspapers are digitised with ABBYY software and the versions ranged from 7.0 to 10.0.
One discrepancy was found in the data and that were a number of files that had both ABBYY 8.1 and 9.0 as processing software in their metadata. We could not find a clear explanation for this and have therefore decided to mark all these files as 8.1.
Next to the division on software versions, we also chose to divide the data in time slots looking at spelling changes in Dutch. We also opted to exclude all 17th century newspapers, because of their bad OCR and the fact that they are being manually corrected in a volunteer project. This is also the case for a large selection of WW2 newspapers, which have therefore also been excluded. This resulted in the following selection of pages (with some deviations due to rounding off):
|Years/software||ABBYY 8.1||ABBYY 9.0||ABBYY 10.0||Total|
|1883-1947 (minus 1940-1945)||652||21||494||1166|
As one of our test strands is assessing what reOCRing old files yields in improvement, we also needed to select the images we would be using in our sample. When digitising we have two types of images delivered, one for access purposes and one for preservation purposes. Both are JPEG2000, but the preservation copy (master image) is lossless while the access copy is lossy. A small selection of the master images is also in greyscale. Harvesting the access images is easy, we can simply use our API for this. However, accessing the master images requires a lot more work, as they are stored on tape and we would need to use a specific application to retrieve the batches that is also used for other purposes and can currently only handle one process at a time.
We thus wanted to know if we could use access images in our project instead of master images. We therefore conducted a small test of 23 pages and had the master and access images reOCRed with ABBYY FineReader 11 and the ALTO files groundtruthed. We used a similar division for the selection of these files as in the table above. The IMPACT Centre of Competence then evaluated the files for us (which is a service they provide to us as member) using their evaluation tool, resulting in the following outcomes:
Word error rate
(independent word order)
P-value access vs master
You can see that the word error rate has indeed decreased (the lower the better) in most files, with the master images producing a lower number than the access images. However, when comparing them we calculated that the difference is in fact not significant in our test. We therefore chose to use access images in our reOCRing strand.
Now that we knew that we could use access images and how the division of files should be between software versions and over time we could select the identifiers to be included in the sample. Using the database mentioned earlier we took a random selection of each category, resulting in a list of 2000 newspaper issues. Due to the fact that OCR for us is most relevant on newspaper articles we chose to only select the first and second page of a newspaper to ensure we would have the most articles and the least amount of advertisements. These images, OCR and metadata were then downloaded and made available to our partners in this project.
Over the next few months we will be working together with a service provider to get the 2000 groundtruth files ready for the evaluation. We will also provide the original ALTO files and a selection of the groundtruth files to our external partners working on the post-correction software who will use the data to train their tools and improve the files. Once we have all files corrected, the IMPACT Centre of Competence will again provide an evaluation on all files, which we will of course share here!
All data created and gathered in this project is available for reuse. Some of it contains in-copyright material, which is why we cannot publish it openly, but feel free to contact firstname.lastname@example.org and we can provide you with a copy.