We have digitised over 100 million pages of text, of which 4.4 million are from magazines. These magazines are full-text searchable via the our platform Delpher, but the articles in them have not yet been segmented, in contradiction to the newspapers, which have all been segmented into 4 types of articles. However, as our magazines have not received a similar process, we hope there is an automated way to add this metadata to the pages.
We've launched a competition together with the PRImA Research Lab of the University of Salford, which is part of a research track within the library to enrich our digitised material with new metadata. We therefore challenge you to design a workflow which recognises articles on a page of a digitised magazine and can then also determine to which class(es) these articles belong.
This competition focuses on the recognition of groups of text blocks within digitised historical magazines, i.e. article segmentation. The task is to first classify the type of page and then, where needed, recognise all separate articles, and define the type of article by means of a set of rules.
Schedule
Until 25 April: Registration open. Developers of candidate methods register their intention to participate. The example dataset (document images and associated ground truth) will be provided soon. |
15 April: Registered participants will be able to download the document images and OCR results of the evaluation dataset. |
26 April: Developers submit the results of the candidate methods along with the executables of the candidate methods (to be able to replicate the results) and a brief description of each method via email. The organisers then evaluate the submitted results and prepare a detailed report describing and comparing the candidate methods. |
September 2019: The results will be announced. |