This year's DH Benelux Conference takes place at the University of Luxembourg. Mirjam Cuper will be hosting a pre-conference workshop on extracting text, layout and metadata from XML-files of OCR-ed historical texts. Please join us there!

What: Workshop track 4 (MSH DHLab – cap: 20) Automatically extract text, layout and metadata information from XML-files of OCR-ed historical texts by Mirjam Cuper

When: 31 May 9:30 - 12:30

The complete conference programme (including information about the workshops) is available on the DH Benelux website.

Abstract

In the domain of digital humanities, researchers are often interested in analyzing large amounts of historical texts. Most of these texts are digitized with the use of Optical Character Recognition (OCR) software, of which some are manually corrected or enriched. These texts are often stored by digital heritage institutions in a variety of Extensible Markup Language (XML) formats. To be able to use these texts in most types of analyses, the plain text needs to be extracted from these XML files in order to perform further research. XML files can also contain important information regarding the reading order, style, layout information, recognition confidence metrics and which OCR software was used. Furthermore, XML files can contain metadata about full issues of, for example, newspapers. These metadata files contain information such as the title of the paper, name of the publisher, date of publication, and type of text (e.g. article, advertisement or image). This information can be used by researchers to make specific selections of texts based on these characteristics out of the large amounts of data.

DH Benelux 2022 workshop Automatically extract text, layout and metadata information from XML-files of OCR-ed historical texts