Introduction

This course was created as an addition to the workshop that was given at the DHBenelux 2022.

This course is set up to be a quick-start into working with XML files using Python. No prior knowledge of Python or XML is needed as the first lessons cover the basics of working with Python in Jupyter Notebooks as well as the basics of XML structure. We will give an overview of two Python packages that are often used when working with XML. After this, a practical lesson for both packages follows, in which you learn how to use the packages to extract content and metadata from an example XML file.

We continue with an introduction to three XML formats that are commonly used in Digital Heritage institutions. The remaining lessons are practical examples and exercises to get familiar with extracting content and metadata from these files with Python. We use both packages and real-life XML examples to show the differences, and to provide working code blocks to base future work on. We will end with instructions on how to perform such extractions automatically on batches of files.

Data:

Apart from the example XML, the used data in this course is provided by the KB and covers the TEI, Alto, Didl and PAGE XML formats.

Requirements:

To follow the course, an installation of Python 3 and Jupyter Notebooks is needed.

Link to the course: https://kbnlresearch.github.io/xml-workshop/intro.html

When citing this page we request you cite it as follows:

Cuper, M., Boer, E. den, (2022) Automatically extract XML content with Python. KB Lab: The Hague. https://lab.kb.nl/tutorial/automatically-extract-xml-content-python

Previous citation (updated after migration to tutorials menu on 24 May 2023):

Cuper, M., Boer, E. den, (2022) Automatically extract XML content with Python. KB Lab: The Hague. https://lab.kb.nl/tool/automatically-extract-xml-content-python

Examples

Link to the course