Extracting text from EPUB files in Python

    Introduction
    Body

    On his blog, our colleague Johan van der Knijff, published a post which provides a brief introduction to extracting unformatted text from EPUB files. The occasion for this work was a request by his Digital Humanities colleagues who are involved in the SANE (Secure ANalysis Environment) project. They were looking for some advice on how to implement the text extraction component, preferably using a Python-based solution.

    Evaluated tools

    The following tools are evaluated:

    1. Tika-python. This is a Python wrapper for Apache Tika (which itself is a Java application). Apache Tika is a toolkit for text and metadata extraction from a wide range of file formats, including EPUB.
    2. Textract. This offers text extraction functionality that is similar to Tika, but unlike Tika, Textract is natively written in Python.
    3. EbookLib. This is a Python library for reading and writing E-books in various formats, including EPUB (both EPUB 2 en EPUB 3). EbookLib is also the E-book library that is used by Textract.

    Test environment and data

    For all of the tests a simple desktop PC running Linux Mint 20.1 (Ulyssa), MATE edition, with Python 3.8.10 was used.

    The following two data sets were used:

    1. A selection of 15 files in EPUB 2.0.1 format from the KB’s DBNL (Digital Library for Dutch Literature) collection.
    2. A selection of 10 files in EPUB 3.2 format from Standard Ebooks.

    Continue reading

    Read more about the simple demo scripts he wrote that show how each tool is used within a processing workflow and how he applied these scripts to two data sets, and used the results to obtain a first impression of the performance of each of the tools on his blog post available at bitsgalore.org

    Citaat

    Used image: clockwork picture of an itinerant dentist performing an extraction in French rural scene, wood frame, metal workings, first half 19th century. Science Museum, London. Attribution 4.0 International (CC BY 4.0) (cropped from original).

    When citing this page we request you cite it as follows: 

    Knijff, J. van der, (2023) Extracting text from EPUB files in Python. KB Lab: The Hague. https://lab.kb.nl/tutorial/extracting-text-epub-files-python