Introduction
Description
Annif is an open source tool developed at the National Library of Finland. It uses a combination of existing natural language processing and machine learning tools to suggest subjects from a user set controlled vocabulary for an input text.
We use Annif as a part of a larger tool that is being developed for our library catalogers to help make their task of cataloging more efficient by suggesting authors and keywords to a given publication. This project is a continuation of the research described in the whitepaper ‘Exploring possibilities Automated Generation of Metadata’.
Data
The data we use to train the models available in Annif come from the GGC-database; GGC is a collaborative cataloging system for Dutch libraries. We have trained a couple of models using data that consists of titles, subtitles and summaries of Dutch e-books. As controlled vocabulary we used the Brinkman thesaurus. The Brinkman thesaurus has both genre and subject keywords, we have trained separate models for each.
Models
Full dataset | TF-IDF Brinkman | This is a TF-IDF model trained on the whole dataset i.e. both genre and subject keywords. |
Subject models | Omikuji Zaaktrefwoorden | Omikuji is an efficient implementation of Partitioned Label Trees and its variations for extreme multilabel classification. |
fastText Zaaktrefwoorden | fastText is an algorithm based on word2vec type models but representations are learnt of character n-grams, and words are represented as the sum of the n-gram vectors. Adding subword information helps the embeddings understand suffixes and prefixes. A skipgram model is trained to learn the embeddings. | |
Ensemble Zaaktrefwoorden | Ensemble of Omikuji Zaaktrefwoorden and fastText Zaaktrefwoorden. More weight is given to the Omikuji model (3:1). | |
Genre models | vw-multi ECT Brinkman Vorm | Vowpal Wabbit is a multiclass and multilabel classification system best suited for classification tasks with a relatively small number of classes. |
Omikuji Brinkman Vorm | See Omikuji Zaaktrefwoorden for more information. | |
Ensemble Brinkman Vorm | Ensemble of vw-multi ECT Brinkman Vorm and Omikuji Brinkman Vorm. |
When citing this tool we request you cite it as follows:
Haighton, T. and Veldhoen, S., (2021) Assisted keyword assignment using Annif. KB Lab: The Hague. http://kbresearch.nl/annif/
Live demo
Instructions
Copy paste a summary of a book into the textbox, select a model from the dropdown menu and click on ‘Get suggestions’. The system generates a list of Brinkman keywords that fits the given input. You can set how many keywords are returned by choosing either 10, 15 or 20 below the ‘Max # of suggestions’ text.
Note: only works for Dutch publications.