Annif is an open source tool developed at the National Library of Finland. It uses a combination of existing natural language processing and machine learning tools to suggest subjects from a user set controlled vocabulary for an input text.

    We use Annif as a part of a larger tool that is being developed for our library catalogers to help make their task of cataloging more efficient by suggesting authors and keywords to a given publication. This project is a continuation of the research described in the whitepaper ‘Exploring possibilities Automated Generation of Metadata’.


    The data we use to train the models available in Annif come from the GGC-database; GGC is a collaborative cataloging system for Dutch libraries. We have trained a couple of models using data that consists of titles, subtitles and summaries of Dutch e-books. As controlled vocabulary we used the Brinkman thesaurus. The Brinkman thesaurus has both genre and subject keywords, we have trained separate models for each. 


    Full dataset

    TF-IDF Brinkman

    This is a TF-IDF model trained on the whole dataset i.e. both genre and subject keywords.

    Subject models

    Omikuji Zaaktrefwoorden

    Omikuji is an efficient implementation of Partitioned Label Trees and its variations for extreme multilabel classification.


    fastText Zaaktrefwoorden

    fastText is an algorithm based on word2vec type models but representations are learnt of character n-grams, and words are represented as the sum of the n-gram vectors. Adding subword information helps the embeddings understand suffixes and prefixes. A skipgram model is trained to learn the embeddings.


    Ensemble Zaaktrefwoorden

    Ensemble of Omikuji Zaaktrefwoorden and fastText Zaaktrefwoorden. More weight is given to the Omikuji model (3:1).

    Genre models

    vw-multi ECT Brinkman Vorm

    Vowpal Wabbit is a multiclass and multilabel classification system best suited for classification tasks with a relatively small number of classes.


    Omikuji Brinkman Vorm

    See Omikuji Zaaktrefwoorden for more information.


    Ensemble Brinkman Vorm

    Ensemble of vw-multi ECT Brinkman Vorm and Omikuji Brinkman Vorm.



    When citing this tool we request you cite it as follows:

    Haighton, T. and Veldhoen, S., (2021) Assisted keyword assignment using Annif. KB Lab: The Hague.

    Annif instruction

    Copy paste a summary of a book into the textbox, select a model from the dropdown menu and click on ‘Get suggestions’. The system generates a list of Brinkman keywords that fits the given input. You can set how many keywords are returned by choosing either 10, 15 or 20 below the ‘Max # of suggestions’ text.

    Note: only works for Dutch publications.