Introduction
    Body

    Description

    Annif is an open source tool developed at the National Library of Finland. It uses a combination of existing natural language processing and machine learning tools to suggest subjects from a user set controlled vocabulary for an input text.

    We use Annif as a part of a larger tool that is being developed for our library catalogers to help make their task of cataloging more efficient by suggesting authors and keywords to a given publication. This project is a continuation of the research described in the whitepaper ‘Exploring possibilities Automated Generation of Metadata’.

    Data

    The data we use to train the models available in Annif come from the GGC-database; GGC is a collaborative cataloging system for Dutch libraries. We have trained a couple of models using data that consists of titles, subtitles and summaries of Dutch e-books. As controlled vocabulary we used the Brinkman thesaurus. The Brinkman thesaurus has both genre and subject keywords, we have trained separate models for each. 

    Models

    Full dataset TF-IDF Brinkman This is a TF-IDF model trained on the whole dataset i.e. both genre and subject keywords.
    Subject models Omikuji Zaaktrefwoorden Omikuji is an efficient implementation of Partitioned Label Trees and its variations for extreme multilabel classification.
      fastText Zaaktrefwoorden fastText is an algorithm based on word2vec type models but representations are learnt of character n-grams, and words are represented as the sum of the n-gram vectors. Adding subword information helps the embeddings understand suffixes and prefixes. A skipgram model is trained to learn the embeddings.
      Ensemble Zaaktrefwoorden Ensemble of Omikuji Zaaktrefwoorden and fastText Zaaktrefwoorden. More weight is given to the Omikuji model (3:1).
    Genre models vw-multi ECT Brinkman Vorm Vowpal Wabbit is a multiclass and multilabel classification system best suited for classification tasks with a relatively small number of classes.
      Omikuji Brinkman Vorm See Omikuji Zaaktrefwoorden for more information.
      Ensemble Brinkman Vorm Ensemble of vw-multi ECT Brinkman Vorm and Omikuji Brinkman Vorm.

     

    Citaat

    When citing this tool we request you cite it as follows:

    Haighton, T. and Veldhoen, S., (2021) Assisted keyword assignment using Annif. KB Lab: The Hague. http://kbresearch.nl/annif/

    Inhoudsblokken
    Afbeelding
    Image
    Annif instruction
    Body

    Copy paste a summary of a book into the textbox, select a model from the dropdown menu and click on ‘Get suggestions’. The system generates a list of Brinkman keywords that fits the given input. You can set how many keywords are returned by choosing either 10, 15 or 20 below the ‘Max # of suggestions’ text.

    Note: only works for Dutch publications.