Annif is an open source tool developed at the National Library of Finland. It uses a combination of existing natural language processing and machine learning tools to suggest subjects from a user set controlled vocabulary for an input text.
We use Annif as a part of a larger tool that is being developed for our library catalogers to help make their task of cataloging more efficient by suggesting authors and keywords to a given publication. This project is a continuation of the research described in the whitepaper ‘Exploring possibilities Automated Generation of Metadata’.
The data we use to train the models available in Annif come from the GGC-database; GGC is a collaborative cataloging system for Dutch libraries. We have trained a couple of models using data that consists of titles, subtitles and summaries of Dutch e-books. As controlled vocabulary we used the Brinkman thesaurus. The Brinkman thesaurus has both genre and subject keywords, we have trained separate models for each.
| || || |
|Full dataset||TF-IDF Brinkman||This is a TF-IDF model trained on the whole dataset i.e. both genre and subject keywords.|
|Subject models||Omikuji Zaaktrefwoorden||Omikuji is an efficient implementation of Partitioned Label Trees and its variations for extreme multilabel classification.|
|fastText Zaaktrefwoorden||fastText is an algorithm based on word2vec type models but representations are learnt of character n-grams, and words are represented as the sum of the n-gram vectors. Adding subword information helps the embeddings understand suffixes and prefixes. A skipgram model is trained to learn the embeddings.|
|Ensemble Zaaktrefwoorden||Ensemble of Omikuji Zaaktrefwoorden and fastText Zaaktrefwoorden. More weight is given to the Omikuji model (3:1).|
|Genre models||vw-multi ECT Brinkman Vorm||Vowpal Wabbit is a multiclass and multilabel classification system best suited for classification tasks with a relatively small number of classes.|
|Omikuji Brinkman Vorm||See Omikuji Zaaktrefwoorden for more information.|
|Ensemble Brinkman Vorm||Ensemble of vw-multi ECT Brinkman Vorm and Omikuji Brinkman Vorm.|