The Keyword Generator is a command line tool that extracts significant keywords from a collection of sample texts provided by the user, based on either a topic model or on tf-idf scores. It was developed within the KB Researcher-in-residence project of dr. Pim Huijnen and is available on GitHub. The resulting keywords can be used to search the KB newspaper collection with the Dictionary viewer companion tool.
When using the Keyword generator, we request you to cite it as follows:
Lonij, J., Huijnen, P., Keyword generator (2016). KB Lab: The Hague http://lab.kb.nl/tool/keyword-generator
The Keyword Generator is a command line tool written in Python that can be dowloaded from GitHub.
- To run the Keyword Generator, Python 2.7 should be installed
- Gensim is needed for topic modelling and tf-idf score calculation and should be installed as well
- Mallet has to be installed in order to use the Mallet topic modelling option
Installing the Keyword Generator only requires unpacking of the zip-file or downloading the source code from GitHub. This results in the creation of a folder ‘keyword-generator’, in which three files (corpus.py, keywords_lda.py, keywords_tfidf.py) and one subfolder ‘data’ appear. The ‘data’ subfolder in turn contains three subfolders of its own: ‘documents’, ‘models’, and ‘stop_words’. The user can put his own stop word lists in the last of these folders, dependent on whether or not he wants to leave stop words out of the equation. In general, the Keyword Generator is not language specific, but, obviously, the use of stop words is. The (collections of) text(s) from which the Keyword Generator will derive its keyword list can be put in the ‘documents’ folder. The input should consist of one or more plain text files (.txt extension, UTF-8 encoded).
Once installed, the Keyword Generator can be started by entering 'python keywords_lda.py' or 'python keywords_tfidf.py' at the command line from within the ‘keyword-generator’ folder. A very elaborate instruction with an explanation of all available options was written by dr. Pim Huijnen on the KB Research Blog. A brief overview with some example commands can be found on GitHub.