Code created during KB Research in Residence project "Why girls smile and boys don't cry". This repository provides tools for training and fine-tuning word embedding models (Word2Vec and FastText) on a selected subset of Dutch Newspapers available in Delpher.
It comes with various functions to explore the trained embeddings. Lexicon expansion, allows you to "travel through a vector space" and interactively create a lexicon of conceptually related words in the process. In the Bias folder, you find various tools for analysing bias over time and other dimensions such as political leaning and place. You can for example inspect how bias changes over time, comparing the evolution for different facets. Besides these timelines, you can zoom in on a specific year, and inspect the words that drive these differences, by either plotting the distribution of bias scores by facet, category and rank words by their bias scores. The notebook gives an overview of all function for analysing bias in word embedding trained on the Delpher newspapers.
When citing this tool we request you cite it as follows:
Beelen, K., Cuper, M. (2020) Word embedding playground. KB Lab: The Hague. https://lab.kb.nl/tool/word-embedding-playground
Lexicon expansion provides some functionality to interactively explore (i.e. travel through) word vector spaces. The screencast gives a quick overview of the process but more function are availble.
The different steps covered in the screencast are:
- Select seed words: in this case we chose "vrouw" and "vrouwen" as the seed query
- Select Sampling strategy: : "average" selects the simplest method which samples the closest neighbours to the query vector, other option are "query_tokens", "entropy" and "distance".
- Annotate: Core words will be added to lexicon and influence constructing the query vector. Peripheral words will be saved but don't influence the sampling. In this scenario I added unambiguously "female" words to the Core lexicon and OCR variants to the Peripheral word list. These words are saved, in case they are need later. I ignored For all other words (Ignore)
- Update lexicon with annotations: the next code blocks, updated the lexicon with the annotations. You can now go back to the previous step to harvest more words (but don't forget to save afterwards!) or you can plot the results.
- Plot the lexicon and surround words: the visualisation plots all the selected words on a 2D plane. The re
- Save lexicon: save the results of the annotation process for later use.
The expansion normally consists of multiple iterations.
For more information and to see the screencast in its original size, go to this page here.
The repository with information and the tools can be found on a KBNL research github page
You can find the general introduction here along with an explanation how to use the code and hyperlinks to other modules.
For more information on the Lexicon Expansion; see this README.
For more exact instructions please see this Notebook.
Information about the selected subset of Dutch Newspapers available in Delpher can be found here.
There is also a repository on Zenodo which contains Word2Vec models trained on Dutch historical newspaper data converting the period from 1840 to 1890.