Newspaper ngram collection

    Introduction
    Body

    This dataset was generated by PoliticalMashup and contains yearly counts for word ngrams for n ranging from 1 to 5 from the KB newspaper collection. PoliticalMashup is a start-up project including the University of Amsterdam Informatics Dept., the University of Groningen Documentation Centre for Political Parties (DNPP), and the University of Twente Computer Linguistics Dept.

    The dataset is searchable in the Newspaper ngram viewer, but we also offer the data as a downloadable set for your convenience. Please note! The data has not been updated since it was produced in 2013 and does not contain the same newspapers as those currently available on Delpher.

    The data

    The Delpher newspaper corpus has been analysed to build a ngram viewer. To do this, the text has been divided into sentences and from each sentence all 1-, 2-, 3-, 4- and 5-grams were extracted. The following sentence could be divided into separate words ('lorem', 'ipsum',...), 2-grams ('lorem ipsum', 'impsum dolor',...) up until 5-grams ('lorem ipsum dolor sit amet').

    Lorem ipsum dolor sit amet, consectetur adipiscing elit.

    Next to that, punctuation is removed to make sure 'amet' and 'amet,' are not seen as different terms. The same applies for capitals that have been replaced by lower caps in order to see 'Lorem' and 'lorem' as the same words. The downside of this is that words where capital letters indicate a different meaning (the name 'Bakker' versus the profession 'bakker') are seen as one word.

    OCR errors are a big problem in the corpus and could lead to an extreme variation in the number or terms. When analysing the text of the corpus a large number of terms are only present once and this could most likely be contributed to OCR errors. The data is therefore filtered; a 1-gram has to appear at least twice in the whole corpus before it is included in the index. 2- to 5-grams have to appear at least 5 times in 1 year.

    Type and token frequencies

    The next table gives for each N the vocabulary size (in the second column) and the total number of ngram-tokens.

    Ngram size Types Tokens
    KBngram1 49.514.842 18.437.979.846
    KBngram2 39.156.451 11.821.165.297
    KBngram3 65.169.507 5.808.214.106
    KBngram4 47.955.071 2.386.522.277
    KBngram5 46.222.852 1.056.997.790
    Total 248.018.723 39.510.879.316

     

    Citaat

    When using the data of the newspaper ngram viewer, we request you to cite it as follows:

    B. de Goede, J. van Wees en M. Marx, 'PoliticalMashup Ngramviewer', in: Proceedings of the 13th Ductch-Belgian Workshop on Information Retrieval 2013, p. 54-55.

    Instructies

    Setup 

    The dataset is divided into five sets;

    • 1-grams: KBngram1
    • 2-grams: KBngram2
    • 3-grams: KBngram3
    • 4-grams: KBngram4
    • 5-grams: KBngram5

    For each n, the data contains a subdirectory KBngramN with

    • A tab-separated index file called IndexKBNgram1600-1995.tsv with schema:

         #ngram #YearFrequency #CorpusFrequency "Array with Year:Frequency pairs for all non-zero years"

    • A tab-separated yearly total file called Ngram-TotalYearFrequencies.csv with schema:

         #Total number of ngram types and tokens per year #Year # of Types # of tokens

    This information is provided for normalisation reasons. Using the number of tokens per year, one can give the percentage of tokens matching a word, rather than the absolute frequencies. As the corpus-size varies very much from year to year, this is useful (also the default option in the Google ngram viewer).

    • A folderindexesPerYear containing for each year tab-separated files of schemangram year frequency. Note that the counts are only per year starting from 1840. For 1600 and 1700 the counts are for those centuries. For 18[0123]0, the counts are for those decennia.

    The bigram folder contains Indexes to quickly find which words occur before or after a given word. A README file explains how this works and can be used.

    The unigram index contains all words which occur at least twice in the whole collection. Thus only "corpus-hapaxes" are removed from the unigram vocabulary. That leaves 50M (to be precise: 49.514.842) unique unigrams. Of these 16M occur only in one year. 9.6M occur just twice (the minumum) in the complete collection. The total number of unigrams tokens is 18.437.979.846.

    The n-gram indexes for n=2,3,4,5 contain only ngrams that occur at least in one year strictly more than 5 times. In the index-files per year, only those ngrams have been kept that occur strictly more than 5 times. This means that in all counts for an ngram only those occurances are counted which fall in years with at least 6 occurences. Thus the corpusfrequency of an ngram is the sum of the year-frequencies for those years in which it occurs at least 6 times.

    Toegang

    KBngram1:

    KBngram2:

    KBngram3:

    KBngram4:

    KBngram5:

    Examples

    Word Before and After

    Here are two examples of how to use the preceding and following word indexes. Note that the files are sorted (descending) on CorpusFrequency (the last column).

    (marx@mashup3) cat WordBeforeIndex.tsv | awk -F$'\t' '$2~/^maarten$/'|head sint maarten 133 37351 van maarten 142 6845 en maarten 104 5499 aan maarten 56 1738 door maarten 62 1155 heer maarten 77 971 de maarten 78 885 jan maarten 45 735 dat maarten 50 733

    (marx@mashop3) cat WordAfterIndex.tsv |awk -F$'\t' '$1~/^ondeugende/'|head ondeugende meisjes 13 2409 ondeugende vrouwen 3 985 ondeugende huisvrouwtjes 2 935 ondeugende hete 4 796 ondeugende babbelaars 3 704 ondeugende meid 12 682 ondeugende jongen 50 646 ondeugende kinderen 49 536 ondeugende meiden 4 536 ondeugende streken 36 510

    Note on hapaxes in the unigrams

    To keep the index managable we removed unigrams which occured just once in the complete corpus from the index. But note that the hapaxes are still in the yearly index files in indexesPerYear. Thus they can in principle be computed.

    This slight mismatch between the Index and the indexes per year causes that the file with yearly counts1gram-TotalYearFrequencies.csv has per year the complete vocabulary plus the total number of unigrams. Also in the total number of unigrams the hapaxes are counted. Only in the total vocabulary size the hapaxes are NOT counted (as this is calculated as the number of lines in the Index).

    Due to OCR errors the number of hapaxes per year is very large. As an example, consider 1926. It has 12.8M unique unigrams, of which 10.4M are hapaxes.

    (marx@mashup3) zcat 1926-1gram-Min1.csv.gz| awk -F$'\t' '$3==1'|wc -l 10430964

    (marx@mashup3) zcat 1926-1gram-Min1.csv.gz|wc -l 12820204

    (marx@mashup3) for n in `seq 1 10`; do c=`zcat 1926-1gram-Min1.csv.gz| awk -F$'\t' -v n=$n '$3==n'|wc -l`; echo -e "$n\t$c";done 1 10430964 2 965511 3 372856 4 204135 5 127698 6 88945 7 65633 8 50531 9 40468 10 33437