25 Jan 2017

Bridging the gap between quantitative and qualitative research in digital newspaper archives


This blog post is written by Thomas Smits, KB Researcher-in-residence from May 2017

One of the central and most far-reaching promises of the so-called Digital Humanities has been the possibility to analyse large datasets of cultural production, such as books, periodicals, and newspapers, in a quantitative way. Since the early 2000s, humanities 3.0, as Rens Bod has called it, was posited as being able to discover new patterns, mostly over long periods of time, that were overlooked by traditional qualitative approaches.[1] In the last couple of weeks a study by a team of academics led by Professor Nello Christianini of the University of Bristol made headlines: “This AI found trends hidden in British history for more than 150 years” (Wired) and “What did Big Data find when it analysed 150 years of British history? (Phys.org). Did Big Data and Humanities 3.0 finally deliver on its promise? And could the KB’s collection of digitised newspapers be used for similar research?

The study, “Content analysis of 150 years of British periodicals”, is based on a corpus of 28.6 billion words, contained in 35.9 million articles of 120 regional, or local British newspapers from the period 1800-1950. [2] Focussing on six spheres – values and beliefs, UK politics, technology, economy, social change, and popular culture – the study is mostly based on the ‘use frequency’ of n-grams: the number of times a (combination of) word(s) appears in relation to all the words of the corpus in a specific year. For example, if a corpus consists of 100 words and the 2-gram, a combination of two words, “Digital Humanities” appears three times, the use frequency of this 2-gram is 0,03. In short: by applying n-grams the researchers were able to measure the relative importance of certain words, or combinations of words.

By using this method, the researchers were able to pinpoint specific historic events in their corpus, such as coronations, the election of a new pope, and outbreaks of several contagious diseases. More importantly, they used their method to test the validity of certain long-held notions about the nineteenth century. For example, the study suggests a very clear timeline in the emergence of the concept of “Britishness” in the popular imagination. While recent studies have posited that ‘national identity’ has deep historical roots, predating the nineteenth century, Christianini and his colleagues found that “British” overtook “English” only in the late-nineteenth century, supporting the close connection between the production of national identity, modernity, and the rise of mass media.[3]


While the researchers carefully composed their corpus, further contextualisation could make future research less biased. First of all, while the 120 newspaper titles studied in this research represent roughly 14% of all published titles, it remains unclear to what part of the press landscape these titles belonged. For example, the study neglects to discuss the fact that newspapers predominantly aimed to reach middle class readers. This bias is further enhanced by two factors: publications directed at lower classes, such as those of the chartist movement or the so-called penny press, were often deemed to be unworthy of archiving.[4] This process is enhanced by digitisation: well-known nineteenth century titles are more likely to be digitised than lesser-known, but not necessarily less influential, cheaper and/or radical publications.[5] Furthermore, by focussing on regional newspapers, the study aspires to mitigate the London-centric bias of research based on newspaper coverage. However, it neglects to account for the fact that in the first half of the nineteenth century many regional newspapers copied articles from London-based newspapers on a large scale, while syndication of content achieved, to some extent, the same result in the final decades of the century.[6]

Most crucially, the researchers seem to equate attention given in newspapers to historical significance. By doing so, they run the risk of failing to acknowledge the most important aspect of the medial form of the newspaper: its focus on ‘newsworthy’ events. The importance of certain long-term developments, which were never perceived as being radically new by contemporary commentators, can only be recognised with the benefit of hindsight. This leads to a somewhat paradoxical situation: digital newspaper archives are used to discover long-term trends, while newspaper discourse is mostly centred on short-term developments.


What can researchers using the Delpher corpus learn from this study? The research department of the KB has already made important steps in the large-scale analysis of digitsed newspapers. Its open access n-gram viewer, developed by the University of Amsterdam, enables any user to replicate important parts of the British research. For example, a search for ‘nieuwe paus [new pope]’, yields the same results as the British study. In addition, recent projects of the KB’s fellows and researchers-in-residence use Dutch digitised newspapers in innovative ways. I hope that my own project, which applies two computer vision techniques to images in Dutch newspapers, can continue this tradition.

One could even ask the question why British digitised newspapers are not used more frequently for research similar to that of Christianini. Two private companies, Gale and FindMyPast, provide access to the collection of digitised historical newspapers, originally archived by the British Library. In a recent article, I compare these companies to Trove, the digital collection of Australian newspapers maintained by the government.[7] While public digital collections, such as Trove and Delpher, encourage users to tweak the archive, providing them with access to ‘raw’ data and API’s, private companies, such as FindMyPast, focus on a specific kind of use, in this case amateur historians studying their family history. This results in the fact that access to raw data is expensive, which makes it relatively hard, especially for junior researchers, to use it. In my opinion, we should take a critical look at the network of actors involved in the digitisation of newspapers, books, and other sources. Private companies, such as Google and FindMyPast, increasingly shape our access to the past and our use of historical sources. As I argue in my article, we should continue to discuss how this influences the ways that both researchers and the general public are able to interpret the past and relate to it in ways that are meaningful to them.

Bridging the gap

(Media) historians using qualitative methods would be wise to take note of the results of this study. It opens up a new world of possible research and shows how quantitative analysis can be used to substantiate existing theories. More importantly, the article raises the question if the strict separation between qualitative and quantitative research, or distant and close reading, is useful in distinguishing ‘traditional’ methods from their ‘digital’ counterparts. As the study amply shows, insights from traditional research are essential in defining the questions and contextualising both the corpus and the results of this kind of data-driven research. The project points to the importance of interdisciplinary research teams and, hopefully, will further undermine the trenches into which practitioners of the ‘digital’ and the ‘traditional’ humanities have grouped themselves.

Six cylinder press
Thomas Smits
Thomas Smits
PhD researcher of illustrated newspapers and transnational visual news culture
Thomas Smits is completing a PhD on the transnational trade. He is an editor of the Journal for European Periodicals Research (JEPS) and a PhD-board member of the Royal Netherlands Historical Society.
Extra informatie

[1] R. Bod, “Who is afraid of patterns? The Particular versus the Universal and the Meaning of Humanities 3.0” BMGN 128, no. 4 (2013): 171-80, 175.

[2] Landall-Welfare et al, “Content analysis of 150 years of British periodicals” PNAS (published ahead of print January 9, 2017). doi:10.1073/pnas.1606380114.

[3] This body of scholarship is mostly connected to Benedict Anderson’s concept of the imagined community. B. Anderson, Imagined Communities. London: Verso, 1983.

[4] For an explanation of what French press historian Jean-Pierre Bacot has called the ‘downward spiral of popularity’ of nineteenth-century newspapers and periodicals see Andrew King’s work on the London Journal: J.P. Bacot, La presse illustrée au XIXe siècle: une histoire oublié. Limoges: PULIM, 2005, 75 : A. King, The London Journal 1845-83: Periodicals, Production, and Gender. Aldershot: Ashgate, 2004, 16.

[5] A. Hobbs, “The Deleterious Dominance of The Times in Nineteenth-Century Scholarship” Journal of Victorian Culture 18, no. 4 (2013): 472-497.

[6] M. Beals, “Musings on a Multimodal Analysis of Scissors-and-Paste Journalism (Part 1),” accessed November 22, 2016, http://mhbeals.com/musings-on-a-multimodal-analysis-of-scissors-and-pas…; B. Nicholson, “‘You Kick the Bucket; We Do the Rest!’: Jokes and the Culture of Reprinting in the Transatlantic Press” Media History 17, no. 3 (2012): 277-278.

[7] T. Smits, “Making the News National: Using Digitized Newspapers to Study the Distribution of the Queen’s Speech by W. H. Smith & Son, 1846–1858” Victorian Periodicals Review 49, no. 4 (2016): 598-625. DOI: 10.1353/vpr.2016.0041