Our current Researcher-in-Residence, Frank Harbers, is well under way with his project “Discerning Journalistic Styles. Exploring Automated Analysis of Journalism’s Modes of Expression”. In this blogpost he gives an update on his project and its progress.
It has been several months since I wrote the first blog about my work as researcher-in-residence and the research project is in full swing by now. The first phase of the project , connecting the metadata from my own database to the historical newspaper data (and metadata) in Delpher is finished and we are fully enveloped in the main part of the project: training a classifier to automatically determine the genre of historical newspaper articles.
The first phase was not as successful as we hoped, but we have managed to create a – modest – dataset to train the classifier. Initially, we hoped to be able to connect the metadata about approximately 33.000 Dutch newspaper articles to the data in Delpher. A crucial factor in the success of this attempt was the extent to which the segmentation of newspaper articles in Delpher matched the way the newspaper articles were segmented for the content analysis that resulted in the set of metadata about the historical newspaper articles. Unfortunately, it was far from a perfect match. For that reason the newspapers before 1945 could not be included – basically half of the metadata. Furthermore, De Volkskrant after the Second World War has not been digitized by the KB. In addition, the segmentation of De Telegraaf in the postwar period was so different that we couldn’t include that either. In the end, this meant that we could only use the data of Algemeen Handelsblad/NRC Handelsblad in the postwar period. So quickly we saw our dataset shrink from the potential 33.000 articles to a modest 2000 articles. A bit of a setback, but fortunately we can still use this smaller dataset to train a genre classifier. This experience does make clear how crucial segmentation is for the creation of datasets that can be fruitfully used for digital humanities research into journalism history.
At the moment, we are working on the second phase of the project. We have identified several genres that we would like to classify. These genres, such as news reports, reportages, interviews, opinion articles, reviews, news analyses, can shed light on the way journalism developed from a reflective, opinion-oriented way of doing journalism to a more event-centered and fact-oriented journalism practice. At the core of this part is the translation of the genre definitions to clear linguistic markers that can be identified automatically. Take for instance the news report, a genre that is defined by the use of the inverted pyramid (a story structure in which typical journalistic questions, like Who, What, Where and When, are answered in the first paragraph. Moreover, it often contains direct quotes from sources and is generally a fairly concise article written in a depersonalized, objective style. Question is how you can recognize these features automatically in the text. In this case, the quotes can be recognized by the presence of quotation marks (for which a high quality OCR is crucial) and we will attempt to identify the inverted pyramid structure by using named identity recognition to see whether questions concerning who was involved and where and when it happened are answered. We hope the depersonalized style can be captured by looking at the lack of a first person perspective (the use of the pronoun ‘I’ or ‘We’) and the lack of adjectives that create a colorful and subjective account.
Juliette Lonij, programmer on this project, is currently developing the Python software to extract the features on which the classifier will run. She looked into different natural language processing software packages to pre-process the article texts and chose to use FROG for tokenization and Part-of-Speech tagging, which facilitates our research needs quite well (other packages might be added in the future). And today, we have just run a first exploratory test with the classifier, which showed promising results. In the coming weeks we will keep on testing and refining the classifier. So wish us luck!