Two weeks ago I took the train in Groningen at 7.16 AM and arrived around 10 AM in The Hague to start my fellowship as researcher-in-residence. My first day mainly consisted of tasks of practical and organizational nature (login data, an access pass, printer codes, etc.). Martijn Kleppe gave me a tour of the building with all its corners and corridors. I hadn’t seen more than the general and special collections reading room, where I spent quite some time perusing the original historical newspaper material during my PhD research into the development of the press from the 19th century onwards.
During my PhD research (‘Between Personal Experience and Detached Information. The Development of Reporting and the Reportage in Great Britain, the Netherlands and France, 1885-2005’) I studied the development of reporting by manually coding a sample of historical newspapers for characteristics such as topic, genre, sourcing practices, images; a long and arduous endeavor. To make this easier, in my current project at the National Library of the Netherlands (KB) ‘Discerning Journalistic Styles. Exploring Automated Analysis of Journalism’s Modes of Expression’ I will attempt to automate the classification of genre of newspaper articles. The large database of metadata of newspaper articles I compiled during my PhD provides the necessary already-coded test material. Fortunately, I don’t have to do this all by myself, but I am lucky that Juliette Lonij, who knows so much more about the technical side of this type of research, will collaborate with me on this project.
And now that all practicalities have been arranged, we could ‘really’ get started this week. The first problem we will try to solve is one of more practical nature: my database with metadata about the historical newspapers has to be linked to the actual digitized newspaper material that is found in Delpher. It is the first necessary step to be able to compile the datasets we will use to explore and experiment with the different approaches and tools to automate the classification of genre.
What makes this project so challenging is the fact that genres, such as reportages, news reports, background analyses or interviews, cannot be recognized by the topical content (such as sports for example), but only by its stylistic and formal characteristics. A reportage, for instance, is typified by the many depictions of the atmosphere, but such depictions can relate to the aggressive atmosphere in a football stadium, an impression of the natural beauty of the Amazon, or to the tension that can be felt during a police arrest. You have to focus on different features, such as the use of adjectives for example.
Manually classifying the genre of articles is time consuming, which means that you can only code a limited amount of newspaper material. This limitation makes generalizing statements about the development of journalism and reporting within a particular cultural context, like the Netherlands, problematic. It would therefore be an important step forward if the classification of genre can be automated. That way much larger amounts of material can be examined, making the historical analyses more robust. So, lets get to work!