The Genre classifier has been developed within the KB Researcher-in-residence project of dr. Frank Harbers, 'Discerning Journalistic Styles' (DJS). The project examines the opportunities, approaches and issues of automatically classifying historical newspaper articles from the Netherlands for ‘genre’ as an expression of the historically and culturally determined conception of journalism.
Genre is defined as conventionalised communicative forms that express a specific set of communicative goals. Examining the generic form of newspaper articles from a historical perspective therefore sheds an interesting light on the way newspaper journalism has developed. However, particularly in our current ‘age of abundance’ with more historical newspaper material than traditional methods can handle, there is a great need for new research approaches. In that respect, automatic forms of content analysis are highly suitable for longitudinal and also comparative historical research into the development of newspapers.
The DJS project
The DJS project aims to examine genre as a mode of expression of newspapers. This is a particularly challenging task as genres are dynamic and can change or fade away over time while new ones can emerge. Moreover, genres are ideal-typical discursive constructs, which means the textual manifestations do not always match the characteristics of these constructs perfectly, nor can they always be clearly delineated from other genres.
To examine this question, we have outlined an approach to automate the classification of genre of historical newspaper articles. This approach builds on an existing set of metadata (compiled during a large-scale project into the historical development of European newspapers: “Reporting at the boundaries of the public sphere. Form, style and strategy of European journalism, 1880-2005”) describing several textual characteristics, such as genre, of a large sample of historical newspaper articles. This metadata was connected to the full text of the corresponding digitised articles in the Dutch newspaper repository of the National Library (KB) to form a labeled training set for an automatic genre classifier.
The articles were cleaned of quoted text by means of regular expressions and preprocessed with the NLP suite Frog. From the resulting annotated text a number of features were calculated for each article, including length, number of adjectives, various types of pronouns and named entities found in the text. These features were used to train several classifiers to choose one of eight possible genres, ranging from news report to opinion article, for each article. A linear SVM classifier yielded the best results with an accuracy of 65%. Since the intercoder agreement for genre in the manual content analysis was around 80%, this is considered a promising first result.