Impresso

Critical text mining of newspaper archives

Source: This example was kindly contributed by Matteo Romanello, Digital Humanities Laboratory, EPFL, Switzerland

The Impresso project aims to enable critical text mining of newspaper archives with the implementation of a technological framework to extract, process, link, and explore data from print media archives.

The project works with a large corpus of historical newspapers data in several languages: mainly French and German, but also Luxembourgish; historical period: from end of 18th century to the present time (see state of newspaper collection).

To achieve this aim, several NLP techniques are employed, including:

Named entity processing (classification and disambiguation/linking)
NE recognition and classification: custom tag set, based on Quaero annotation guidelines
NE linking: against Wikidata and an internal KB

Due to the special nature of the text corpus, off-the-shelve models cannot be used for the above tasks. For this reason, INCEpTION will be used as the annotation environment for an extensive annotation campaign (running August 2018 -> June 2019). We will use it for annotating NE-related information in our data, either from scratch or with pre-annotated annotations.

Features that we consider of particular importance to our use-case are:

import structured stand-off annotations generated by Python code via Cassis
linking with KB: to note here the necessity for INCEpTION’s functionalities to scale up to e.g. Wikidata-sized resources, with tens of thousands of class instances
active learning will be used to speed up the annotation process
monitoring, to follow the annotators’ progress
inter-annotator agreement