Impresso
Critical text mining of newspaper archives
Source: This example was kindly contributed by Matteo Romanello, Digital Humanities Laboratory, EPFL, Switzerland
The Impresso project aims to enable critical text mining of newspaper archives with the implementation of a technological framework to extract, process, link, and explore data from print media archives.
The project works with a large corpus of historical newspapers data in several languages: mainly French and German, but also Luxembourgish; historical period: from end of 18th century to the present time (see state of newspaper collection).
To achieve this aim, several NLP techniques are employed, including:
- Named entity processing (classification and disambiguation/linking)
- NE recognition and classification: custom tag set, based on Quaero annotation guidelines
- NE linking: against Wikidata and an internal KB
Due to the special nature of the text corpus, off-the-shelve models cannot be used for the above tasks. For this reason, INCEpTION will be used as the annotation environment for an extensive annotation campaign (running August 2018 -> June 2019). We will use it for annotating NE-related information in our data, either from scratch or with pre-annotated annotations.
Features that we consider of particular importance to our use-case are:
- import structured stand-off annotations generated by Python code via Cassis
- linking with KB: to note here the necessity for INCEpTION’s functionalities to scale up to e.g. Wikidata-sized resources, with tens of thousands of class instances
- active learning will be used to speed up the annotation process
- monitoring, to follow the annotators’ progress
- inter-annotator agreement