GUM Corpus

Georgetown University Multilayer Corpus (GUM)

Source: This example was kindly contributed by Amir Zeldes, Department of Linguistics, Georgetown University, Washington, D.C., USA

The Georgetown University Multilayer Corpus (GUM) is an open source multilayer corpus of richly annotated web texts from eight text types. The corpus is collected and expanded by students as part of the curriculum in LING-367 Computational Corpus Linguistics at Georgetown University. The selection of text types is meant to represent different communicative purposes, while coming from sources that are readily and openly available (mostly Creative Commons licenses), so that new texts can be annotated and published with ease.

All documents are annotated with a range of annotation layers, most of which are produced or corrected manually. Layers include annotations for:

  • Multiple POS tags, morphological features and lemmatization
  • Sentence segmentation and rough speech act
  • Document structure in TEI XML (paragraphs, headings, figures, etc.)
  • ISO date/time annotations
  • Speaker information (where relevant)
  • Constituent and dependency syntax
  • Information status (given, accessible, new)
  • Entity and coreference annotation, including bridging anaphora
  • Discourse parses in Rhetorical Structure Theory

For the annotation of entities, information status, coreference resolution and bridging anaphora, marked in bold above, we use INCEpTION, which allows students to start their annotation projects from automatic NLP tool output and instructors to correct student solutions in the same interface.

The entity annotation data is available in INCEpTION’s WebAnno TSV format on GitHub, as well as in other formats for different annotation types. The merged corpus with all annotations concurrently searchable is accessed using ANNIS.