Skip to content

Latest commit

 

History

History
310 lines (209 loc) · 21.7 KB

README.md

File metadata and controls

310 lines (209 loc) · 21.7 KB

HIPE-2022-data

HIPE 2022 shared task is a CLEF 2022 Evaluation Lab on named entity recognition and classification (NERC) and entity linking (EL) in multilingual historical documents.

Following the first CLEF-HIPE-2020 evaluation lab on historical newspapers in three languages, HIPE-2022 is based on diverse datasets and aims at confronting systems with the challenges of dealing with more languages, learning domain-specific entities, and adapting to diverse annotation tag sets. The objective is to gain new insights into the transferability of named entity processing approaches across languages, time periods, document types, and annotation tag sets.

Key information
Primary datasets
HIPE-2022 Releases
HIPE-2022 Evaluation
Acknowledgements
References

Key information

Primary datasets

HIPE-2022 datasets are based on six primary datasets composed of historical newspapers and classic commentaries covering ca. 200 years. They feature several languages and different entity tag sets and annotation schemes and originate from several European cultural heritage projects, from HIPE organisers’ previous research project, and from the previous HIPE-2020 campaign. Some are already published, others are released for the first time for HIPE-2022.

Dataset alias README Document type Languages Suitable for Project License
ajmc link classical commentaries de, fr, en NERC-Coarse, NERC-Fine, EL AjMC License: CC BY 4.0
hipe2020 link historical newspapers de, fr, en NERC-Coarse, NERC-Fine, EL CLEF-HIPE-2020 License: CC BY-NC-SA 4.0
letemps link historical newspapers fr NERC-Coarse, NERC-Fine LeTemps License: CC BY-NC-SA 4.0
topres19th link historical newspapers en NERC-Coarse, EL Living with Machines License: CC BY-NC-SA 4.0
newseye link historical newspapers de, fi, fr, sv NERC-Coarse, NERC-Fine, EL NewsEye License: CC BY 4.0
sonar link historical newspapers de NERC-Coarse, EL SoNAR License: CC BY 4.0

HIPE-2022 releases

A HIPE-2022 release corresponds to a single package composed of neatly structured and homogeneously formatted primary datasets of diverse origins. Primary datasets undergo the following preparation steps:

  • conversion to the HIPE format (with correction of data inconsistencies and metadata consolidation);
  • rearrangement or composition of train and dev splits.

Directory structure, naming conventions and versioning:

HIPE-2022 data directory is organised per HIPE release version, dataset and language, as follows:

data
└── vx.x
  └── dataset1
  │   ├── lg1
  │   │   ├── HIPE-2022-vx.x-dataset1-train-lg1.tsv
  │   │   ├── HIPE-2022-vx.x-dataset1-dev-lg1.tsv
  │   └── lg2
  │       ├── HIPE-2022-vx.x-dataset2-train-lg2.tsv
  │       ├── HIPE-2022-vx.x-dataset2-dev-lg2.tsv
  └── dataset2
  │   ├── lg1
  │   │   ├── HIPE-2022-vx.x-dataset2-train-lg1.tsv
  │   │   ├── ...
  └── ...

Files and file naming conventions

  • Training and development datasets consist of UTF-8, tab-separated-values files.
  • There is one .tsv file per dataset, language and dataset split.
  • Files contain information needed for all tasks (NERC-Coarse, NERC-Fine, and entity linking).
  • Files are named according to this schema: HIPE-2022-<hipeversion>-<dataset-alias>-<split>-<language>.tsv where # split = sample|train|dev|dev2|test|. For example, the file HIPE-2022-v1.0-newseye-dev-sv.tsv contains NE-annotated documents of the Swedish part of the newseye corpus which are meant as development set, in HIPE format and from HIPE-2022 release v1.0.

Versioning

  • HIPE-2022 release are versioned with a two-part version number (Major.Minor) which is present in 1) the data directory structure and 2) the filename of each file.
  • Each HIPE-2022 release has an equivalent git repository release, with release notes.
  • The version of a primary dataset is mentioned in its document metadata (see below).

HIPE format and tagging scheme

HIPE format is a simple tab-separated column textual format using an IOB tagging scheme (inside-outside-beginning format), in a similar fashion to that of the CoNLL-U format.

File structure

Files encode annotations needed for all tasks (NERC-Coarse, NERC-Fine and NEL) and contain the following lines:

  • empty lines, which mark the boundaries between documents;
  • comment lines, which give further information and start with the character #;
  • annotated lines, which contain a token followed by tab-separated annotations.

A file contains all the documents of one dataset/language/split. Documents are separated with empty lines and are preceded with several metadata comment lines. The notion of document varies from one dataset to another, please refer to dataset-specific READMEs.

Document metadata

Primary datasets provide different document metadata, with different granularity. This information is kept in HIPE-2022 files in the form of "metadata blocks". HIPE-2022 metadata blocks encode as much information as necessary to ensure that each document is self-contained with respect to HIPE-2022 settings.

Metadata blocks uses name spacing to distinguish between mandatory HIPE-2022 metadata and dataset-specific (optional) metadata:

# hipe2022:document_id     = [identifier for the document inside a dataset]
# hipe2022:date            = [original document publication date (YYYY-MM-DD, with YYYY-01-01 if month or date are not available)]
# hipe2022:language        = [iso two-letter language code]
# hipe2022:dataset         = [dataset alias as in file name]
# hipe2022:document_type   = [newspaper or commentary]
# hipe2022:original_source = [path to source file in original dataset release] 
# hipe2022:applicable_columns = [all relevant columns for this dataset (TOKEN NE-COARSE etc.) Non-applicable columns have _ values everywhere] 
# DATASET:doi              = [DOI url of primary dataset release (if available)]   
# DATASET:version          = [version of the primary dataset used in the HIPE-2022 release]   
# DATASET: xxx	           = [any other metadata provided with the dataset]

Columns

Each annotated line consists of 10 columns:

  1. TOKEN: the annotated token.
  2. NE-COARSE-LIT: the coarse type (IOB-type) of the entity mention token, according to the literal sense.
  3. NE-COARSE-METO: the coarse type (IOB-type) of the entity mention token, according to the metonymic sense.
  4. NE-FINE-LIT: the fine-grained type (IOB-type.subtype.subtype) of the entity mention token, according to the literal sense.
  5. NE-FINE-METO: the fine-grained type (IOB-type.subtype.subtype) of the entity mention token, according to the metonymic sense.
  6. NE-FINE-COMP: the component type of the entity mention token.
  7. NE-NESTED: the coarse type of the nested entity (if any).
  8. NEL-LIT: the Wikidata Qid of the literal sense, or NIL if an entity cannot be linked. Rows without link annotations have value `_’.
  9. NEL-METO: the Wikidata Qid of the metonymic sense, or NIL.
  10. MISC: a flag which can take the following values:
    • NoSpaceAfter, to indicate the absence of white space after the token.
    • EndOfLine, to indicate the end of a layout line.
    • EndOfSentence, to indicate the end of a sentence.
    • Partial-START:END, to indicate the character on/offsets of mentions that do not cover the full token (esp. for German compounds).

Non-specified values are marked by the underscore character (_).

Since they were created according to different annotation schemes, datasets do not systematically include all columns. Applicable columns for a dataset are specified in each document metadata. When a column does not apply for a specific dataset, all its values are _.

HIPE-2022 NE annotation types

HIPE-2022 annotation scheme originates from the CLEF-HIPE-2020 shared task and contains detailed named entity annotation types (reflected in the IOB file columns and presented above). All HIPE-2022 primary datasets do not necessarily have all annotation types.

Datasets and their annotation types:

NE annotation type ajmc hipe2020 letemps topres19th newseye sonar
NE-COARSE-LIT x x x x x* x
NE-COARSE-METO x x
NE-FINE-LIT x x x x*
NE-FINE-METO x
NE-FINE-COMP x
NE-NESTED x x x x
NEL-LIT x x x x x* x
NEL-METO x

*: For this dataset, this column includes the metonymic sense when present.

Given its wide scope in terms of languages and datasets, HIPE-2022 tasks only focuses on a selection of NE annotation types (in contrast to CLEF-HIPE-2020 which focused on fine-grained NE processing).

Overview of HIPE-2022 tasks and their annotation types:

HIPE-2022 Tasks NE annotation types
NERC-Coarse NE-COARSE-LIT
NERC-Fine NE-FINE-LIT, NE-NESTED
NEL NEL-LIT

The annotation types NE-COARSE-METO, NE-FINE-METO, NE-FINE-COMP are not considered in HIPE-2022 tasks and evaluation scenarios but are left in the IOB files when present with a dataset, for systems to use this information if beneficial.

Dataset statistics

Binder

Available via this jupyter notebook.

HIPE-2022 Evaluation

To accommodate the different dimensions that characterize the HIPE-2022 Evaluation Lab (tasks, languages, document types, entity tag sets) and foster research on transferability, the evaluation lab is organized around challenges and tracks.

An overview of the evaluation settings is given below; refer to the Participation Guidelines for more information (entity tagsets, evalaution metrics, etc.).

Acknowledgements

The HIPE 2022 organizing team expresses her greatest appreciation to the CLEF-2022 Lab Organising Committee for the overall organization, to the members of the HIPE-2022 advisory board, namely Sally Chambers, Frédéric Kaplan and Clemens Neudecker, for their support, and to the partnering projects, namely AJMC, impresso-HIPE-2020, Living with Machines, NewsEye, and SoNAR, for contributing (and hiding) their NE-annotated datasets.

References

About HIPE-2022

M. Ehrmann, M. Romanello, S. Najem-Meyer, A. Doucet, and S. Clematide (2022). Extended Overview of HIPE-2022: Named Entity Recognition and Linking in Multilingual Historical Documents. In Proceedings of the Working Notes of CLEF 2022 - Conference and Labs of the Evaluation Forum, edited by Guglielmo Faggioli, Nicola Ferro, Allan Hanbury, and Martin Potthast, Vol. 3180. CEUR-WS, 2022. https://doi.org/10.5281/zenodo.6979577.

@inproceedings{ehrmann_extended_2022,
  title = {Extended Overview of {{HIPE-2022}}: {{Named Entity Recognition}} and {{Linking}} in {{Multilingual Historical Documents}}},
  booktitle = {Proceedings of the {{Working Notes}} of {{CLEF}} 2022 - {{Conference}} and {{Labs}} of the {{Evaluation Forum}}},
  author = {Ehrmann, Maud and Romanello, Matteo and {Najem-Meyer}, Sven and Doucet, Antoine and Clematide, Simon},
  editor = {Faggioli, Guglielmo and Ferro, Nicola and Hanbury, Allan and Potthast, Martin},
  year = {2022},
  volume = {3180},
  publisher = {{CEUR-WS}},
  doi = {10.5281/zenodo.6979577},
  url = {http://ceur-ws.org/Vol-3180/paper-83.pdf}
}
  • LNCS HIPE-2020 Condensed Lab Overview Paper:

M. Ehrmann, M. Romanello, S. Najem-Meyer, A. Doucet, and S. Clematide (2022). Overview of HIPE-2022: Named Entity Recognition and Linking in Multilingual Historical Documents. In: Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Thirteenth International Conference of the CLEF Association (CLEF 2022). Lecture Notes in Computer Science. Springer, Cham (link to accepted version).

@inproceedings{hipe2022_condensed_2022,
  title     = {{Overview of HIPE-2022: Named Entity Recognition and Linking in Multilingual Historical Documents}},
  booktitle = {{Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Thirteenth International Conference of the CLEF Association (CLEF 2022)}},
  series    = {Lecture Notes in Computer Science (LNCS)},
  publisher = {Springer},
  author    = {Ehrmann, Maud and Romanello, Matteo and Najem-Meyer, Sven and Doucet, Antoine and Clematide, Simon},
  year      = {2022},
  editor    = {Barrón-Cedeño, Alberto and Da San Martino, Giovanni and Degli Esposti, Mirko and Sebastiani, Fabrizio and Macdonald, Craig and Pasi, Gabriella and Hanbury, Allan and Potthast, Martin and Faggioli, Guglielmo and Ferro, Nicola
}
  • ECIR-2022 Introduction Short Paper:

M. Ehrmann, M. Romanello, A. Doucet, and S. Clematide (2022). Introducing the HIPE 2022 Shared Task: Named Entity Recognition and Linking in Multilingual Historical Documents. In: Advances in Information Retrieval. ECIR 2022. Lecture Notes in Computer Science, vol 13186. Springer, Cham (link to postprint).

@inproceedings{ehrmann_introducing_2022,
  title     = {{Introducing the HIPE 2022 Shared Task:Named Entity Recognition and Linking in Multilingual Historical Documents}},
  booktitle = {Proceedings of the 44\textsuperscript{d} European Conference on {{IR}} Research ({{ECIR}} 2022)},
  author    = {Ehrmann, Maud and Romanello, Matteo and Clematide, Simon and Doucet, Antoine},
  year      = {2022},
  publisher = {{Lecture Notes in Computer Science, Springer}},
  address   = {{Stavanger, Norway}},
  url       = {https://link.springer.com/chapter/10.1007/978-3-030-99739-7_44}
}

Datasets

Previous shared task and survey

@article{nerc_hist_survey,
  title   = {{A Survey of Named Entity Recognition and Classification in Historical Documents}},
  author  = {Ehrmann, Maud and Hamdi, Ahmed  and Linhares Pontes, Elvys and Romanello, Matteo and Douvet, Antoine},
  journal = {ACM Computing Surveys},
  year    = {2022 (to appear)},
  url     = {https://arxiv.org/abs/2109.11406}
}

Appendix: Overview of Mapping of Primary Dataset to HIPE-2022