Skip to content

Named Entity Linking systems for the NewsEye project.

Notifications You must be signed in to change notification settings

NewsEye/Named-Entity-Linking

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 

Repository files navigation

Named Entity Linking

A named entity is a real-world object, such as persons, locations, organizations, etc. Named entities have been shown to be key to digital library access as they are contained in a majority of the search queries submitted to digital library portals. They were notably found in 80% of queries submitted to Gallica, the portal of the national library of France. Collecting data from different sources leads to reveal the problem of duplicate and ambiguous information about named entities. Therefore they are often not distinctive since one single name may correspond to several entities. A disambiguation process is thus essential to distinguish named entities to be indexed in digital libraries.

Named Entity Linking (NEL) is the task of recognizing and disambiguating named entities by linking them to entries of a Knowledge Base (KB). Knowledge bases (e.g. Wikipedia , DBpedia, YAGO, and Freebase) contain rich information about the worlds entities, their semantic classes, and their mutual relationships. NEL is a challenging task because a named entity may have multiple surface forms, such as its full name, partial names, aliases, abbreviations, and alternate spellings. Besides digital libraries, this task is important to several NLP applications, e.g. information extraction, information retrieval (for the adequate retrieval of ambiguous information), content analysis (for the analysis of the general content of a text in terms of its topics, ideas or categorizations), question answering and knowledge base population.

In a nutshell, NEL aims to locate mentions of an NE, and to accurately link them to the right entry of a knowledge base, a process that often requires disambiguation. A NEL system typically performs two tasks: named entity recognition (NER) and entity disambiguation (ED). NER extracts entities in a document, and ED links these entities to their corresponding entities in a KB. Until recently, the common approach of popular systems was to solve these two sub-problems independently. However, the significant dependency between these two tasks is ignored and errors caused by NER will propagate to the ED without the possibility of recovery. Therefore, recent approaches propose the joint analysis of these sub-tasks in order to reduce the amount of errors.

NEL in digital libraries is especially challenging due to the fact that most digitized documents are indexed through their OCRed version. This causes numerous errors due to the state of documents, following aging, bad storage conditions and/or the poor quality of initial printing materials. For instance, Chiron et al. (2017) analyzed a collection of OCRed documents with 12M characters from 9 sources written in 2 languages. This collection is composed of documents from 1654-2000 and contains error rates that vary from 1% to 4%.

State-of-the-art systems

For the NewsEye project, we have used the NEL systems proposed by Ganea and Hofmman (https://github.com/dalab/deep-ed) and Le and Titov (https://github.com/lephong/mulrel-nel) as baselines to disambiguate mentions in English documents with OCR problems.

Datasets

To the best of our knowledge, there is no publicly available corpora in the literature that are addressed to both named entity linking and post-OCR correction. There are either corpora where named entities are well recognized and linked, but the text is not noisy or conversely, there are corpora where the text generated by an OCR process is aligned with the original text, but named entities are not annotated. Therefore we started from existing NEL corpora to build their versions in noisy context.

The link (https://zenodo.org/record/3490333) provides the degraded images, the noisy texts extracted by the OCR and their aligned version with clean data for 6 English NEL datasets.

Impact of OCR Quality on Named Entity Linking

More details about how we use these systems and datasets are available at: https://zenodo.org/record/3529180#.XcvDbdEo85k

About

Named Entity Linking systems for the NewsEye project.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published