A SpaCy MWE identification pipeline component

This component identifies multiword expressions (MWEs) in SpaCy documents and makes the output available at token._.mwe_wikt. The component, its underlying data and training are described in Überrück-Fries et al. (2024).

The component has been evaluated on the Deep-Sequoia corpus and reached an F1-score of 0.776. Further details on the evaluation procedure and performance can also be found in the paper.

Currently MWE identification is supported only for French.

Installation

Clone repository and install via

pip install .

Install directly from GitHub via

pip install git+https://github.com/empiriker/mwe-detector.git

Usage

To identify MWEs, you'll need to import and add the pipeline to your SpaCy model:

import mwe_detector.pipeline
import spacy

nlp = spacy.load("fr_core_news_sm")
nlp.add_pipe("mwe_detector")

doc = nlp("L'identification des expressions polylexicales va bon train.")

print([tok._.wikt_mwe for tok in doc])
# ['*', '*', '*', '*', '*', '1:aller bon train:VERB', '1:aller bon train:VERB', '1:aller bon train:VERB', '*']

The model will return a wikt_mwe label per token. If a token is not part of an MWE, the label is *. If a token is part of an MWE, it will receive a label in the format [Number of MWE in doc, 1-indexed, integer]:[Lemma of MWE]:[POS of MWE]. If a token is part of multiple MWEs, the different labels are separated by |.

Development

To install the development dependencies, clone the repository and run

pip install .[dev]

Train

Train the model on your own training data, specified in config.py with

python train.py --lang_code fr

Data

This repository contains data that has been used in training and evaluation of the pipeline.

fr_train_wiktionary.cupt contains example sentences extracted from the French Wiktionary. The sentences have been processed with SpaCy and converted to the cupt format. An additional column WIKT:MWE has been added. These labels are not exhaustive, i.e. not all MWEs that could be annotated are annotated.
fr_test_sequoia.cupt contains the original Deep-Sequoia corpus but replaces the original annotations PARSEME:MWE and FRSEMCOR:NOUN with WIKT:MWE as described in Überrück-Fries et al. (2024).
fr_rank.json is a rank dictionary derived from the Lexique383 word list. It serves to store an MWEs constituent lemmas by inverse order of frequency, optimizing the search for MWE candidates.

Acknowledgements

This work was funded by an internship grant form the Graduate School in Computer Science of the Paris-Saclay University, as well as by the French Agence Nationale pour la Recherche, through the SELEXINI project (ANR-21-CE23-0033-01).

License

The file fr_test_sequoia.cupt is licensed with LGPLLR. All other files in this repository are licensed with CC BY-SA 4.0.

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
data		data
mwe_detector		mwe_detector
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
LICENSE-LGPLLR		LICENSE-LGPLLR
README.md		README.md
config.py		config.py
load_cupt_to_spacy.py		load_cupt_to_spacy.py
setup.py		setup.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

A SpaCy MWE identification pipeline component

Installation

Usage

Development

Train

Data

Acknowledgements

License

About

Licenses found

Languages

License

Licenses found

empiriker/mwe-detector

Folders and files

Latest commit

History

Repository files navigation

A SpaCy MWE identification pipeline component

Installation

Usage

Development

Train

Data

Acknowledgements

License

About

Topics

Resources

License

Licenses found

Stars

Watchers

Forks

Languages