This project model code was used in the paper ״WEC: Deriving a Large-scale Cross-document Event Coreference dataset from Wikipedia״ for the cross document event coreference baseline model for WEC-Eng.
Can be downloaded from huggingface hub: https://huggingface.co/Alon/wec
WEC-Eng can be download from huggingface hub: https://huggingface.co/datasets/biu-nlp/WEC-Eng
See the Dataset card, for instructions on how to read and use WEC-Eng
- Python 3.6 or above
#>pip install -r requirements.txt
#>export PYTHONPATH=<ROOT_PROJECT_FOLDER>
The main train process require the mentions pairs and embeddings from each set.
Project contains datasets/ecb.zip already in the needed input format running the scripts.
ECB+ pairs generation for ECB+ test/dev/train sets is straight forward, for example to generate just run:
#>python src/preprocess_gen_pairs.py resources/ecb/dev/Event_gold_mentions.json --dataset=ecb --topic=subtopic
#>python src/preprocess_gen_pairs.py resources/ecb/test/Event_gold_mentions.json --dataset=ecb --topic=subtopic
#>python src/preprocess_gen_pairs.py resources/ecb/train/Event_gold_mentions.json --dataset=ecb --topic=subtopic
Since WEC-Eng train set contains many mentions, generating all negative pairs is very resource and time consuming.
To that end, we added a control for the negative:positive ratio.
For the Dev and Test sets, as they are much smaller in size,pairs generation is similar to ECB+ (all).
#>python src/preprocess_gen_pairs.py resources/wec/dev/Event_gold_mentions_validated.json --dataset=wec --split=dev
#>python src/preprocess_gen_pairs.py resources/wec/test/Event_gold_mentions_validated.json --dataset=wec --split=test
#>python src/preprocess_gen_pairs.py resources/wec/train/Event_gold_mentions.json --dataset=wec --split=train --ratio=10
To generate the embeddings for ECB+/WEC-Eng run the following script and provide the slit files location, for example:
#>python src/preprocess_embed.py resources/wec/dev/Event_gold_mentions.json resources/wec/test/Event_gold_mentions.json resources/wec/train/Event_gold_mentions.json --cuda=True
See train.py
file header for the complete set of script parameters.
Model file will be saved at output folder (for each iteration that improves).
- For training over ECB+:
#> python src/train.py --tpf=resources/ecb/train/Event_gold_mentions_PosPairs.pickle --tnf=resources/ecb/train/Event_gold_mentions_NegPairs.pickle --dpf=resources/ecb/dev/Event_gold_mentions_PosPairs.pickle --dnf=resources/ecb/dev/Event_gold_mentions_NegPairs.pickle --te=resources/ecb/train/Event_gold_mentions_roberta_large.pickle --de=resources/ecb/dev/Event_gold_mentions_roberta_large.pickle --mf=ecb_pairwise_model --dataset=ecb --cuda=True
- For training over WEC-Eng:
#> python src/train.py --tpf=resources/wec/train/Event_gold_mentions_PosPairs.pickle --tnf=resources/wec/train/Event_gold_mentions_NegPairs.pickle --dpf=resources/wec/dev/Event_gold_mentions_PosPairs.pickle --dnf=resources/wec/dev/Event_gold_mentions_NegPairs.pickle --te=resources/wec/train/Event_gold_mentions_roberta_large.pickle --de=resources/wec/dev/Event_gold_mentions_roberta_large.pickle --mf=wec_pairwise_model --dataset=wec --cuda=True --ratio=10
See inference.py
file header for the complete set of script parameters.
Running pairwize evaluation example:
python src/inference.py --tpf=resources/ecb/test/Event_gold_mentions_PosPairs.pickle --tnf=resources/ecb/test/Event_gold_mentions_NegPairs.pickle --te=resources/ecb/test/Event_gold_mentions_roberta_large.pickle --mf=<checkpoint>/ecb_pairwise_modeliter_6 --cuda=True
Generate the pairs predictions (distance) before running the agglomerative clustering script for final results
See generate_pairs_predictions.py
file header for the complete set of script parameters.
Running the pairs prediction algorithm:
python src/generate_pairs_predictions.py --tmf=resources/ecb/test/Event_gold_mentions.json --tef=resources/ecb/test/Event_gold_mentions_roberta_large.pickle --mf=<checkpoint>/ecb_pairwise_modeliter_6 --out=<checkpoint>/ecb_predictions --cuda=True
Running agglomerative clustering to get the final cluster configuration on the pairwise predictions.
See cluster.py
file header for the complete set of script parameters.
Running the pairs prediction algorithm:
python src/cluster.py --tmf=resources/ecb/test/Event_gold_mentions.json --predictions=<checkpoint>/ecb_predictions --alt=0.7
To score our model we used the official CoNLL coreference scorer.
Gold scorer files are at gold_socrer/ecb/*
folder.
Usage Example:
#>perl scorer/scorer.pl all gold_scorer/ecb/CD_test_event_mention_dataset.txt <checkpoint>/ecb_pairwise_modeliter_6_0.7 none
Calculate the dataset files statistics (mentions, singleton mentions, clusters...)
python helper_scripts/stats_calculation.py resources/ecb/dev/Event_gold_mentions.json`
Create an HTML page to visualize clusters and mentions from the given set
python helper_scripts/visualize.py resources/ecb/dev/Event_gold_mentions.json --present=cluster`<br/>
Page will be accessed via http://localhost:5000