SPEED++

Data for our EMNLP-2024 paper "SPEED++: A Multilingual Event Extraction Framework for Epidemic Prediction and Preparedness"

Owing to Twitter's data-sharing policy, we can't publicly share the data. But you can fill out this Google Form, and we can reach out to you with steps to procure the data.

Data Statistics

Here are some high level statistics of our data categorized by language (* indicates length is measured based on number of characters):

Language	# Sentences	Average Length	# Event Mentions	# Arguments
English	2,560	32.5	2,887	8,423
Spanish	1,012	32.4	614	1,485
Japanese	716	30.0	627	2,344
Hindi	819	89.2*	549	1,575

We also present the statistics for the train-dev-test splits of our data below:

Split	Disease	Language	# Sentences	# Event Mentions
Train	COVID	English	1,601	1,746
Dev	COVID	English	374	471
Test	COVID	Spanish	534	365
Test	COVID	Japanese	542	395
Test	COVID	Hindi	416	412
Test	Mpox	English	286	398
Test	Zika/Dengue	English	299	272
Test	Zika/Dengue	Spanish	478	249
Test	Zika/Dengue	Japanese	277	154
Test	Zika/Dengue	Hindi	300	215

Synthetic Data Generation

In our work, we utilize CLaP to create synthetic multilingual data by translating the original SPEED++ training English data. Specifically, we utilize Google Machine Translation (GMT) to translate the English data and utilize CLaP to create EE specific data. For subsequent training, we combine the synthetic data with our original data.

Training/Evaluating Models

For all model training/evaluations, we utilize TextEE. Majorly we export our data as a dataset in this repo, create specific configs, and train/evaluate the models using the TextEE scripts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

SPEED++

Data for our EMNLP-2024 paper "SPEED++: A Multilingual Event Extraction Framework for Epidemic Prediction and Preparedness"

Data Statistics

Synthetic Data Generation

Training/Evaluating Models

Files

README.md

Latest commit

History

README.md

File metadata and controls

SPEED++

Data for our EMNLP-2024 paper "SPEED++: A Multilingual Event Extraction Framework for Epidemic Prediction and Preparedness"

Data Statistics

Synthetic Data Generation

Training/Evaluating Models