This is the repository for the course 'Advanced Natural Language Processing' for the study 'Digital Sciences' at the University of Applied Sciences Cologne.
It contains the project code for the participation in the Clickbait Challenge proposed at SemEval-2023
- Task 1 Spoiler Classification: RoBERTa model with NER and custom components
- Task 2 Spoiler Generation: RoBERTa SQuAD2.0 model with rule-based approach
- Dataset: Webis Clickbait Spoiling Corpus 2022
doc\
: Contains the project presentation and project reporttask1_anlp_deploy\
: Code and Docker File of Task 1task2_anlp_deploy\
: Code and Docker File of Task 2
filename | description |
---|---|
EDA.ipynb |
Code for pre-processing the WEBIS Clickbait Spoiling Corpus 2022 |
simple_transformer_task1.ipynb |
Code for training the RoBERTa model for multi-class classification |
run_task_1.py |
File for running the spoiler classifcation |
Reformat_to_SQuAD.ipynb |
Code for reformatting the spoiler questions into the SQuaD2.0 format |
Training_model.ipynb |
Code for training the RoBERTa-SQuAD2.0 model for the downstream task for spoiler generation |
run_task_2.py |
File for running the spoiler generation. Arguments: --apply_rule_base v1 / --apply_rule_base v2 |
The docker images can be pulled from these dockerhub repositories:
[Task 1 Dockerhub Repo] | [Task 2 Dockerhub Repo]
docker run --rm -d >>>CONTAINER_NAME<<< --input >>INPUT_DATA<<<.jsonl --output output.jsonl --apply_ner=yes
Without rule-based approach:
docker run --rm -d >>>CONTAINER_NAME<<< --input >>INPUT_DATA<<<.jsonl --output output.jsonl --apply_rule_base=v1
With rule-based approach:
docker run --rm -d >>>CONTAINER_NAME<<< --input >>INPUT_DATA<<<.jsonl --output output.jsonl --apply_rule_base=v2
Due to their size, the language models can be downloaded separately from these sciebo links:
[Task 1 Model] | [Task 2 Model]
Rename the respective root folder to saved_models
and put it in either task1_anlp_deploy/saved_models
or task2_anlp_deploy/saved_models
respectively for usage.