Modeling Documentation

This document relates to the NLP modeling process for DEEP. The aim of this project is to create a model for multi-label classification to assist users of the DEEP in the process of tagging documents, as part of the secondary data analysis performed in the platform.

Index

1. Data

2. Modelling

2.1. General strategies
2.2. Modelling using deep pretrained transformers
2.3. Modelling for geolocation

Dictionnary [LINK TO DEFS IN THE DEEP]

tag: different sections taggers work on. The tags we model are
subtag: all the possible items a tagger can select inside a specific tag
positive examples: entries where taggers chose at least one subtag for a specific tag. One entry can be considered as a positive example for one tag but not for another one
negative examples: entries where the tagger did not choose any subtag and that belong to a lead where there is at least one positive tag (leads not tagged at all for a specific tag are not considered to be negative entries). Just like for positive examples, one entry can be considered as a negative example for one tag but not for another one

1. Data

1.1. General information:

The training data is data collected from tags retrieved from the DEEP platform. Overall, we have ...... entries, fetched from ..... different analysis frameworks.
We work with 8 tags overall: [LINKS FOR DEF OF EACH TAG]
- 3 primary tags: sectors, subpillars_2d, subpillars_1d
- 5 secondary tags: affected_groups, demographic_groups, specific_needs_groups, severity, geolocation
Different tags are treated independently one from another. One model is trained alone for each different tag.

1.2. Data augmentation

The dataset consists of mainly three languages: english (...%), french (...%), spanish (...%).

i) Data augmentation using basic techniques:

Using basic data augmentation techniques (random swapping, random synonym changes) did not yield any improvement to results.

ii) Data augmentation by translation:

Performing data augmentation with translation has two advantages:
- The models learn to perform well on the three languages.
- More overall data for training
In the end, each augmented entry takes the entry_id of the original sentence. We do this to avoid bias and so that one entry (original + translated) are all either in the training or the test set.
The main idea of data augmentation is to translate each these three languages to the two others. More specifically, we translate:
- english entries to french and spanish.
- french entries to english and spanish.
- spanish entries to french and english.
For translating, we had two options: using google translation or using some pretrained translation models (main examples with Helsinky translation models). We went for the first option as it was free and faster.

1.3. Data used for testing

For proper assessment of models, we create the test set so that it follows the following criteria:
- stratified train test splitting of positive examples: For each tag, the distribution of subtags must be the same across the train and the test set.
- fixed proportion of negative examples for each tag: the proportion of negative examples in the test set has to follow the distribution of negative examples in the whole dataset.

2. Modelling

2.1. General strategies

Overall, we have two different classification strategies depending on the tags:
- (multi-label or single-label) classification using deep pretrained transformer: sectors, subpillars_2d, subppillars_1d, affected_groups, demographic_groups, specific_needs_groups, severity.
- ii) NER (Named Entity Recognition) to detect specific words (location names) in entry: geolocation.

2.2. Modelling using deep pretrained transformers:

i) Preliminary work:

Before using transformers, we tried to train models using fastext or NER models (for example for demoraphic_groups: detecting key words then classifying according to them yielded bad results)

ii) Models architecture:

Transformer choosing process: The transformer had to fulfill some criteria:
- multilingual: it needs to work for different languages
- good performance: in order for it to be useful, the model needs to be performant
- fast predictions: the main goal of the modelling is to give live predictions to taggers while they are working on tagging. Speed is critical in this case and the faster the model the better.
- one endpoint only for deployment: in order to optimize costs, we want to have one endpoint only for all models and predictions. To do this, we create one custom class containing models and deploy it.
We use the transformer microsoft/xtremedistil-l6-h256-uncased as a backbone. We also do multitask learning on some classification tasks using the last hidden states. The general architecture of the endpoint is the following:

We overall train three independent models: one for sectors,one for subpillars and one for secondary tags. The sectors is trained without multitask learning on different data (we don't train on entries containing the Cross tag because it can false the model.
For the subpillars and for the secondary tags tags, we use tree like multitask learning finetuning the last hidden state differently for each task. We have 13 different subtasks for the subpillars model (Humanitarian Conditions, At Risk, Displacement, Covid-19, Humanitarian Access, Impact, Information And Communication, Shock/Event, Capacities & Response, Context, Casualties, Priority Interventions, Priority Needs) and 6 for the secondary tags models (severity, gender_kw, age, specific_needs_groups, affected groups non Displaced affected groups Displaced). Finally, each task contains different binary classifier heads for different labels.

iii) metrics

We want to understand the models' performances for predicted tags as well as not predicted tags. For this purpose, we use the following metrics:
i) f1 score:
- 1_f1_score f1 score for labels the model classified as True.
- 0_f1_score f1 score for labels the model classified as False.
- macro_f1_score arithmetic mean of the 1_f1_score and 0_f1_score.
ii) precision score:
- 1_precision precision score for labels the model classified as True.
- 0_precision precision score for labels the model classified as False.
- macro_precision arithmetic mean of the 1_precision_score and 0_precision_score.
iii) recall score:
- 1_recall recall score for labels the model classified as True.
- 0_recall recall score for labels the model classified as False.
- macro_recall arithmetic mean of the 1_recall_score and 0_recall_score.
iv) hamming_loss: to have an insight on the fraction of the wrong labels to the total number of labels.
v) zero_one_loss

iii) threshold tuning

After performing the training, the threshold is the value of the minimum probability for which a tag is selected. Since the distribution of subtags is different in our data, we implemented a method to select the threshold that:
- maximizes the beta f1 score for each sub-tag
- avoid selecting an outlier threshold

iii) Models staging and deployment:

The models are staged in MLflow and stored in Amazon Web Services (AWS). Two flavors were possible to use while staging: the pytorch flavor or the pyfunc flavor. We kept the pytorch flavor since it allowed us to deploy models using CPU.

2.3. Modelling for gelocation:

For this task, we do two steps:
i) Detect geolocations using the spacy pretrained model xx_ent_wiki_sm. It has tow advantages:
- It is multilingual, so it has the advantage of detecting places names in different languages.
- The model is small, so it yields predictions in an acceptable time.
ii) post-process predictions: We only keep the locations present in the DEEP database.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modeling Documentation

Index

Dictionnary [LINK TO DEFS IN THE DEEP]

1. Data

1.1. General information:

1.2. Data augmentation

i) Data augmentation using basic techniques:

ii) Data augmentation by translation:

1.3. Data used for testing

2. Modelling

2.1. General strategies

2.2. Modelling using deep pretrained transformers:

i) Preliminary work:

ii) Models architecture:

iii) metrics

iii) threshold tuning

iii) Models staging and deployment:

2.3. Modelling for gelocation:

🐙 DEEPL Team

Documents

Quick Links

Clone this wiki locally