-
Notifications
You must be signed in to change notification settings - Fork 0
Modeling Documentation
This document relates to the NLP modeling process for DEEP. The aim of this project is to create a model for multi-label classification to assist users of the DEEP in the process of tagging documents, as part of the secondary data analysis performed in the platform.
1. Data
- 1.1. General information
- 1.2. Data augmentation
- 1.3. Data used for testing
2. Modelling
- 2.1. General strategies
- 2.2. Modelling using deep pretrained transformers
- 2.3. Modelling for geolocation
-
tag
: different sections taggers work on. The tags we model are -
subtag
: all the possible items a tagger can select inside a specific tag -
positive examples
: entries where taggers chose at least one subtag for a specific tag. One entry can be considered as a positive example for one tag but not for another one -
negative examples
: entries where the tagger did not choose any subtag and that belong to a lead where there is at least one positive tag (leads not tagged at all for a specific tag are not considered to be negative entries). Just like forpositive examples
, one entry can be considered as a negative example for one tag but not for another one
-
The training data is data collected from tags retrieved from the DEEP platform. Overall, we have ...... entries, fetched from ..... different analysis frameworks.
-
We work with 8 tags overall: [LINKS FOR DEF OF EACH TAG]
- 3 primary tags:
sectors
,subpillars_2d
,subpillars_1d
- 5 secondary tags:
affected_groups
,demographic_groups
,specific_needs_groups
,severity
,geolocation
- 3 primary tags:
-
Different tags are treated independently one from another. One model is trained alone for each different tag.
1.2. Data augmentation
- The dataset consists of mainly three languages: english (...%), french (...%), spanish (...%).
- Using basic data augmentation techniques (random swapping, random synonym changes) did not yield any improvement to results.
-
Performing data augmentation with translation has two advantages:
- The models learn to perform well on the three languages.
- More overall data for training
-
In the end, each augmented entry takes the entry_id of the original sentence. We do this to avoid bias and so that one entry (original + translated) are all either in the training or the test set.
-
The main idea of data augmentation is to translate each these three languages to the two others. More specifically, we translate:
- english entries to french and spanish.
- french entries to english and spanish.
- spanish entries to french and english.
-
For translating, we had two options: using google translation or using some pretrained translation models (main examples with Helsinky translation models). We went for the first option as it was free and faster.
For proper assessment of models, we create the test set so that it follows the following criteria:
- stratified train test splitting of positive examples: For each tag, the distribution of subtags must be the same across the train and the test set.
- fixed proportion of negative examples for each tag: the proportion of negative examples in the test set has to follow the distribution of negative examples in the whole dataset.
- Overall, we have two different classification strategies depending on the tags:
- (multi-label or single-label) classification using deep pretrained transformer:
sectors
,subpillars_2d
,subppillars_1d
,affected_groups
,demographic_groups
,specific_needs_groups
,severity
. - ii) NER (Named Entity Recognition) to detect specific words (location names) in entry:
geolocation
.
- (multi-label or single-label) classification using deep pretrained transformer:
Before using transformers, we tried to train models using fastext or NER models (for example for demoraphic_groups
: detecting key words then classifying according to them yielded bad results)
-
Transformer choosing process: The transformer had to fulfill some criteria:
- multilingual: it needs to work for different languages
- good performance: in order for it to be useful, the model needs to be performant
- fast predictions: the main goal of the modelling is to give live predictions to taggers while they are working on tagging. Speed is critical in this case and the faster the model the better.
- one endpoint only for deployment: in order to optimize costs, we want to have one endpoint only for all models and predictions. To do this, we create one custom class containing models and deploy it.
- We use the transformer
microsoft/xtremedistil-l6-h256-uncased
as a backbone. We also do multitask learning on some classification tasks using the last hidden states. The general architecture of the endpoint is the following:
- We overall train three independent models: one for
sectors
,one forsubpillars
and one forsecondary tags
. Thesectors
is trained without multitask learning on different data (we don't train on entries containing theCross
tag because it can false the model. - For the
subpillars
and for thesecondary tags
tags, we use tree like multitask learning finetuning the last hidden state differently for each task. We have 13 different subtasks for thesubpillars
model (Humanitarian Conditions, At Risk, Displacement, Covid-19, Humanitarian Access, Impact, Information And Communication, Shock/Event, Capacities & Response, Context, Casualties, Priority Interventions, Priority Needs) and 6 for the secondary tags models (severity, gender_kw, age, specific_needs_groups, affected groups non Displaced affected groups Displaced). Finally, each task contains different binary classifier heads for different labels.
- We want to understand the models' performances for predicted tags as well as not predicted tags. For this purpose, we use the following metrics:
i) f1 score:
-1_f1_score
f1 score for labels the model classified as True.
-0_f1_score
f1 score for labels the model classified as False.
-macro_f1_score
arithmetic mean of the1_f1_score
and0_f1_score
.
ii) precision score:
-1_precision
precision score for labels the model classified as True.
-0_precision
precision score for labels the model classified as False.
-macro_precision
arithmetic mean of the1_precision_score
and0_precision_score
.
iii) recall score:
-1_recall
recall score for labels the model classified as True.
-0_recall
recall score for labels the model classified as False.
-macro_recall
arithmetic mean of the1_recall_score
and0_recall_score
.
iv)hamming_loss
: to have an insight on the fraction of the wrong labels to the total number of labels.
v)zero_one_loss
iii) threshold tuning
- After performing the training, the threshold is the value of the minimum probability for which a tag is selected. Since the distribution of subtags is different in our data, we implemented a method to select the threshold that:
- maximizes the beta f1 score for each sub-tag
- avoid selecting an outlier threshold
The models are staged in MLflow and stored in Amazon Web Services (AWS). Two flavors were possible to use while staging: the pytorch flavor
or the pyfunc flavor
. We kept the pytorch flavor since it allowed us to deploy models using CPU.
For this task, we do two steps:
i) Detect geolocations using the spacy pretrained model xx_ent_wiki_sm
. It has tow advantages:
- It is multilingual, so it has the advantage of detecting places names in different languages.
- The model is small, so it yields predictions in an acceptable time.
ii) post-process predictions: We only keep the locations present in the DEEP database.