Modeling Documentation

Summary

I) Data

General information

Data augmentation

Data used for testing

II) Modelling

General strategies

Modelling using deep pretrained transformers

Modelling for geolocation

Words definition used later

tag: different sections taggers work on: 8 listed above
subtag: all the possible items a tagger can select inside a specific tag
positive examples: entries where taggers chose at least one subtag for a specific tag. One entry can be considered as a positive example for one tag but not for another one
negative examples: entries where the tagger did not choose any subtag and that belong to a lead where there is at least one positive tag (leads not tagged at all for a specific tag are not considered to be negative entries). Just like for positive examples, one entry can be considered as a negative example for one tag but not for another one

Data

1) General information:

The training data is data collected from tags retrieved from the DEEP platform. Overall, we have ...... entries, fetched from ..... analysis frameworks.
We work with 8 tags overall:
- 3 primary tags: sectors, subpillars_2d, subppillars_1d
- 5 secondary tags: affected_groups, demographic_groups, specific_needs_groups, severity, geolocation
Different tags are treated independently one from another. One model is trained alone for each different tag.

2) Data augmentation

The dataset consists of mainly three languages: english (...%), french (...%), spanish (...%).

i) Data augmentation using basic techniques:

Using basic data augmentation techniques (random swapping, random synonym changes) did not yield any improvement to results.

ii) Data augmentation by translation:

Performing data augmentation with translation has two advantages:
- The models learn to perform well on the three languages.
- More overall data for training
In the end, each augmented entry takes the entry_id of the original sentence. We do this to avoid bias and so that one entry (original + translated) are all either in the training or the test set.
The main idea of data augmentation is to translate each these three languages to the two others. More specifically, we translate:
- english entries to french and spanish.
- french entries to english and spanish.
- spanish entries to french and english.
For translating, we had two options: using google sheets for translation or using some pretrained translation models (main examples with Helsinky translation models).

	pros	cons
Google sheets translation	free	data needs to be processed with chunks (otherwise the google sheets crashes)
	predictions generated faster	not automated: data needs to be uploaded to the drive then manually saved
pretrained translation models	Can translate all the data with one script	costly: needs use of GPU
	Data gets downloaded and stored automatically	slow compared to the first option

We chose the first option (translating with google sheets), especially for time and money saving.

3) Data used for testing

For proper assessment of models, we create the test set so that it follows the following criteria:
- stratified train test splitting of positive examples: For each tag, the distribution of subtags must be the same across the train and the test set.
- fixed proportion of negative examples for each tag: the proportion of negative examples in the test set has to follow the distribution of negative examples in the whole dataset.

Modelling

1) General strategies

Overall, we have two different classification strategies depending on the tags:
- (multi-label or single-label) classification using deep pretrained transformer: sectors, subpillars_2d, subppillars_1d, affected_groups, demographic_groups, specific_needs_groups, severity.
- ii) Named Entity Recognition to detect specific words (location names) in entry: geolocation.

2) Modelling using deep pretrained transformers:

i) Preliminary work:

Before using transformers, we tried to train models using fastext or NER models (for example for demoraphic_groups: detecting key words then classifying according to them yielded bad results)

ii) Models architecture:

Transformer choosing process: The transformer had to fulfill some criteria:
- multilingual: it needs to work for different languages
- good performance: in order for it to be useful, the model needs to be performant
- fast predictions: the main goal of the modelling is to give live predictions to taggers while they are working on tagging. Speed is critical in this case and the faster the model the better.
We chose the transformer microsoft/xtremedistil-l6-h256-uncased. Here is the model architecture:

iii) metrics

In order to introduce the least bias possible to users (users seeing a false prediction (False Positives) and assuming it is right), we want to favor precision more than recall. We also want our recall not to be too low. To achieve this, we fix the main metric as the beta f1 score, with beta=........ .

iii) Automatic threshold tuning

After performing the training, the threshold is the value of the minimum probability for which a tag is selected. Since the distribution of subtags is different in our data, we implemented a method to select the threshold that:
- maximizes the beta f1 score for each sub-tag
- avoid selecting an outlier threshold

iii) Models staging and deployment:

The models are staged in MLflow and stored in Amazon Web Services (AWS). Two flavors were possible to use while staging: the pytorch flavor or the pyfunc flavor. We kept the pytorch flavor since it allowed us to deploy models using CPU.

3) Modelling for gelocation:

For this task, we do two steps:
i) Detect geolocations using the spacy pretrained model xx_ent_wiki_sm. It has tow advantages:
- It is multilingual, so it has the advantage of detecting places names in different languages.
- The model is small, so it yields predictions in an acceptable time.
ii) post-process predictions: We only keep the locations present in the DEEP database.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly