-
Notifications
You must be signed in to change notification settings - Fork 0
Modeling Documentation
I) Data
- General information
- Data augmentation
- Data used for testing
II) Modelling
- General strategies
- Modelling using deep pretrained transformers
- Modelling for geolocation
-
tag
: different sections taggers work on: 8 listed above -
subtag
: all the possible items a tagger can select inside a specific tag -
positive examples
: entries where taggers chose at least one subtag for a specific tag. One entry can be considered as a positive example for one tag but not for another one -
negative examples
: entries where the tagger did not choose any subtag and that belong to a lead where there is at least one positive tag (leads not tagged at all for a specific tag are not considered to be negative entries). Just like forpositive examples
, one entry can be considered as a negative example for one tag but not for another one
-
The training data is data collected from tags retrieved from the DEEP platform. Overall, we have ...... entries, fetched from ..... analysis frameworks.
-
We work with 8 tags overall:
- 3 primary tags:
sectors
,subpillars_2d
,subppillars_1d
- 5 secondary tags:
affected_groups
,demographic_groups
,specific_needs_groups
,severity
,geolocation
- 3 primary tags:
-
Different tags are treated independently one from another. One model is trained alone for each different tag.
- The dataset consists of mainly three languages: english (...%), french (...%), spanish (...%).
- Using basic data augmentation techniques (random swapping, random synonym changes) did not yield any improvement to results.
-
Performing data augmentation with translation has two advantages:
- The models learn to perform well on the three languages.
- More overall data for training
-
In the end, each augmented entry takes the entry_id of the original sentence. We do this to avoid bias and so that one entry (original + translated) are all either in the training or the test set.
-
The main idea of data augmentation is to translate each these three languages to the two others. More specifically, we translate:
- english entries to french and spanish.
- french entries to english and spanish.
- spanish entries to french and english.
-
For translating, we had two options: using google sheets for translation or using some pretrained translation models (main examples with Helsinky translation models).
pros | cons | |
---|---|---|
Google sheets translation | free | data needs to be processed with chunks (otherwise the google sheets crashes) |
predictions generated faster | not automated: data needs to be uploaded to the drive then manually saved | |
pretrained translation models | Can translate all the data with one script | costly: needs use of GPU |
Data gets downloaded and stored automatically | slow compared to the first option |
- We chose the first option (translating with google sheets), especially for time and money saving.
For proper assessment of models, we create the test set so that it follows the following criteria:
- stratified train test splitting of positive examples: For each tag, the distribution of subtags must be the same across the train and the test set.
- fixed proportion of negative examples for each tag: the proportion of negative examples in the test set has to follow the distribution of negative examples in the whole dataset.
- Overall, we have two different classification strategies depending on the tags:
- (multi-label or single-label) classification using deep pretrained transformer:
sectors
,subpillars_2d
,subppillars_1d
,affected_groups
,demographic_groups
,specific_needs_groups
,severity
. - ii) Named Entity Recognition to detect specific words (location names) in entry:
geolocation
.
- (multi-label or single-label) classification using deep pretrained transformer:
Before using transformers, we tried to train models using fastext or NER models (for example for demoraphic_groups
: detecting key words then classifying according to them yielded bad results)
-
Transformer choosing process: The transformer had to fulfill some criteria:
- multilingual: it needs to work for different languages
- good performance: in order for it to be useful, the model needs to be performant
- fast predictions: the main goal of the modelling is to give live predictions to taggers while they are working on tagging. Speed is critical in this case and the faster the model the better.
- We chose the transformer
microsoft/xtremedistil-l6-h256-uncased
. Here is the model architecture:
In order to introduce the least bias possible to users (users seeing a false prediction (False Positives) and assuming it is right), we want to favor precision more than recall. We also want our recall not to be too low. To achieve this, we fix the main metric as the beta f1 score, with beta=........ .
- After performing the training, the threshold is the value of the minimum probability for which a tag is selected. Since the distribution of subtags is different in our data, we implemented a method to select the threshold that:
- maximizes the beta f1 score for each sub-tag
- avoid selecting an outlier threshold
The models are staged in MLflow and stored in Amazon Web Services (AWS). Two flavors were possible to use while staging: the pytorch flavor
or the pyfunc flavor
. We kept the pytorch flavor since it allowed us to deploy models using CPU.
For this task, we do two steps:
i) Detect geolocations using the spacy pretrained model xx_ent_wiki_sm
. It has tow advantages:
- It is multilingual, so it has the advantage of detecting places names in different languages.
- The model is small, so it yields predictions in an acceptable time.
ii) post-process predictions: We only keep the locations present in the DEEP database.