Skip to content

Modeling Documentation

Ximena Contla edited this page Dec 6, 2021 · 13 revisions

This document relates to the NLP modeling process for DEEP. The aim of this project is to create a model for multi-label classification to assist users of the DEEP in the process of tagging documents, as part of the secondary data analysis performed in the platform.

Index

1. Data

2. Modelling

Dictionnary [LINK TO DEFS IN THE DEEP]

  • tag: different sections taggers work on. The tags we model are

  • subtag: all the possible items a tagger can select inside a specific tag

  • positive examples: entries where taggers chose at least one subtag for a specific tag. One entry can be considered as a positive example for one tag but not for another one

  • negative examples: entries where the tagger did not choose any subtag and that belong to a lead where there is at least one positive tag (leads not tagged at all for a specific tag are not considered to be negative entries). Just like for positive examples, one entry can be considered as a negative example for one tag but not for another one

1. Data

1.1. General information:

  • The training data is data collected from tags retrieved from the DEEP platform. Overall, we have ...... entries, fetched from ..... different analysis frameworks.

  • We work with 8 tags overall: [LINKS FOR DEF OF EACH TAG]

    • 3 primary tags: sectors, subpillars_2d, subpillars_1d
    • 5 secondary tags: affected_groups, demographic_groups, specific_needs_groups, severity, geolocation
  • Different tags are treated independently one from another. One model is trained alone for each different tag.

  • The dataset consists of mainly three languages: english (...%), french (...%), spanish (...%).

i) Data augmentation using basic techniques:

  • Using basic data augmentation techniques (random swapping, random synonym changes) did not yield any improvement to results.

ii) Data augmentation by translation:

  • Performing data augmentation with translation has two advantages:

    • The models learn to perform well on the three languages.
    • More overall data for training
  • In the end, each augmented entry takes the entry_id of the original sentence. We do this to avoid bias and so that one entry (original + translated) are all either in the training or the test set.

  • The main idea of data augmentation is to translate each these three languages to the two others. More specifically, we translate:

    • english entries to french and spanish.
    • french entries to english and spanish.
    • spanish entries to french and english.
  • For translating, we had two options: using google translation or using some pretrained translation models (main examples with Helsinky translation models). We went for the first option as it was free and faster.

1.3. Data used for testing

For proper assessment of models, we create the test set so that it follows the following criteria:
- stratified train test splitting of positive examples: For each tag, the distribution of subtags must be the same across the train and the test set.
- fixed proportion of negative examples for each tag: the proportion of negative examples in the test set has to follow the distribution of negative examples in the whole dataset.

2. Modelling

2.1. General strategies

  • Overall, we have two different classification strategies depending on the tags:
    • (multi-label or single-label) classification using deep pretrained transformer: sectors, subpillars_2d, subppillars_1d, affected_groups, demographic_groups, specific_needs_groups, severity.
    • ii) NER (Named Entity Recognition) to detect specific words (location names) in entry: geolocation.

2.2. Modelling using deep pretrained transformers:

i) Preliminary work:

Before using transformers, we tried to train models using fastext or NER models (for example for demoraphic_groups: detecting key words then classifying according to them yielded bad results)

ii) Models architecture:

  • Transformer choosing process: The transformer had to fulfill some criteria:
    • multilingual: it needs to work for different languages
    • good performance: in order for it to be useful, the model needs to be performant
    • fast predictions: the main goal of the modelling is to give live predictions to taggers while they are working on tagging. Speed is critical in this case and the faster the model the better.
  • We chose the transformer microsoft/xtremedistil-l6-h256-uncased. Here is the model architecture:

iii) metrics

  • We want to understand the models' performances for predicted tags as well as not predicted tags. For this purpose, we use the following metrics:
      i) f1 score:
        - 1_f1_score f1 score for labels the model classified as True.
        - 0_f1_score f1 score for labels the model classified as False.
        - macro_f1_score arithmetic mean of the 1_f1_score and 0_f1_score.
      ii) precision score:
        - 1_precision precision score for labels the model classified as True.
        - 0_precision precision score for labels the model classified as False.
        - macro_precision arithmetic mean of the 1_precision_score and 0_precision_score.
      iii) recall score:
        - 1_recall recall score for labels the model classified as True.
        - 0_recall recall score for labels the model classified as False.
        - macro_recall arithmetic mean of the 1_recall_score and 0_recall_score.
      iv) hamming_loss: to have an insight on the fraction of the wrong labels to the total number of labels.
      v) zero_one_loss
  • After performing the training, the threshold is the value of the minimum probability for which a tag is selected. Since the distribution of subtags is different in our data, we implemented a method to select the threshold that:
    • maximizes the beta f1 score for each sub-tag
    • avoid selecting an outlier threshold

iii) Models staging and deployment:

The models are staged in MLflow and stored in Amazon Web Services (AWS). Two flavors were possible to use while staging: the pytorch flavor or the pyfunc flavor. We kept the pytorch flavor since it allowed us to deploy models using CPU.

2.3. Modelling for gelocation:

For this task, we do two steps:
  i) Detect geolocations using the spacy pretrained model xx_ent_wiki_sm. It has tow advantages:
    - It is multilingual, so it has the advantage of detecting places names in different languages.
    - The model is small, so it yields predictions in an acceptable time.
  ii) post-process predictions: We only keep the locations present in the DEEP database.

Documents

Quick Links

Clone this wiki locally