Skip to content

Latest commit

 

History

History
199 lines (163 loc) · 12.3 KB

README.md

File metadata and controls

199 lines (163 loc) · 12.3 KB

ner-commodities

Experiment to create a Named Entity Recognition (NER) model to identify commodities in FT content.

It runs on Jupyter Notebook and uses modules from the spaCy library: an open-source software library for advanced natural language processing, written in the programming languages Python and Cython

Setup

Creating rules

  • Click the create-rules.ipynb file to open this kernel on http://localhost:8888/notebooks/create-rules.ipynb
  • Run each of the cells in order from top to bottom, which will create the commodities_ner_rules directory that is employed by subsequent kernels

Creating entities per article dataset

  • Click the create-entities-per-article-data.ipynb file to open this kernel on http://localhost:8888/notebooks/create-entities-per-article-data.ipynb
  • Run each of the cells in order from top to bottom, which will create the following files in the data directory:
    • entities_per_article_data.json: commodities identified in each article (non-essential for training the NER model)

Creating training datasets

  • Click the create-training-data.ipynb file to open this kernel on http://localhost:8888/notebooks/create-training-data.ipynb
  • Run each of the cells in order from top to bottom, which will create the following files in the data directory:
    • training_data.json (used by spaCy v2)
    • training_data.spacy (used by spaCy v3)

Creating evaluation datasets

  • Click the create-evaluation-data.ipynb file to open this kernel on http://localhost:8888/notebooks/create-evaluation-data.ipynb
  • Run each of the cells in order from top to bottom, which will create the following files in the data directory:
    • evaluation_data.json (used by spaCy v2)
    • evaluation_data.spacy (used by spaCy v3)

Creating test dataset

  • Click the create-test-data.ipynb file to open this kernel on http://localhost:8888/notebooks/create-test-data.ipynb
  • Run each of the cells in order from top to bottom, which will create the following file in the data directory:
    • test_data.json: body text segments that have not been used to train/validate the NER model that can be used to test the NER model

Creating a spaCy NER model

  • Visit spaCy's training config quickstart, apply the below settings before copying the contents to your clipboard:
    • Language: English
    • Components: ner
    • Hardware: CPU
    • Optimize for: efficiency
  • Paste the contents into a root-level file called base_config.cfg and update the [paths] variables to point to the corresponding spaCy format datasets:
    • train = null -> train = "/data/training_data.spacy"
    • dev = null -> dev = "/data/evaluation_data.spacy"
  • Run $ python -m spacy init fill-config base_config.cfg config.cfg to create from base_config.cfg a properly formatted config.cfg file that will be used to train the NER model
  • Run $ python -m spacy train config.cfg --output ./output to use the training data to create a spaCy NER model, and which will display output that looks like:
✔ Created output directory: output
ℹ Saving to output directory: output
ℹ Using CPU

=========================== Initializing pipeline ===========================
[2022-06-03 13:10:15,235] [INFO] Set up nlp object from config
[2022-06-03 13:10:15,241] [INFO] Pipeline: ['tok2vec', 'ner']
[2022-06-03 13:10:15,243] [INFO] Created vocabulary
[2022-06-03 13:10:15,244] [INFO] Finished initializing nlp object
[2022-06-03 13:10:16,276] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['tok2vec', 'ner']
ℹ Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     23.67    0.29    0.16    1.33    0.00
  0     200         61.09   1053.11   93.33   98.30   88.83    0.93
  0     400         40.14    103.51   97.58   98.52   96.65    0.98
  0     600         34.08     45.84   99.30   99.08   99.52    0.99
  1     800         26.39     25.22   99.42   99.16   99.68    0.99
  1    1000         28.02     19.98   99.10   98.41   99.80    0.99
  2    1200         19.63     13.10   99.78   99.64   99.92    1.00
  3    1400         19.82      6.97   99.70   99.68   99.72    1.00
  4    1600         16.01      6.17   99.72   99.52   99.92    1.00
  6    1800          7.11      4.08   99.66   99.32  100.00    1.00
  7    2000         38.36     11.99   99.72   99.48   99.96    1.00
  9    2200         66.60     20.97   99.14   98.41   99.88    0.99
 12    2400         70.13     24.79   98.61   97.25  100.00    0.99
 14    2600         29.04      8.53   99.64   99.32   99.96    1.00
 17    2800          0.00      0.00   99.60   99.20  100.00    1.00
✔ Saved pipeline to output directory
output/model-last

It will also create an output root-level directory which contains model-best and model-last sub-directories:

├── output
│   ├── model-best
│   ├── model-last

Conducting an informal NER model test

  • Click the test-informal.ipynb file to open this kernel on http://localhost:8888/notebooks/test-informal.ipynb
  • Run each of the cells in order from top to bottom, which in the final cell will test the specified item from the test data against the NER model

Conducting a formal NER model test

  • Click the test-formal.ipynb file to open this kernel on http://localhost:8888/notebooks/test-formal.ipynb
  • Run each of the cells in order from top to bottom, which in the final cell will test all items in the test data against the NER model and display a confusion matrix

Sample file extracts

The sample files were sourced by searching for FT content whose body text included one of the 20 commodities:

  • aluminium
  • cattle
  • cobalt
  • cocoa
  • coffee
  • copper
  • corn
  • cotton
  • crude oil
  • gold
  • iron ore
  • lithium
  • natural gas
  • palm oil
  • poultry
  • rice
  • silver
  • sugar
  • wheat
  • zinc

data/ft-articles-training.txt

  • Each line should contain an article, starting with its UUID, followed by triple pipes, followed by the body text split into segments delineated by double pipes (I chose to delineate segments based on where line breaks occurred)
  • The file should not end with an empty newline
  • The file includes 2,000 unique articles: 100 articles for each of the 20 commodities (though each article may potentially mention multiple commodities)
  • I chose articles that mentioned the commodities in contexts that would emphasise it as such, e.g. "aluminium traders", "the price of corn", "producers of cotton", "wheat futures contracts", etc., which is a very manual process as it requires avoiding homonyms, mentions of the commodity in the wrong sort of context, and metaphorical usage, etc. (examples below)
  • The UUIDs of articles used for this file can be seen in the FT articles training set wiki

Homonyms:

  • cattle -> Cattle's PLC: a British consumer finance company
  • gold -> Yamana Gold: Canadian gold mine and established producer
  • rice -> Condoleezza Rice: Former United States Secretary of State

Undesired contexts:

  • articles about coffee shop culture
  • rice pudding recipes

Metaphorical usage:

  • "…about as inviting as a bowl of cold rice pudding…"
1e852438-161d-4095-90f8-fccb810b4efe|||Lorem ipsum dolor sit amet…||Ut enim ad minim veniam…||Duis aute irure dolor.
3659322d-b762-437a-b345-22e3bc203e5c|||Sed ut perspiciatis unde…||Nemo enim ipsam voluptatem…||Neque porro quisquam est.
…
aa1e07d2-0a30-41cd-b146-b730ea5467ad|||At vero eos et accusamus…||Et harum quidem…||Nam libero tempore.

data/ft-articles-evaluation.txt


data/ft-articles-test.txt

References