Python program for training linguistic annotation taggers based on a configuration file and list of datasets. It prepares the resulting trained models for dockerization and adds relevant metadata. It is tagger software agnostic as long as a simple Python shell is built around it.
- galahad
- galahad-train-battery [you are here]
- galahad-taggers-dockerized
- galahad-corpus-data
- int-pie
- int-huggingface-tagger [to be released]
git clone --recurse-submodules https://github.com/INL/galahad-train-battery
- Run
unzip-and-clean-corpus.sh
- Run
train.py
(see usage below).
Running python3 train.py
trains all configs found in configs/
. To train a specific config, supply its directory name relative to configs/
. E.g. python3 train.py pie/tdn
, or supply multiple. Once trained, the models will appear in the galahad-taggers-dockerized/
folder. docker-build.py
builds all models to images. Optionally, supply one or more specific configurations as an argument: python3 docker-build.py pie/tdn
.
The first level of folders in configs/
dictates what tagger is used. A tagger with the same directory name must exist in taggers/
. The second level of folders correspond to models: each configuration will train a single model. A config folder contains a datasets.json
and config.json
file. The format of datasets.json
is:
{
"datasets": ["my-first-dataset", "my-second-dataset"]
}
Dataset names (e.g. my-first-dataset
) must correspond to a folder in galahad-corpus-data/training-data
(e.g. galahad-corpus-data/training-data/my-first-dataset/
). When multiple datasets are specified, they will be merged automatically. The format of config.json
is up to the specific tagger and can be used to set parameters such as learning rate.
Each folder in galahad-corpus-data/training-data
contains one dataset, pre-split in train, dev and test. A dataset must have three files that end in *train.tsv
, *dev.tsv
and *test.tsv
respectively. If multiple matching files exist, the first is chosen.
To support the merging of datasets, tsv files cannot have headers. That does mean that only datasets with the same column order should be merged, so pay attention to this when selecting datasets to merge.
Tsv files are expected to have the same number of tabs (\t
) on each line. This is especially important for taggers that read the first line to determine the number of columns.
See the galahad-corpus-data repository for more information on the datasets.
Training a configuration creates a folder in galahad-taggers-dockerized/[tagger-name]/[config-name]
where the trained model is stored. It uses a docker prefab as a base to create this folder, stored in docker-prefabs/[tagger-name]/
. I.e., any files in the prefab (such as a Dockerfile) are conveniently copied over.
See https://github.com/INL/galahad-taggers-dockerized/ for more information on how to hook up a trained model to Galahad.
When you run train.py
, it:
- merges datasets defined in
datasets.json
(if multiple are defined) and stores them ingalahad-corpus-data/training-data/[dataset1]-[dataset2]/
- creates an output folder at
galahad-taggers-dockerized/[tagger-name]/[config-name]
. - copies over
config.json
anddatasets.json
to the output folder for sake of provenance. - copies over the contents of
docker-prefabs/[tagger-name]
to the output folder (overwriting the json files if desired). - creates a virtual environment for the tagger located at
taggers/[tagger-name]/venv/
. - runs
taggers/[tagger-name]/[tagger-name]/requirements.sh
that installs all necessary packages in the venv. - calls
taggers/[tagger-name]/train.py
from the venv with 4 console arguments (see below). - Lastly, the tagger and its own
train.py
are now expected to produce a model atgalahad-taggers-dockerized/[tagger-name]
.
Use of --no-deps
is not yet supported for requirements.txt (see pypa/pip#10837). Some taggers need to install packages with --no-deps
. This is a workaround.
If you don't need this, your requirements.sh
can be as simple as:
pip install -r requirements.txt
- To add a new tagger, create the folder
taggers/my-tagger
. - In here, create
train.py
. This file will be called by thetrain.py
at the repo root. It will provide 4 console arguments:python3 taggers/my-tagger/train.py [train_set_path] [dev_set_path] [config_path] [docker_path]
.train_set_path
anddev_set_path
will each refer to a single tsv file.config_path
refers to a json file inconfigs/my-tagger/some-configuration
.docker_path
will refer to a folder where your tagger can save its trained model. - Add the code of your own tagger. Git clone your tagger to
taggers/my-tagger/my-tagger
. Configure yourtrain.py
so that it will train your tagger, and provide arequirements.sh
in your cloned repositorytaggers/[tagger-name]/[tagger-name]/requirements.sh
. - Add all the files that you want to be copied over to
docker_path
atdocker-prefabs/my-tagger
. ADockerfile
for example. - Add a configuration for your tagger at
configs/my-tagger/some-configuration
. This contains adatasets.json
and aconfig.json
. You can use the latter for specific configuration for your tagger. - Now, run
python3 train.py my-tagger/some-configuration
. The output model should appear ingalahad-taggers-dockerized/my-tagger
. You can now dockerize your tagger. See the galahad-taggers-dockerized repository for further steps with your docker container.
galahad-train-battery/
- galahad-corpus-data/
- training-data/
- first/
- *train.tsv
- *dev.tsv
- *test.tsv
- second/
- *train.tsv
- *dev.tsv
- *test.tsv
- pie-first-second/ # created automatically by merging the datasets
- galahad-taggers-dockerized/
- pie/
- first-second/ # folder is created
- Dockerfile # copied over from docker-prefabs
- taggers/
- pie/
- pie/
- requirements.sh
- train.py
- docker-prefabs/
- pie/
- Dockerfile
- configs/
- pie/
- first-second/
- config.json
- datasets.json # specifies both datasets
- train.py
- docker-build.py
- We use the Black formatter.
- Python 3.10.12