Skip to content

Latest commit

 

History

History
85 lines (65 loc) · 2.88 KB

training.md

File metadata and controls

85 lines (65 loc) · 2.88 KB

Getting started with Training

Prerequisites

Complete all of the setup as mentioned in the setup doc.

Launching a small-scale baseline run

opt-baselines \
  -n 2 -g 8 \
  -p test_v0 \
  --model-size 125m \
  --azure \
  --checkpoints-dir "INSERT_YOUR_CHECKPOINT_DIR" \
  --no-save-dir  # Remove this if you want to print out full save-dir path

Bring up tensorboard for the run

tensorboard serve --logdir="INSERT_TENSORBOARD_LOGDIR"  --bind_all --port=6018

Start Aim for in-depth training runs comparison

aim up --repo /path/to/log/dir --port 43800

Enable Aim to log metaseq training runs for snappy visualization and in-depth comparison.

Pass an extra flag to the train command --aim-repo /path/to/log/dir

Example train command with Aim enabled:

metaseq-train --task streaming_language_modeling \
  data-bin/pile-00 \
  --vocab-filename data-bin/pile-00/bpe-vocab.json \
  --merges-filename data-bin/pile-00/bpe-merges.txt \
  --criterion cross_entropy \
  --batch-size 8 \
  --save-dir /checkpoints/lm_transformer_pile-00 \
  --arch transformer_lm --share-decoder-input-output-embed \
  --dropout 0.1 \
  --optimizer adam --weight-decay 0.01 --clip-norm 0.0 \
  --lr 0.0005 --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-07 \
  --tokens-per-sample 1024 --sample-break-mode none --fp16 \
  --aim-repo /path/to/log/dir

Once the training is started, open Aim UI to see the results:

aim up --repo /path/to/log/dir --port 43800

The following message will be outputted, meaning Aim UI is up and running:

Running Aim UI on repo `<Repo#-5930451821203570655 path=/.aim read_only=None>`
Open http://127.0.0.1:43800
Press Ctrl+C to exit

Open your browser and navigate to http://127.0.0.1:43800/runs to explore tracked runs. Depending on the setup host and port can be different.

Navigate to "Metrics Explorer" and select metrics to explore from the top left dropdown + Metrics and click Search

Metrics Explorer

  • metaseq Aim arguments:
Arguments Description
aim_repo Defines the path to store collected training logs. If set to "." logs will be stored at cwd(current working directory).
aim_run_hash Training run hash. If skipped creates or continues run based on "save_dir". Otherwise, stores training metadata in the specified run.
  • aim up arguments:
Arguments Description
--repo <repo_path> Path to stored logs - parent directory of .aim repo. Current working directory by default.
-h | --host <host> Specify host address to run UI on.
-p | --port <port> Specify port to listen to.

See Aim full documentation at aimstack.readthedocs.io.