Add profiler support in llm foundry #678

j316chuck · 2023-10-17T01:16:32Z

Description

Add profiler support for llm foundry

Along for the ride:

Adding a yaml to support training mpt models in CPU mode. Ths is useful so you don't have to spin up an interactive session, wait for an interactive session download, set up interactive session, get/download your data, and let's you test quickly on a small model for cpu only features. Only con is no gpu 😛

Tests

composer train/train.py \ train/yamls/pretrain/mpt-small-cpu.yaml \ data_local=my-copy-c4 \ train_loader.dataset.split=train_small \ eval_loader.dataset.split=val_small \ max_duration=10ba \ eval_interval=0 \ save_folder=mpt-125m

Produces the chrome traces: composer_traces/ep0-ba6-rank0.json. Example:

 {"ph": "X", "cat": "python_function", "name": "queue.py(213): _put", "pid": 0, "tid": 1981795, "ts": 1697505580241206, "dur": 0, "args": {"Ev Idx": 2178228, "Python id": 1145107, "Python parent id": 1145104}},
{"ph": "X", "cat": "python_function", "name": "<built-in method append of collections.deque object at 0x2c5582890>", "pid": 0, "tid": 1981795, "ts": 1697505580241206, "dur": 0, "args": {"Ev Idx": 2178229, "Python id": 1145108, "Python parent id": 1145107}},
{"ph": "X", "cat": "python_function", "name": "threading.py(359): notify", "pid": 0, "tid": 1981795, "ts": 1697505580241207, "dur": 0, "args": {"Ev Idx": 2178230, "Python id": 1145109, "Python parent id": 1145104}},
{"ph": "X", "cat": "python_function", "name": "threading.py(279): _is_owned", "pid": 0, "tid": 1981795, "ts": 1697505580241207, "dur": 0, "args": {"Ev Idx": 2178231, "Python id": 1145110, "Python parent id": 1145109}},

Produces the pytorch traces: torch_traces/rank0.6.pt.trace.json. Example:

  {
    "ph": "X", "cat": "python_function", "name": "/Users/chuck.tang/composer/composer/utils/string_enum.py(69): __eq__", "pid": 54200, "tid": 1981795,
    "ts": 1697505581229305, "dur": 0,
    "args": {
      "Ev Idx": 2233416, "Python id": 1200295, "Python parent id": 1200286
    }
  },

Useful for profiling memory and time usage

Screen.Recording.2023-10-16.at.8.59.10.PM.mov

Perfetto View:

S3:

loggers:
  s3: {bucket_uri: s3://mosaicml-internal-checkpoints-shared/ }

aws s3 cp --recursive s3://mosaicml-internal-checkpoints-shared/chuck/mpt_causal_lm_cpu/traces/

Full training run:
mpt-7b-gpu-8-chinchilla-light-profile-ynjNZ2, mpt-7b-gpu-8-chinchilla-full-profile-uJSCOF, mpt-7b-gpu-8-chinchilla-none-profile-Cwm3GA

README.md

scripts/train/yamls/pretrain/mpt-125m-cpu.yaml

scripts/train/train.py

llmfoundry/utils/builders.py

add profiler flags

956ecb9

j316chuck requested a review from dakinggg October 17, 2023 01:16

j316chuck added 3 commits October 16, 2023 18:33

add everything

9bb077d

ok

612cd3b

ok

0cb3a4d

j316chuck requested a review from mvpatel2000 October 17, 2023 01:36

j316chuck added 6 commits October 16, 2023 20:05

format

d1c2517

fix type

50d5a85

add train

8d7b143

ok

0a9a543

add s3

f0cc81a

clean up yaml

5d16b58

mvpatel2000 approved these changes Oct 17, 2023

View reviewed changes

dakinggg reviewed Oct 17, 2023

View reviewed changes

README.md Outdated Show resolved Hide resolved

dakinggg reviewed Oct 17, 2023

View reviewed changes

scripts/train/yamls/pretrain/mpt-125m-cpu.yaml Outdated Show resolved Hide resolved

dakinggg reviewed Oct 17, 2023

View reviewed changes

scripts/train/train.py Show resolved Hide resolved

j316chuck added 2 commits October 17, 2023 12:00

clean up readme

2942e63

commit change

7a6dc9d

j316chuck enabled auto-merge (squash) October 17, 2023 19:24

Merge branch 'main' into chuck/add_profiler_flags

967b4d2

dakinggg reviewed Oct 17, 2023

View reviewed changes

llmfoundry/utils/builders.py Outdated Show resolved Hide resolved

j316chuck and others added 4 commits October 17, 2023 12:53

commit change

c4ae7ea

commit change

34be7da

fix

e8ef6c1

Merge branch 'main' into chuck/add_profiler_flags

bbfb258

j316chuck merged commit 92bd673 into main Oct 18, 2023
12 checks passed

dakinggg deleted the chuck/add_profiler_flags branch November 17, 2023 06:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add profiler support in llm foundry #678

Add profiler support in llm foundry #678

j316chuck commented Oct 17, 2023 •

edited

Loading

Add profiler support in llm foundry #678

Add profiler support in llm foundry #678

Conversation

j316chuck commented Oct 17, 2023 • edited Loading

Description

Along for the ride:

Tests

j316chuck commented Oct 17, 2023 •

edited

Loading