Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Output eval logging (batch level) (#2977)
* prelim commit * fix max answer lengths for cot * add output logger * create eval output logger * fix pyright; git push * change dist reduce fx * change dist reduce fx * fix pyright * Add nightly docker image (#2452) Add pytorch nightly and CUDA 12.1 support for composer docker images What issue(s) does this change relate to? Related to https://mosaicml.atlassian.net/browse/GRT-2305 Tests docker image: mosaicml/ci-staging:72744756-794c-4390-94db-72c212dd5e00 (cuda 12.1, pytorch 2.1.0) mcli connect temp-test-ZAVxMh Python 3.10.12 (main, Jun 7 2023, 12:45:35) [GCC 9.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import torch >>> print(torch.version) <module 'torch.version' from '/usr/lib/python3/dist-packages/torch/version.py'> >>> print(torch.__version__) 2.1.0.dev20230623+cu121 >>> print(torch.version.cuda) 12.1 Integration Test @mvpatel2000 has validated that this trains on initial mpt-2 experiments and speeds up training by +7-8% from 0.25 MFU to 0.27 MFU * Fix local eval (#2465) * fix autoresume with slashed directory * Revert "fix autoresume with slashed directory" This reverts commit 3dfb5f5. revert * fix * fix precommit * Update in_context_learning_evaluation.py * Update in_context_learning_evaluation.py * Update in_context_learning_evaluation.py * add tests * Add torch 2.1.0 args for github release-docker workflow * Log system metrics on each event (#2412) Signed-off-by: Prithvi Kannan <[email protected]> Co-authored-by: Evan Racah <[email protected]> Co-authored-by: eracah <[email protected]> * Fix torch 2.1.0 docker tag (#2472) * Upstream Generate Callback (#2449) Upstreams and generalizes the callback that logs generations to wandb from foundry to composer. * Upgrade torch nightly docker image for 0.18.3 NCCL version (#2476) Upgrade torch docker nightly version to 08-23-23 so that we get nccl version 0.18.3 which was merged on 08-18-23. * Test pytorch 2.1.0 docker images on ci/cd (#2469) Test pytorch 2.1.0 docker images on ci/cd #2469 * Fix huggingface tokenizer loading for slow tokenizers (#2483) * Deprecate Fused LayerNorm (#2475) Will be removed in v0.18. * Transformers upgrade (#2489) * Update RTD build config with build.os (#2490) * Update RTD build config with build.os * Remove python.version --------- Co-authored-by: Bandish Shah <[email protected]> * Upgrade torch docker version and github workflow tests (#2488) * upgrade node version (#2492) # What does this PR do? Security vulnerability in `semver` seen due to node. This PR upgrades the node version to bump up semver from 7.5.1 to 7.5.2 # Tests Action Run: https://github.com/mosaicml/composer/actions/runs/6017539089 Correct version of semver seen after upgrade: ``` #14 [pytorch_stage 7/24] RUN npm list -g semver --depth=1 #14 2.223 /usr/lib #14 2.223 `-- [email protected] #14 2.223 `-- [email protected] #14 2.223 #14 DONE 2.4s ``` * Gating tying modules w/ FSDP for torch 2.0 (#2467) * Gating tying modules w/ FSDP * Changing weight tying filtering to be less aggressive * precommit formatting * Removing min_params (#2494) * Removing min_params * formatting? * removing overlap with another commit * Fix torchmetrics backwards compatibility issue (#2468) * add fix * fix tests * qwf * dsfg * add key * remove short * add map test * remove comment * filter warning * simplify wrapping * checkdown * fix torchmetrics * 300 * fix tests * remove metric * cleanup * bug fixes * fix lint * fix lint * fix test * lint * remove cuda * fix tests * fix ignore * fix loading * fix test * save ckpt --------- Co-authored-by: Mihir Patel <[email protected]> Co-authored-by: Daniel King <[email protected]> Co-authored-by: Your Name <[email protected]> * Adding some fixes to FSDP tests (#2495) * Adding some fixes to FSDP tests * Add filter warnings * fail count (#2496) * Remove PR curve metrics from backward compatibility test and skip torch 1.13 (#2497) * filter warning (#2500) * bump version (#2498) * Skip metrics in state dict (#2501) * skip metrics in state dict * fix unit tests * Add peak memory stats (#2504) * add peak memory stats * fix tests * fix sharded ckpt (#2505) * Bump gitpython from 3.1.31 to 3.1.34 (#2509) Bumps [gitpython](https://github.com/gitpython-developers/GitPython) from 3.1.31 to 3.1.34. - [Release notes](https://github.com/gitpython-developers/GitPython/releases) - [Changelog](https://github.com/gitpython-developers/GitPython/blob/main/CHANGES) - [Commits](gitpython-developers/GitPython@3.1.31...3.1.34) --- updated-dependencies: - dependency-name: gitpython dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Annotate `torch_prof_remote_file_name` as Optional (#2512) The `torch_prof_remote_file_name` argument of `Profiler` is passed as the `remote_file_name` argument of `TorchProfiler`, which supports passing `None` to disable uploading trace files. Prior to this commit, passing `None` to `Profiler` to do this whilst using a static type checker led to a type error. * fix: when there is no train_metrics, do not checkpoint (#2502) * Remove metric saving (#2514) * no metric save * fix docs * checkdown * fix tests * filter warning * move to device * fix device gpu * Update composer/core/state.py Co-authored-by: Daniel King <[email protected]> --------- Co-authored-by: Daniel King <[email protected]> * Fix daily tests by removing gpu marker (#2515) * Refactor mosaic_fsdp.py (#2506) * Refactor mosaic_fsdp.py * Format file * Rename monkey patch function * Fix import path * Format files * Fix version * fix pr (#2517) * Add custom sharding to ChunkShardingSpec (#2507) * Refactor mosaic_fsdp.py * Format file * Rename monkey patch function * Fix import path * Format files * Fix version * Fix import path * Monkey patch ChunkShardingSpec to dynamically detect sharding dim * Format file * Add non divisible functionality to ChunkShardingSpec * Format file * Format file * Update nightly docker image to torch nightly 09-03-23 (#2518) * Update pre-commit in setup.py (#2522) * Add FSDP custom wrap with torch 2.1 (#2460) * add torch2 * add code * tag more changes * Update composer/trainer/mosaic_fsdp.py Co-authored-by: Vitaliy Chiley <[email protected]> * monkeypatch init * raise pins * add print * more logs * change if statements * remove imports * remove imports * fix init * fix versioning * add hybrid shard * checkdown * revert hsdp * add peak memory stats * lint * imports * Update composer/trainer/mosaic_fsdp.py Co-authored-by: Daniel King <[email protected]> * fix wrap * fix gate * lint * test * change thresh * import typing * fix checks * nuke pyright * typo * Update composer/trainer/mosaic_fsdp.py Co-authored-by: Brian <[email protected]> * Update composer/trainer/mosaic_fsdp.py Co-authored-by: Brian <[email protected]> * Update composer/trainer/mosaic_fsdp_utils.py Co-authored-by: Brian <[email protected]> * resolve comments * add comments * add comments --------- Co-authored-by: Vitaliy Chiley <[email protected]> Co-authored-by: Daniel King <[email protected]> Co-authored-by: Brian <[email protected]> * Fix GCSObjectStore bug where hmac keys auth doesn't work (#2519) * prelim commit * add output logger * create eval output logger * change dist reduce fx * Bump gitpython from 3.1.34 to 3.1.35 (#2525) Bumps [gitpython](https://github.com/gitpython-developers/GitPython) from 3.1.34 to 3.1.35. - [Release notes](https://github.com/gitpython-developers/GitPython/releases) - [Changelog](https://github.com/gitpython-developers/GitPython/blob/main/CHANGES) - [Commits](gitpython-developers/GitPython@3.1.34...3.1.35) --- updated-dependencies: - dependency-name: gitpython dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump pytest from 7.4.0 to 7.4.2 (#2523) Bumps [pytest](https://github.com/pytest-dev/pytest) from 7.4.0 to 7.4.2. - [Release notes](https://github.com/pytest-dev/pytest/releases) - [Changelog](https://github.com/pytest-dev/pytest/blob/main/CHANGELOG.rst) - [Commits](pytest-dev/pytest@7.4.0...7.4.2) --- updated-dependencies: - dependency-name: pytest dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Upgrade to mlflow version 2.5.0 (#2528) * disable cifar daily (#2527) * mosaicml logger robustness improvements (#2530) * Fix metrics keys sort in DecoupledAdamW for OptimizerMonitor FSDP metric agreggation (#2531) * Fix github actions for GCS integration testing (#2532) * fix github actions * make gpu test * change dist reduce fx * fix pyright * Fix GCS tests (#2535) * add PR tests * fix test * remove pr daily * remove pr daily * finish error logging cb * fix * add import to init * add import to init * add import to init * add file writing * add file writing * add file writing * add file writing * add file writing * move tensors to cpu * remove tensors * remove tensors * remove tensors * add prompt to qa * add prompt to qa * add prompt to qa * add prompt to qa * add prompt to qa * add prompt to qa * add prompt to qa * add prompt to qa * add prompt to qa * add prompt to qa * add prompt to qa * add prompt to qa * add prompt to qa * add prompt to qa * try debugging dist sync issue * nit * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * fix syncing of non tensor state * added gpu test * fix error * finish testing callback * fix all errors * test commit * roll back test commit * remove ranks * re-tesT * add custome gen kwargs and stopping on eos token * modify test * modify test * finish * finish * finish * finish * finish pr * implement early stop * add tesT * merge * fix * finish * finish * fix bug * finish * bug fix * add keys * add correcT * modify sync * diff split * fix typo * edit condition * broken wip * design demonstration commit * simplify pr * further simplify * wip * add comments * add other icl metrics * wip * change dict method, add more stuff to logging * fix typos, change some comments * decode tensors, fix wrong dict key * fix mc * 1 to 0 lol * wip linting * adjust to step logging * adjust logging names * add mflow, rm batch keys * add comments, check for dict in huggingface model update_metric * add user specified logging * move metric_name duplication to update_metric * wip fix testing * fix input shape error * rm init * rm eval_after_all * step=None * step=state.timestamp.batch.value * update name to include step * linting, wip on test * fix test * pyright wip * add non-batch warning * pyright * debug * rm this commit that wasn't the right branch * log at the end of training * rm silly wandb table logging * add run_name * add docstring * add debug logging * more logging * rm info logging * improve comments * Update composer/callbacks/eval_output_logging_callback.py Co-authored-by: Evan Racah <[email protected]> * rm logging bool * fix logging for schema tasks * fix schema / mc tasks * yapf * rm reshape * fix tests * cleanup test * pyright * pyright * docstring * pyright * update tests * rm attention mask requirement * Update composer/metrics/nlp.py Co-authored-by: Mihir Patel <[email protected]> * Update composer/metrics/nlp.py Co-authored-by: Mihir Patel <[email protected]> * rm todo * lint * lint * lint * more lint --------- Signed-off-by: Prithvi Kannan <[email protected]> Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: Jeremy Dohmann <[email protected]> Co-authored-by: Jeremy D <[email protected]> Co-authored-by: Charles Tang <[email protected]> Co-authored-by: Rishab Parthasarathy <[email protected]> Co-authored-by: Prithvi Kannan <[email protected]> Co-authored-by: Evan Racah <[email protected]> Co-authored-by: eracah <[email protected]> Co-authored-by: Irene Dea <[email protected]> Co-authored-by: Daniel King <[email protected]> Co-authored-by: nik-mosaic <[email protected]> Co-authored-by: bandish-shah <[email protected]> Co-authored-by: Bandish Shah <[email protected]> Co-authored-by: bcui19 <[email protected]> Co-authored-by: Mihir Patel <[email protected]> Co-authored-by: Your Name <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Scott Stevenson <[email protected]> Co-authored-by: furkanbiten <[email protected]> Co-authored-by: Brian <[email protected]> Co-authored-by: Vitaliy Chiley <[email protected]> Co-authored-by: Nicholas Garcia <[email protected]> Co-authored-by: Mikhail Kolesov <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: Tessa Barton <[email protected]>
- Loading branch information