Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update runners #3472

Closed
wants to merge 70 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
e72bbaf
bump (#3383)
KuuCi Jun 8, 2024
9c4b0ba
Fix backward compatibility caused by missing eval metrics class (#3385)
bigning Jun 8, 2024
e85e738
Bump version v0.23.2 (#3386)
bigning Jun 8, 2024
afa2e39
Restore dev version (#3388)
bigning Jun 8, 2024
4cbb4a2
Only requires `databricks-sdk` when inside the Databricks platform (#…
antoinebrl Jun 9, 2024
735aa6f
Update packaging requirement from <24.1,>=21.3.0 to >=21.3.0,<24.2 (#…
dependabot[bot] Jun 10, 2024
db1325a
Bump cryptography from 42.0.6 to 42.0.8 (#3391)
dependabot[bot] Jun 10, 2024
7778fcf
Skip extra dataset state load (#3393)
mvpatel2000 Jun 11, 2024
919fe91
Remove FSDP restriction from PyTorch 1.13 (#3395)
mvpatel2000 Jun 12, 2024
b07b82e
Check for 'CUDA error: out of memory' with auto-microbatching (#3400)
JAEarly Jun 13, 2024
6298d76
Add tokens to iterations (#3374)
b-chu Jun 13, 2024
9500fd1
Busy wait utils in dist (#3396)
dakinggg Jun 14, 2024
a1c581d
Add buffering time to mlflow logger (#3401)
chenmoneygithub Jun 14, 2024
fffa335
Update _patch_pytorch.py (#3402)
mvpatel2000 Jun 14, 2024
3e1396e
Add pynvml to mlflow dep group (#3404)
dakinggg Jun 17, 2024
0eb1eee
min/max flagging added to system_metrics_monitor with only non-redund…
JackZ-db Jun 17, 2024
0ee83f7
simplify launcher (#3398)
mvpatel2000 Jun 17, 2024
04ba0b6
Optionally use `flash-attn`'s CE loss for metrics (#3394)
snarayan21 Jun 17, 2024
1dfd3bc
log image fix (#3286)
jessechancy Jun 17, 2024
f7e17de
[ckpt-rewr] Save state dict API (#3372)
eracah Jun 17, 2024
0a1a6a4
Revert "Optionally use `flash-attn`'s CE loss for metrics (#3394)" (#…
snarayan21 Jun 18, 2024
0d6ef26
CPU tests image fix (#3409)
snarayan21 Jun 18, 2024
dac1995
Add setter for epoch in iteration (#3407)
b-chu Jun 18, 2024
567c6e5
Move pillow dep as required (#3412)
mvpatel2000 Jun 18, 2024
f26a1d3
fixing mlflow logging to Databricks workspace file paths with /Shared…
JackZ-db Jun 18, 2024
894a192
Bump version v0.23.3 (#3414)
karan6181 Jun 19, 2024
459a019
Update numpy requirement from <1.27.0,>=1.21.5 to >=1.21.5,<2.1.0 (#3…
dependabot[bot] Jun 20, 2024
7a4644a
Restore dev version (#3417)
karan6181 Jun 20, 2024
94f1ec1
Save checkpoint to disk for API with new save layout (#3399)
eracah Jun 21, 2024
d420765
fix typing (#3419)
mvpatel2000 Jun 21, 2024
ba17897
Fixes some typing issues (#3418)
dakinggg Jun 21, 2024
4e8ed2e
Fix small things (#3420)
b-chu Jun 21, 2024
5ba56ac
Bump coverage[toml] from 7.5.3 to 7.5.4 (#3422)
dependabot[bot] Jun 24, 2024
abfd78c
Update psutil requirement from <6,>=5.8.0 to >=5.8.0,<7 (#3424)
dependabot[bot] Jun 24, 2024
d3e95a9
Add support for variable length dataloaders in DDP (#3416)
JAEarly Jun 24, 2024
84c4723
Hsdp + MoE CI tests (#3378)
KuuCi Jun 24, 2024
4501305
bumping mlflow to 2.14.1 (#3425)
JackZ-db Jun 25, 2024
a7218d1
Skip HSDP + TP pytests that require torch 2.3 or above (#3426)
KuuCi Jun 25, 2024
8361862
remove codeql (#3429)
mvpatel2000 Jun 26, 2024
0b74933
Remove save overwrite (#3431)
mvpatel2000 Jun 27, 2024
dd3e7f9
LeDocs (#3430)
snarayan21 Jun 28, 2024
ac4bd59
Lower the system metrics logging frequency to reduce MLflow server's …
chenmoneygithub Jun 28, 2024
38e5e51
Update paramiko requirement from <3,>=2.11.0 to >=3.4.0,<4 (#3439)
dependabot[bot] Jul 1, 2024
6b461d0
bump versions (#3433)
mvpatel2000 Jul 1, 2024
6bac335
fix eval after all (#3445)
mvpatel2000 Jul 1, 2024
3cd6e6d
skip log (#3446)
mvpatel2000 Jul 1, 2024
cf76c96
Remove MosaicMLLambdaEvalClient (#3432)
aspfohl Jul 1, 2024
8fbca38
Relax hf hub pin (#3435)
dakinggg Jul 1, 2024
54d58c9
Pytest skip 2 (#3448)
KuuCi Jul 2, 2024
5a129d1
bump version (#3450)
XiaohanZhangCMU Jul 2, 2024
a0806f6
Bump ipykernel from 6.29.2 to 6.29.5 (#3459)
dependabot[bot] Jul 8, 2024
4b71141
Update torchmetrics requirement (#3460)
dependabot[bot] Jul 8, 2024
89db4e2
Bump databricks-sdk from 0.28.0 to 0.29.0 (#3456)
dependabot[bot] Jul 8, 2024
6df01ba
[Checkpoint] Fix symlink issue where symlink file uploaded before che…
bigning Jul 8, 2024
e951f0a
Correctly process `parallelism_config['tp']` when it's a dict (#3434)
snarayan21 Jul 8, 2024
6dec835
[checkpoint v2] Download api (#3447)
bigning Jul 9, 2024
18795f1
removed exception from logger (#3464)
jjanezhang Jul 9, 2024
11bad57
fixed docs for mfu (#3469)
JackZ-db Jul 11, 2024
74c7d3b
add comment (#3470)
mvpatel2000 Jul 11, 2024
14bc187
Change pytorch eval for FP8 to default to fall back to BF16 (#3454)
j316chuck Jul 11, 2024
a5dc155
Fix checkpoint events (#3468)
b-chu Jul 15, 2024
69b8b23
Add mosaicmllogger attr for fit start (#3467)
ethanma-db Jul 16, 2024
15c329e
Bump coverage[toml] from 7.5.4 to 7.6.0 (#3471)
dependabot[bot] Jul 18, 2024
8a09a3b
Bump flash attention to 2.6.1 (#3476)
dakinggg Jul 21, 2024
779ff3e
cpu
KevDevSha Jul 22, 2024
2c0eac2
gpu
KevDevSha Jul 22, 2024
bf1843a
Merge branch 'dev' into kevin/emu-allowlisted-runners
KevDevSha Jul 22, 2024
2b3ff8c
coverage fix
KevDevSha Jul 22, 2024
bd6515c
lint
KevDevSha Jul 22, 2024
4488555
Update pr-cpu.yaml
KevDevSha Jul 22, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 16 additions & 12 deletions .github/workflows/pr-cpu.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
name: PR CPU tests
on:
pull_request:
pull_request_target:
workflow_dispatch:
# Cancel old runs when a new commit is pushed to the same branch if not on main
# or dev
Expand All @@ -9,7 +9,8 @@ concurrency:
cancel-in-progress: ${{ github.ref != 'refs/heads/main' && github.ref != 'refs/heads/dev' }}
jobs:
pytest-cpu:
uses: mosaicml/ci-testing/.github/workflows/[email protected]
name: ${{ matrix.name }}
runs-on: linux-ubuntu-latest
strategy:
matrix:
include:
Expand All @@ -29,16 +30,19 @@ jobs:
container: mosaicml/pytorch:2.3.1_cpu-python3.11-ubuntu20.04
markers: not daily and not remote and not gpu and doctest
pytest_command: coverage run -m pytest tests/test_docs.py
name: ${{ matrix.name }}
if: github.repository_owner == 'mosaicml'
with:
composer_package_name: mosaicml
container: ${{ matrix.container }}
name: ${{ matrix.name }}
pip_deps: "[all]"
pytest-command: ${{ matrix.pytest_command }}
pytest-markers: ${{ matrix.markers }}
safe_directory: composer
steps:
- name: Checkout code
uses: actions/checkout@v2
- name: Run PR CPU Tests
uses: mosaicml/ci-testing/.github/actions/[email protected]
with:
container: ${{ matrix.container }}
name: ${{ matrix.name }}
pip_deps: "[all]"
pytest_command: ${{ matrix.pytest_command }}
pytest_markers: ${{ matrix.markers }}
safe_directory: composer
composer_package_name: mosaicml
coverage:
uses: ./.github/workflows/coverage.yaml
name: Coverage Results
Expand Down
106 changes: 60 additions & 46 deletions .github/workflows/pr-gpu.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,8 @@
cancel-in-progress: ${{ github.ref != 'refs/heads/main' && github.ref != 'refs/heads/dev' }}
jobs:
pytest-gpu-1:
uses: mosaicml/ci-testing/.github/workflows/[email protected]
name: ${{ matrix.name }}
runs-on: linux-ubuntu-latest
strategy:
matrix:
include:
Expand All @@ -18,24 +19,28 @@
markers: not daily and not remote and gpu and (doctest or not doctest)
pytest_command: coverage run -m pytest
composer_package_name: mosaicml
name: ${{ matrix.name }}
if: github.repository_owner == 'mosaicml'
with:
composer_package_name: ${{ matrix.composer_package_name }}
container: ${{ matrix.container }}
git_repo: mosaicml/composer
mcloud-timeout: 2250
name: ${{ matrix.name }}
pip_deps: "[all]"
pytest-command: ${{ matrix.pytest_command }}
pytest-markers: ${{ matrix.markers }}
python-version: 3.9
gpu_num: 1
secrets:
mcloud-api-key: ${{ secrets.MCLOUD_API_KEY }}

steps:
- name: Checkout code
uses: actions/checkout@v3
- name: Run PR GPU Tests
uses: mosaicml/ci-testing/.github/actions/[email protected]
with:
composer_package_name: ${{ matrix.composer_package_name }}
container: ${{ matrix.container }}
git_repo: mosaicml/composer
mcloud_timeout: 2250
name: ${{ matrix.name }}
pip_deps: "[all]"
pytest_command: ${{ matrix.pytest_command }}
pytest_markers: ${{ matrix.markers }}
python_version: 3.9
gpu_num: 1
mcloud_api_key: ${{ secrets.MCLOUD_API_KEY }}
ci_repo_gpu_test_ref: v0.1.0
pytest-gpu-2:
uses: mosaicml/ci-testing/.github/workflows/[email protected]
name: ${{ matrix.name }}
runs-on: linux-ubuntu-latest
strategy:
matrix:
include:
Expand All @@ -44,25 +49,30 @@
markers: not daily and not remote and gpu and (doctest or not doctest)
pytest_command: coverage run -m pytest
composer_package_name: mosaicml
name: ${{ matrix.name }}
if: github.repository_owner == 'mosaicml'
with:
composer_package_name: ${{ matrix.composer_package_name }}
container: ${{ matrix.container }}
git_repo: mosaicml/composer
mcloud-timeout: 2250
name: ${{ matrix.name }}
pip_deps: "[all]"
pytest-command: ${{ matrix.pytest_command }}
pytest-markers: ${{ matrix.markers }}
python-version: 3.9
gpu_num: 2
secrets:
mcloud-api-key: ${{ secrets.MCLOUD_API_KEY }}
steps:
- name: Checkout code
uses: actions/checkout@v3
- name: Run PR GPU Tests
uses: mosaicml/ci-testing/.github/actions/[email protected]
with:
composer_package_name: ${{ matrix.composer_package_name }}
container: ${{ matrix.container }}
git_repo: mosaicml/composer
mcloud_timeout: 2250
name: ${{ matrix.name }}
pip_deps: "[all]"
pytest_command: ${{ matrix.pytest_command }}
pytest_markers: ${{ matrix.markers }}
python_version: 3.9
gpu_num: 2
mcloud_api_key: ${{ secrets.MCLOUD_API_KEY }}
ci_repo_gpu_test_ref: v0.1.0


pytest-gpu-4:
uses: mosaicml/ci-testing/.github/workflows/[email protected]
name: ${{ matrix.name }}
runs-on: linux-ubuntu-latest
strategy:
matrix:
include:
Expand All @@ -71,18 +81,22 @@
markers: not daily and not remote and gpu and (doctest or not doctest)
pytest_command: coverage run -m pytest
composer_package_name: mosaicml
name: ${{ matrix.name }}
if: github.repository_owner == 'mosaicml'
with:
composer_package_name: ${{ matrix.composer_package_name }}
container: ${{ matrix.container }}
git_repo: mosaicml/composer
mcloud-timeout: 2250
name: ${{ matrix.name }}
pip_deps: "[all]"
pytest-command: ${{ matrix.pytest_command }}
pytest-markers: ${{ matrix.markers }}
python-version: 3.9
gpu_num: 4
secrets:
mcloud-api-key: ${{ secrets.MCLOUD_API_KEY }}
steps:
- name: Checkout code
uses: actions/checkout@v3
- name: Run PR GPU Tests
uses: mosaicml/ci-testing/.github/actions/[email protected]
with:
composer_package_name: ${{ matrix.composer_package_name }}
container: ${{ matrix.container }}
git_repo: mosaicml/composer
mcloud_timeout: 2250
name: ${{ matrix.name }}
pip_deps: "[all]"
pytest_command: ${{ matrix.pytest_command }}
pytest_markers: ${{ matrix.markers }}
python_version: 3.9
gpu_num: 4
mcloud_api_key: ${{ secrets.MCLOUD_API_KEY }}
ci_repo_gpu_test_ref: v0.1.0

Check failure on line 102 in .github/workflows/pr-gpu.yaml

View workflow job for this annotation

GitHub Actions / code-quality (3.11, [dev])

102:37 [new-line-at-end-of-file] no new line character at the end of file
Loading