Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add rosetta tests to ci.yaml #298

Merged
merged 48 commits into from
Dec 7, 2023
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
ae8badc
Fixes rosetta tests from overwriting artifacts from upstream tests
terrykong Oct 4, 2023
fa666c9
add rosetta tests to ci.yaml
ashors1 Oct 11, 2023
ea81a9a
add SUFFIX option to rosetta pax workflow
ashors1 Oct 11, 2023
4b6df40
fix missing suffix in pax workflow
ashors1 Oct 11, 2023
bf802b2
fix rosetta images in ci.yaml
ashors1 Oct 11, 2023
45f9667
replace SUFFIX with ARTIFACT_NAME
ashors1 Oct 12, 2023
145b10f
Merge branch 'main' into add-rosetta-tests-to-ci
yhtang Oct 13, 2023
b453aeb
Disable TE build/test to temporarily work-around the workflow file nu…
yhtang Oct 13, 2023
e85c77e
Fix test name in job dependency
yhtang Oct 13, 2023
1dc9d57
fix typo
ashors1 Oct 13, 2023
ad20730
add 'ARTIFACT_NAME' in a few more places in rosetta workflow
ashors1 Oct 14, 2023
49cc36a
fix metrics report when artifact prefix is used
ashors1 Oct 16, 2023
0d8bf43
fix parsing
ashors1 Oct 18, 2023
a00ee2e
update filename parsing
ashors1 Oct 18, 2023
2da2664
Merge branch 'main' into add-rosetta-tests-to-ci
terrykong Oct 19, 2023
bd2e163
fix typo
ashors1 Oct 19, 2023
3d3b2cc
small parsing fix
ashors1 Oct 19, 2023
0ff5eef
hardcode 'rosetta' prefix for rosetta pax tests
ashors1 Oct 19, 2023
7477998
add artifact name for rosetta t5x
ashors1 Oct 19, 2023
7590866
Remove unneeded comment
ashors1 Oct 30, 2023
d394b72
add ARTIFACT_NAME where missing, publish rosetta paxml results along …
ashors1 Oct 31, 2023
0416fe5
Merge branch 'main' of github.com:NVIDIA/JAX-Toolbox into add-rosetta…
ashors1 Oct 31, 2023
875baa7
remove dependence on build-te for now
ashors1 Oct 31, 2023
657018e
rename pax rosetta yaml file
ashors1 Nov 1, 2023
99619e1
merge main branch and add 'rosetta-' prefix to some more places in _t…
ashors1 Nov 2, 2023
badb873
Add ci.yaml changes from main
ashors1 Nov 2, 2023
bd9a540
temporarily remove a workflow to test changes
ashors1 Nov 3, 2023
7631ccf
temporarily remove another workflow from ci.yaml
ashors1 Nov 3, 2023
2a9a5a9
remove dependencies on commented-out workflows
ashors1 Nov 3, 2023
52c5eff
Merge branch 'main' of github.com:NVIDIA/JAX-Toolbox into add-rosetta…
ashors1 Nov 16, 2023
955b585
fix typo
ashors1 Nov 17, 2023
fac6b75
resolve merge conflicts
ashors1 Nov 27, 2023
12c2095
Merge branch 'main' of github.com:NVIDIA/JAX-Toolbox into add-rosetta…
ashors1 Nov 30, 2023
02e31fe
consolidate rosetta t5x and vit workflows
ashors1 Nov 30, 2023
4b4375a
remove vit workflow
ashors1 Nov 30, 2023
9f29a87
fix some test names, make rosetta pax status output format match rose…
ashors1 Dec 1, 2023
7028c8e
fix publish-test step for rosetta t5x
ashors1 Dec 1, 2023
25b5e3d
fix syntax
ashors1 Dec 1, 2023
865840b
fix completion tables
ashors1 Dec 3, 2023
874154c
fix vit image
ashors1 Dec 4, 2023
f672380
Merge branch 'main' of github.com:NVIDIA/JAX-Toolbox into add-rosetta…
ashors1 Dec 4, 2023
ebceefb
fix vit batch size
ashors1 Dec 4, 2023
9cb02a7
Merge branch 'main' of github.com:NVIDIA/JAX-Toolbox into add-rosetta…
ashors1 Dec 5, 2023
2e2e882
correct artifact names in nightly rosetta tests
ashors1 Dec 6, 2023
74cec84
remove _test_vit workflow from nightly tests as vit tests are now inc…
ashors1 Dec 6, 2023
f61d3ab
remove references to test-vit
ashors1 Dec 6, 2023
dab2131
fix t5x job name
ashors1 Dec 6, 2023
ccf7eb6
fix typo
ashors1 Dec 6, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 9 additions & 4 deletions .github/workflows/_test_pax.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,11 @@ on:
description: Extra command line args to pass to test-pax.sh
default: ""
required: false
ARTIFACT_NAME:
type: string
description: If provided, will prepend a prefix to the artifact name. Helpful if re-running this reusable workflow to prevent clobbering of artifacts
default: ""
required: false
outputs:
TEST_STATUS:
description: 'Summary pass/fail value indicating if results from tests are acceptable'
Expand Down Expand Up @@ -65,7 +70,7 @@ jobs:
NODES=$(((TOTAL_TASKS+MAX_GPUS_PER_NODE-1)/MAX_GPUS_PER_NODE))
GPUS_PER_NODE=$((TOTAL_TASKS/NODES))

JOB_NAME=${GITHUB_RUN_ID}-${TEST_CASE_NAME}
JOB_NAME=${{ inputs.ARTIFACT_NAME }}${GITHUB_RUN_ID}-${TEST_CASE_NAME}
LOG_FILE=/nfs/cluster/${JOB_NAME}.log
MODEL_PATH=/nfs/cluster/${JOB_NAME}
for var in IMAGE TEST_CASE_NAME TOTAL_TASKS NODES GPUS_PER_NODE JOB_NAME LOG_FILE MODEL_PATH; do
Expand Down Expand Up @@ -172,7 +177,7 @@ jobs:
shell: bash -x {0}
run: |
pip install pytest pytest-reportlog tensorboard
for i in ${GITHUB_RUN_ID}-*DP*FSDP*TP*PP; do
for i in ${{ inputs.ARTIFACT_NAME }}${GITHUB_RUN_ID}-*DP*FSDP*TP*PP; do
SUBDIR=$(echo $i | cut -d'-' -f2)
mv $i/$SUBDIR* .
python3 .github/workflows/baselines/summarize_metrics.py $SUBDIR # create result json in baseline format
Expand Down Expand Up @@ -201,10 +206,10 @@ jobs:
if: ( always() )
secrets: inherit
with:
ENDPOINT_FILENAME: 'pax-test-status.json'
ENDPOINT_FILENAME: '${{ inputs.ARTIFACT_NAME }}pax-test-status.json'
PUBLISH: false
SCRIPT: |
EXIT_STATUSES="${GITHUB_RUN_ID}-*DP*FSDP*TP*PP/*-status.json"
EXIT_STATUSES="${{ inputs.ARTIFACT_NAME }}${GITHUB_RUN_ID}-*DP*FSDP*TP*PP/*-status.json"
PASSED_TESTS=$(jq -r '. | select ((.state == "COMPLETED") and (.exitcode == "0")) | .state' $EXIT_STATUSES | wc -l)
FAILED_TESTS=$(jq -r '. | select ((.state != "COMPLETED") or (.exitcode != "0")) | .state' $EXIT_STATUSES | wc -l)
TOTAL_TESTS=$(ls $EXIT_STATUSES | wc -l)
Expand Down
23 changes: 14 additions & 9 deletions .github/workflows/_test_pax_rosetta.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,11 @@ on:
description: Extra command line args to pass to test-pax.sh
default: ""
required: false
ARTIFACT_NAME:
type: string
description: If provided, will prepend a prefix to the artifact name. Helpful if re-running this reusable workflow to prevent clobbering of artifacts
default: ""
required: false
outputs:
TEST_STATUS:
description: 'Summary pass/fail value indicating if results from tests are acceptable'
Expand Down Expand Up @@ -63,7 +68,7 @@ jobs:
NODES=$(((TOTAL_TASKS+MAX_GPUS_PER_NODE-1)/MAX_GPUS_PER_NODE))
GPUS_PER_NODE=$((TOTAL_TASKS/NODES))

JOB_NAME=${GITHUB_RUN_ID}-${TEST_CASE_NAME}
JOB_NAME=${{ inputs.ARTIFACT_NAME }}${GITHUB_RUN_ID}-${TEST_CASE_NAME}
LOG_FILE=/nfs/cluster/${JOB_NAME}.log
MODEL_PATH=/nfs/cluster/${JOB_NAME}
for var in IMAGE TEST_CASE_NAME TOTAL_TASKS NODES GPUS_PER_NODE JOB_NAME LOG_FILE MODEL_PATH; do
Expand Down Expand Up @@ -199,7 +204,7 @@ jobs:
NODES=$(((TOTAL_TASKS+MAX_GPUS_PER_NODE-1)/MAX_GPUS_PER_NODE))
GPUS_PER_NODE=$((TOTAL_TASKS/NODES))

JOB_NAME=${GITHUB_RUN_ID}-${TEST_CASE_NAME}
JOB_NAME=${{ inputs.ARTIFACT_NAME }}${GITHUB_RUN_ID}-${TEST_CASE_NAME}
LOG_FILE=/nfs/cluster/${JOB_NAME}.log
MODEL_PATH=/nfs/cluster/${JOB_NAME}
for var in IMAGE TEST_CASE_NAME TOTAL_TASKS NODES GPUS_PER_NODE JOB_NAME LOG_FILE MODEL_PATH; do
Expand Down Expand Up @@ -331,7 +336,7 @@ jobs:
NODES=$(((TOTAL_TASKS+MAX_GPUS_PER_NODE-1)/MAX_GPUS_PER_NODE))
GPUS_PER_NODE=$((TOTAL_TASKS/NODES))

JOB_NAME=${GITHUB_RUN_ID}-${TEST_CASE_NAME}
JOB_NAME=${{ inputs.ARTIFACT_NAME }}${GITHUB_RUN_ID}-${TEST_CASE_NAME}
LOG_FILE=/nfs/cluster/${JOB_NAME}.log
MODEL_PATH=/nfs/cluster/${JOB_NAME}
for var in IMAGE TEST_CASE_NAME TOTAL_TASKS NODES GPUS_PER_NODE JOB_NAME LOG_FILE MODEL_PATH; do
Expand Down Expand Up @@ -442,13 +447,13 @@ jobs:
shell: bash -x {0}
run: |
pip install pytest pytest-reportlog tensorboard
for i in ${GITHUB_RUN_ID}-*DP*FSDP*TP*PP* ${GITHUB_RUN_ID}-*DP_TE_dropout; do
for i in ${{ inputs.ARTIFACT_NAME }}${GITHUB_RUN_ID}-*DP*FSDP*TP*PP* ${{ inputs.ARTIFACT_NAME }}${GITHUB_RUN_ID}-*DP_TE_dropout; do
SUBDIR=$(echo $i | cut -d'-' -f2)
mv $i/$SUBDIR* .
python3 .github/workflows/baselines/summarize_metrics.py $SUBDIR # create result json in baseline format
done

echo '## PAX MGMN Test Metrics' >> $GITHUB_STEP_SUMMARY
echo '## Rosetta PAX MGMN Test Metrics' >> $GITHUB_STEP_SUMMARY
for i in *_metrics.json; do
echo $i | cut -d'.' -f1
echo '```json'
Expand All @@ -471,10 +476,10 @@ jobs:
if: ( always() )
secrets: inherit
with:
ENDPOINT_FILENAME: 'pax-test-status.json'
ENDPOINT_FILENAME: '${{ inputs.ARTIFACT_NAME }}pax-test-status.json'
ashors1 marked this conversation as resolved.
Show resolved Hide resolved
PUBLISH: false
SCRIPT: |
EXIT_STATUSES="${GITHUB_RUN_ID}-*DP*FSDP*TP*PP*/*-status.json ${GITHUB_RUN_ID}-*DP_TE_dropout/*-status.json"
EXIT_STATUSES="${{ inputs.ARTIFACT_NAME }}${GITHUB_RUN_ID}-*DP*FSDP*TP*PP*/*-status.json ${{ inputs.ARTIFACT_NAME }}${GITHUB_RUN_ID}-*DP_TE_dropout/*-status.json"
PASSED_TESTS=$(jq -r '. | select ((.state == "COMPLETED") and (.exitcode == "0")) | .state' $EXIT_STATUSES | wc -l)
FAILED_TESTS=$(jq -r '. | select ((.state != "COMPLETED") or (.exitcode != "0")) | .state' $EXIT_STATUSES | wc -l)
TOTAL_TESTS=$(ls $EXIT_STATUSES | wc -l)
Expand Down Expand Up @@ -528,9 +533,9 @@ jobs:
(
cat << EOF

## PAX MGMN training
## Rosetta PAX MGMN training

[view metrics](https://${{ vars.HOSTNAME_TENSORBOARD }}/#scalars&regexInput=${GITHUB_RUN_ID}&_smoothingWeight=0&tagFilter=seqs_per)
[view metrics](https://${{ vars.HOSTNAME_TENSORBOARD }}/#scalars&regexInput=${{ inputs.ARTIFACT_NAME }}${GITHUB_RUN_ID}&_smoothingWeight=0&tagFilter=seqs_per)

EOF
) | tee $GITHUB_STEP_SUMMARY
Expand Down
11 changes: 8 additions & 3 deletions .github/workflows/_test_t5x.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,11 @@ on:
description: Extra gin args to pass to test-t5x.sh
default: ""
required: false
ARTIFACT_NAME:
type: string
description: If provided, will prepend a prefix to the artifact name. Helpful if re-running this reusable workflow to prevent clobbering of artifacts
default: ""
required: false
outputs:
TEST_STATUS:
description: 'Summary pass/fail value indicating if results from tests are acceptable'
Expand Down Expand Up @@ -58,7 +63,7 @@ jobs:
run: |
IMAGE="$(echo ${{inputs.T5X_IMAGE}} | sed 's/\//#/')"
TEST_CASE_NAME=1P${{ matrix.N_GPU }}G
JOB_NAME=${GITHUB_RUN_ID}-${TEST_CASE_NAME}
JOB_NAME=${{ inputs.ARTIFACT_NAME }}${GITHUB_RUN_ID}-${TEST_CASE_NAME}
LOG_FILE=/nfs/cluster/${JOB_NAME}.log
MODEL_PATH=/nfs/cluster/${JOB_NAME}
BATCH_SIZE=$((${{ inputs.BATCH_SIZE_PER_GPU }} * ${{ matrix.N_GPU }}))
Expand Down Expand Up @@ -312,10 +317,10 @@ jobs:
if: ( always() )
secrets: inherit
with:
ENDPOINT_FILENAME: 't5x-test-completion-status.json'
ENDPOINT_FILENAME: '${{ inputs.ARTIFACT_NAME }}t5x-test-completion-status.json'
PUBLISH: false
SCRIPT: |
EXIT_STATUSES="${GITHUB_RUN_ID}-*/*-status.json"
EXIT_STATUSES="${{ inputs.ARTIFACT_NAME }}${GITHUB_RUN_ID}-*[PG]*[GN]/*-status.json"
PASSED_TESTS=$(jq -r '. | select ((.state == "COMPLETED") and (.exitcode == "0")) | .state' $EXIT_STATUSES | wc -l)
FAILED_TESTS=$(jq -r '. | select ((.state != "COMPLETED") or (.exitcode != "0")) | .state' $EXIT_STATUSES | wc -l)
TOTAL_TESTS=$(ls $EXIT_STATUSES | wc -l)
Expand Down
60 changes: 40 additions & 20 deletions .github/workflows/ci.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -119,15 +119,15 @@ jobs:
REF_XLA: ${{ needs.metadata.outputs.REF_XLA }}
secrets: inherit

build-te:
needs: [metadata, build-jax]
uses: ./.github/workflows/_build_te.yaml
with:
BUILD_DATE: ${{ needs.metadata.outputs.BUILD_DATE }}
BASE_IMAGE: ${{ needs.build-jax.outputs.DOCKER_TAGS }}
REPO_TE: ${{ needs.metadata.outputs.REPO_TE }}
REF_TE: ${{ needs.metadata.outputs.REF_TE }}
secrets: inherit
# build-te:
ashors1 marked this conversation as resolved.
Show resolved Hide resolved
# needs: [metadata, build-jax]
# uses: ./.github/workflows/_build_te.yaml
# with:
# BUILD_DATE: ${{ needs.metadata.outputs.BUILD_DATE }}
# BASE_IMAGE: ${{ needs.build-jax.outputs.DOCKER_TAGS }}
# REPO_TE: ${{ needs.metadata.outputs.REPO_TE }}
# REF_TE: ${{ needs.metadata.outputs.REF_TE }}
# secrets: inherit

build-t5x:
needs: [metadata, build-jax]
Expand Down Expand Up @@ -182,7 +182,8 @@ jobs:
secrets: inherit

build-summary:
needs: [build-base, build-jax, build-te, build-t5x, build-pax, build-rosetta-t5x, build-rosetta-pax]
needs: [build-base, build-jax, build-t5x, build-pax, build-rosetta-t5x, build-rosetta-pax]
# needs: [build-base, build-jax, build-te, build-t5x, build-pax, build-rosetta-t5x, build-rosetta-pax]
# needs: [build-base, build-jax, build-te, build-t5x, build-pax, build-pax-aarch64, build-rosetta-t5x, build-rosetta-pax]
if: always()
runs-on: ubuntu-22.04
Expand All @@ -197,7 +198,6 @@ jobs:
| ------------ | -------------------------------------------------- |
| Base | ${{ needs.build-base.outputs.DOCKER_TAGS }} |
| JAX | ${{ needs.build-jax.outputs.DOCKER_TAGS }} |
| JAX-TE | ${{ needs.build-te.outputs.DOCKER_TAGS }} |
| T5X | ${{ needs.build-t5x.outputs.DOCKER_TAGS }} |
| PAX | ${{ needs.build-pax.outputs.DOCKER_TAGS }} |
| ROSETTA(t5x) | ${{ needs.build-rosetta-t5x.outputs.DOCKER_TAGS }} |
Expand All @@ -216,26 +216,45 @@ jobs:
JAX_IMAGE: ${{ needs.build-jax.outputs.DOCKER_TAGS }}
secrets: inherit

test-te:
needs: build-te
uses: ./.github/workflows/_test_te.yaml
with:
JAX_TE_IMAGE: ${{ needs.build-te.outputs.DOCKER_TAGS }}
secrets: inherit
test-t5x:
# test-te:
ashors1 marked this conversation as resolved.
Show resolved Hide resolved
# needs: build-te
# uses: ./.github/workflows/_test_te.yaml
# with:
# JAX_TE_IMAGE: ${{ needs.build-te.outputs.DOCKER_TAGS }}
# secrets: inherit

test-upstream-t5x:
needs: build-t5x
uses: ./.github/workflows/_test_t5x.yaml
with:
T5X_IMAGE: ${{ needs.build-t5x.outputs.DOCKER_TAGS }}
secrets: inherit

test-pax:
test-rosetta-t5x:
needs: build-rosetta-t5x
uses: ./.github/workflows/_test_t5x.yaml
with:
T5X_IMAGE: ${{ needs.build-rosetta-t5x.outputs.DOCKER_TAGS }}
# Disable packing b/c rosetta-t5x images run with TE by default, and TE does not currently support packing
EXTRA_GIN_ARGS: "--gin.train/utils.DatasetConfig.pack=False --gin.train_eval/utils.DatasetConfig.pack=False"
ARTIFACT_NAME: "rosetta-"
secrets: inherit

test-upstream-pax:
needs: build-pax
uses: ./.github/workflows/_test_pax.yaml
with:
PAX_IMAGE: ${{ needs.build-pax.outputs.DOCKER_TAGS }}
secrets: inherit

test-rosetta-pax:
needs: build-rosetta-pax
uses: ./.github/workflows/_test_pax_rosetta.yaml
with:
PAX_IMAGE: ${{ needs.build-rosetta-pax.outputs.DOCKER_TAGS }}
ARTIFACT_NAME: "rosetta-"
secrets: inherit

test-vit:
needs: build-rosetta-t5x
uses: ./.github/workflows/_test_vit.yaml
Expand All @@ -247,7 +266,8 @@ jobs:
finalize:
if: always()
# TODO: use dynamic matrix to make dependencies self-updating
needs: [build-summary, test-distribution, test-jax, test-te, test-t5x, test-pax]
# needs: [build-summary, test-distribution, test-jax, test-te, test-t5x, test-pax]
ashors1 marked this conversation as resolved.
Show resolved Hide resolved
needs: [build-summary, test-distribution, test-jax, test-upstream-t5x, test-upstream-pax, test-rosetta-t5x, test-rosetta-pax]
uses: ./.github/workflows/_finalize.yaml
with:
PUBLISH_BADGE: false
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/nightly-rosetta-pax-build.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,7 @@ jobs:
if: (github.event_name == 'workflow_run' && github.event.workflow_run.conclusion == 'success') || github.event_name == 'workflow_dispatch'
with:
PAX_IMAGE: ${{ needs.build.outputs.DOCKER_TAGS }}
ARTIFACT_NAME: "rosetta-"
secrets: inherit

publish-test:
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/nightly-rosetta-t5x-build-test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,7 @@ jobs:
T5X_IMAGE: ${{ needs.build.outputs.DOCKER_TAGS }}
# Disable packing b/c rosetta-t5x images run with TE by default, and TE does not currently support packing
EXTRA_GIN_ARGS: "--gin.train/utils.DatasetConfig.pack=False --gin.train_eval/utils.DatasetConfig.pack=False"
ARTIFACT_NAME: "rosetta-"
secrets: inherit

test-vit:
Expand Down
Loading