You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We provision GitHub Actions runners for running JAX/Pallas/Levanter A100 unit tests using SLURM jobs that launch ephemeral runner instances inside the job. Currently, the launch-slurm-runner job will succeed as long as it is able to submit the job and wait for the job to finish/terminate. If the runner fails to register, e.g. due to network issue, then launch-slurm-runner would succeed, while the actual unit test job that needs to use the runner will wait forever. Even cancelling and then restarting the workflow cannot work around this, as the launcher job is marked as successfully and won't be rerun.
To fix that, we need the launch-slurm-runner job to query the status of the runner running inside the job. If the runner errors out without being able to handle a job, then the CI job should also fail.
Note the difference between
launch-slurm-runner successfully starts an A100 runner, which picks up a job that run and failed, and
launch-slurm-runner fails to start an A100 runner.
The latter is what we need to address in this issue.
The text was updated successfully, but these errors were encountered:
We provision GitHub Actions runners for running JAX/Pallas/Levanter A100 unit tests using SLURM jobs that launch ephemeral runner instances inside the job. Currently, the launch-slurm-runner job will succeed as long as it is able to submit the job and wait for the job to finish/terminate. If the runner fails to register, e.g. due to network issue, then
launch-slurm-runner
would succeed, while the actual unit test job that needs to use the runner will wait forever. Even cancelling and then restarting the workflow cannot work around this, as the launcher job is marked as successfully and won't be rerun.To fix that, we need the
launch-slurm-runner
job to query the status of the runner running inside the job. If the runner errors out without being able to handle a job, then the CI job should also fail.Note the difference between
launch-slurm-runner
successfully starts an A100 runner, which picks up a job that run and failed, andlaunch-slurm-runner
fails to start an A100 runner.The latter is what we need to address in this issue.
The text was updated successfully, but these errors were encountered: