-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
execute_k8s_job does not handle watch client stale state #26626
Labels
area: execution
Related to Execution
deployment: k8s
Related to deploying Dagster to Kubernetes
type: bug
Something isn't working
Comments
garethbrickman
added
deployment: k8s
Related to deploying Dagster to Kubernetes
area: execution
Related to Execution
labels
Dec 20, 2024
As a workaround, this seems to be working (again, hard to confirm because I can't easily recreate the issue). I just created a copy of the
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
area: execution
Related to Execution
deployment: k8s
Related to deploying Dagster to Kubernetes
type: bug
Something isn't working
What's the issue?
Long calls to
execute_k8s_job
sometimes fail when reading the logs. The method has retries aroundnext(log_stream)
, but if the watch client enters a stale state, the code ends up failing. Example log:I found similar issues reported in ansible-playbook, and the relevant issue in the kubernetes client. The solution is to move the watch client creation (
log_stream = watch.stream()
) into a loop as well. I'm trying it out in my repo and will post a PR with a fix after I confirm that it's working (or at least not introducing new issues)What did you expect to happen?
The code shouldn't fail because of intermediate errors
How to reproduce?
This is difficult to reproduce. It originates from the underlying k8s client and only happens very rarely (but often enough to fail long running, expensive, training jobs).
Dagster version
1.9.3
Deployment type
Dagster Helm chart
Deployment details
No response
Additional information
No response
Message from the maintainers
Impacted by this issue? Give it a 👍! We factor engagement into prioritization.
By submitting this issue, you agree to follow Dagster's Code of Conduct.
The text was updated successfully, but these errors were encountered: