-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Graceful teardown #103
Merged
Merged
Graceful teardown #103
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
mmkay
approved these changes
Nov 20, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested with Tempo Pro:
Case 1, model teardown: all is good ✔️
Case 2, juju refresh for the modified worker with a version of a charm that doesn't have the changes: it seems we still have a restart in the logs however the charm never goes to error state:
20 Nov 2024 12:30:17+01:00 juju-unit executing running tempo-cluster-relation-changed hook
20 Nov 2024 12:30:18+01:00 workload active (all roles) ready.
20 Nov 2024 12:30:19+01:00 juju-unit idle
20 Nov 2024 12:32:17+01:00 workload maintenance stopping charm software
20 Nov 2024 12:32:17+01:00 juju-unit executing running stop hook
20 Nov 2024 12:32:18+01:00 workload maintenance restarting... (attempt #1)
20 Nov 2024 12:32:23+01:00 workload maintenance restarting... (attempt #2)
20 Nov 2024 12:32:23+01:00 workload waiting waiting for resources patch to apply
20 Nov 2024 12:32:26+01:00 workload maintenance
20 Nov 2024 12:33:12+01:00 juju-unit executing running upgrade-charm hook
mathmarchand
pushed a commit
to mathmarchand/cos-lib
that referenced
this pull request
Nov 21, 2024
* updated restart logic and added stop if not ready * fixed utest * fmt and layer fix * added layer replace * vbump * simplified layer-stop mechanism
PietroPasotti
added a commit
that referenced
this pull request
Nov 22, 2024
* fix: Adding test_invalid_databag_content for ClusterProvider Signed-off-by: Mathieu Marchand <[email protected]> * fix: Improving test_invalid_databag_content for ClusterProvider. Signed-off-by: Mathieu Marchand <[email protected]> * fix: Improving test_invalid_databag - Used directly the Coordinator ClusterProvider object. - Added comments. - Changed assert for the unit status after the manager ran. Co-authored-by: PietroPasotti <[email protected]> Signed-off-by: Math Marchand <[email protected]> Signed-off-by: Mathieu Marchand <[email protected]> * Graceful teardown (#103) * updated restart logic and added stop if not ready * fixed utest * fmt and layer fix * added layer replace * vbump * simplified layer-stop mechanism * Graceful teardown (fix static checks) (#104) * updated restart logic and added stop if not ready * fixed utest * fmt and layer fix * added layer replace * vbump * simplified layer-stop mechanism * type ignore --------- Signed-off-by: Mathieu Marchand <[email protected]> Signed-off-by: Math Marchand <[email protected]> Co-authored-by: PietroPasotti <[email protected]> Co-authored-by: PietroPasotti <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Issue
Fixes canonical/tempo-worker-k8s-operator#50
TLDR:
when the worker is happy and configured, but you want it to go, no event will tell it to stop all services. Instead, it will attempt to restart and keep failing at it, as some of the required resources (s3, a coordinator) might be gone already.
Juju will then see the error status and refuse to clean it up.
Solution
This PR changes the Worker logic to ensure that, in the reconcile flow, if the worker isn't ready:
.restart()
as we know it should failTesting Instructions
deploy COS with this version of cosl
(the quickest way probably is to deploy COS from edge, then scp/sync this version of cosl into their venvs)
what I did:
deploy Tempo HA and get it to active/idle
Case 1: teardown
then do a
destroy-model
orremove-application <tempo|loki|mimir>
(and workers)Wait for things to go down without ever setting
error
status as that will prevent juju from cleaning it up.Case 2: upgrade
juju refresh
any of the HA solutions (the worker alone should do)Wait for things to come back up without ever setting
error
status as that will prevent juju from proceeding.Upgrade Notes
To upgrade from an older revision of any of these charms, the user will need to
juju resolve
manually all units of tempo, mimir, loki that are in error until they're gone.