Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graceful teardown #103

Merged
merged 6 commits into from
Nov 21, 2024
Merged

Graceful teardown #103

merged 6 commits into from
Nov 21, 2024

Conversation

PietroPasotti
Copy link
Contributor

@PietroPasotti PietroPasotti commented Nov 19, 2024

Issue

Fixes canonical/tempo-worker-k8s-operator#50

TLDR:
when the worker is happy and configured, but you want it to go, no event will tell it to stop all services. Instead, it will attempt to restart and keep failing at it, as some of the required resources (s3, a coordinator) might be gone already.
Juju will then see the error status and refuse to clean it up.

Solution

This PR changes the Worker logic to ensure that, in the reconcile flow, if the worker isn't ready:

  • we wipe all configurations (strictly speaking not necessary, but it felt like the clean thing to do)
  • we avoid triggering .restart() as we know it should fail
  • we update the layer to disable service autorestart (as there is no way to 'remove the layer' altogether), and we stop the services one by one

Testing Instructions

deploy COS with this version of cosl
(the quickest way probably is to deploy COS from edge, then scp/sync this version of cosl into their venvs)

what I did:

deploy Tempo HA and get it to active/idle

# shell 1
cd cos-lib/src
jhack sync tempo-worker --source ./ --remote-root /var/lib/juju/agents/unit-tempo-worker-0/charm/venv/

# shell 2 
touch cos-lib/src/cosl/coordinated_workers/worker.py

juju destroy-model --destroy-storage --no-prompt

Case 1: teardown
then do a destroy-model or remove-application <tempo|loki|mimir> (and workers)
Wait for things to go down without ever setting error status as that will prevent juju from cleaning it up.

Case 2: upgrade
juju refresh any of the HA solutions (the worker alone should do)
Wait for things to come back up without ever setting error status as that will prevent juju from proceeding.

Upgrade Notes

To upgrade from an older revision of any of these charms, the user will need to juju resolve manually all units of tempo, mimir, loki that are in error until they're gone.

@PietroPasotti PietroPasotti requested a review from a team as a code owner November 19, 2024 09:41
Copy link
Contributor

@mmkay mmkay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested with Tempo Pro:

Case 1, model teardown: all is good ✔️

Case 2, juju refresh for the modified worker with a version of a charm that doesn't have the changes: it seems we still have a restart in the logs however the charm never goes to error state:

20 Nov 2024 12:30:17+01:00  juju-unit  executing    running tempo-cluster-relation-changed hook
20 Nov 2024 12:30:18+01:00  workload   active       (all roles) ready.
20 Nov 2024 12:30:19+01:00  juju-unit  idle         
20 Nov 2024 12:32:17+01:00  workload   maintenance  stopping charm software
20 Nov 2024 12:32:17+01:00  juju-unit  executing    running stop hook
20 Nov 2024 12:32:18+01:00  workload   maintenance  restarting... (attempt #1)
20 Nov 2024 12:32:23+01:00  workload   maintenance  restarting... (attempt #2)
20 Nov 2024 12:32:23+01:00  workload   waiting      waiting for resources patch to apply
20 Nov 2024 12:32:26+01:00  workload   maintenance  
20 Nov 2024 12:33:12+01:00  juju-unit  executing    running upgrade-charm hook

@PietroPasotti PietroPasotti merged commit c5abf13 into main Nov 21, 2024
7 checks passed
@PietroPasotti PietroPasotti deleted the graceful-teardown branch November 21, 2024 09:46
mathmarchand pushed a commit to mathmarchand/cos-lib that referenced this pull request Nov 21, 2024
* updated restart logic and added stop if not ready

* fixed utest

* fmt and layer fix

* added layer replace

* vbump

* simplified layer-stop mechanism
PietroPasotti added a commit that referenced this pull request Nov 22, 2024
* fix: Adding test_invalid_databag_content for ClusterProvider

Signed-off-by: Mathieu Marchand <[email protected]>

* fix: Improving test_invalid_databag_content for ClusterProvider.

Signed-off-by: Mathieu Marchand <[email protected]>

* fix: Improving test_invalid_databag

- Used directly the Coordinator ClusterProvider object.
- Added comments.
- Changed assert for the unit status after the manager ran.

Co-authored-by: PietroPasotti <[email protected]>
Signed-off-by: Math Marchand <[email protected]>
Signed-off-by: Mathieu Marchand <[email protected]>

* Graceful teardown (#103)

* updated restart logic and added stop if not ready

* fixed utest

* fmt and layer fix

* added layer replace

* vbump

* simplified layer-stop mechanism

* Graceful teardown (fix static checks) (#104)

* updated restart logic and added stop if not ready

* fixed utest

* fmt and layer fix

* added layer replace

* vbump

* simplified layer-stop mechanism

* type ignore

---------

Signed-off-by: Mathieu Marchand <[email protected]>
Signed-off-by: Math Marchand <[email protected]>
Co-authored-by: PietroPasotti <[email protected]>
Co-authored-by: PietroPasotti <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

tempo-worker stuck on teardown
2 participants