Graceful teardown #103

PietroPasotti · 2024-11-19T09:41:29Z

Issue

Fixes canonical/tempo-worker-k8s-operator#50

TLDR:
when the worker is happy and configured, but you want it to go, no event will tell it to stop all services. Instead, it will attempt to restart and keep failing at it, as some of the required resources (s3, a coordinator) might be gone already.
Juju will then see the error status and refuse to clean it up.

Solution

This PR changes the Worker logic to ensure that, in the reconcile flow, if the worker isn't ready:

we wipe all configurations (strictly speaking not necessary, but it felt like the clean thing to do)
we avoid triggering .restart() as we know it should fail
we update the layer to disable service autorestart (as there is no way to 'remove the layer' altogether), and we stop the services one by one

Testing Instructions

deploy COS with this version of cosl
(the quickest way probably is to deploy COS from edge, then scp/sync this version of cosl into their venvs)

what I did:

deploy Tempo HA and get it to active/idle

# shell 1
cd cos-lib/src
jhack sync tempo-worker --source ./ --remote-root /var/lib/juju/agents/unit-tempo-worker-0/charm/venv/

# shell 2 
touch cos-lib/src/cosl/coordinated_workers/worker.py

juju destroy-model --destroy-storage --no-prompt

Case 1: teardown
then do a destroy-model or remove-application <tempo|loki|mimir> (and workers)
Wait for things to go down without ever setting error status as that will prevent juju from cleaning it up.

Case 2: upgrade
juju refresh any of the HA solutions (the worker alone should do)
Wait for things to come back up without ever setting error status as that will prevent juju from proceeding.

Upgrade Notes

To upgrade from an older revision of any of these charms, the user will need to juju resolve manually all units of tempo, mimir, loki that are in error until they're gone.

mmkay

Tested with Tempo Pro:

Case 1, model teardown: all is good ✔️

Case 2, juju refresh for the modified worker with a version of a charm that doesn't have the changes: it seems we still have a restart in the logs however the charm never goes to error state:

20 Nov 2024 12:30:17+01:00  juju-unit  executing    running tempo-cluster-relation-changed hook
20 Nov 2024 12:30:18+01:00  workload   active       (all roles) ready.
20 Nov 2024 12:30:19+01:00  juju-unit  idle         
20 Nov 2024 12:32:17+01:00  workload   maintenance  stopping charm software
20 Nov 2024 12:32:17+01:00  juju-unit  executing    running stop hook
20 Nov 2024 12:32:18+01:00  workload   maintenance  restarting... (attempt #1)
20 Nov 2024 12:32:23+01:00  workload   maintenance  restarting... (attempt #2)
20 Nov 2024 12:32:23+01:00  workload   waiting      waiting for resources patch to apply
20 Nov 2024 12:32:26+01:00  workload   maintenance  
20 Nov 2024 12:33:12+01:00  juju-unit  executing    running upgrade-charm hook

* updated restart logic and added stop if not ready * fixed utest * fmt and layer fix * added layer replace * vbump * simplified layer-stop mechanism

* fix: Adding test_invalid_databag_content for ClusterProvider Signed-off-by: Mathieu Marchand <[email protected]> * fix: Improving test_invalid_databag_content for ClusterProvider. Signed-off-by: Mathieu Marchand <[email protected]> * fix: Improving test_invalid_databag - Used directly the Coordinator ClusterProvider object. - Added comments. - Changed assert for the unit status after the manager ran. Co-authored-by: PietroPasotti <[email protected]> Signed-off-by: Math Marchand <[email protected]> Signed-off-by: Mathieu Marchand <[email protected]> * Graceful teardown (#103) * updated restart logic and added stop if not ready * fixed utest * fmt and layer fix * added layer replace * vbump * simplified layer-stop mechanism * Graceful teardown (fix static checks) (#104) * updated restart logic and added stop if not ready * fixed utest * fmt and layer fix * added layer replace * vbump * simplified layer-stop mechanism * type ignore --------- Signed-off-by: Mathieu Marchand <[email protected]> Signed-off-by: Math Marchand <[email protected]> Co-authored-by: PietroPasotti <[email protected]> Co-authored-by: PietroPasotti <[email protected]>

updated restart logic and added stop if not ready

c71c949

PietroPasotti requested a review from a team as a code owner November 19, 2024 09:41

PietroPasotti added 3 commits November 19, 2024 12:16

fixed utest

e270fff

fmt and layer fix

1130aab

added layer replace

f32b369

PietroPasotti mentioned this pull request Nov 19, 2024

Jnsgrukification of tempo-coordinator-k8s canonical/tempo-coordinator-k8s-operator#62

Open

vbump

c406ee1

mmkay approved these changes Nov 20, 2024

View reviewed changes

simplified layer-stop mechanism

8ac325e

PietroPasotti merged commit c5abf13 into main Nov 21, 2024
7 checks passed

PietroPasotti deleted the graceful-teardown branch November 21, 2024 09:46

mathmarchand pushed a commit to mathmarchand/cos-lib that referenced this pull request Nov 21, 2024

Graceful teardown (canonical#103)

d1d6700

* updated restart logic and added stop if not ready * fixed utest * fmt and layer fix * added layer replace * vbump * simplified layer-stop mechanism

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graceful teardown #103

Graceful teardown #103

PietroPasotti commented Nov 19, 2024 •

edited

Loading

mmkay left a comment

Graceful teardown #103

Graceful teardown #103

Conversation

PietroPasotti commented Nov 19, 2024 • edited Loading

Issue

Solution

Testing Instructions

Upgrade Notes

mmkay left a comment

Choose a reason for hiding this comment

PietroPasotti commented Nov 19, 2024 •

edited

Loading