Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MISC] Revert upgrade lib #489

Closed
wants to merge 2 commits into from
Closed
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 7 additions & 20 deletions lib/charms/data_platform_libs/v0/upgrade.py
Original file line number Diff line number Diff line change
Expand Up @@ -285,7 +285,7 @@ def restart(self, event) -> None:

# Increment this PATCH version before using `charmcraft publish-lib` or reset
# to 0 if you are raising the major API version
LIBPATCH = 16
LIBPATCH = 15

PYDEPS = ["pydantic>=1.10,<2", "poetry-core"]

Expand Down Expand Up @@ -501,7 +501,7 @@ class DataUpgrade(Object, ABC):

STATES = ["recovery", "failed", "idle", "ready", "upgrading", "completed"]

on = UpgradeEvents() # pyright: ignore [reportAssignmentType]
on = UpgradeEvents() # pyright: ignore [reportGeneralTypeIssues]

def __init__(
self,
Expand Down Expand Up @@ -606,21 +606,6 @@ def upgrade_stack(self, stack: List[int]) -> None:
self.peer_relation.data[self.charm.app].update({"upgrade-stack": json.dumps(stack)})
self._upgrade_stack = stack

@property
def other_unit_states(self) -> list:
"""Current upgrade state for other units.

Returns:
Unsorted list of upgrade states for other units.
"""
if not self.peer_relation:
return []

return [
self.peer_relation.data[unit].get("state", "")
for unit in list(self.peer_relation.units)
]

@property
def unit_states(self) -> list:
"""Current upgrade state for all units.
Expand Down Expand Up @@ -941,8 +926,11 @@ def on_upgrade_changed(self, event: EventBase) -> None:
return

if self.substrate == "vm" and self.cluster_state == "recovery":
# skip run while in recovery. The event will be retrigged when the cluster is ready
logger.debug("Cluster in recovery, skip...")
# Only defer for vm, that will set unit states to "ready" on upgrade-charm
# on k8s only the upgrading unit will receive the upgrade-charm event
# and deferring will prevent the upgrade stack from being popped
logger.debug("Cluster in recovery, deferring...")
event.defer()
Comment on lines -944 to +933
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the cause why the rollback gets stuck.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @dragomirp , this was causing concurrent upgrades on units, so I don't think we should revert it on the lib.
Does manual upgrades run also stuck?
Is the test timming out? I've also observed juju takes humongous amount of time to trigger upgrade-charm after a refresh being run, eating a lot of the wait period, forcing us to increase timeout times for upgrade tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The real issue is the following:

  • In the test_upgrade_from_edge test:
    • TimescaleDB in the snap is leading to a longer snap refresh (only some more seconds).
    • With that, Patroni is performing a failover, so a new unit, different from the Juju leader unit (the former primary), becomes the primary.
    • Also with that, the former primary unit, the Juju leader unit, becomes a replica (not a synchronous standby).
  • In the test_fail_and_rollback test:
    • After the failed upgrade, the pre-upgrade-check is run, which sets the recovery state for the Juju leader unit (which is a replica).
    • When the refresh is started, the upgrade charm event is fired in each of the three units, setting the state to ready in all of them.
    • The upgrade changed event is fired only in the non-primary units because only the relation data from the Juju leader unit changed (the state changed from recovery to ready, while in the other units, it was already set to ready, so no change, no events firing for the Juju leader unit).
    • As the Juju leader unit is the replica, it’s the top of the stack unit (the one that should be upgraded first). However, as the upgrade relation changed event hasn’t fired for it, it cannot start the upgrade (in this case is a rollback).

The new version of the upgrade library (which is being reverted through this PR) is also causing this issue to start happening because of the missing defer call that Dragomir commented about.

However, the root cause is the failover happening, which has shown a specific path of the execution that wasn’t covered by the lib. Before adding TimescaleDB, there was no failover, so the Juju leader unit was the primary during the entire test, and what happened in the test_fail_and_rollback test was the following:
* After the failed upgrade, the pre-upgrade-check is run, which sets the recovery state for the Juju leader unit (which is the primary).
* When the refresh is started, the upgrade charm event is fired in each of the three units, setting the state to ready in all of them.
* As the Juju leader unit was not the top of the stack unit, it was ok for it not to receive the upgrade relation event in the first moment.

For Kafka and MySQL operators on VM, it hasn’t happened because both always set the upgrade stack with the Juju leader unit as the last unit to be upgraded. For the following codes, remember that the build_upgrade_stack is always executed in the Juju leader unit.

It may be fixed by canonical/data-platform-libs#176.

return

# if all units completed, mark as complete
Expand Down Expand Up @@ -993,7 +981,6 @@ def on_upgrade_changed(self, event: EventBase) -> None:
self.charm.unit == top_unit
and top_state in ["ready", "upgrading"]
and self.cluster_state == "ready"
and "upgrading" not in self.other_unit_states
):
logger.debug(
f"{top_unit.name} is next to upgrade, emitting `upgrade_granted` event and upgrading..."
Expand Down
Loading