-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The charm cannot recover from a quorum loss event of 3-node cluster #571
Comments
Hi @nobuto-m , thank you for the well prepared bug report! After the detailed investigation:
|
For the history: as discussed on the last sync with @nobuto-m , Data team has prepared PoC/fix and shared for testing privately to ensure implementation full-fits requirements. P.S. we are not sharing the links here, because the last time we have found such experimental build in pre-production. ;-) |
+1 |
@taurus-forever The deployment failed straightaway.
|
Just for the history: the last post reports duplicate of https://warthogs.atlassian.net/browse/DPE-5648
The CharmHub branch has been updated and now installable in #611. |
Steps to reproduce
juju deploy postgresql --base [email protected] --channel 14/stable -n 3
Expected behavior
The cluster should stop accepting a write request to the PostgreSQL since it's a quorum loss event. However, the replica is valid in the living node out of 3 so the charm should be able to recover the cluster from the replica.
Actual behavior
The charm gets stuck at
waiting for primary to be reachable from this unit
andawaiting for member to start
. Also Patroni configuration hasn't been recovered to be functional.initial status
after taking down the Leader and Sync Standby
-> the quorum loss is expected here.
cleanup of dead nodes
->
remove-machine --force
was used on purpose sinceremove-unit
is no-op when the agent is not reachable.after cleanup
-> status looks okay except for the fact that there is no "Primary" line
-> Patroni is still not working
-> there are left overs of dead unit configurations.
adding two nodes to form the 3-node cluster again
after adding two nodes
-> juju status doesn't settle.
-> Patroni hasn't been recovered
-> Patroni config still has leftovers. It has a 5-node cluster config instead of 3-node cluster.
Versions
Operating system: jammy
Juju CLI: 3.5.3-genericlinux-amd64
Juju agent: 3.5.3
Charm revision: 14/stable 429
LXD: N/A
Log output
Juju debug log:
3-node-recovery_debug.log
Additional context
The text was updated successfully, but these errors were encountered: