The charm cannot recover from a quorum loss event of 3-node cluster #571

nobuto-m · 2024-08-05T14:14:10Z

Steps to reproduce

Prepare a MAAS provider
deploy a 3-node cluster
juju deploy postgresql --base [email protected] --channel 14/stable -n 3
Take down the primary and one more unit to simulate a quorum loss event by losing 2 out of 3 nodes

Expected behavior

The cluster should stop accepting a write request to the PostgreSQL since it's a quorum loss event. However, the replica is valid in the living node out of 3 so the charm should be able to recover the cluster from the replica.

Actual behavior

The charm gets stuck at waiting for primary to be reachable from this unit and awaiting for member to start. Also Patroni configuration hasn't been recovered to be functional.

initial status

$ juju status
Model     Controller            Cloud/Region       Version  SLA          Timestamp
postgres  mysunbeam-controller  mysunbeam/default  3.5.3    unsupported  13:25:53Z

App         Version  Status  Scale  Charm       Channel    Rev  Exposed  Message
postgresql  14.11    active      3  postgresql  14/stable  429  no

Unit           Workload  Agent  Machine  Public address   Ports     Message
postgresql/0*  active    idle   0        192.168.151.117  5432/tcp
postgresql/1   active    idle   1        192.168.151.118  5432/tcp  Primary
postgresql/2   active    idle   2        192.168.151.119  5432/tcp

Machine  State    Address          Inst id    Base          AZ       Message
0        started  192.168.151.117  machine-7  [email protected]  default  Deployed
1        started  192.168.151.118  machine-8  [email protected]  default  Deployed
2        started  192.168.151.119  machine-1  [email protected]  default  Deployed

$ sudo -u snap_daemon patronictl -c /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml topology
+ Cluster: postgresql (7399642793178039038) ------+-----------+----+-----------+
| Member         | Host            | Role         | State     | TL | Lag in MB |
+----------------+-----------------+--------------+-----------+----+-----------+
| postgresql-1   | 192.168.151.118 | Leader       | running   |  1 |           |
| + postgresql-0 | 192.168.151.117 | Sync Standby | streaming |  1 |         0 |
| + postgresql-2 | 192.168.151.119 | Replica      | streaming |  1 |         0 |
+----------------+-----------------+--------------+-----------+----+-----------+

after taking down the Leader and Sync Standby

$ juju status
Model     Controller            Cloud/Region       Version  SLA          Timestamp
postgres  mysunbeam-controller  mysunbeam/default  3.5.3    unsupported  13:39:18Z

App         Version  Status  Scale  Charm       Channel    Rev  Exposed  Message
postgresql  14.11    active    1/3  postgresql  14/stable  429  no       

Unit           Workload  Agent  Machine  Public address   Ports     Message
postgresql/0*  active    idle   0        192.168.151.117  5432/tcp  
postgresql/1   unknown   lost   1        192.168.151.118  5432/tcp  agent lost, see 'juju show-status-log postgresql/1'
postgresql/2   unknown   lost   2        192.168.151.119  5432/tcp  agent lost, see 'juju show-status-log postgresql/2'

Machine  State    Address          Inst id    Base          AZ       Message
0        started  192.168.151.117  machine-7  [email protected]  default  Deployed
1        down     192.168.151.118  machine-8  [email protected]  default  Deployed
2        down     192.168.151.119  machine-1  [email protected]  default  Deployed

$ sudo -u snap_daemon env PATRONI_LOG_LEVEL=DEBUG patronictl -c /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml list
2024-08-05 13:38:25,462 - DEBUG - Loading configuration from file /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml
2024-08-05 13:38:30,529 - INFO - waiting on raft
2024-08-05 13:38:35,530 - INFO - waiting on raft
2024-08-05 13:38:40,530 - INFO - waiting on raft
2024-08-05 13:38:45,531 - INFO - waiting on raft
2024-08-05 13:38:50,532 - INFO - waiting on raft
2024-08-05 13:38:55,532 - INFO - waiting on raft
2024-08-05 13:39:00,533 - INFO - waiting on raft
^C
Aborted!

-> the quorum loss is expected here.

cleanup of dead nodes

$ juju remove-machine --force 1
WARNING This command will perform the following actions:
will remove machine 1
- will remove unit postgresql/1
- will remove storage pgdata/1

Continue [y/N]? y

$ juju remove-machine --force 2
WARNING This command will perform the following actions:
will remove machine 2
- will remove unit postgresql/2
- will remove storage pgdata/2

Continue [y/N]? y

-> remove-machine --force was used on purpose since remove-unit is no-op when the agent is not reachable.

after cleanup

$ juju status
Model     Controller            Cloud/Region       Version  SLA          Timestamp
postgres  mysunbeam-controller  mysunbeam/default  3.5.3    unsupported  13:42:14Z

App         Version  Status  Scale  Charm       Channel    Rev  Exposed  Message
postgresql  14.11    active      1  postgresql  14/stable  429  no       

Unit           Workload  Agent  Machine  Public address   Ports     Message
postgresql/0*  active    idle   0        192.168.151.117  5432/tcp  

Machine  State    Address          Inst id    Base          AZ       Message
0        started  192.168.151.117  machine-7  [email protected]  default  Deployed

-> status looks okay except for the fact that there is no "Primary" line

machine-7:~$ sudo -u snap_daemon env PATRONI_LOG_LEVEL=DEBUG patronictl -c /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml list
2024-08-05 13:43:18,378 - DEBUG - Loading configuration from file /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml
2024-08-05 13:43:23,445 - INFO - waiting on raft
2024-08-05 13:43:28,446 - INFO - waiting on raft
2024-08-05 13:43:33,446 - INFO - waiting on raft
^C
Aborted!

-> Patroni is still not working

$ sudo cat /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml

...

raft:
  data_dir: /var/snap/charmed-postgresql/current/etc/patroni/raft
  self_addr: '192.168.151.117:2222'
  partner_addrs:
  - 192.168.151.118:2222
  - 192.168.151.119:2222




...

  pg_hba:
    - local all backup peer map=operator
    - local all operator scram-sha-256
    - local all monitoring password
    - host replication replication 127.0.0.1/32 md5
    - host all all 0.0.0.0/0 md5
    # Allow replications connections from other cluster members.
    - host     replication    replication    192.168.151.118/0    md5
    
    - host     replication    replication    192.168.151.119/0    md5

...

-> there are left overs of dead unit configurations.

adding two nodes to form the 3-node cluster again

$ juju add-unit postgresql -n 2

after adding two nodes

$ juju status
Model     Controller            Cloud/Region       Version  SLA          Timestamp
postgres  mysunbeam-controller  mysunbeam/default  3.5.3    unsupported  13:57:16Z

App         Version  Status  Scale  Charm       Channel    Rev  Exposed  Message
postgresql  14.11    active      3  postgresql  14/stable  429  no       

Unit           Workload  Agent  Machine  Public address   Ports     Message
postgresql/0*  waiting   idle   0        192.168.151.117  5432/tcp  waiting for primary to be reachable from this unit
postgresql/3   waiting   idle   3        192.168.151.120  5432/tcp  awaiting for member to start
postgresql/4   waiting   idle   4        192.168.151.121  5432/tcp  awaiting for member to start

Machine  State    Address          Inst id    Base          AZ       Message
0        started  192.168.151.117  machine-7  [email protected]  default  Deployed
3        started  192.168.151.120  machine-8  [email protected]  default  Deployed
4        started  192.168.151.121  machine-1  [email protected]  default  Deployed

-> juju status doesn't settle.

$ sudo -u snap_daemon env PATRONI_LOG_LEVEL=DEBUG patronictl -c /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml list
2024-08-05 13:55:31,623 - DEBUG - Loading configuration from file /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml
2024-08-05 13:55:38,696 - INFO - waiting on raft
2024-08-05 13:55:43,696 - INFO - waiting on raft
2024-08-05 13:55:48,697 - INFO - waiting on raft
2024-08-05 13:55:53,697 - INFO - waiting on raft
2024-08-05 13:55:58,698 - INFO - waiting on raft
^C
Aborted!

-> Patroni hasn't been recovered

$ sudo cat /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml

...

raft:
  data_dir: /var/snap/charmed-postgresql/current/etc/patroni/raft
  self_addr: '192.168.151.117:2222'
  partner_addrs:
  - 192.168.151.119:2222
  - 192.168.151.121:2222
  - 192.168.151.118:2222
  - 192.168.151.120:2222

...

  pg_hba:
    - local all backup peer map=operator
    - local all operator scram-sha-256
    - local all monitoring password
    - host replication replication 127.0.0.1/32 md5
    - host all all 0.0.0.0/0 md5
    # Allow replications connections from other cluster members.
    - host     replication    replication    192.168.151.119/0    md5
    
    - host     replication    replication    192.168.151.121/0    md5
    
    - host     replication    replication    192.168.151.118/0    md5
    
    - host     replication    replication    192.168.151.120/0    md5

-> Patroni config still has leftovers. It has a 5-node cluster config instead of 3-node cluster.

Versions

Operating system: jammy

Juju CLI: 3.5.3-genericlinux-amd64

Juju agent: 3.5.3

Charm revision: 14/stable 429

LXD: N/A

Log output

Juju debug log:
3-node-recovery_debug.log

Additional context

The text was updated successfully, but these errors were encountered:

github-actions · 2024-08-05T14:14:26Z

https://warthogs.atlassian.net/browse/DPE-5045

taurus-forever · 2024-08-21T10:23:43Z

Hi @nobuto-m , thank you for the well prepared bug report!

After the detailed investigation:

charm stuck is a known issue (duplicate of Cluster of 2 units is unable to restore after leader is removed (1 unit remains) #418 => https://warthogs.atlassian.net/browse/DPE-3684 ), we should continue discussion there. TL;DR: The pySyncObj raft implementation is not-fixable. We tried to workaround this here no luck so far, exploring other options right now. In general Raft quorum works for 3+ nodes only.
Expected behavior: The cluster should stop accepting a write request to the PostgreSQL since it's a quorum loss event. However, the replica is valid in the living node out of 3 so the charm should be able to recover the cluster from the replica.

The initial idea was to elect new primary, continue writing there and all nodes rejoin the cluster.
It failed due to 1) above. New primary is not elected. Will be addressed in DPE-3684.

The stop accepting a write request should be performed by Patroni once quorum loss event noticed => didn't happen due to stuck in pySyncObj. Once we fix/replace the lib the behavior we have should match your expectation.

taurus-forever · 2024-10-18T14:32:26Z

For the history: as discussed on the last sync with @nobuto-m , Data team has prepared PoC/fix and shared for testing privately to ensure implementation full-fits requirements.

P.S. we are not sharing the links here, because the last time we have found such experimental build in pre-production. ;-)

gianlucaperna · 2024-10-18T15:28:14Z

+1

nobuto-m · 2024-10-22T20:02:32Z

@taurus-forever The deployment failed straightaway.

$ juju status
Model     Controller  Cloud/Region         Version  SLA          Timestamp
postgres  localhost   localhost/localhost  3.5.4    unsupported  19:59:43Z

App         Version  Status  Scale  Charm       Channel        Rev  Exposed  Message
postgresql           error       1  postgresql  14/edge/pr611  483  no       hook failed: "leader-elected"

Unit           Workload  Agent  Machine  Public address  Ports  Message
postgresql/0*  error     idle   0        10.0.9.136             hook failed: "leader-elected"

Machine  State    Address     Inst id        Base          AZ  Message
0        started  10.0.9.136  juju-5aece3-0  [email protected]      Running

$ juju debug-log -i postgresql/0 --replay --color --date --utc | head -n80
unit-postgresql-0: 2024-10-22 08:55:45 INFO juju Starting unit workers for "postgresql/0"
unit-postgresql-0: 2024-10-22 08:55:45 INFO juju.worker.apicaller [9ed86c] "unit-postgresql-0" successfully connected to "10.0.9.59:17070"
unit-postgresql-0: 2024-10-22 08:55:45 INFO juju.worker.apicaller [9ed86c] password changed for "unit-postgresql-0"
unit-postgresql-0: 2024-10-22 08:55:45 INFO juju.worker.apicaller [9ed86c] "unit-postgresql-0" successfully connected to "10.0.9.59:17070"
unit-postgresql-0: 2024-10-22 08:55:45 INFO juju.worker.upgrader no waiter, upgrader is done
unit-postgresql-0: 2024-10-22 08:55:45 INFO juju.worker.migrationminion migration migration phase is now: NONE
unit-postgresql-0: 2024-10-22 08:55:45 INFO juju.worker.logger logger worker started
unit-postgresql-0: 2024-10-22 08:55:45 ERROR juju.worker.meterstatus error running "meter-status-changed": charm missing from disk
unit-postgresql-0: 2024-10-22 08:55:45 INFO juju.worker.uniter unit "postgresql/0" started
unit-postgresql-0: 2024-10-22 08:55:45 INFO juju.worker.uniter resuming charm install
unit-postgresql-0: 2024-10-22 08:55:45 INFO juju.worker.uniter.charm downloading ch:amd64/jammy/postgresql-483 from API server
unit-postgresql-0: 2024-10-22 08:55:52 INFO juju.worker.uniter hooks are retried true
unit-postgresql-0: 2024-10-22 08:55:54 WARNING unit.postgresql/0.pgdata-storage-attached /var/lib/juju/agents/unit-postgresql-0/charm/./src/charm.py:2050: DeprecationWarning: Calling `ops.main.main()` is deprecated, call `ops.main()` instead
unit-postgresql-0: 2024-10-22 08:55:54 WARNING unit.postgresql/0.pgdata-storage-attached   main(PostgresqlOperatorCharm)
unit-postgresql-0: 2024-10-22 08:55:54 INFO juju.worker.uniter.operation ran "pgdata-storage-attached" hook (via hook dispatching script: dispatch)
unit-postgresql-0: 2024-10-22 08:55:54 INFO juju.worker.uniter.storage initial storage attachments ready
unit-postgresql-0: 2024-10-22 08:55:55 WARNING unit.postgresql/0.install /var/lib/juju/agents/unit-postgresql-0/charm/./src/charm.py:2050: DeprecationWarning: Calling `ops.main.main()` is deprecated, call `ops.main()` instead
unit-postgresql-0: 2024-10-22 08:55:55 WARNING unit.postgresql/0.install   main(PostgresqlOperatorCharm)
unit-postgresql-0: 2024-10-22 08:55:55 INFO unit.postgresql/0.juju-log Running legacy hooks/install.
unit-postgresql-0: 2024-10-22 08:55:55 WARNING unit.postgresql/0.install /var/lib/juju/agents/unit-postgresql-0/charm/./src/charm.py:2050: DeprecationWarning: Calling `ops.main.main()` is deprecated, call `ops.main()` instead
unit-postgresql-0: 2024-10-22 08:55:55 WARNING unit.postgresql/0.install   main(PostgresqlOperatorCharm)
unit-postgresql-0: 2024-10-22 08:55:56 INFO unit.postgresql/0.juju-log Installing snap charmed-postgresql, revision 124, tracking 14/stable
unit-postgresql-0: 2024-10-22 08:55:57 WARNING unit.postgresql/0.install error: snap "charmed-postgresql" is not available on 14/stable but is available
unit-postgresql-0: 2024-10-22 08:55:57 WARNING unit.postgresql/0.install        to install on the following channels:
unit-postgresql-0: 2024-10-22 08:55:57 WARNING unit.postgresql/0.install 
unit-postgresql-0: 2024-10-22 08:55:57 WARNING unit.postgresql/0.install        14/edge    snap install --channel=14/edge charmed-postgresql
unit-postgresql-0: 2024-10-22 08:55:57 WARNING unit.postgresql/0.install 
unit-postgresql-0: 2024-10-22 08:55:57 WARNING unit.postgresql/0.install        Please be mindful pre-release channels may include features not
unit-postgresql-0: 2024-10-22 08:55:57 WARNING unit.postgresql/0.install        completely tested or implemented. Get more information with 'snap info
unit-postgresql-0: 2024-10-22 08:55:57 WARNING unit.postgresql/0.install        charmed-postgresql'.
unit-postgresql-0: 2024-10-22 08:55:57 ERROR unit.postgresql/0.juju-log An exception occurred when installing charmed-postgresql. Reason: Snap: 'charmed-postgresql'; command ['snap', 'install', 'charmed-postgresql', '--channel="14/stable"', '--revision="124"'] failed with output = ''
unit-postgresql-0: 2024-10-22 08:55:57 INFO juju.worker.uniter.operation ran "install" hook (via hook dispatching script: dispatch)
unit-postgresql-0: 2024-10-22 08:55:58 WARNING unit.postgresql/0.upgrade-relation-created /var/lib/juju/agents/unit-postgresql-0/charm/./src/charm.py:2050: DeprecationWarning: Calling `ops.main.main()` is deprecated, call `ops.main()` instead
unit-postgresql-0: 2024-10-22 08:55:58 WARNING unit.postgresql/0.upgrade-relation-created   main(PostgresqlOperatorCharm)
unit-postgresql-0: 2024-10-22 08:55:58 INFO juju.worker.uniter.operation ran "upgrade-relation-created" hook (via hook dispatching script: dispatch)
unit-postgresql-0: 2024-10-22 08:55:59 WARNING unit.postgresql/0.database-peers-relation-created /var/lib/juju/agents/unit-postgresql-0/charm/./src/charm.py:2050: DeprecationWarning: Calling `ops.main.main()` is deprecated, call `ops.main()` instead
unit-postgresql-0: 2024-10-22 08:55:59 WARNING unit.postgresql/0.database-peers-relation-created   main(PostgresqlOperatorCharm)
unit-postgresql-0: 2024-10-22 08:55:59 INFO juju.worker.uniter.operation ran "database-peers-relation-created" hook (via hook dispatching script: dispatch)
unit-postgresql-0: 2024-10-22 08:56:00 WARNING unit.postgresql/0.restart-relation-created /var/lib/juju/agents/unit-postgresql-0/charm/./src/charm.py:2050: DeprecationWarning: Calling `ops.main.main()` is deprecated, call `ops.main()` instead
unit-postgresql-0: 2024-10-22 08:56:00 WARNING unit.postgresql/0.restart-relation-created   main(PostgresqlOperatorCharm)
unit-postgresql-0: 2024-10-22 08:56:00 INFO juju.worker.uniter.operation ran "restart-relation-created" hook (via hook dispatching script: dispatch)
unit-postgresql-0: 2024-10-22 08:56:01 INFO juju.worker.uniter found queued "leader-elected" hook
unit-postgresql-0: 2024-10-22 08:56:01 WARNING unit.postgresql/0.leader-elected /var/lib/juju/agents/unit-postgresql-0/charm/./src/charm.py:2050: DeprecationWarning: Calling `ops.main.main()` is deprecated, call `ops.main()` instead
unit-postgresql-0: 2024-10-22 08:56:01 WARNING unit.postgresql/0.leader-elected   main(PostgresqlOperatorCharm)
unit-postgresql-0: 2024-10-22 08:56:08 ERROR unit.postgresql/0.juju-log Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-postgresql-0/charm/./src/charm.py", line 2050, in <module>
    main(PostgresqlOperatorCharm)
  File "/var/lib/juju/agents/unit-postgresql-0/charm/venv/ops/main.py", line 45, in main
    return _main.main(charm_class=charm_class, use_juju_for_storage=use_juju_for_storage)
  File "/var/lib/juju/agents/unit-postgresql-0/charm/venv/ops/_main.py", line 543, in main
    manager.run()
  File "/var/lib/juju/agents/unit-postgresql-0/charm/venv/ops/_main.py", line 529, in run
    self._emit()
  File "/var/lib/juju/agents/unit-postgresql-0/charm/venv/ops/_main.py", line 518, in _emit
    _emit_charm_event(self.charm, self.dispatcher.event_name, self._juju_context)
  File "/var/lib/juju/agents/unit-postgresql-0/charm/venv/ops/_main.py", line 134, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-postgresql-0/charm/venv/ops/framework.py", line 347, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-postgresql-0/charm/venv/ops/framework.py", line 857, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-postgresql-0/charm/venv/ops/framework.py", line 947, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-postgresql-0/charm/lib/charms/tempo_k8s/v1/charm_tracing.py", line 735, in wrapped_function
    return callable(*args, **kwargs)  # type: ignore
  File "/var/lib/juju/agents/unit-postgresql-0/charm/./src/charm.py", line 1087, in _on_leader_elected
    self.update_config()
  File "/var/lib/juju/agents/unit-postgresql-0/charm/lib/charms/tempo_k8s/v1/charm_tracing.py", line 735, in wrapped_function
    return callable(*args, **kwargs)  # type: ignore
  File "/var/lib/juju/agents/unit-postgresql-0/charm/./src/charm.py", line 1801, in update_config
    self._patroni.render_patroni_yml_file(
  File "/var/lib/juju/agents/unit-postgresql-0/charm/lib/charms/tempo_k8s/v1/charm_tracing.py", line 735, in wrapped_function
    return callable(*args, **kwargs)  # type: ignore
  File "/var/lib/juju/agents/unit-postgresql-0/charm/src/cluster.py", line 627, in render_patroni_yml_file
    version=self.get_postgresql_version().split(".")[0],
AttributeError: 'NoneType' object has no attribute 'split'
unit-postgresql-0: 2024-10-22 08:56:08 ERROR juju.worker.uniter.operation hook "leader-elected" (via hook dispatching script: dispatch) failed: exit status 1
unit-postgresql-0: 2024-10-22 08:56:08 INFO juju.worker.uniter awaiting error resolution for "leader-elected" hook
unit-postgresql-0: 2024-10-22 08:56:13 INFO juju.worker.uniter awaiting error resolution for "leader-elected" hook

taurus-forever · 2024-10-24T13:11:06Z

Just for the history: the last post reports duplicate of https://warthogs.atlassian.net/browse/DPE-5648

... Installing snap charmed-postgresql, revision 124, tracking 14/stable
... error: snap "charmed-postgresql" is not available on 14/stable but is available

The CharmHub branch has been updated and now installable in #611.

nobuto-m added the bug Something isn't working label Aug 5, 2024

taurus-forever mentioned this issue Aug 21, 2024

The cluster stops functioning after replacing a failed unit then another unit failed #573

Open

nobuto-m mentioned this issue Oct 24, 2024

[DPE-3684] Reinitialise raft #611

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The charm cannot recover from a quorum loss event of 3-node cluster #571

The charm cannot recover from a quorum loss event of 3-node cluster #571

nobuto-m commented Aug 5, 2024 •

edited

Loading

github-actions bot commented Aug 5, 2024

taurus-forever commented Aug 21, 2024

taurus-forever commented Oct 18, 2024

gianlucaperna commented Oct 18, 2024

nobuto-m commented Oct 22, 2024

taurus-forever commented Oct 24, 2024

The charm cannot recover from a quorum loss event of 3-node cluster #571

The charm cannot recover from a quorum loss event of 3-node cluster #571

Comments

nobuto-m commented Aug 5, 2024 • edited Loading

Steps to reproduce

Expected behavior

Actual behavior

initial status

after taking down the Leader and Sync Standby

cleanup of dead nodes

after cleanup

adding two nodes to form the 3-node cluster again

after adding two nodes

Versions

Log output

Additional context

github-actions bot commented Aug 5, 2024

taurus-forever commented Aug 21, 2024

taurus-forever commented Oct 18, 2024

gianlucaperna commented Oct 18, 2024

nobuto-m commented Oct 22, 2024

taurus-forever commented Oct 24, 2024

nobuto-m commented Aug 5, 2024 •

edited

Loading