`Failed to retrieve the PostgreSQL version to initialise/update db-admin relation` during failover #566

nobuto-m · 2024-08-02T15:40:27Z

Steps to reproduce

deploy landscape stable bundle

$ juju deploy landscape-scalable
Located bundle "landscape-scalable" in charm-hub, revision 33

scale postgresql to 2 units (primary + replica)
$ juju add-unit postgresql -n 1
take down the primary unit and trigger the failover

Expected behavior

The failover succeeds, so the replica node will be promoted to primary. Then, the consumer of postgresql will be notified to write a new configuration file with the new primary node.

Actual behavior

The unit gets stuck at blocked Failed to retrieve the PostgreSQL version to initialise/update db-admin relation

$ juju status
Model      Controller            Cloud/Region       Version  SLA          Timestamp
landscape  mysunbeam-controller  mysunbeam/default  3.5.3    unsupported  15:37:11Z

App               Version  Status  Scale  Charm             Channel        Rev  Exposed  Message
haproxy                    active      1  haproxy           latest/stable   75  yes      Unit is ready
landscape-server           active      1  landscape-server  latest/stable  111  no       Unit is ready
postgresql        14.11    active    1/2  postgresql        14/stable      429  no       
rabbitmq-server   3.9.13   active      1  rabbitmq-server   3.9/stable     188  no       Unit is ready

Unit                 Workload  Agent  Machine  Public address   Ports           Message
haproxy/0*           active    idle   0        192.168.151.106  80,443/tcp      Unit is ready
landscape-server/0*  active    idle   1        192.168.151.107                  Unit is ready
postgresql/0         unknown   lost   2        192.168.151.108  5432/tcp        agent lost, see 'juju show-status-log postgresql/0'
postgresql/1*        blocked   idle   4        192.168.151.110  5432/tcp        Failed to retrieve the PostgreSQL version to initialise/update db-admin relation
rabbitmq-server/0*   active    idle   3        192.168.151.109  5672,15672/tcp  Unit is ready

Machine  State    Address          Inst id    Base          AZ       Message
0        started  192.168.151.106  machine-7  [email protected]  default  Deployed
1        started  192.168.151.107  machine-8  [email protected]  default  Deployed
2        down     192.168.151.108  machine-1  [email protected]  default  Deployed
3        started  192.168.151.109  machine-2  [email protected]  default  Deployed
4        started  192.168.151.110  machine-3  [email protected]  default  Deployed

Landscape app still holds the previous primary PostgreSQL endpoint (192.168.151.108).

$ juju exec --unit landscape-server/0 -- head -n 15 /etc/landscape/service.conf
[stores]
user = landscape
password = VNGIWFwCut3vK6XB
host = 192.168.151.108:5432
main = landscape-standalone-main
account-1 = landscape-standalone-account-1
resource-1 = landscape-standalone-resource-1
package = landscape-standalone-package
session = landscape-standalone-session
session-autocommit = landscape-standalone-session?isolation=autocommit
knowledge = landscape-standalone-knowledge

[global]
oops-path = /var/lib/landscape/landscape-oops
syslog-address = /dev/log

And the charm itself says the primary is the dead node.

$ juju run postgresql/leader get-primary
Running operation 7 with 1 task
  - task 8 on unit-postgresql-1

Waiting for task 8...
primary: postgresql/0


$ juju status postgresql/0
Model      Controller            Cloud/Region       Version  SLA          Timestamp
landscape  mysunbeam-controller  mysunbeam/default  3.5.3    unsupported  15:47:46Z

App         Version  Status  Scale  Charm       Channel    Rev  Exposed  Message
postgresql  14.11    active    0/1  postgresql  14/stable  429  no       

Unit          Workload  Agent  Machine  Public address   Ports     Message
postgresql/0  unknown   lost   2        192.168.151.108  5432/tcp  agent lost, see 'juju show-status-log postgresql/0'

Machine  State  Address          Inst id    Base          AZ       Message
2        down   192.168.151.108  machine-1  [email protected]  default  Deployed

Versions

Operating system: jammy

Juju CLI: 3.5.3-genericlinux-amd64

Juju agent: 3.5.3

Charm revision: 14/stable rev 429

LXD: N/A

Log output

Juju debug log:

landscape_model.log

unit-postgresql-1: 15:29:01 DEBUG unit.postgresql/1.juju-log Starting new HTTP connection (1): 192.168.151.110:8008
unit-postgresql-1: 15:29:01 DEBUG unit.postgresql/1.juju-log http://192.168.151.110:8008 "GET /cluster HTTP/10" 200 None
unit-postgresql-1: 15:29:01 DEBUG unit.postgresql/1.juju-log Starting new HTTP connection (1): 192.168.151.110:8008
unit-postgresql-1: 15:29:01 DEBUG unit.postgresql/1.juju-log http://192.168.151.110:8008 "GET /cluster HTTP/10" 200 None
unit-postgresql-1: 15:29:03 WARNING unit.postgresql/1.juju-log Failed to connect to PostgreSQL.
unit-postgresql-1: 15:29:03 DEBUG unit.postgresql/1.juju-log Starting new HTTP connection (1): 192.168.151.110:8008
unit-postgresql-1: 15:29:03 DEBUG unit.postgresql/1.juju-log http://192.168.151.110:8008 "GET /cluster HTTP/10" 200 None
unit-postgresql-1: 15:29:03 DEBUG unit.postgresql/1.juju-log Starting new HTTP connection (1): 192.168.151.110:8008
unit-postgresql-1: 15:29:03 DEBUG unit.postgresql/1.juju-log http://192.168.151.110:8008 "GET /cluster HTTP/10" 200 None
unit-postgresql-1: 15:29:03 DEBUG unit.postgresql/1.juju-log Starting new HTTP connection (1): 192.168.151.110:8008
unit-postgresql-1: 15:29:03 DEBUG unit.postgresql/1.juju-log http://192.168.151.110:8008 "GET /cluster HTTP/10" 200 None
unit-postgresql-1: 15:29:03 DEBUG unit.postgresql/1.juju-log Starting new HTTP connection (1): 192.168.151.110:8008
unit-postgresql-1: 15:29:03 DEBUG unit.postgresql/1.juju-log http://192.168.151.110:8008 "GET /cluster HTTP/10" 200 None
unit-postgresql-1: 15:29:04 DEBUG unit.postgresql/1.juju-log Starting new HTTP connection (1): 192.168.151.110:8008
unit-postgresql-1: 15:29:04 DEBUG unit.postgresql/1.juju-log http://192.168.151.110:8008 "GET /cluster HTTP/10" 200 None
unit-postgresql-1: 15:29:04 DEBUG unit.postgresql/1.juju-log Starting new HTTP connection (1): 192.168.151.110:8008
unit-postgresql-1: 15:29:04 DEBUG unit.postgresql/1.juju-log http://192.168.151.110:8008 "GET /cluster HTTP/10" 200 None
unit-postgresql-1: 15:29:04 ERROR unit.postgresql/1.juju-log Failed to get PostgreSQL version: connection to server at "192.168.151.108", port 5432 failed: No route to host
        Is the server running on that host and accepting TCP/IP connections?

unit-postgresql-1: 15:29:04 ERROR unit.postgresql/1.juju-log 
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-postgresql-1/charm/lib/charms/postgresql_k8s/v0/postgresql.py", line 419, in get_postgresql_version
    with self._connect_to_database() as connection, connection.cursor() as cursor:
  File "/var/lib/juju/agents/unit-postgresql-1/charm/lib/charms/tempo_k8s/v1/charm_tracing.py", line 544, in wrapped_function
    return callable(*args, **kwargs)  # type: ignore
  File "/var/lib/juju/agents/unit-postgresql-1/charm/lib/charms/postgresql_k8s/v0/postgresql.py", line 127, in _connect_to_database
    connection = psycopg2.connect(
  File "/var/lib/juju/agents/unit-postgresql-1/charm/venv/psycopg2/__init__.py", line 122, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: connection to server at "192.168.151.108", port 5432 failed: No route to host
        Is the server running on that host and accepting TCP/IP connections?


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-postgresql-1/charm/src/relations/db.py", line 323, in update_endpoints
    postgresql_version = self.charm.postgresql.get_postgresql_version()
  File "/var/lib/juju/agents/unit-postgresql-1/charm/lib/charms/tempo_k8s/v1/charm_tracing.py", line 544, in wrapped_function
    return callable(*args, **kwargs)  # type: ignore
  File "/var/lib/juju/agents/unit-postgresql-1/charm/lib/charms/postgresql_k8s/v0/postgresql.py", line 425, in get_postgresql_version
    raise PostgreSQLGetPostgreSQLVersionError()
charms.postgresql_k8s.v0.postgresql.PostgreSQLGetPostgreSQLVersionError

Additional context

The text was updated successfully, but these errors were encountered:

github-actions · 2024-08-02T15:40:43Z

https://warthogs.atlassian.net/browse/DPE-5030

marceloneppel · 2024-08-05T12:43:11Z

Hi, @nobuto-m! Thanks for the bug report.

How did you take down the primary unit? Was it through sudo systemctl stop jujud-machine-2.service or by some other means?

nobuto-m · 2024-08-05T13:00:19Z

How did you take down the primary unit? Was it through sudo systemctl stop jujud-machine-2.service or by some other means?

It was a force poweroff to simulate a hardware failure instead of graceful shutdown or anything like that.

taurus-forever · 2024-09-12T06:50:09Z

Hi Nobuto,

The blocked state issue should be addressed in #578 and available as 14/stable nowadays. Can you please re-check it and confirm the fix from your side?

Thank you!

taurus-forever · 2024-09-12T12:55:03Z

Hi @nobuto-m ,

I would separate this bugreport in two parts:

blocked charm : Failed to retrieve the PostgreSQL version to initialise/update db-admin relation (fixed here)
failover issues in stereo mode: duplicate of The charm allows a 2-node cluster but it's not functional after a failover #570 (let's move there)

As discussed today, I have invested a lot of time trying to reproduce the issue Failed to retrieve the PostgreSQL version to initialise/update db-admin relation using the previous 14/stable revision 429 (as reported here) and the latest revision 468 (which is supposed to have the fix) => Cannot reproduce such error in both cases.

In the same time. I have reported followup ticket with UX improvement (based on rev 468): #618.

In revision 429 I saw some instabilities on the initial deployment, which I cannot reproduce in rev 468 at all (tried 3 times).
I will keep an eye on it, but at the moment I would say 468 looks more stable and no issues noticed here.
Also, I have requested the bundle update to rev 468: https://chat.canonical.com/canonical/pl/nbwb5udh4jb8tdtiuo1mk6ssda

At this point I believe we can resolve this issue and focus on Raft/Failover related tickets you have reported separately.
What do you think?

nobuto-m · 2024-09-13T08:03:15Z

For the record, it's straightforward to reproduce
Failed to retrieve the PostgreSQL version to initialise/update db-admin relation
with the original reproduction steps even today.

$ juju status
Model  Controller  Cloud/Region         Version  SLA          Timestamp
psql   localhost   localhost/localhost  3.5.3    unsupported  07:59:00Z

App               Version  Status  Scale  Charm             Channel        Rev  Exposed  Message
haproxy                    active      1  haproxy           latest/stable   75  yes      Unit is ready
landscape-server           active      1  landscape-server  latest/stable  111  no       Unit is ready
postgresql        14.11    active      2  postgresql        14/stable      429  no       
rabbitmq-server   3.9.13   active      1  rabbitmq-server   3.9/stable     188  no       Unit is ready

Unit                 Workload  Agent  Machine  Public address  Ports           Message
haproxy/0*           active    idle   0        10.0.9.88       80,443/tcp      Unit is ready
landscape-server/0*  active    idle   1        10.0.9.115                      Unit is ready
postgresql/0*        active    idle   2        10.0.9.165      5432/tcp        Primary
postgresql/1         active    idle   4        10.0.9.156      5432/tcp        
rabbitmq-server/0*   active    idle   3        10.0.9.149      5672,15672/tcp  Unit is ready

Machine  State    Address     Inst id        Base          AZ  Message
0        started  10.0.9.88   juju-294bf7-0  [email protected]      Running
1        started  10.0.9.115  juju-294bf7-1  [email protected]      Running
2        started  10.0.9.165  juju-294bf7-2  [email protected]      Running
3        started  10.0.9.149  juju-294bf7-3  [email protected]      Running
4        started  10.0.9.156  juju-294bf7-4  [email protected]      Running

lxc stop -f juju-294bf7-2

$ juju status
Model  Controller  Cloud/Region         Version  SLA          Timestamp
psql   localhost   localhost/localhost  3.5.3    unsupported  08:00:56Z

App               Version  Status  Scale  Charm             Channel        Rev  Exposed  Message
haproxy                    active      1  haproxy           latest/stable   75  yes      Unit is ready
landscape-server           active      1  landscape-server  latest/stable  111  no       Unit is ready
postgresql        14.11    active    1/2  postgresql        14/stable      429  no       
rabbitmq-server   3.9.13   active      1  rabbitmq-server   3.9/stable     188  no       Unit is ready

Unit                 Workload  Agent  Machine  Public address  Ports           Message
haproxy/0*           active    idle   0        10.0.9.88       80,443/tcp      Unit is ready
landscape-server/0*  active    idle   1        10.0.9.115                      Unit is ready
postgresql/0         unknown   lost   2        10.0.9.165      5432/tcp        agent lost, see 'juju show-status-log postgresql/0'
postgresql/1*        blocked   idle   4        10.0.9.156      5432/tcp        Failed to retrieve the PostgreSQL version to initialise/update db-admin relation
rabbitmq-server/0*   active    idle   3        10.0.9.149      5672,15672/tcp  Unit is ready

Machine  State    Address     Inst id        Base          AZ  Message
0        started  10.0.9.88   juju-294bf7-0  [email protected]      Running
1        started  10.0.9.115  juju-294bf7-1  [email protected]      Running
2        down     10.0.9.165  juju-294bf7-2  [email protected]      Running
3        started  10.0.9.149  juju-294bf7-3  [email protected]      Running
4        started  10.0.9.156  juju-294bf7-4  [email protected]      Running

nobuto-m · 2024-09-13T08:34:53Z

And by using unpinned version of the bundle, there is no traceback or the
Failed to retrieve the PostgreSQL version to initialise/update db-admin relation
message. And the second unit gets active/idle.

Aside from the fact that two node clusters are not maintainable nor sustainable and the status active/idle is wrong since the cluster failed to pick the new primary, the "issue" is fixed.

$ git diff --no-index landscape-scalable_r33/bundle.yaml bundle.yaml 
diff --git a/landscape-scalable_r33/bundle.yaml b/bundle.yaml
index 68ff865..715dece 100644
--- a/landscape-scalable_r33/bundle.yaml
+++ b/bundle.yaml
@@ -6,7 +6,6 @@ applications:
   haproxy:
     charm: ch:haproxy
     channel: stable
-    revision: 75
     num_units: 1
     expose: true
     options:
@@ -17,7 +16,6 @@ applications:
   landscape-server:
     charm: ch:landscape-server
     channel: stable
-    revision: 111
     num_units: 1
     constraints: mem=4096
     options:
@@ -25,7 +23,6 @@ applications:
   postgresql:
     charm: ch:postgresql
     channel: 14/stable
-    revision: 429
     num_units: 1
     options:
       plugin_plpython3u_enable: true
@@ -38,7 +35,6 @@ applications:
   rabbitmq-server:
     charm: ch:rabbitmq-server
     channel: 3.9/stable
-    revision: 188
     num_units: 1
     options:
       consumer-timeout: 259200000

$ juju status
Model  Controller  Cloud/Region         Version  SLA          Timestamp
psql   localhost   localhost/localhost  3.5.3    unsupported  08:25:18Z

App               Version  Status  Scale  Charm             Channel        Rev  Exposed  Message
haproxy                    active      1  haproxy           latest/stable   84  yes      Unit is ready
landscape-server           active      1  landscape-server  latest/stable  121  no       Unit is ready
postgresql        14.12    active      2  postgresql        14/stable      468  no       
rabbitmq-server   3.9.13   active      1  rabbitmq-server   3.9/stable     188  no       Unit is ready

Unit                 Workload  Agent  Machine  Public address  Ports           Message
haproxy/0*           active    idle   0        10.0.9.104      80,443/tcp      Unit is ready
landscape-server/0*  active    idle   1        10.0.9.190                      Unit is ready
postgresql/0*        active    idle   2        10.0.9.188      5432/tcp        Primary
postgresql/1         active    idle   4        10.0.9.141      5432/tcp        
rabbitmq-server/0*   active    idle   3        10.0.9.53       5672,15672/tcp  Unit is ready

Machine  State    Address     Inst id        Base          AZ  Message
0        started  10.0.9.104  juju-5d992c-0  [email protected]      Running
1        started  10.0.9.190  juju-5d992c-1  [email protected]      Running
2        started  10.0.9.188  juju-5d992c-2  [email protected]      Running
3        started  10.0.9.53   juju-5d992c-3  [email protected]      Running
4        started  10.0.9.141  juju-5d992c-4  [email protected]      Running

lxc stop -f juju-5d992c-2

$ juju status
Model  Controller  Cloud/Region         Version  SLA          Timestamp
psql   localhost   localhost/localhost  3.5.3    unsupported  08:29:30Z

App               Version  Status  Scale  Charm             Channel        Rev  Exposed  Message
haproxy                    active      1  haproxy           latest/stable   84  yes      Unit is ready
landscape-server           active      1  landscape-server  latest/stable  121  no       Unit is ready
postgresql        14.12    active    1/2  postgresql        14/stable      468  no       
rabbitmq-server   3.9.13   active      1  rabbitmq-server   3.9/stable     188  no       Unit is ready

Unit                 Workload  Agent  Machine  Public address  Ports           Message
haproxy/0*           active    idle   0        10.0.9.104      80,443/tcp      Unit is ready
landscape-server/0*  active    idle   1        10.0.9.190                      Unit is ready
postgresql/0         unknown   lost   2        10.0.9.188      5432/tcp        agent lost, see 'juju show-status-log postgresql/0'
postgresql/1*        active    idle   4        10.0.9.141      5432/tcp        
rabbitmq-server/0*   active    idle   3        10.0.9.53       5672,15672/tcp  Unit is ready

Machine  State    Address     Inst id        Base          AZ  Message
0        started  10.0.9.104  juju-5d992c-0  [email protected]      Running
1        started  10.0.9.190  juju-5d992c-1  [email protected]      Running
2        down     10.0.9.188  juju-5d992c-2  [email protected]      Running
3        started  10.0.9.53   juju-5d992c-3  [email protected]      Running
4        started  10.0.9.141  juju-5d992c-4  [email protected]      Running

unit-postgresql-1: 08:26:30 INFO juju.worker.uniter found queued "leader-elected" hook
unit-postgresql-1: 08:26:34 WARNING unit.postgresql/1.juju-log Failed to connect to PostgreSQL.
unit-postgresql-1: 08:26:39 WARNING unit.postgresql/1.juju-log Failed to connect to PostgreSQL.
unit-postgresql-1: 08:26:44 WARNING unit.postgresql/1.juju-log Failed to connect to PostgreSQL.
unit-postgresql-1: 08:26:49 WARNING unit.postgresql/1.juju-log Failed to connect to PostgreSQL.
unit-postgresql-1: 08:26:54 WARNING unit.postgresql/1.juju-log Failed to connect to PostgreSQL.
unit-postgresql-1: 08:26:55 INFO juju.worker.uniter.operation ran "leader-elected" hook (via hook dispatching script: dispatch)
unit-postgresql-1: 08:27:53 ERROR unit.postgresql/1.juju-log Failed to list PostgreSQL database users: connection to server at "10.0.9.188", port 5432 failed: timeout expired

unit-postgresql-1: 08:27:54 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-postgresql-1: 08:33:35 ERROR unit.postgresql/1.juju-log Failed to list PostgreSQL database users: connection to server at "10.0.9.188", port 5432 failed: No route to host
        Is the server running on that host and accepting TCP/IP connections?

unit-postgresql-1: 08:33:36 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)

taurus-forever · 2024-09-13T12:31:47Z

@nobuto-m just for the history, what was your LXD version?
I was executing the same step on the same juju 3.5.3 and didn't get into the original 'issue'.

I will update my old LXD version, but the used in my tests:

> lxd               5.0.3-80aeff7  29351  5.0/stable/…        canonical✓     -

Tnx!

nobuto-m added the bug Something isn't working label Aug 2, 2024

nobuto-m mentioned this issue Aug 5, 2024

The charm allows a 2-node cluster but it's not functional after a failover #570

Open

dragomirp mentioned this issue Aug 12, 2024

[DPE-3562] Don't block on failure to get the db version #578

Merged

taurus-forever mentioned this issue Sep 12, 2024

Temporary error message: hook failed: "start" on VM recovery #618

Open

nobuto-m closed this as completed Sep 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Failed to retrieve the PostgreSQL version to initialise/update db-admin relation` during failover #566

`Failed to retrieve the PostgreSQL version to initialise/update db-admin relation` during failover #566

nobuto-m commented Aug 2, 2024 •

edited

Loading

github-actions bot commented Aug 2, 2024

marceloneppel commented Aug 5, 2024

nobuto-m commented Aug 5, 2024

taurus-forever commented Sep 12, 2024

taurus-forever commented Sep 12, 2024

nobuto-m commented Sep 13, 2024

nobuto-m commented Sep 13, 2024

taurus-forever commented Sep 13, 2024 •

edited

Loading

Failed to retrieve the PostgreSQL version to initialise/update db-admin relation during failover #566

Failed to retrieve the PostgreSQL version to initialise/update db-admin relation during failover #566

Comments

nobuto-m commented Aug 2, 2024 • edited Loading

Steps to reproduce

Expected behavior

Actual behavior

Versions

Log output

Additional context

github-actions bot commented Aug 2, 2024

marceloneppel commented Aug 5, 2024

nobuto-m commented Aug 5, 2024

taurus-forever commented Sep 12, 2024

taurus-forever commented Sep 12, 2024

nobuto-m commented Sep 13, 2024

nobuto-m commented Sep 13, 2024

taurus-forever commented Sep 13, 2024 • edited Loading

`Failed to retrieve the PostgreSQL version to initialise/update db-admin relation` during failover #566

`Failed to retrieve the PostgreSQL version to initialise/update db-admin relation` during failover #566

nobuto-m commented Aug 2, 2024 •

edited

Loading

taurus-forever commented Sep 13, 2024 •

edited

Loading