Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to retrieve the PostgreSQL version to initialise/update db-admin relation during failover #566

Closed
nobuto-m opened this issue Aug 2, 2024 · 8 comments
Labels
bug Something isn't working

Comments

@nobuto-m
Copy link

nobuto-m commented Aug 2, 2024

Steps to reproduce

  1. deploy landscape stable bundle

    $ juju deploy landscape-scalable
    Located bundle "landscape-scalable" in charm-hub, revision 33
    
  2. scale postgresql to 2 units (primary + replica)
    $ juju add-unit postgresql -n 1

  3. take down the primary unit and trigger the failover

Expected behavior

The failover succeeds, so the replica node will be promoted to primary. Then, the consumer of postgresql will be notified to write a new configuration file with the new primary node.

Actual behavior

The unit gets stuck at blocked Failed to retrieve the PostgreSQL version to initialise/update db-admin relation

$ juju status
Model      Controller            Cloud/Region       Version  SLA          Timestamp
landscape  mysunbeam-controller  mysunbeam/default  3.5.3    unsupported  15:37:11Z

App               Version  Status  Scale  Charm             Channel        Rev  Exposed  Message
haproxy                    active      1  haproxy           latest/stable   75  yes      Unit is ready
landscape-server           active      1  landscape-server  latest/stable  111  no       Unit is ready
postgresql        14.11    active    1/2  postgresql        14/stable      429  no       
rabbitmq-server   3.9.13   active      1  rabbitmq-server   3.9/stable     188  no       Unit is ready

Unit                 Workload  Agent  Machine  Public address   Ports           Message
haproxy/0*           active    idle   0        192.168.151.106  80,443/tcp      Unit is ready
landscape-server/0*  active    idle   1        192.168.151.107                  Unit is ready
postgresql/0         unknown   lost   2        192.168.151.108  5432/tcp        agent lost, see 'juju show-status-log postgresql/0'
postgresql/1*        blocked   idle   4        192.168.151.110  5432/tcp        Failed to retrieve the PostgreSQL version to initialise/update db-admin relation
rabbitmq-server/0*   active    idle   3        192.168.151.109  5672,15672/tcp  Unit is ready

Machine  State    Address          Inst id    Base          AZ       Message
0        started  192.168.151.106  machine-7  [email protected]  default  Deployed
1        started  192.168.151.107  machine-8  [email protected]  default  Deployed
2        down     192.168.151.108  machine-1  [email protected]  default  Deployed
3        started  192.168.151.109  machine-2  [email protected]  default  Deployed
4        started  192.168.151.110  machine-3  [email protected]  default  Deployed

Landscape app still holds the previous primary PostgreSQL endpoint (192.168.151.108).

$ juju exec --unit landscape-server/0 -- head -n 15 /etc/landscape/service.conf
[stores]
user = landscape
password = VNGIWFwCut3vK6XB
host = 192.168.151.108:5432
main = landscape-standalone-main
account-1 = landscape-standalone-account-1
resource-1 = landscape-standalone-resource-1
package = landscape-standalone-package
session = landscape-standalone-session
session-autocommit = landscape-standalone-session?isolation=autocommit
knowledge = landscape-standalone-knowledge

[global]
oops-path = /var/lib/landscape/landscape-oops
syslog-address = /dev/log

And the charm itself says the primary is the dead node.

$ juju run postgresql/leader get-primary
Running operation 7 with 1 task
  - task 8 on unit-postgresql-1

Waiting for task 8...
primary: postgresql/0


$ juju status postgresql/0
Model      Controller            Cloud/Region       Version  SLA          Timestamp
landscape  mysunbeam-controller  mysunbeam/default  3.5.3    unsupported  15:47:46Z

App         Version  Status  Scale  Charm       Channel    Rev  Exposed  Message
postgresql  14.11    active    0/1  postgresql  14/stable  429  no       

Unit          Workload  Agent  Machine  Public address   Ports     Message
postgresql/0  unknown   lost   2        192.168.151.108  5432/tcp  agent lost, see 'juju show-status-log postgresql/0'

Machine  State  Address          Inst id    Base          AZ       Message
2        down   192.168.151.108  machine-1  [email protected]  default  Deployed

Versions

Operating system: jammy

Juju CLI: 3.5.3-genericlinux-amd64

Juju agent: 3.5.3

Charm revision: 14/stable rev 429

LXD: N/A

Log output

Juju debug log:

landscape_model.log

unit-postgresql-1: 15:29:01 DEBUG unit.postgresql/1.juju-log Starting new HTTP connection (1): 192.168.151.110:8008
unit-postgresql-1: 15:29:01 DEBUG unit.postgresql/1.juju-log http://192.168.151.110:8008 "GET /cluster HTTP/10" 200 None
unit-postgresql-1: 15:29:01 DEBUG unit.postgresql/1.juju-log Starting new HTTP connection (1): 192.168.151.110:8008
unit-postgresql-1: 15:29:01 DEBUG unit.postgresql/1.juju-log http://192.168.151.110:8008 "GET /cluster HTTP/10" 200 None
unit-postgresql-1: 15:29:03 WARNING unit.postgresql/1.juju-log Failed to connect to PostgreSQL.
unit-postgresql-1: 15:29:03 DEBUG unit.postgresql/1.juju-log Starting new HTTP connection (1): 192.168.151.110:8008
unit-postgresql-1: 15:29:03 DEBUG unit.postgresql/1.juju-log http://192.168.151.110:8008 "GET /cluster HTTP/10" 200 None
unit-postgresql-1: 15:29:03 DEBUG unit.postgresql/1.juju-log Starting new HTTP connection (1): 192.168.151.110:8008
unit-postgresql-1: 15:29:03 DEBUG unit.postgresql/1.juju-log http://192.168.151.110:8008 "GET /cluster HTTP/10" 200 None
unit-postgresql-1: 15:29:03 DEBUG unit.postgresql/1.juju-log Starting new HTTP connection (1): 192.168.151.110:8008
unit-postgresql-1: 15:29:03 DEBUG unit.postgresql/1.juju-log http://192.168.151.110:8008 "GET /cluster HTTP/10" 200 None
unit-postgresql-1: 15:29:03 DEBUG unit.postgresql/1.juju-log Starting new HTTP connection (1): 192.168.151.110:8008
unit-postgresql-1: 15:29:03 DEBUG unit.postgresql/1.juju-log http://192.168.151.110:8008 "GET /cluster HTTP/10" 200 None
unit-postgresql-1: 15:29:04 DEBUG unit.postgresql/1.juju-log Starting new HTTP connection (1): 192.168.151.110:8008
unit-postgresql-1: 15:29:04 DEBUG unit.postgresql/1.juju-log http://192.168.151.110:8008 "GET /cluster HTTP/10" 200 None
unit-postgresql-1: 15:29:04 DEBUG unit.postgresql/1.juju-log Starting new HTTP connection (1): 192.168.151.110:8008
unit-postgresql-1: 15:29:04 DEBUG unit.postgresql/1.juju-log http://192.168.151.110:8008 "GET /cluster HTTP/10" 200 None
unit-postgresql-1: 15:29:04 ERROR unit.postgresql/1.juju-log Failed to get PostgreSQL version: connection to server at "192.168.151.108", port 5432 failed: No route to host
        Is the server running on that host and accepting TCP/IP connections?

unit-postgresql-1: 15:29:04 ERROR unit.postgresql/1.juju-log 
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-postgresql-1/charm/lib/charms/postgresql_k8s/v0/postgresql.py", line 419, in get_postgresql_version
    with self._connect_to_database() as connection, connection.cursor() as cursor:
  File "/var/lib/juju/agents/unit-postgresql-1/charm/lib/charms/tempo_k8s/v1/charm_tracing.py", line 544, in wrapped_function
    return callable(*args, **kwargs)  # type: ignore
  File "/var/lib/juju/agents/unit-postgresql-1/charm/lib/charms/postgresql_k8s/v0/postgresql.py", line 127, in _connect_to_database
    connection = psycopg2.connect(
  File "/var/lib/juju/agents/unit-postgresql-1/charm/venv/psycopg2/__init__.py", line 122, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: connection to server at "192.168.151.108", port 5432 failed: No route to host
        Is the server running on that host and accepting TCP/IP connections?


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-postgresql-1/charm/src/relations/db.py", line 323, in update_endpoints
    postgresql_version = self.charm.postgresql.get_postgresql_version()
  File "/var/lib/juju/agents/unit-postgresql-1/charm/lib/charms/tempo_k8s/v1/charm_tracing.py", line 544, in wrapped_function
    return callable(*args, **kwargs)  # type: ignore
  File "/var/lib/juju/agents/unit-postgresql-1/charm/lib/charms/postgresql_k8s/v0/postgresql.py", line 425, in get_postgresql_version
    raise PostgreSQLGetPostgreSQLVersionError()
charms.postgresql_k8s.v0.postgresql.PostgreSQLGetPostgreSQLVersionError

Additional context

@nobuto-m nobuto-m added the bug Something isn't working label Aug 2, 2024
Copy link
Contributor

github-actions bot commented Aug 2, 2024

@marceloneppel
Copy link
Member

Hi, @nobuto-m! Thanks for the bug report.

How did you take down the primary unit? Was it through sudo systemctl stop jujud-machine-2.service or by some other means?

@nobuto-m
Copy link
Author

nobuto-m commented Aug 5, 2024

How did you take down the primary unit? Was it through sudo systemctl stop jujud-machine-2.service or by some other means?

It was a force poweroff to simulate a hardware failure instead of graceful shutdown or anything like that.

@taurus-forever
Copy link
Contributor

Hi Nobuto,

The blocked state issue should be addressed in #578 and available as 14/stable nowadays. Can you please re-check it and confirm the fix from your side?

Thank you!

@taurus-forever
Copy link
Contributor

Hi @nobuto-m ,

I would separate this bugreport in two parts:

As discussed today, I have invested a lot of time trying to reproduce the issue Failed to retrieve the PostgreSQL version to initialise/update db-admin relation using the previous 14/stable revision 429 (as reported here) and the latest revision 468 (which is supposed to have the fix) => Cannot reproduce such error in both cases.

In the same time. I have reported followup ticket with UX improvement (based on rev 468): #618.

In revision 429 I saw some instabilities on the initial deployment, which I cannot reproduce in rev 468 at all (tried 3 times).
I will keep an eye on it, but at the moment I would say 468 looks more stable and no issues noticed here.
Also, I have requested the bundle update to rev 468: https://chat.canonical.com/canonical/pl/nbwb5udh4jb8tdtiuo1mk6ssda

At this point I believe we can resolve this issue and focus on Raft/Failover related tickets you have reported separately.
What do you think?

@nobuto-m
Copy link
Author

For the record, it's straightforward to reproduce
Failed to retrieve the PostgreSQL version to initialise/update db-admin relation
with the original reproduction steps even today.

$ juju status
Model  Controller  Cloud/Region         Version  SLA          Timestamp
psql   localhost   localhost/localhost  3.5.3    unsupported  07:59:00Z

App               Version  Status  Scale  Charm             Channel        Rev  Exposed  Message
haproxy                    active      1  haproxy           latest/stable   75  yes      Unit is ready
landscape-server           active      1  landscape-server  latest/stable  111  no       Unit is ready
postgresql        14.11    active      2  postgresql        14/stable      429  no       
rabbitmq-server   3.9.13   active      1  rabbitmq-server   3.9/stable     188  no       Unit is ready

Unit                 Workload  Agent  Machine  Public address  Ports           Message
haproxy/0*           active    idle   0        10.0.9.88       80,443/tcp      Unit is ready
landscape-server/0*  active    idle   1        10.0.9.115                      Unit is ready
postgresql/0*        active    idle   2        10.0.9.165      5432/tcp        Primary
postgresql/1         active    idle   4        10.0.9.156      5432/tcp        
rabbitmq-server/0*   active    idle   3        10.0.9.149      5672,15672/tcp  Unit is ready

Machine  State    Address     Inst id        Base          AZ  Message
0        started  10.0.9.88   juju-294bf7-0  [email protected]      Running
1        started  10.0.9.115  juju-294bf7-1  [email protected]      Running
2        started  10.0.9.165  juju-294bf7-2  [email protected]      Running
3        started  10.0.9.149  juju-294bf7-3  [email protected]      Running
4        started  10.0.9.156  juju-294bf7-4  [email protected]      Running
lxc stop -f juju-294bf7-2
$ juju status
Model  Controller  Cloud/Region         Version  SLA          Timestamp
psql   localhost   localhost/localhost  3.5.3    unsupported  08:00:56Z

App               Version  Status  Scale  Charm             Channel        Rev  Exposed  Message
haproxy                    active      1  haproxy           latest/stable   75  yes      Unit is ready
landscape-server           active      1  landscape-server  latest/stable  111  no       Unit is ready
postgresql        14.11    active    1/2  postgresql        14/stable      429  no       
rabbitmq-server   3.9.13   active      1  rabbitmq-server   3.9/stable     188  no       Unit is ready

Unit                 Workload  Agent  Machine  Public address  Ports           Message
haproxy/0*           active    idle   0        10.0.9.88       80,443/tcp      Unit is ready
landscape-server/0*  active    idle   1        10.0.9.115                      Unit is ready
postgresql/0         unknown   lost   2        10.0.9.165      5432/tcp        agent lost, see 'juju show-status-log postgresql/0'
postgresql/1*        blocked   idle   4        10.0.9.156      5432/tcp        Failed to retrieve the PostgreSQL version to initialise/update db-admin relation
rabbitmq-server/0*   active    idle   3        10.0.9.149      5672,15672/tcp  Unit is ready

Machine  State    Address     Inst id        Base          AZ  Message
0        started  10.0.9.88   juju-294bf7-0  [email protected]      Running
1        started  10.0.9.115  juju-294bf7-1  [email protected]      Running
2        down     10.0.9.165  juju-294bf7-2  [email protected]      Running
3        started  10.0.9.149  juju-294bf7-3  [email protected]      Running
4        started  10.0.9.156  juju-294bf7-4  [email protected]      Running

@nobuto-m
Copy link
Author

And by using unpinned version of the bundle, there is no traceback or the
Failed to retrieve the PostgreSQL version to initialise/update db-admin relation
message. And the second unit gets active/idle.

Aside from the fact that two node clusters are not maintainable nor sustainable and the status active/idle is wrong since the cluster failed to pick the new primary, the "issue" is fixed.

$ git diff --no-index landscape-scalable_r33/bundle.yaml bundle.yaml 
diff --git a/landscape-scalable_r33/bundle.yaml b/bundle.yaml
index 68ff865..715dece 100644
--- a/landscape-scalable_r33/bundle.yaml
+++ b/bundle.yaml
@@ -6,7 +6,6 @@ applications:
   haproxy:
     charm: ch:haproxy
     channel: stable
-    revision: 75
     num_units: 1
     expose: true
     options:
@@ -17,7 +16,6 @@ applications:
   landscape-server:
     charm: ch:landscape-server
     channel: stable
-    revision: 111
     num_units: 1
     constraints: mem=4096
     options:
@@ -25,7 +23,6 @@ applications:
   postgresql:
     charm: ch:postgresql
     channel: 14/stable
-    revision: 429
     num_units: 1
     options:
       plugin_plpython3u_enable: true
@@ -38,7 +35,6 @@ applications:
   rabbitmq-server:
     charm: ch:rabbitmq-server
     channel: 3.9/stable
-    revision: 188
     num_units: 1
     options:
       consumer-timeout: 259200000
$ juju status
Model  Controller  Cloud/Region         Version  SLA          Timestamp
psql   localhost   localhost/localhost  3.5.3    unsupported  08:25:18Z

App               Version  Status  Scale  Charm             Channel        Rev  Exposed  Message
haproxy                    active      1  haproxy           latest/stable   84  yes      Unit is ready
landscape-server           active      1  landscape-server  latest/stable  121  no       Unit is ready
postgresql        14.12    active      2  postgresql        14/stable      468  no       
rabbitmq-server   3.9.13   active      1  rabbitmq-server   3.9/stable     188  no       Unit is ready

Unit                 Workload  Agent  Machine  Public address  Ports           Message
haproxy/0*           active    idle   0        10.0.9.104      80,443/tcp      Unit is ready
landscape-server/0*  active    idle   1        10.0.9.190                      Unit is ready
postgresql/0*        active    idle   2        10.0.9.188      5432/tcp        Primary
postgresql/1         active    idle   4        10.0.9.141      5432/tcp        
rabbitmq-server/0*   active    idle   3        10.0.9.53       5672,15672/tcp  Unit is ready

Machine  State    Address     Inst id        Base          AZ  Message
0        started  10.0.9.104  juju-5d992c-0  [email protected]      Running
1        started  10.0.9.190  juju-5d992c-1  [email protected]      Running
2        started  10.0.9.188  juju-5d992c-2  [email protected]      Running
3        started  10.0.9.53   juju-5d992c-3  [email protected]      Running
4        started  10.0.9.141  juju-5d992c-4  [email protected]      Running
lxc stop -f juju-5d992c-2
$ juju status
Model  Controller  Cloud/Region         Version  SLA          Timestamp
psql   localhost   localhost/localhost  3.5.3    unsupported  08:29:30Z

App               Version  Status  Scale  Charm             Channel        Rev  Exposed  Message
haproxy                    active      1  haproxy           latest/stable   84  yes      Unit is ready
landscape-server           active      1  landscape-server  latest/stable  121  no       Unit is ready
postgresql        14.12    active    1/2  postgresql        14/stable      468  no       
rabbitmq-server   3.9.13   active      1  rabbitmq-server   3.9/stable     188  no       Unit is ready

Unit                 Workload  Agent  Machine  Public address  Ports           Message
haproxy/0*           active    idle   0        10.0.9.104      80,443/tcp      Unit is ready
landscape-server/0*  active    idle   1        10.0.9.190                      Unit is ready
postgresql/0         unknown   lost   2        10.0.9.188      5432/tcp        agent lost, see 'juju show-status-log postgresql/0'
postgresql/1*        active    idle   4        10.0.9.141      5432/tcp        
rabbitmq-server/0*   active    idle   3        10.0.9.53       5672,15672/tcp  Unit is ready

Machine  State    Address     Inst id        Base          AZ  Message
0        started  10.0.9.104  juju-5d992c-0  [email protected]      Running
1        started  10.0.9.190  juju-5d992c-1  [email protected]      Running
2        down     10.0.9.188  juju-5d992c-2  [email protected]      Running
3        started  10.0.9.53   juju-5d992c-3  [email protected]      Running
4        started  10.0.9.141  juju-5d992c-4  [email protected]      Running
unit-postgresql-1: 08:26:30 INFO juju.worker.uniter found queued "leader-elected" hook
unit-postgresql-1: 08:26:34 WARNING unit.postgresql/1.juju-log Failed to connect to PostgreSQL.
unit-postgresql-1: 08:26:39 WARNING unit.postgresql/1.juju-log Failed to connect to PostgreSQL.
unit-postgresql-1: 08:26:44 WARNING unit.postgresql/1.juju-log Failed to connect to PostgreSQL.
unit-postgresql-1: 08:26:49 WARNING unit.postgresql/1.juju-log Failed to connect to PostgreSQL.
unit-postgresql-1: 08:26:54 WARNING unit.postgresql/1.juju-log Failed to connect to PostgreSQL.
unit-postgresql-1: 08:26:55 INFO juju.worker.uniter.operation ran "leader-elected" hook (via hook dispatching script: dispatch)
unit-postgresql-1: 08:27:53 ERROR unit.postgresql/1.juju-log Failed to list PostgreSQL database users: connection to server at "10.0.9.188", port 5432 failed: timeout expired

unit-postgresql-1: 08:27:54 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-postgresql-1: 08:33:35 ERROR unit.postgresql/1.juju-log Failed to list PostgreSQL database users: connection to server at "10.0.9.188", port 5432 failed: No route to host
        Is the server running on that host and accepting TCP/IP connections?

unit-postgresql-1: 08:33:36 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)

@taurus-forever
Copy link
Contributor

taurus-forever commented Sep 13, 2024

@nobuto-m just for the history, what was your LXD version?
I was executing the same step on the same juju 3.5.3 and didn't get into the original 'issue'.

I will update my old LXD version, but the used in my tests:

> lxd               5.0.3-80aeff7  29351  5.0/stable/…        canonical✓     -  

Tnx!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants