Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MISC] Check for peer relation when getting the primary endpoint #366

Merged
Merged
Show file tree
Hide file tree
Changes from 23 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
ac24da8
Reenable landscape tests
dragomirp Feb 13, 2024
98e151e
Try GH runners
dragomirp Feb 13, 2024
14b7d3f
Merge branch 'main' into reenable-landscape
dragomirp Feb 21, 2024
5d9da4c
Back on self-hosted runner
dragomirp Feb 21, 2024
375c957
Don't check for primary if peer didn't join yet
dragomirp Feb 21, 2024
8c47669
Don't recheck missing primary ip when looking for primary endpoint
dragomirp Feb 21, 2024
1e0b514
Catch list user exception
dragomirp Feb 21, 2024
fa44790
Catch Psycopg2 error when trying to set config
dragomirp Feb 21, 2024
7fce785
Try to preflight connection on config validation
dragomirp Feb 21, 2024
d50d669
Don't retry db connection check
dragomirp Feb 22, 2024
8184dfa
Set the primary endpoint before starting the primary
dragomirp Feb 22, 2024
5a09131
Revert "Set the primary endpoint before starting the primary"
dragomirp Feb 22, 2024
1eb5529
Wait for peer before starting primary
dragomirp Feb 22, 2024
e14d5a1
Switch to GH runners
dragomirp Feb 22, 2024
dc4588f
Recheck DB connection
dragomirp Feb 22, 2024
b89476f
Merge branch 'check-for-peer-rel' into check-for-peer-rel-gh-runner
dragomirp Feb 22, 2024
e7e8965
Don't check peer on primary startup
dragomirp Feb 22, 2024
dbf7deb
Merge branch 'check-for-peer-rel' into check-for-peer-rel-gh-runner
dragomirp Feb 22, 2024
3f1d62e
Merge branch 'dpe-3591-fix-shared-buffers-validation' into check-for-…
dragomirp Feb 23, 2024
5c78ca8
Try to skip primary IP validation if not all cluster IPs seem to be a…
dragomirp Feb 23, 2024
1ea2833
Merge remote-tracking branch 'origin/reenable-landscape' into check-f…
dragomirp Feb 23, 2024
beea41d
Comment
dragomirp Feb 23, 2024
e520923
Review comments
dragomirp Feb 26, 2024
f3e3d33
Unit tests
dragomirp Feb 26, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 37 additions & 1 deletion src/charm.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
PostgreSQL,
PostgreSQLCreateUserError,
PostgreSQLEnableDisableExtensionError,
PostgreSQLListUsersError,
PostgreSQLUpdateUserPasswordError,
)
from charms.postgresql_k8s.v0.postgresql_tls import PostgreSQLTLS
Expand Down Expand Up @@ -299,6 +300,9 @@ def postgresql(self) -> PostgreSQL:
@property
def primary_endpoint(self) -> Optional[str]:
"""Returns the endpoint of the primary instance or None when no primary available."""
if not self._peers:
logger.debug("primary endpoint early exit: Peer relation not joined yet.")
return None
Comment on lines +303 to +305
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the peer relation is not yet joined, don't try to find the primary.

try:
for attempt in Retrying(stop=stop_after_delay(60), wait=wait_fixed(3)):
with attempt:
Expand All @@ -308,7 +312,20 @@ def primary_endpoint(self) -> Optional[str]:
# returned is not in the list of the current cluster members
# (like when the cluster was not updated yet after a failed switchover).
if not primary_endpoint or primary_endpoint not in self._units_ips:
raise ValueError()
# TODO figure out why peer data is not available
if (
primary_endpoint
and len(self._units_ips) == 1
and len(self._peers.units) > 1
):
logger.warning(
"Possibly incoplete peer data: Will not map primary IP to unit IP"
)
return primary_endpoint
Comment on lines +315 to +324
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On juju 3 we don't seem to be getting the peer data used to get the IP addr when a new unit is joining (e.g. when scaling in the test_db suite).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It smells dangerous...

logger.debug(
"primary endpoint early exit: Primary IP not in cached peer list."
)
primary_endpoint = None
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_units_ips is a property and will not update on retry. The Patroni response might change while retrying, but IMHO it's better to leave that to the topology observer script.

except RetryError:
return None
else:
Expand Down Expand Up @@ -1036,6 +1053,10 @@ def _start_primary(self, event: StartEvent) -> None:
logger.exception(e)
self.unit.status = BlockedStatus("Failed to create postgres user")
return
except PostgreSQLListUsersError:
logger.warning("Deferriing on_start: Unable to list users")
event.defer()
return
Comment on lines +1056 to +1059
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skipping the wait in primary_endpoint could lead o exceptions here.


self.postgresql.set_up_database()

Expand Down Expand Up @@ -1381,6 +1402,17 @@ def _is_workload_running(self) -> bool:

return charmed_postgresql_snap.services["patroni"]["active"]

@property
def _can_connect_to_postgresql(self) -> bool:
try:
for attempt in Retrying(stop=stop_after_delay(30), wait=wait_fixed(3)):
with attempt:
assert self.postgresql.get_postgresql_timezones()
except RetryError:
logger.debug("Cannot connect to database")
return False
return True

def update_config(self, is_creating_backup: bool = False) -> bool:
"""Updates Patroni config file based on the existence of the TLS files."""
if (
Expand Down Expand Up @@ -1424,6 +1456,10 @@ def update_config(self, is_creating_backup: bool = False) -> bool:
logger.debug("Early exit update_config: Patroni not started yet")
return False

# Try to connect
if not self._can_connect_to_postgresql:
logger.warning("Early exit update_config: Cannot connect to Postgresql")
return False
self._validate_config_options()

self._patroni.bulk_update_parameters_controller_by_patroni(
Expand Down
1 change: 0 additions & 1 deletion tests/integration/test_db_admin.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,6 @@


@pytest.mark.group(1)
@pytest.mark.skip(reason="DB Admin tests are currently broken")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to have it re-enable. I suspect it is not related to this PR but enabled as coincident. Right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reenabled it here to reduce the amount CI runs.

async def test_landscape_scalable_bundle_db(ops_test: OpsTest, charm: str) -> None:
"""Deploy Landscape Scalable Bundle to test the 'db-admin' relation."""
await ops_test.model.deploy(
Expand Down
Loading