-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DPE-4532] Increase timeout and terminate processes that are still up #514
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #514 +/- ##
==========================================
+ Coverage 68.66% 68.68% +0.02%
==========================================
Files 11 11
Lines 3003 3015 +12
Branches 532 535 +3
==========================================
+ Hits 2062 2071 +9
- Misses 822 823 +1
- Partials 119 121 +2 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Skip remaining tests if one fails. There are a lot of timeouts in the continuous writes fixture, so if one of the tests breaks the db, we end up waiting for a while.
@@ -68,7 +68,7 @@ async def are_all_db_processes_down(ops_test: OpsTest, process: str) -> bool: | |||
pgrep_cmd = ("pgrep", "-x", process) | |||
|
|||
try: | |||
for attempt in Retrying(stop=stop_after_delay(60), wait=wait_fixed(3)): | |||
for attempt in Retrying(stop=stop_after_delay(400), wait=wait_fixed(3)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We set Patroni's loop_wait
to 300
logger.info("Unit %s not yet down" % unit.name) | ||
# Try to rekill the unit | ||
await send_signal_to_process(ops_test, unit.name, process, signal) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Observing the log, there's usually at least one unit stuck. I guess that it manages to escape systemd's restart condition/Patroni's loop_wait and gets revived, so killing it again to make sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks a lot!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Try to stabilise full cluster restart tests