Make tests less flaky on CI #716

StephenSorriaux · 2023-03-24T21:39:09Z

Why is this needed?

This makes the tests way less flaky, I hope (and first results confirmed it... but you never know).

Proposed Changes

renamed back test__connection.py to test_connection.py, where it belongs
added client_cluster_health in the setup_zookeeper step that is there to "guarantee" the cluster is up and running, without even considering the configuration that will be used by the test itself. I added a "custom" retry logic without using a KazooRetry since KazooRetry does not handle refused/dropped connections like I would want too: even if a connection_retry is given to KazooClient(), the retry configuration is actually not respected because the interrupt() end up being triggered anyway (ie. the client._stopped event)
now performing a ZK cluster nuking at the end of each test case (not each test, it's different) since keeping the same ZK state from test cases to test cases actually led to some tests randomly failing
fixed the test_read_only test based on what is done in the official java client (https://github.com/apache/zookeeper/blob/master/zookeeper-server/src/test/java/org/apache/zookeeper/test/ReadOnlyModeTest.java), we were missing some things and that led the test to fail from time to time (see comments in the code)
activated the pytest log_cli configuration, as it gives more logs for each tests and I find it useful

Does this PR introduce any breaking change?

Nope.

codecov · 2023-03-24T21:52:12Z

Codecov Report

Patch coverage has no change and project coverage change: +0.36 🎉

Comparison is base (ffa0ae9) 96.28% compared to head (a41f564) 96.65%.

❗ Current head a41f564 differs from pull request most recent head 64af698. Consider uploading reports for the commit 64af698 to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #716      +/-   ##
==========================================
+ Coverage   96.28%   96.65%   +0.36%     
==========================================
  Files          27       27              
  Lines        3554     3554              
==========================================
+ Hits         3422     3435      +13     
+ Misses        132      119      -13

see 7 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

kazoo/tests/test_connection.py

kazoo/tests/conftest.py

kazoo/tests/test_connection.py

pyproject.toml

kazoo/tests/test_connection.py

kazoo/testing/harness.py

kazoo/tests/conftest.py

kazoo/tests/test_connection.py

kazoo/tests/test_lock.py

kazoo/tests/test_connection.py

jeffwidman

A few more comments. Overall, I'm still 👍 on this, but realized there's a few more sections where it'd be helpful to have more of the history/rationale of the change documented, not only for me but also future git blame spelunkers.

Would also appreciate if @ceache takes a look, although IMO we shouldn't wait for him before merging this if he doesn't have time.

jeffwidman · 2023-04-11T04:50:33Z

I added a "custom" retry logic without using a KazooRetry since KazooRetry does not handle refused/dropped connections like I would want too: even if a connection_retry is given to KazooClient(), the retry configuration is actually not respected because the interrupt() end up being triggered anyway (ie. the client._stopped event)

This is fine for this PR, but is this something we should consider changing/fixing in KazooClient/KazooRetry long-term?

ceache · 2023-04-11T22:34:00Z

I'll try to review this thoroughly this week. Sorry I have been quite distracted recently. Besides the flakiness within the tests themselves, i was looking at the harness recently and I wanted to know how you felt about swapping our custom zookeeper ensemble orchestration driver for some docker compose driver. This would remove a lot of complexity and allow us to know we are starting the tests only after the ensemble is actually up and ready. I know how to craft such a compose definition and used it numerous times for my own testing. The down side is introducing a dependency on docker for local testing (or we could leave the "native" harness around for docker-less development) What do you think?

…

On Tue, Apr 11, 2023, 18:05 Stephen Sorriaux ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In kazoo/testing/harness.py <#716 (comment)>: > + client_cluster_health.start() + client_cluster_health.ensure_path("/") + client_cluster_health.stop() + self.log(logging.INFO, "cluster looks ready to go") + break + except Exception: + tries += 1 + if tries >= MAX_INIT_TRIES: + raise + if tries > 0 and tries % 2 == 0: + self.log( + logging.WARNING, + "nuking current cluster and making another one", + ) + self.cluster.terminate() + self.cluster.start() Yes, there is already some sleep() hidden in cluster.start() ( https://github.com/python-zk/kazoo/blob/master/kazoo/testing/common.py#L370 ) — Reply to this email directly, view it on GitHub <#716 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAIFTHUSQS3TH5DKY3SMGM3XAXIQ7ANCNFSM6AAAAAAWHB5WWQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Fixed the `test_read_only` test based on what is done in the official Java client, we were missing some things and that led the test to fail from time to time (see comments in the code for full story). Also renamed `test__connection.py` to `test_connection.py`, where it belongs.

Now performing a ZK cluster nuking at the end of each test case since keeping the same ZK state from test cases to test cases actually led to some tests randomly failing. Added `client_cluster_health` ZK client in the `setup_zookeeper()` step that is there to "guarantee" the cluster is up and running, without even considering the configuration that will be used by the test itself.

Activate `pytest`'s `logcli` setting to better read the client lifecycle during testing. Add `log()` function to KazooTestHarness class so that it is possible to log like crazy when something is not working Display the ZK client port in logs when starting a ZK server (useful for test "debug") Be able to get more than the last 100 lines of ZK logs (can be useful, believe me)

StephenSorriaux · 2023-04-11T22:51:37Z

I added a "custom" retry logic without using a KazooRetry since KazooRetry does not handle refused/dropped connections like I would want too: even if a connection_retry is given to KazooClient(), the retry configuration is actually not respected because the interrupt() end up being triggered anyway (ie. the client._stopped event)

This is fine for this PR, but is this something we should consider changing/fixing in KazooClient/KazooRetry long-term?

Yes, I think we should consider looking into it, especially what to expect of the retry when used as a connection_retry parameter for the client

StephenSorriaux · 2023-04-11T22:56:11Z

I'll be all in for a docker based harness! I personally have java installed on my desktop just for kazoo...

jeffwidman · 2023-04-13T00:01:59Z

I would 110% support a docker-based harness. No problem adding that as a test/dev dependency, it's almost certainly more accessible to drive-by devs as well who are going to be familiar with that from other libraries doing similar things.

jeffwidman

This all looks good to me, thank you so much for the detailed explanations, not only for me but also on behalf of other folks who stumble across this later.

Probably best to wait to merge til Charles get a chance to look at it too.

ceache

This looks good to me. Thanks a lot @StephenSorriaux !

StephenSorriaux · 2023-04-24T19:55:41Z

I added a "custom" retry logic without using a KazooRetry since KazooRetry does not handle refused/dropped connections like I would want too: even if a connection_retry is given to KazooClient(), the retry configuration is actually not respected because the interrupt() end up being triggered anyway (ie. the client._stopped event)

This is fine for this PR, but is this something we should consider changing/fixing in KazooClient/KazooRetry long-term?

Yes, I think we should consider looking into it, especially what to expect of the retry when used as a connection_retry parameter for the client

Update: I think #685 tried to tackle the "issue" I mentionned

ceache · 2023-04-24T20:33:52Z

I will try to rebase my PR, now that this is merged. Would there be anything holding it back ? I understood Jeff's comment to mean he thought it was ok.

…

On Mon, Apr 24, 2023, 21:55 Stephen Sorriaux ***@***.***> wrote: - I added a "custom" retry logic without using a KazooRetry since KazooRetry does not handle refused/dropped connections like I would want too: even if a connection_retry is given to KazooClient(), the retry configuration is actually not respected because the interrupt() end up being triggered anyway (ie. the client._stopped event) This is fine for this PR, but is this something we should consider changing/fixing in KazooClient/KazooRetry long-term? Yes, I think we should consider looking into it, especially what to expect of the retry when used as a connection_retry parameter for the client Update: I think #685 <#685> tried to tackle the "issue" I mentionned — Reply to this email directly, view it on GitHub <#716 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAIFTHVM7SYTPRBYKHIHEWTXC3LERANCNFSM6AAAAAAWHB5WWQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

StephenSorriaux · 2023-04-24T20:42:09Z

Sorry for the rebase, test_connection.py might give you some conflict...

I think I owe you a review on #685, will take the time now!

StephenSorriaux force-pushed the fix/flaky-tests branch 2 times, most recently from 8a21392 to 93b7337 Compare March 24, 2023 21:45

StephenSorriaux force-pushed the fix/flaky-tests branch 2 times, most recently from 092eabd to 8e2e933 Compare March 24, 2023 22:06

houndci-bot reviewed Mar 24, 2023

View reviewed changes

kazoo/tests/test_connection.py Outdated Show resolved Hide resolved

StephenSorriaux force-pushed the fix/flaky-tests branch 24 times, most recently from 84a34a4 to 53bc8c1 Compare March 26, 2023 05:35

StephenSorriaux requested review from ceache and a-ungurianu April 10, 2023 21:33

jeffwidman reviewed Apr 10, 2023

View reviewed changes

kazoo/tests/conftest.py Outdated Show resolved Hide resolved

kazoo/tests/test_connection.py Outdated Show resolved Hide resolved

pyproject.toml Outdated Show resolved Hide resolved

kazoo/tests/test_connection.py Show resolved Hide resolved

StephenSorriaux force-pushed the fix/flaky-tests branch 2 times, most recently from cde26c3 to a41f564 Compare April 10, 2023 22:21