[BUG] Loss of reliability correlated with Testing Harness PR #589

tbraun96 · 2023-05-02T12:16:19Z

Diff link: https://github.com/webb-tools/dkg-substrate/commit/92ee13401bf45305c7c41f7a22524d71330b188d#diff-981734d6de1b0b3ff2f68[…]c3d270071f0d25ee26fb5893792

I have identified every place of that PR that was a non-logging based change inside the DKG that may change behavior:
diff line 96-100 of state_machine.rs
diff line 80-106 of remote.rs
diff line 85-103 of state_machine_wrapper.rs
diff line 924-928 of worker.rs
diff line 1023-1027 of worker.rs

While none of these strike me as causing the problem, I will nonetheless keep them here for our reference.
Assuming we are absolutely certain that the testing harness PR is what “caused” the loss in reliability, there are three possibilities here:

The above diff lines “caused” the bug

The above diff lines “revealed” the bug

The added logging code revealed (or caused) a bug. I use the word “revealed”, because sometimes when adding additional overhead to a simple call (in this case, instead of just printing an error, we potentially also print to a file. Which, we don’t do in the pipeline because we don’t pass the --output-path flag to the binaries. I suspect we don’t do it either in the other testing we’re doing), it can open room for synchronization problems to surface.

As for #1, I am doubtful. We can explore the diff if you want, but it’s hard for me to believe that’s causing a problem.

As for #2, this is possible. There’s a big difference between new code “causing” a bug versus new code “revealing” a bug. When new code “reveals” a bug, that means the new code changed the behavior such that there was a reduction/gain in overhead; without that change in behavior, the bug would have remained hidden. Sometime when programs speed up, concurrency/synchronization problems begin to surface (the same can even be said for when programs slow down). I’ve run across these sorts of bugs in the past.

As for #3, I am doubtful for thinking it “caused” the bug, but it is possible it “revealed” the bug for reasons explained earlier.
In summary, we should not jump to removing this PR just because we noticed a change in behavior. This PR may have revealed to us a bug without causing it, in which case, dismissing the PR would be moving in the wrong direction.

1xstj · 2023-05-04T13:30:34Z

Related : #590
#591

1xstj closed this as completed May 4, 2023

1xstj reopened this May 4, 2023

1xstj self-assigned this May 8, 2023

1xstj added this to Webb Universe May 8, 2023

github-project-automation bot moved this to Not Started 🕧 in Webb Universe May 8, 2023

1xstj closed this as completed May 22, 2023

github-project-automation bot moved this from Not Started 🕧 to Completed ✅ in Webb Universe May 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Loss of reliability correlated with Testing Harness PR #589

[BUG] Loss of reliability correlated with Testing Harness PR #589

tbraun96 commented May 2, 2023

1xstj commented May 4, 2023

[BUG] Loss of reliability correlated with Testing Harness PR #589

[BUG] Loss of reliability correlated with Testing Harness PR #589

Comments

tbraun96 commented May 2, 2023

1xstj commented May 4, 2023