Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Loss of reliability correlated with Testing Harness PR #589

Closed
tbraun96 opened this issue May 2, 2023 · 1 comment
Closed

[BUG] Loss of reliability correlated with Testing Harness PR #589

tbraun96 opened this issue May 2, 2023 · 1 comment
Assignees

Comments

@tbraun96
Copy link
Contributor

tbraun96 commented May 2, 2023

Diff link: https://github.com/webb-tools/dkg-substrate/commit/92ee13401bf45305c7c41f7a22524d71330b188d#diff-981734d6de1b0b3ff2f68[…]c3d270071f0d25ee26fb5893792

I have identified every place of that PR that was a non-logging based change inside the DKG that may change behavior:
diff line 96-100 of state_machine.rs
diff line 80-106 of remote.rs
diff line 85-103 of state_machine_wrapper.rs
diff line 924-928 of worker.rs
diff line 1023-1027 of worker.rs

While none of these strike me as causing the problem, I will nonetheless keep them here for our reference.
Assuming we are absolutely certain that the testing harness PR is what “caused” the loss in reliability, there are three possibilities here:

The above diff lines “caused” the bug

The above diff lines “revealed” the bug

The added logging code revealed (or caused) a bug. I use the word “revealed”, because sometimes when adding additional overhead to a simple call (in this case, instead of just printing an error, we potentially also print to a file. Which, we don’t do in the pipeline because we don’t pass the --output-path flag to the binaries. I suspect we don’t do it either in the other testing we’re doing), it can open room for synchronization problems to surface.

As for #1, I am doubtful. We can explore the diff if you want, but it’s hard for me to believe that’s causing a problem.

As for #2, this is possible. There’s a big difference between new code “causing” a bug versus new code “revealing” a bug. When new code “reveals” a bug, that means the new code changed the behavior such that there was a reduction/gain in overhead; without that change in behavior, the bug would have remained hidden. Sometime when programs speed up, concurrency/synchronization problems begin to surface (the same can even be said for when programs slow down). I’ve run across these sorts of bugs in the past.

As for #3, I am doubtful for thinking it “caused” the bug, but it is possible it “revealed” the bug for reasons explained earlier.
In summary, we should not jump to removing this PR just because we noticed a change in behavior. This PR may have revealed to us a bug without causing it, in which case, dismissing the PR would be moving in the wrong direction.

@1xstj
Copy link
Contributor

1xstj commented May 4, 2023

Related : #590
#591

@1xstj 1xstj closed this as completed May 4, 2023
@1xstj 1xstj reopened this May 4, 2023
@1xstj 1xstj self-assigned this May 8, 2023
@github-project-automation github-project-automation bot moved this to Not Started 🕧 in Webb Universe May 8, 2023
@1xstj 1xstj closed this as completed May 22, 2023
@github-project-automation github-project-automation bot moved this from Not Started 🕧 to Completed ✅ in Webb Universe May 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

2 participants