You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have identified every place of that PR that was a non-logging based change inside the DKG that may change behavior:
diff line 96-100 of state_machine.rs
diff line 80-106 of remote.rs
diff line 85-103 of state_machine_wrapper.rs
diff line 924-928 of worker.rs
diff line 1023-1027 of worker.rs
While none of these strike me as causing the problem, I will nonetheless keep them here for our reference.
Assuming we are absolutely certain that the testing harness PR is what “caused” the loss in reliability, there are three possibilities here:
The above diff lines “caused” the bug
The above diff lines “revealed” the bug
The added logging code revealed (or caused) a bug. I use the word “revealed”, because sometimes when adding additional overhead to a simple call (in this case, instead of just printing an error, we potentially also print to a file. Which, we don’t do in the pipeline because we don’t pass the --output-path flag to the binaries. I suspect we don’t do it either in the other testing we’re doing), it can open room for synchronization problems to surface.
As for #1, I am doubtful. We can explore the diff if you want, but it’s hard for me to believe that’s causing a problem.
As for #2, this is possible. There’s a big difference between new code “causing” a bug versus new code “revealing” a bug. When new code “reveals” a bug, that means the new code changed the behavior such that there was a reduction/gain in overhead; without that change in behavior, the bug would have remained hidden. Sometime when programs speed up, concurrency/synchronization problems begin to surface (the same can even be said for when programs slow down). I’ve run across these sorts of bugs in the past.
As for #3, I am doubtful for thinking it “caused” the bug, but it is possible it “revealed” the bug for reasons explained earlier.
In summary, we should not jump to removing this PR just because we noticed a change in behavior. This PR may have revealed to us a bug without causing it, in which case, dismissing the PR would be moving in the wrong direction.
The text was updated successfully, but these errors were encountered:
Diff link: https://github.com/webb-tools/dkg-substrate/commit/92ee13401bf45305c7c41f7a22524d71330b188d#diff-981734d6de1b0b3ff2f68[…]c3d270071f0d25ee26fb5893792
I have identified every place of that PR that was a non-logging based change inside the DKG that may change behavior:
diff line 96-100 of state_machine.rs
diff line 80-106 of remote.rs
diff line 85-103 of state_machine_wrapper.rs
diff line 924-928 of worker.rs
diff line 1023-1027 of worker.rs
While none of these strike me as causing the problem, I will nonetheless keep them here for our reference.
Assuming we are absolutely certain that the testing harness PR is what “caused” the loss in reliability, there are three possibilities here:
The above diff lines “caused” the bug
The above diff lines “revealed” the bug
The added logging code revealed (or caused) a bug. I use the word “revealed”, because sometimes when adding additional overhead to a simple call (in this case, instead of just printing an error, we potentially also print to a file. Which, we don’t do in the pipeline because we don’t pass the --output-path flag to the binaries. I suspect we don’t do it either in the other testing we’re doing), it can open room for synchronization problems to surface.
As for #1, I am doubtful. We can explore the diff if you want, but it’s hard for me to believe that’s causing a problem.
As for #2, this is possible. There’s a big difference between new code “causing” a bug versus new code “revealing” a bug. When new code “reveals” a bug, that means the new code changed the behavior such that there was a reduction/gain in overhead; without that change in behavior, the bug would have remained hidden. Sometime when programs speed up, concurrency/synchronization problems begin to surface (the same can even be said for when programs slow down). I’ve run across these sorts of bugs in the past.
As for #3, I am doubtful for thinking it “caused” the bug, but it is possible it “revealed” the bug for reasons explained earlier.
In summary, we should not jump to removing this PR just because we noticed a change in behavior. This PR may have revealed to us a bug without causing it, in which case, dismissing the PR would be moving in the wrong direction.
The text was updated successfully, but these errors were encountered: