-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrapping request client calls with a read/write mutex #3701
Conversation
It happens that keep-core client panics when interacting with the go-electrum library. We noticed "concurrent write to websocket connection" error was thrown when calling GetLatestBlockHeight() function. The stack trace leads to gorilla/websocket/WriteMessage which is called from the go-electrum client. The latest block height call was already wrapped with a read mutex, but we should also wrap it with the write mutex as well to prevent such concurrent errors.
|
In this test we spin up 20 goroutines that hit electrum's client.SubscribeHeaders() concurrently. In the cab1168 commit we introduced a RW lock on getting the latest block height function. Without a W lock, the test would've failed with 'concurrent write to websocket connection' panic. If there's no panic, then the W lock works. 20 goroutine was always enough to see the panic thrown with only R lock.
@@ -771,9 +771,9 @@ func requestWithRetry[K interface{}]( | |||
requestCtx, requestCancel := context.WithTimeout(ctx, c.config.RequestTimeout) | |||
defer requestCancel() | |||
|
|||
c.clientMutex.RLock() | |||
c.clientMutex.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This effectively creates a semaphore that allows just one request at the same. In theory, this can affect performance if multiple goroutines try to make multiple requests simultaneously. This, in turn, can cause problems in higher layers where some timeouts may be hit unexpectedly. Have we done any benchmarks against real Electrum servers to measure the impact of that change performance-wise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I run the integration tests and system tests, no performace-wise tests. I'm not sure if we have any benchmarks/performace focused tests (please correct me if I'm wrong). Yes, it might affect the performance but I guess only when we expect a lot of write requests at the same time. Not too familiar with the flows to answer this off the top of my head. FWIW we had a Read mutex already, and I'd expect we read more than we write based on the fact that we have way more getter functions and this particular issue came from GetLatestBlockHeight
.
The closest to a real-world testing would be including this fix in a release candidate -> deploy to a tesnet and see if we hit any timeout issues. Open for suggestions regarding the performace-wise testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree regarding the client
instance itself - we rarely reconnect (thus replace this instance) and mostly use it to perform network calls to the Electrum server. But, my whole point was actually about the performance of those network calls. So far, with the read mutex, multiple client goroutines could use the client
instance to make multiple network calls at the same time. Now, only one client goroutine can do so. Others have to wait for their turn. That means the last goroutines in the queue may wait too long if their predecessors stumble upon longer response times from the Electrum server (or expect other network problems). This can cause cascading failures as those client goroutines may have their own timeouts. This is why I'm raising the performance concern.
That said, I'm wondering whether we can limit the impact of this fix. You said (and this is my reasoning as well), the problem is related only to GetLatestBlockHeight
, specifically client.SubscribeHeaders
is problematic. Can't we just wrap the client.SubscribeHeaders
call with a mutex instead of wrapping the whole requestWithRetry
function? This way, only one function is affected and we can easily focus on it during real-world tests. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I'd also like to limit the impact of this fix, but limiting this mutex only for GetLatestBlockHeight
might not be enough and we can observe the same issue in other getters. The thing is that most of the getters hit the s.transport.SendMessage(bytes)
at some point and if the transport layer is set to ws
then we hit the following code https://github.com/keep-network/go-electrum/blob/master/electrum/transport_ws.go#L91 and here we might have another occurrence of this issue. I suspect that we observe the problem only for GetLatestBlockHeight
because this function is being hit more frequently than the other getters in a certain flow. Note that this concurrent write to websocket connection
error happen not very frequently, but when it does it affects the client uptime.
So having said that, I guess there's a tradeoff we need to make. Limiting the scope of this mutex has a lower impact on performance, but the error might pop up in a different function. Leaving the fix for a wider scope might affect the performance, but no error would occur. The consequence of this error was lowering the uptime bar to 50% because it affected the rewards distribution. Obviously, stakers wouldn't mind a client being slightly slower, but not losing the rewards. Also, I agree with your point that we want to avoid timeout issues that might be a concern, but we don't know if it is. I also suspect that our client had some slack for such cases like this and the timeouts are not balancing on the edge. I'd propose releasing a client with a wider scope mutex fix and seeing if we hit any timeouts in a testing env. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing is that most of the getters hit the s.transport.SendMessage(bytes) at some point and if the transport layer is set to ws then we hit the following code https://github.com/keep-network/go-electrum/blob/master/electrum/transport_ws.go#L91 and here we might have another occurrence of this issue
Ok, this is the missing piece I was not aware of. Thanks for clarifying that!
I'd propose releasing a client with a wider scope mutex fix and seeing if we hit any timeouts in a testing env.
That makes sense now but there is one thing I would add in this case: better observability for those calls. It may happen that we won't detect anything on testnet but this problem will manifest in the future and we won't have an easy way to figure out whether it is actually caused by throttled calls to the Electrum server. We can, for example, simply wrap all electrum.Connection
methods with a logger that will display starting [method_name] call to Electrum server
at the beginning and then completed [method_name] to Electrum server with [success/error] after [time]
at the end. This will be really helpful to have such info logs next to timeout errors if they even occur. That's just a proposition so leaving it up to you but I think it will greatly facilitate future debugging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good idea. Mind if I open this in a separate PR? This one is for fixing an existing error, another PR will be around improving observability. Both will be included in the next release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! Leaving this discussion open for future reference.
Commenting this test because as a side effect it exposed another known issue which now happens very often because of this test. It's becoming annoying having red builds because of this test. It should be uncommented once the fix for checksum0/go-electrum#10 is in place.
9882119
to
da09f53
Compare
Refs the discussion #3701 (comment) Electrum client calls sometimes are prone to timeouts or other issues. For debugging purposes we wrap electrum client calls with a logger and measure the time it takes to complete a call.
Refs #3699
Depends on #3704
It happens that the keep-core client panics when interacting with the
go-electrum
library. We noticed a "concurrent write to websocket connection" error was thrown when callingGetLatestBlockHeight()
function. The stack trace leads to gorilla/websocket/WriteMessage which is called from thego-electrum
library. The latest block height call was already wrapped with a read mutex, but we should also wrap it with the write mutex to prevent such concurrent errors.