Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrapping request client calls with a read/write mutex #3701

Merged
merged 7 commits into from
Sep 14, 2023

Conversation

dimpar
Copy link
Contributor

@dimpar dimpar commented Aug 24, 2023

Refs #3699

Depends on #3704

It happens that the keep-core client panics when interacting with the go-electrum library. We noticed a "concurrent write to websocket connection" error was thrown when calling GetLatestBlockHeight() function. The stack trace leads to gorilla/websocket/WriteMessage which is called from the go-electrum library. The latest block height call was already wrapped with a read mutex, but we should also wrap it with the write mutex to prevent such concurrent errors.

It happens that keep-core client panics when interacting with the
go-electrum library. We noticed "concurrent write to websocket connection" error
was thrown when calling GetLatestBlockHeight() function. The stack trace
leads to gorilla/websocket/WriteMessage which is called from the
go-electrum client. The latest block height call was already wrapped
with a read mutex, but we should also wrap it with the write mutex as
well to prevent such concurrent errors.
@dimpar dimpar requested review from nkuba and pdyraga August 24, 2023 15:11
@dimpar
Copy link
Contributor Author

dimpar commented Aug 24, 2023

I wasn't able to recreate the concurrent write to websocket connection error in the integration tests.
I've added an integration test that hit the GetLatestBlockHeight() concurrently. Without the W lock I was able to replicate the same panic. cab1168 fixed it.

dimpar added 3 commits August 29, 2023 14:18
In this test we spin up 20 goroutines that hit electrum's
client.SubscribeHeaders() concurrently. In the cab1168 commit we
introduced a RW lock on getting the latest block height function.
Without a W lock, the test would've failed with 'concurrent write to websocket connection' panic.
If there's no panic, then the W lock works. 20 goroutine was always
enough to see the panic thrown with only R lock.
@dimpar dimpar added this to the v2.0.0-m5 milestone Sep 7, 2023
tomaszslabon
tomaszslabon previously approved these changes Sep 12, 2023
pkg/bitcoin/electrum/electrum.go Show resolved Hide resolved
@@ -771,9 +771,9 @@ func requestWithRetry[K interface{}](
requestCtx, requestCancel := context.WithTimeout(ctx, c.config.RequestTimeout)
defer requestCancel()

c.clientMutex.RLock()
c.clientMutex.Lock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This effectively creates a semaphore that allows just one request at the same. In theory, this can affect performance if multiple goroutines try to make multiple requests simultaneously. This, in turn, can cause problems in higher layers where some timeouts may be hit unexpectedly. Have we done any benchmarks against real Electrum servers to measure the impact of that change performance-wise?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I run the integration tests and system tests, no performace-wise tests. I'm not sure if we have any benchmarks/performace focused tests (please correct me if I'm wrong). Yes, it might affect the performance but I guess only when we expect a lot of write requests at the same time. Not too familiar with the flows to answer this off the top of my head. FWIW we had a Read mutex already, and I'd expect we read more than we write based on the fact that we have way more getter functions and this particular issue came from GetLatestBlockHeight.

The closest to a real-world testing would be including this fix in a release candidate -> deploy to a tesnet and see if we hit any timeout issues. Open for suggestions regarding the performace-wise testing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree regarding the client instance itself - we rarely reconnect (thus replace this instance) and mostly use it to perform network calls to the Electrum server. But, my whole point was actually about the performance of those network calls. So far, with the read mutex, multiple client goroutines could use the client instance to make multiple network calls at the same time. Now, only one client goroutine can do so. Others have to wait for their turn. That means the last goroutines in the queue may wait too long if their predecessors stumble upon longer response times from the Electrum server (or expect other network problems). This can cause cascading failures as those client goroutines may have their own timeouts. This is why I'm raising the performance concern.

That said, I'm wondering whether we can limit the impact of this fix. You said (and this is my reasoning as well), the problem is related only to GetLatestBlockHeight, specifically client.SubscribeHeaders is problematic. Can't we just wrap the client.SubscribeHeaders call with a mutex instead of wrapping the whole requestWithRetry function? This way, only one function is affected and we can easily focus on it during real-world tests. WDYT?

Copy link
Contributor Author

@dimpar dimpar Sep 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I'd also like to limit the impact of this fix, but limiting this mutex only for GetLatestBlockHeight might not be enough and we can observe the same issue in other getters. The thing is that most of the getters hit the s.transport.SendMessage(bytes) at some point and if the transport layer is set to ws then we hit the following code https://github.com/keep-network/go-electrum/blob/master/electrum/transport_ws.go#L91 and here we might have another occurrence of this issue. I suspect that we observe the problem only for GetLatestBlockHeight because this function is being hit more frequently than the other getters in a certain flow. Note that this concurrent write to websocket connection error happen not very frequently, but when it does it affects the client uptime.
So having said that, I guess there's a tradeoff we need to make. Limiting the scope of this mutex has a lower impact on performance, but the error might pop up in a different function. Leaving the fix for a wider scope might affect the performance, but no error would occur. The consequence of this error was lowering the uptime bar to 50% because it affected the rewards distribution. Obviously, stakers wouldn't mind a client being slightly slower, but not losing the rewards. Also, I agree with your point that we want to avoid timeout issues that might be a concern, but we don't know if it is. I also suspect that our client had some slack for such cases like this and the timeouts are not balancing on the edge. I'd propose releasing a client with a wider scope mutex fix and seeing if we hit any timeouts in a testing env. Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing is that most of the getters hit the s.transport.SendMessage(bytes) at some point and if the transport layer is set to ws then we hit the following code https://github.com/keep-network/go-electrum/blob/master/electrum/transport_ws.go#L91 and here we might have another occurrence of this issue

Ok, this is the missing piece I was not aware of. Thanks for clarifying that!

I'd propose releasing a client with a wider scope mutex fix and seeing if we hit any timeouts in a testing env.

That makes sense now but there is one thing I would add in this case: better observability for those calls. It may happen that we won't detect anything on testnet but this problem will manifest in the future and we won't have an easy way to figure out whether it is actually caused by throttled calls to the Electrum server. We can, for example, simply wrap all electrum.Connection methods with a logger that will display starting [method_name] call to Electrum server at the beginning and then completed [method_name] to Electrum server with [success/error] after [time] at the end. This will be really helpful to have such info logs next to timeout errors if they even occur. That's just a proposition so leaving it up to you but I think it will greatly facilitate future debugging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good idea. Mind if I open this in a separate PR? This one is for fixing an existing error, another PR will be around improving observability. Both will be included in the next release.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! Leaving this discussion open for future reference.

pkg/bitcoin/electrum/electrum_integration_test.go Outdated Show resolved Hide resolved
Commenting this test because as a side effect it exposed another known
issue which now happens very often because of this test. It's becoming
annoying having red builds because of this test. It should be
uncommented once the fix for
checksum0/go-electrum#10 is in place.
@dimpar dimpar force-pushed the concurrent-write-to-websocket branch from 9882119 to da09f53 Compare September 13, 2023 19:41
@lukasz-zimnoch lukasz-zimnoch merged commit 5c2604c into main Sep 14, 2023
@lukasz-zimnoch lukasz-zimnoch deleted the concurrent-write-to-websocket branch September 14, 2023 07:24
lukasz-zimnoch added a commit that referenced this pull request Sep 14, 2023
Refs the discussion
#3701 (comment)

Electrum client calls sometimes are prone to timeouts or other issues.
For debugging purposes we wrap electrum client calls with a logger and
measure the time it takes to complete a call.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants