Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NPE in WebSocket client subscriptions #3093

Closed
cthulhu-rider opened this issue Aug 17, 2023 · 5 comments · Fixed by #3532
Closed

NPE in WebSocket client subscriptions #3093

cthulhu-rider opened this issue Aug 17, 2023 · 5 comments · Fixed by #3532
Assignees
Labels
bug Something isn't working I4 No visible changes S4 Routine U1 Critically important to resolve quickly
Milestone

Comments

@cthulhu-rider
Copy link
Contributor

cthulhu-rider commented Aug 17, 2023

caught suddenly while debugging nspcc-dev/neofs-node#2444, don’t have a special context. I subscribe to new blocks and notary requests

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x38 pc=0x7bc119]

goroutine 14 [running]:
github.com/nspcc-dev/neo-go/pkg/rpcclient.(*WSClient).notifySubscribers(0xc000224480, {0x16?, {0xa11180?, 0xc000283ea0?}})
	/home/ll/go/pkg/mod/github.com/nspcc-dev/[email protected]/pkg/rpcclient/wsclient.go:599 +0x299
github.com/nspcc-dev/neo-go/pkg/rpcclient.(*WSClient).wsReader(0xc000224480)
	/home/ll/go/pkg/mod/github.com/nspcc-dev/[email protected]/pkg/rpcclient/wsclient.go:513 +0x545
created by github.com/nspcc-dev/neo-go/pkg/rpcclient.NewWS
	/home/ll/go/pkg/mod/github.com/nspcc-dev/[email protected]/pkg/rpcclient/wsclient.go:422 +0x638

may be i violated some strict requirement on my side, but didn't find it after a quick look

@AnnaShaleva AnnaShaleva added this to the v0.102.1 milestone Aug 18, 2023
@AnnaShaleva AnnaShaleva added the bug Something isn't working label Aug 18, 2023
@AnnaShaleva
Copy link
Member

The issue is caused by the fact that subscriptions map of WS client isn't in sync with receivers map (although it always must be). And currently I can't find where the bug is and also can't reproduce the panic. It's possible that concurrency issue occurs between subsequent event receiving and unsubscription.

@roman-khimov roman-khimov modified the milestones: v0.103.0, v0.104.0 Oct 20, 2023
@AnnaShaleva AnnaShaleva modified the milestones: v0.104.0, v0.105.0 Nov 9, 2023
@roman-khimov roman-khimov added U2 Seriously planned S4 Routine I4 No visible changes labels Dec 21, 2023
@AnnaShaleva AnnaShaleva modified the milestones: v0.105.0, v0.106.0 Dec 28, 2023
@AnnaShaleva AnnaShaleva removed their assignment Jan 19, 2024
@roman-khimov roman-khimov modified the milestones: v0.106.0, v0.107.0 May 13, 2024
@roman-khimov roman-khimov added U1 Critically important to resolve quickly and removed U2 Seriously planned labels Jul 16, 2024
@roman-khimov roman-khimov modified the milestones: v0.107.0, v0.106.3 Jul 16, 2024
@carpawell
Copy link
Member

100ms long blocks make it fail often.

@carpawell
Copy link
Member

@roman-khimov, @AnnaShaleva, @AliceInHunterland, there is some race-by-design currently and I am not sure what can be done here (mostly because I want to make a fix small and reliable, no big changes). The key reason is that the subscription lock is always taken after RPC is done, (e.g.

if err := c.performRequest("subscribe", params, &resp); err != nil {
return "", err
}
c.subscriptionsLock.Lock()
defer c.subscriptionsLock.Unlock()
) and it cannot control the sequence of the client's call.

This became an often issue when we started to wait for some TXs in our test runs: https://github.com/nspcc-dev/neofs-node/blob/24fc815cd4b262b625e24bac62259325aba3444c/pkg/morph/client/meta.go#L32 cause it uses subscriptions per every call:

blocksID, err := w.ws.ReceiveHeadersOfAddedBlocks(&neorpc.BlockFilter{Since: &since}, hRcvr)
if err != nil {
// Falling back to block-based subscription.
if errors.Is(err, neorpc.ErrInvalidParams) {
blocksID, err = w.ws.ReceiveBlocks(&neorpc.BlockFilter{Since: &since}, bRcvr)
}
}
if err != nil {
wsWaitErr = fmt.Errorf("failed to subscribe for new blocks/headers: %w", err)
} else {
waitersActive++
go func() {
<-exit
err = w.ws.Unsubscribe(blocksID)
(per every object put which can be often and can be concurrent).

Every client's (Un)Subscription call does two things: an RPC call and a subscription map lock (two of maps currently). If we imagine that there is one routine that tries to subscribe (A) and one routine that tries to unsubscribe (B), the following sequence can happen:

  1. Current number of subscriptions is X
  2. B does an RPC and makes number of subscriptions X-1
  3. A does an RPC and makes number of subscriptions X again
  4. A holds subscription locks and rewrites client's subscription state (subscription with ID X now points to a different channel; channel that was registered by B is lost and is not related to any real subscription but is still included in the receivers map)
  5. B holds subscription locks and drops subscription X (first, it is an error and we have just lost a subscription that we think was made successfully second, we have lost a channel in the receivers map, and no corresponding subscription points to it)
  6. X subscription is received by the WS client (in practice it is a new block, 100ms, quite often to be sure this issue happens every hour), we range through the receivers, see no corresponding subscription, and panic:
    ok, dropCh := c.subscriptions[id].TrySend(ntf, c.wsOpts.CloseNotificationChannelIfFull)

1

The easiest solution is to put subscription RPC inside Lock/Unlock space so no conflicts will ever happen. but I am not sure this is what is expected to be done and also performance may decrease (but it seems a real solution that fits the current design with non-monotonic subscription IDs).

2

The second idea is to make the subscription ID be smth like the request ID (the later subscription call is made the bigger its ID is always). I am not sure it may happen, cause there is some logic about the subscription numbers:

var id int
for ; id < len(sub.feeds); id++ {
if sub.feeds[id].event == neorpc.InvalidEventID {
break
}
}
if id == len(sub.feeds) {
s.subsLock.Unlock()
return nil, neorpc.NewInternalServerError("maximum number of subscriptions is reached")
}
sub.feeds[id].event = event
sub.feeds[id].filter = filter
s.subsLock.Unlock()
s.subsCounterLock.Lock()
select {
case <-s.shutdown:
s.subsCounterLock.Unlock()
return nil, neorpc.NewInternalServerError("server is shutting down")
default:
}
s.subscribeToChannel(event)
s.subsCounterLock.Unlock()
return strconv.FormatInt(int64(id), 10), nil

3

Also, I think it is possible to make every subscription completely unique and not be a simple number. This may be some hash or time-dependent thing, or requests-dependent, is not important, it just has to be unique for every subscription and has to make conflicts impossible.


Thoughts?

@roman-khimov
Copy link
Member

roman-khimov commented Jul 24, 2024

Server can return any IDs it wants and client has to live with that, so any server-side "fix" is not appropriate.

Simple lock extension can affect event processing since it has to access the same map as well. What we need in fact is to have the same order of events as we have on the server. And it's not a simple thing since two concurrent requests can be served in any order by the server. Still, we do have some order in wsReader, we just need to leverage it.

@roman-khimov
Copy link
Member

The other idea is to fetch the subscriber before unsubscribe and compare current one with it after the request, if it's the old one --- drop it, if it's something else --- OK, we're not touching it. Subscription always overwrites values anyway and that's correct.

carpawell added a commit that referenced this issue Jul 26, 2024
Every client's (Un)Subscription call does two things: an RPC call and a
subscription map lock (two of maps currently). If we imagine that there is
one routine that tries to subscribe (A) and one routine that tries to
unsubscribe (B), the following sequence can happen:

0. Current number of subscriptions is X
1. B does an RPC and makes number of subscriptions X-1
2. A does an RPC and makes number of subscriptions X again
3. A holds subscription locks and rewrites client's subscription state
   (subscription with ID X now points to a different channel; channel that
   was registered by B is lost and is not related to any real subscription
   but is still included in the `receivers` map)
4. B holds subscription locks and drops subscription X (first, it is an
   error and we have just lost a subscription that we think was made
   successfully second, we have lost a channel in the `receivers` map, and
   no corresponding subscription points to it)
5. X subscription is received by the WS client (in practice it is a new
   block, 100ms, quite often to be sure this issue happens every hour), we
   range through the receivers, see no corresponding subscription, and
   panic.

Closes #3093.

Signed-off-by: Pavel Karpy <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working I4 No visible changes S4 Routine U1 Critically important to resolve quickly
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants