Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix race in zk library #35

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

fix race in zk library #35

wants to merge 1 commit into from

Conversation

jhump
Copy link

@jhump jhump commented Jul 20, 2020

We found a race condition in the code and this was a fix we applied internally. Hoping we can get this patch upstreamed.

This repo looks more active than the original (samuel/go-zookeeper). Are you the new maintainers of this library?

@jeffbean
Copy link

Can you add a test to show the race and the now fixed situation.

@jhump
Copy link
Author

jhump commented Jul 21, 2020

I cannot. The issue popped up occasionally, though rarely, in our tests of github.com/fullstorydev/gosolr. That repo contains a solrmonitor package which makes it easy to write solr clients that want to stay in sync with the cluster topology in Zookeeper, for optimal request routing.

We run those tests with -race, and a stack trace popped up that showed unsynchronized access to this field. I'm afraid I do not have a clear way to construct a test that can repro the issue at-will. Also, this patch was originally written over a year ago (March 2019). The only context I still have access to is output from an example failure in our CI test suite:

==================
WARNING: DATA RACE
Write at 0x00c0007bcc80 by goroutine 144:
  github.com/samuel/go-zookeeper/zk.(*Conn).recvLoop()
      /src/mn/projects/fullstory/go/src/github.com/samuel/go-zookeeper/zk/conn.go:820 +0xbb6
  github.com/samuel/go-zookeeper/zk.(*Conn).loop.func2()
      /src/mn/projects/fullstory/go/src/github.com/samuel/go-zookeeper/zk/conn.go:454 +0x62

Previous read at 0x00c0007bcc80 by goroutine 97:
  github.com/samuel/go-zookeeper/zk.(*Conn).sendSetWatches()
      /src/mn/projects/fullstory/go/src/github.com/samuel/go-zookeeper/zk/conn.go:567 +0x4bd
  github.com/samuel/go-zookeeper/zk.(*Conn).loop()
      /src/mn/projects/fullstory/go/src/github.com/samuel/go-zookeeper/zk/conn.go:465 +0x6bf
  github.com/samuel/go-zookeeper/zk.Connect.func1()
      /src/mn/projects/fullstory/go/src/github.com/samuel/go-zookeeper/zk/conn.go:218 +0x3c

The version of the library used in the test at this time was 6f3354f.

@codecov
Copy link

codecov bot commented Jul 21, 2020

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 77.20%. Comparing base (c5e730e) to head (10d288d).
Report is 6 commits behind head on master.

❗ Current head 10d288d differs from pull request most recent head 072f54a. Consider uploading reports for the commit 072f54a to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master      #35      +/-   ##
==========================================
+ Coverage   75.94%   77.20%   +1.26%     
==========================================
  Files           7        7              
  Lines        1193     1316     +123     
==========================================
+ Hits          906     1016     +110     
- Misses        196      209      +13     
  Partials       91       91              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@nemith
Copy link

nemith commented Jul 21, 2020

This repo looks more active than the original (samuel/go-zookeeper). Are you the new maintainers of this library?

We created this fork as there was a number of changes that weren't being upstreamed even after contacting samuel. Since then samuel has accepted some pull requests, but I think the two libraries will start diverging a bit here and there. The advantage here is there is no "one" owner so hopefully we can keep the support going. However we won't be hurt if you did a PR there as well. :)

@nemith
Copy link

nemith commented Jul 22, 2020

Although we should try to recreate this somehow even if it is very synthetic. I am partially worried that just wrapping everything in atomic values is a bit of a bandaid.

@pmazzini
Copy link

The version of the library used in the test at this time was 6f3354f.

Is it still happening with the latest release?

@dragonsinth
Copy link

We can try to put together a test case, to mechanically repro, but honestly it's easier to simply prove the data race by code inspection on this section of conn.loop():

wg.Add(1)
go func() {
	defer close(c.closeChan) // tell send loop to exit
	defer wg.Done()

	var err error
	if c.debugCloseRecvLoop {
		err = errors.New("DEBUG: close recv loop")
	} else {
		err = c.recvLoop(c.conn)
	}
	if err != io.EOF || c.logInfo {
		c.logger.Printf("recv loop terminated: %v", err)
	}
	if err == nil {
		panic("zk: recvLoop should never return nil error")
	}
}()

c.sendSetWatches()
wg.Wait()

c.sendSetWatches() and c.recvLoop(c.conn) definitely run in different goroutines. recvLoop writes that value, and sendSetWatches reads the value, and there's currently no synchronization between the two. Another possible solution would be wrapping the assignment c.lastZxid = res.Zxid inside of c.watchersLock, since that's the lock sendSetWatches uses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants