-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove race condition when accessing remote bulker map #4171
Remove race condition when accessing remote bulker map #4171
Conversation
This pull request does not have a backport label. Could you fix it @michel-laterman? 🙏
|
|
func (b *Bulker) GetBulkerMap() map[string]Bulk { | ||
return b.bulkerMap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I can see, we had a flaky test because the policy self-monitor (internal/pkg/policy/self.go
) was getting the bulker map in order to check remote output health; but we did not prevent our policy output preparation from creating a new bulker concurrently (internal/pkg/policy/policy_output.go
).
I've changes our maps to sync.Map
as we weren't properly using the mutex we had, and changed this func to return a copy instead
internal/pkg/bulk/engine.go
Outdated
if !ok { | ||
return nil | ||
} | ||
return o.(Bulk) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this panic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it shouldn't, we are only adding Bulkers
internal/pkg/bulk/engine.go
Outdated
return bulker, false, nil | ||
bulker, ok := b.bulkerMap.Load(outputName) | ||
if ok && !hasConfigChanged { | ||
return bulker.(Bulk), false, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming it is guaranteed that the sync.Map will only hold bulker type, is that the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
I added some benchmarks, and ran them against main for comparison. Benchmarks were ran with
|
@@ -148,3 +153,57 @@ func Test_CreateAndGetBulkerChanged(t *testing.T) { | |||
assert.Nil(t, err) | |||
assert.Equal(t, true, cancelFnCalled) | |||
} | |||
|
|||
func Benchmark_CreateAndGetBulker(b *testing.B) { | |||
b.Skip("Crashes on remote runner") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why, but this causes issues when make benchmark
is ran, but these tests can be ran individually without this
buildkite test this |
My take on this is that this is a real bug, the I don't like the use of
We don't satisfy 1 because outputs can be changed, updated, and deleted. We don't satisfy 2 because the entire point of It looks like we could solve this but just introducing a RWMutex for all the uses of bulkMap. There are several that are not mutex protected. fleet-server/internal/pkg/bulk/engine.go Lines 127 to 128 in 6b29ab4
The You could potentially rewrite the way the remote ES output healthcheck works to not need a concurrent map at all. The self monitor could listen on a channel for state updates from each attempt to interact with the remote ES update that could fail or something like that. |
I'm not really worried about the performance implications of this after read the code involved. I think that was just a way to see if use of sync.Map was justified. I don't think it is for non-performance reasons (interfaces, ugh). |
I'll reimplement the mutex and create an issue to discuss remote output health |
Issue: #4185 |
Quality Gate passedIssues Measures |
Use the remoteOutputMutex whenever accessing the bulkerMap. Change GetBulkerMap to return a copy of the map so that remote output health will not conflict with adding/removing a bulker from the map. (cherry picked from commit 924ea07)
Use the remoteOutputMutex whenever accessing the bulkerMap. Change GetBulkerMap to return a copy of the map so that remote output health will not conflict with adding/removing a bulker from the map. (cherry picked from commit 924ea07)
Use the remoteOutputMutex whenever accessing the bulkerMap. Change GetBulkerMap to return a copy of the map so that remote output health will not conflict with adding/removing a bulker from the map. (cherry picked from commit 924ea07)
Use the remoteOutputMutex whenever accessing the bulkerMap. Change GetBulkerMap to return a copy of the map so that remote output health will not conflict with adding/removing a bulker from the map. (cherry picked from commit 924ea07) Co-authored-by: Michel Laterman <[email protected]>
Use the remoteOutputMutex whenever accessing the bulkerMap. Change GetBulkerMap to return a copy of the map so that remote output health will not conflict with adding/removing a bulker from the map. (cherry picked from commit 924ea07) Co-authored-by: Michel Laterman <[email protected]>
Use the remoteOutputMutex whenever accessing the bulkerMap. Change GetBulkerMap to return a copy of the map so that remote output health will not conflict with adding/removing a bulker from the map. (cherry picked from commit 924ea07) Co-authored-by: Michel Laterman <[email protected]>
What is the problem this PR solves?
Remove a race condition/bug that may occur when remote ES outputs are used.
How does this PR solve the problem?
Use the
remoteOutputMutex
whenever accessing thebulkerMap
.Change
GetBulkerMap
to return a copy of the map so that remote output health will not conflict with adding/removing a bulker from the map.Design Checklist
I have ensured my design is stateless and will work when multiple fleet-server instances are behind a load balancer.I have or intend to scale test my changes, ensuring it will work reliably with 100K+ agents connected.I have included fail safe mechanisms to limit the load on fleet-server: rate limiting, circuit breakers, caching, load shedding, etc.Checklist
I have made corresponding changes to the documentationI have made corresponding change to the default configuration files./changelog/fragments
using the changelog toolRelated issues