Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove race condition when accessing remote bulker map #4171

Merged
merged 5 commits into from
Dec 6, 2024

Conversation

michel-laterman
Copy link
Contributor

@michel-laterman michel-laterman commented Dec 3, 2024

What is the problem this PR solves?

Remove a race condition/bug that may occur when remote ES outputs are used.

How does this PR solve the problem?

Use the remoteOutputMutex whenever accessing the bulkerMap.
Change GetBulkerMap to return a copy of the map so that remote output health will not conflict with adding/removing a bulker from the map.

Design Checklist

  • I have ensured my design is stateless and will work when multiple fleet-server instances are behind a load balancer.
  • I have or intend to scale test my changes, ensuring it will work reliably with 100K+ agents connected.
  • I have included fail safe mechanisms to limit the load on fleet-server: rate limiting, circuit breakers, caching, load shedding, etc.

Checklist

  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool

Related issues

@michel-laterman michel-laterman added bug Something isn't working flaky-test Unstable or unreliable test cases. Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Dec 3, 2024
@michel-laterman michel-laterman requested a review from a team as a code owner December 3, 2024 00:20
Copy link
Contributor

mergify bot commented Dec 3, 2024

This pull request does not have a backport label. Could you fix it @michel-laterman? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-./d./d is the label to automatically backport to the 8./d branch. /d is the digit

Copy link
Contributor

mergify bot commented Dec 3, 2024

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label and remove the backport-8.x label.

@mergify mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Dec 3, 2024
func (b *Bulker) GetBulkerMap() map[string]Bulk {
return b.bulkerMap
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can see, we had a flaky test because the policy self-monitor (internal/pkg/policy/self.go) was getting the bulker map in order to check remote output health; but we did not prevent our policy output preparation from creating a new bulker concurrently (internal/pkg/policy/policy_output.go).
I've changes our maps to sync.Map as we weren't properly using the mutex we had, and changed this func to return a copy instead

@michel-laterman michel-laterman added backport-8.16 Automated backport with mergify backport-8.17 Automated backport with mergify labels Dec 3, 2024
if !ok {
return nil
}
return o.(Bulk)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this panic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it shouldn't, we are only adding Bulkers

return bulker, false, nil
bulker, ok := b.bulkerMap.Load(outputName)
if ok && !hasConfigChanged {
return bulker.(Bulk), false, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming it is guaranteed that the sync.Map will only hold bulker type, is that the case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

@michel-laterman
Copy link
Contributor Author

I added some benchmarks, and ran them against main for comparison.
Note that while main has a RWMutex, it's only used in the updateBulkerMap func, but not in GetBulker, CreateAndGetBulker.
This PR also uses a sync.Map for remoteOutputConfigMap, and main does not.

Benchmarks were ran with go test -bench=Benchmark_CreateAndGetBulker -count 10 .

goos: darwin
goarch: arm64
pkg: github.com/elastic/fleet-server/v7/internal/pkg/bulk
cpu: Apple M3 Pro
                                              │  main.txt   │              sync.txt               │
                                              │   sec/op    │    sec/op     vs base               │
_CreateAndGetBulker/new_remote_bulker-12        44.62µ ± 7%   45.56µ ± 10%       ~ (p=0.529 n=10)
_CreateAndGetBulker/existing_remote_bulker-12   658.5n ± 3%   710.3n ±  6%  +7.86% (p=0.000 n=10)
_CreateAndGetBulker/changed_remote_bulker-12    46.27µ ± 9%   47.56µ ±  5%       ~ (p=0.529 n=10)
geomean                                         11.08µ        11.55µ        +4.22%

                                              │   main.txt   │              sync.txt               │
                                              │     B/op     │     B/op      vs base               │
_CreateAndGetBulker/new_remote_bulker-12        39.30Ki ± 0%   38.79Ki ± 0%  -1.30% (p=0.000 n=10)
_CreateAndGetBulker/existing_remote_bulker-12     591.0 ± 0%     623.0 ± 0%  +5.41% (p=0.000 n=10)
_CreateAndGetBulker/changed_remote_bulker-12    38.78Ki ± 1%   38.78Ki ± 1%       ~ (p=0.896 n=10)
geomean                                         9.582Ki        9.709Ki       +1.32%

                                              │  main.txt   │              sync.txt               │
                                              │  allocs/op  │  allocs/op   vs base                │
_CreateAndGetBulker/new_remote_bulker-12        1.033k ± 0%   1.039k ± 0%   +0.58% (p=0.000 n=10)
_CreateAndGetBulker/existing_remote_bulker-12    11.00 ± 0%    13.00 ± 0%  +18.18% (p=0.000 n=10)
_CreateAndGetBulker/changed_remote_bulker-12    1.038k ± 0%   1.042k ± 0%   +0.39% (p=0.000 n=10)
geomean                                          227.6         241.4        +6.07%

@@ -148,3 +153,57 @@ func Test_CreateAndGetBulkerChanged(t *testing.T) {
assert.Nil(t, err)
assert.Equal(t, true, cancelFnCalled)
}

func Benchmark_CreateAndGetBulker(b *testing.B) {
b.Skip("Crashes on remote runner")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why, but this causes issues when make benchmark is ran, but these tests can be ran individually without this

@michel-laterman
Copy link
Contributor Author

buildkite test this

@cmacknz
Copy link
Member

cmacknz commented Dec 6, 2024

My take on this is that this is a real bug, the GetBulkerMap() method introduced to support remote outputs isn't concurrency safe and really can't be unless you do something like this.

I don't like the use of sync.Map to solve this, primarily because it removes type safety in a codebase that I already find hard to follow with many levels of interface abstraction. I also don't see us satisfying the two criteria where sync.Map recommends its use: https://pkg.go.dev/sync#Map. I also don't like that it will encourage us to just randomly access the bulker everywhere but TBH that is kind of how it is designed to be used right now.

The Map type is optimized for two common use cases: (1) when the entry for a given key is only ever written once but read many times, as in caches that only grow, or (2) when multiple goroutines read, write, and overwrite entries for disjoint sets of keys. In these two cases, use of a Map may significantly reduce lock contention compared to a Go map paired with a separate Mutex or RWMutex.

We don't satisfy 1 because outputs can be changed, updated, and deleted. We don't satisfy 2 because the entire point of GetBulkerMap is to look at every key and not some non-overlapping set of keys.

It looks like we could solve this but just introducing a RWMutex for all the uses of bulkMap. There are several that are not mutex protected.

bulkerMap: make(map[string]Bulk),
}

The GetBulkerMap method only has one use and it is to get the output name and client for each existing output. Rather than return a reference to the map, you can just hold a mutex to return a copy of those two things. Assuming the client is safe for concurrent access. GetBulkerMap shouldn't exist in its current form.

You could potentially rewrite the way the remote ES output healthcheck works to not need a concurrent map at all. The self monitor could listen on a channel for state updates from each attempt to interact with the remote ES update that could fail or something like that.

@cmacknz
Copy link
Member

cmacknz commented Dec 6, 2024

I'm not really worried about the performance implications of this after read the code involved. I think that was just a way to see if use of sync.Map was justified. I don't think it is for non-performance reasons (interfaces, ugh).

@michel-laterman
Copy link
Contributor Author

I'll reimplement the mutex and create an issue to discuss remote output health

@michel-laterman
Copy link
Contributor Author

Issue: #4185
I think we should remove remote health reporting from the policy self-monitor as it does nothing to the (fleet-server) status, but we can discuss it in the issue.

@michel-laterman michel-laterman merged commit 924ea07 into elastic:main Dec 6, 2024
8 checks passed
@michel-laterman michel-laterman deleted the sync-remote-bulker branch December 6, 2024 23:31
mergify bot pushed a commit that referenced this pull request Dec 6, 2024
Use the remoteOutputMutex whenever accessing the bulkerMap.
Change GetBulkerMap to return a copy of the map so that remote output health will not conflict with adding/removing a bulker from the map.

(cherry picked from commit 924ea07)
mergify bot pushed a commit that referenced this pull request Dec 6, 2024
Use the remoteOutputMutex whenever accessing the bulkerMap.
Change GetBulkerMap to return a copy of the map so that remote output health will not conflict with adding/removing a bulker from the map.

(cherry picked from commit 924ea07)
mergify bot pushed a commit that referenced this pull request Dec 6, 2024
Use the remoteOutputMutex whenever accessing the bulkerMap.
Change GetBulkerMap to return a copy of the map so that remote output health will not conflict with adding/removing a bulker from the map.

(cherry picked from commit 924ea07)
michel-laterman added a commit that referenced this pull request Dec 9, 2024
Use the remoteOutputMutex whenever accessing the bulkerMap.
Change GetBulkerMap to return a copy of the map so that remote output health will not conflict with adding/removing a bulker from the map.

(cherry picked from commit 924ea07)

Co-authored-by: Michel Laterman <[email protected]>
michel-laterman added a commit that referenced this pull request Dec 9, 2024
Use the remoteOutputMutex whenever accessing the bulkerMap.
Change GetBulkerMap to return a copy of the map so that remote output health will not conflict with adding/removing a bulker from the map.

(cherry picked from commit 924ea07)

Co-authored-by: Michel Laterman <[email protected]>
michel-laterman added a commit that referenced this pull request Dec 9, 2024
Use the remoteOutputMutex whenever accessing the bulkerMap.
Change GetBulkerMap to return a copy of the map so that remote output health will not conflict with adding/removing a bulker from the map.

(cherry picked from commit 924ea07)

Co-authored-by: Michel Laterman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.x Automated backport to the 8.x branch with mergify backport-8.16 Automated backport with mergify backport-8.17 Automated backport with mergify bug Something isn't working flaky-test Unstable or unreliable test cases. Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Test_Agent_Remote_ES_Output flaky due to race
3 participants