Remove race condition when accessing remote bulker map #4171

michel-laterman · 2024-12-03T00:20:52Z

What is the problem this PR solves?

Remove a race condition/bug that may occur when remote ES outputs are used.

How does this PR solve the problem?

Use the remoteOutputMutex whenever accessing the bulkerMap.
Change GetBulkerMap to return a copy of the map so that remote output health will not conflict with adding/removing a bulker from the map.

Design Checklist

~~I have ensured my design is stateless and will work when multiple fleet-server instances are behind a load balancer.~~
~~I have or intend to scale test my changes, ensuring it will work reliably with 100K+ agents connected.~~
~~I have included fail safe mechanisms to limit the load on fleet-server: rate limiting, circuit breakers, caching, load shedding, etc.~~

Checklist

I have commented my code, particularly in hard-to-understand areas
~~I have made corresponding changes to the documentation~~
~~I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works
I have added an entry in ./changelog/fragments using the changelog tool

Related issues

Closes Test_Agent_Remote_ES_Output flaky due to race #4170

mergify · 2024-12-03T00:21:31Z

This pull request does not have a backport label. Could you fix it @michel-laterman? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-./d./d is the label to automatically backport to the 8./d branch. /d is the digit

mergify · 2024-12-03T00:21:31Z

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label and remove the backport-8.x label.

michel-laterman · 2024-12-03T00:26:02Z

internal/pkg/bulk/engine.go

 func (b *Bulker) GetBulkerMap() map[string]Bulk {
-	return b.bulkerMap


From what I can see, we had a flaky test because the policy self-monitor (internal/pkg/policy/self.go) was getting the bulker map in order to check remote output health; but we did not prevent our policy output preparation from creating a new bulker concurrently (internal/pkg/policy/policy_output.go).
I've changes our maps to sync.Map as we weren't properly using the mutex we had, and changed this func to return a copy instead

kaanyalti · 2024-12-05T15:32:14Z

internal/pkg/bulk/engine.go

+	if !ok {
+		return nil
+	}
+	return o.(Bulk)


Can this panic?

it shouldn't, we are only adding Bulkers

kaanyalti · 2024-12-05T15:37:53Z

internal/pkg/bulk/engine.go

-		return bulker, false, nil
+	bulker, ok := b.bulkerMap.Load(outputName)
+	if ok && !hasConfigChanged {
+		return bulker.(Bulk), false, nil


I'm assuming it is guaranteed that the sync.Map will only hold bulker type, is that the case?

michel-laterman · 2024-12-05T19:39:33Z

I added some benchmarks, and ran them against main for comparison.
Note that while main has a RWMutex, it's only used in the updateBulkerMap func, but not in GetBulker, CreateAndGetBulker.
This PR also uses a sync.Map for remoteOutputConfigMap, and main does not.

Benchmarks were ran with go test -bench=Benchmark_CreateAndGetBulker -count 10 .

goos: darwin
goarch: arm64
pkg: github.com/elastic/fleet-server/v7/internal/pkg/bulk
cpu: Apple M3 Pro
                                              │  main.txt   │              sync.txt               │
                                              │   sec/op    │    sec/op     vs base               │
_CreateAndGetBulker/new_remote_bulker-12        44.62µ ± 7%   45.56µ ± 10%       ~ (p=0.529 n=10)
_CreateAndGetBulker/existing_remote_bulker-12   658.5n ± 3%   710.3n ±  6%  +7.86% (p=0.000 n=10)
_CreateAndGetBulker/changed_remote_bulker-12    46.27µ ± 9%   47.56µ ±  5%       ~ (p=0.529 n=10)
geomean                                         11.08µ        11.55µ        +4.22%

                                              │   main.txt   │              sync.txt               │
                                              │     B/op     │     B/op      vs base               │
_CreateAndGetBulker/new_remote_bulker-12        39.30Ki ± 0%   38.79Ki ± 0%  -1.30% (p=0.000 n=10)
_CreateAndGetBulker/existing_remote_bulker-12     591.0 ± 0%     623.0 ± 0%  +5.41% (p=0.000 n=10)
_CreateAndGetBulker/changed_remote_bulker-12    38.78Ki ± 1%   38.78Ki ± 1%       ~ (p=0.896 n=10)
geomean                                         9.582Ki        9.709Ki       +1.32%

                                              │  main.txt   │              sync.txt               │
                                              │  allocs/op  │  allocs/op   vs base                │
_CreateAndGetBulker/new_remote_bulker-12        1.033k ± 0%   1.039k ± 0%   +0.58% (p=0.000 n=10)
_CreateAndGetBulker/existing_remote_bulker-12    11.00 ± 0%    13.00 ± 0%  +18.18% (p=0.000 n=10)
_CreateAndGetBulker/changed_remote_bulker-12    1.038k ± 0%   1.042k ± 0%   +0.39% (p=0.000 n=10)
geomean                                          227.6         241.4        +6.07%

michel-laterman · 2024-12-05T21:00:31Z

internal/pkg/bulk/bulk_remote_output_test.go

@@ -148,3 +153,57 @@ func Test_CreateAndGetBulkerChanged(t *testing.T) {
 	assert.Nil(t, err)
 	assert.Equal(t, true, cancelFnCalled)
 }
+
+func Benchmark_CreateAndGetBulker(b *testing.B) {
+	b.Skip("Crashes on remote runner")


I'm not sure why, but this causes issues when make benchmark is ran, but these tests can be ran individually without this

michel-laterman · 2024-12-06T19:57:31Z

buildkite test this

cmacknz · 2024-12-06T20:35:10Z

My take on this is that this is a real bug, the GetBulkerMap() method introduced to support remote outputs isn't concurrency safe and really can't be unless you do something like this.

I don't like the use of sync.Map to solve this, primarily because it removes type safety in a codebase that I already find hard to follow with many levels of interface abstraction. I also don't see us satisfying the two criteria where sync.Map recommends its use: https://pkg.go.dev/sync#Map. I also don't like that it will encourage us to just randomly access the bulker everywhere but TBH that is kind of how it is designed to be used right now.

The Map type is optimized for two common use cases: (1) when the entry for a given key is only ever written once but read many times, as in caches that only grow, or (2) when multiple goroutines read, write, and overwrite entries for disjoint sets of keys. In these two cases, use of a Map may significantly reduce lock contention compared to a Go map paired with a separate Mutex or RWMutex.

We don't satisfy 1 because outputs can be changed, updated, and deleted. We don't satisfy 2 because the entire point of GetBulkerMap is to look at every key and not some non-overlapping set of keys.

It looks like we could solve this but just introducing a RWMutex for all the uses of bulkMap. There are several that are not mutex protected.

fleet-server/internal/pkg/bulk/engine.go

Lines 127 to 128 in 6b29ab4

    
           	bulkerMap: make(map[string]Bulk), 
        
           }

The GetBulkerMap method only has one use and it is to get the output name and client for each existing output. Rather than return a reference to the map, you can just hold a mutex to return a copy of those two things. Assuming the client is safe for concurrent access. GetBulkerMap shouldn't exist in its current form.

You could potentially rewrite the way the remote ES output healthcheck works to not need a concurrent map at all. The self monitor could listen on a channel for state updates from each attempt to interact with the remote ES update that could fail or something like that.

cmacknz · 2024-12-06T20:39:29Z

I'm not really worried about the performance implications of this after read the code involved. I think that was just a way to see if use of sync.Map was justified. I don't think it is for non-performance reasons (interfaces, ugh).

michel-laterman · 2024-12-06T21:05:11Z

I'll reimplement the mutex and create an issue to discuss remote output health

michel-laterman · 2024-12-06T21:28:59Z

Issue: #4185
I think we should remove remote health reporting from the policy self-monitor as it does nothing to the (fleet-server) status, but we can discuss it in the issue.

elastic-sonarqube · 2024-12-06T21:52:40Z

Quality Gate passed

Issues
0 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
50.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube

Use the remoteOutputMutex whenever accessing the bulkerMap. Change GetBulkerMap to return a copy of the map so that remote output health will not conflict with adding/removing a bulker from the map. (cherry picked from commit 924ea07)

Use the remoteOutputMutex whenever accessing the bulkerMap. Change GetBulkerMap to return a copy of the map so that remote output health will not conflict with adding/removing a bulker from the map. (cherry picked from commit 924ea07) Co-authored-by: Michel Laterman <[email protected]>

Remove race condition when accessing remote bulker map

3fb37e8

michel-laterman added bug Something isn't working flaky-test Unstable or unreliable test cases. Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Dec 3, 2024

michel-laterman requested a review from a team as a code owner December 3, 2024 00:20

michel-laterman requested review from blakerouse and kaanyalti December 3, 2024 00:20

mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Dec 3, 2024

michel-laterman commented Dec 3, 2024

View reviewed changes

michel-laterman added backport-8.16 Automated backport with mergify backport-8.17 Automated backport with mergify labels Dec 3, 2024

kaanyalti reviewed Dec 5, 2024

View reviewed changes

Add remote bulker benchmarks

4ec2546

skip new benchmarks as they crash on remote

3ca6a00

michel-laterman mentioned this pull request Dec 5, 2024

Replace most context.TODO calls, comment on other Background and TODOs #4168

Merged

3 tasks

michel-laterman commented Dec 5, 2024

View reviewed changes

belimawr mentioned this pull request Dec 6, 2024

Prevent mapping explosion on logs #4181

Open

3 tasks

kaanyalti approved these changes Dec 6, 2024

View reviewed changes

michel-laterman added 2 commits December 6, 2024 13:21

Merge remote-tracking branch 'origin/main' into sync-remote-bulker

47d2615

Revert to use of mutex and copy the map

478e59e

michel-laterman merged commit 924ea07 into elastic:main Dec 6, 2024
8 checks passed

michel-laterman deleted the sync-remote-bulker branch December 6, 2024 23:31

mergify bot mentioned this pull request Dec 6, 2024

[8.16](backport #4171) Remove race condition when accessing remote bulker map #4186

Merged

3 tasks

This was referenced Dec 6, 2024

[8.x](backport #4171) Remove race condition when accessing remote bulker map #4187

Merged

[8.17](backport #4171) Remove race condition when accessing remote bulker map #4188

Merged

ycombinator mentioned this pull request Dec 7, 2024

[Automation] Bump Golang version to 1.23.4 #4173

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove race condition when accessing remote bulker map #4171

Remove race condition when accessing remote bulker map #4171

michel-laterman commented Dec 3, 2024 •

edited

Loading

mergify bot commented Dec 3, 2024

mergify bot commented Dec 3, 2024

michel-laterman Dec 3, 2024

kaanyalti Dec 5, 2024

michel-laterman Dec 5, 2024

kaanyalti Dec 5, 2024

michel-laterman Dec 5, 2024

michel-laterman commented Dec 5, 2024

michel-laterman Dec 5, 2024

michel-laterman commented Dec 6, 2024

cmacknz commented Dec 6, 2024 •

edited

Loading

cmacknz commented Dec 6, 2024

michel-laterman commented Dec 6, 2024

michel-laterman commented Dec 6, 2024

elastic-sonarqube bot commented Dec 6, 2024

		func (b *Bulker) GetBulkerMap() map[string]Bulk {
		return b.bulkerMap

Remove race condition when accessing remote bulker map #4171

Remove race condition when accessing remote bulker map #4171

Conversation

michel-laterman commented Dec 3, 2024 • edited Loading

What is the problem this PR solves?

How does this PR solve the problem?

Design Checklist

Checklist

Related issues

mergify bot commented Dec 3, 2024

mergify bot commented Dec 3, 2024

michel-laterman Dec 3, 2024

Choose a reason for hiding this comment

kaanyalti Dec 5, 2024

Choose a reason for hiding this comment

michel-laterman Dec 5, 2024

Choose a reason for hiding this comment

kaanyalti Dec 5, 2024

Choose a reason for hiding this comment

michel-laterman Dec 5, 2024

Choose a reason for hiding this comment

michel-laterman commented Dec 5, 2024

michel-laterman Dec 5, 2024

Choose a reason for hiding this comment

michel-laterman commented Dec 6, 2024

cmacknz commented Dec 6, 2024 • edited Loading

cmacknz commented Dec 6, 2024

michel-laterman commented Dec 6, 2024

michel-laterman commented Dec 6, 2024

elastic-sonarqube bot commented Dec 6, 2024

Quality Gate passed

michel-laterman commented Dec 3, 2024 •

edited

Loading

cmacknz commented Dec 6, 2024 •

edited

Loading