Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove race condition when accessing remote bulker map #4171

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

michel-laterman
Copy link
Contributor

What is the problem this PR solves?

Remove a race condition/bug that may occur when remote ES outputs are used.

How does this PR solve the problem?

Change the bulkerMap and remoteOutputConfigMap attributes to sync.Map so concurrency controls are used for all interactions.
Change GetBulkerMap to return a copy of the map so that remote output health will not conflict with adding/removing a bulker from the map.

Design Checklist

  • I have ensured my design is stateless and will work when multiple fleet-server instances are behind a load balancer.
  • I have or intend to scale test my changes, ensuring it will work reliably with 100K+ agents connected.
  • I have included fail safe mechanisms to limit the load on fleet-server: rate limiting, circuit breakers, caching, load shedding, etc.

Checklist

  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool

Related issues

@michel-laterman michel-laterman added bug Something isn't working flaky-test Unstable or unreliable test cases. Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Dec 3, 2024
@michel-laterman michel-laterman requested a review from a team as a code owner December 3, 2024 00:20
Copy link
Contributor

mergify bot commented Dec 3, 2024

This pull request does not have a backport label. Could you fix it @michel-laterman? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-./d./d is the label to automatically backport to the 8./d branch. /d is the digit

Copy link
Contributor

mergify bot commented Dec 3, 2024

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label and remove the backport-8.x label.

@mergify mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Dec 3, 2024
func (b *Bulker) GetBulkerMap() map[string]Bulk {
return b.bulkerMap
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can see, we had a flaky test because the policy self-monitor (internal/pkg/policy/self.go) was getting the bulker map in order to check remote output health; but we did not prevent our policy output preparation from creating a new bulker concurrently (internal/pkg/policy/policy_output.go).
I've changes our maps to sync.Map as we weren't properly using the mutex we had, and changed this func to return a copy instead

@michel-laterman michel-laterman added backport-8.16 Automated backport with mergify backport-8.17 Automated backport with mergify labels Dec 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.x Automated backport to the 8.x branch with mergify backport-8.16 Automated backport with mergify backport-8.17 Automated backport with mergify bug Something isn't working flaky-test Unstable or unreliable test cases. Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Test_Agent_Remote_ES_Output flaky due to race
1 participant