Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make otel scheduler sync #2262

Merged
merged 5 commits into from
Jan 2, 2025
Merged

Make otel scheduler sync #2262

merged 5 commits into from
Jan 2, 2025

Conversation

wildum
Copy link
Contributor

@wildum wildum commented Dec 11, 2024

The fix in #2027 did not address all the race conditions that could happen with the scheduler. We noticed it when getting rid of our Otel fork brought back the flakiness of a batch processor test.

With this change, the scheduling of otel components is not an async operation anymore. This simplifies the logic and ensures that the components are started before they can start consuming.

@wildum wildum requested a review from a team as a code owner December 11, 2024 16:50
Copy link
Contributor

@ptodev ptodev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OTel contract states that Start must return quickly, so I suppose sync would be fine. Do we know if the Collector does this async or sync? It'd be best to just do what it does, just to be on the safe side.

Regarding the issue with the batch processor - does it happen because something is wrong on the OTel side, or because #2027 maybe doesn't work for some edge case?

@ptodev
Copy link
Contributor

ptodev commented Dec 11, 2024

It might be good for @thampiotr to opine here, given that this PR reverts changes in #2027.

@wildum
Copy link
Contributor Author

wildum commented Dec 12, 2024

The OTel contract states that Start must return quickly, so I suppose sync would be fine. Do we know if the Collector does this async or sync? It'd be best to just do what it does, just to be on the safe side.

It's done sync in the otel collector. The service starts everything sync (https://github.com/open-telemetry/opentelemetry-collector/blob/main/service/service.go#L230) and it's called sync here (https://github.com/open-telemetry/opentelemetry-collector/blob/main/otelcol/collector.go#L228)

Regarding the issue with the batch processor - does it happen because something is wrong on the OTel side, or because #2027 maybe doesn't work for some edge case?

Piotr and I had a discussion about it here: https://raintank-corp.slack.com/archives/C02GSU8SHBN/p1733846257348439
TLDR is that on Update we will create a new underlying unstarted Otel component. Because we start it async, there is always a window where the lazy consumer is unpaused and the unstarted component might start consuming before being started.

Copy link
Contributor

@ptodev ptodev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add a changelog entry? If it's a problem end users could see, then mentioning the error in the changelog could indicate to them that they shouldn't see it again.

@thampiotr
Copy link
Contributor

It might be good for @thampiotr to opine here, given that this PR reverts changes in #2027.

It doesn't revert all the changes, it's reusing the approach.

Copy link
Contributor

@thampiotr thampiotr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I like it more. Thanks for fixing @wildum 🚀

@wildum wildum mentioned this pull request Dec 12, 2024
3 tasks
* Don't start components until Run is called

* Update consumers after stopping the component

* Minor fixes
Copy link
Contributor

@thampiotr thampiotr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of comments. Do we have sufficient test coverage? Cause it's a bit hard to ensure there are no bugs by just reading 😅

func (cs *Scheduler) Schedule(h otelcomponent.Host, cc ...otelcomponent.Component) {
// Schedule() completely overrides the set of previously running components.
// Components which have been removed since the last call to Schedule will be stopped.
func (cs *Scheduler) Schedule(ctx context.Context, updateConsumers func(), h otelcomponent.Host, cc ...otelcomponent.Component) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we document updateConsumers?

Also I wonder if we should use the "consumer" terminology in Scheduler... before these changes it was somewhat decoupled from what you schedule - just some component that and onPause/onResume callbacks, but now the comments and the updateConsumers variable name talk directly about scheduling the consumers. Maybe we should go all-in and make this a ConsumerScheduler?.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I documented the updateConsumers function

It's true that the scheduler functionality is a bit extended with the fact that it calls a function that's expected to set consumers. But its primary purpose is still to schedule the otel components. I find ConsumerScheduler a bit confusing because it missing the "component" part in the name. I would prefer to keep Scheduler. I think that in the context of otel, the fact that it also expects a function to update consumers is ok

internal/component/otelcol/internal/scheduler/scheduler.go Outdated Show resolved Hide resolved
@wildum
Copy link
Contributor Author

wildum commented Jan 2, 2025

Do we have sufficient test coverage?

We have tests at the scheduler and at the lazyconsumer level + the batch processor "Test_Update" is really good at spotting race conditions (thanks for adding all those tests when you did the change last time btw). Changes to the components themselves are really small, I think that we are ok

@wildum wildum merged commit 8568931 into main Jan 2, 2025
17 of 18 checks passed
@wildum wildum deleted the fix-otel-scheduler branch January 2, 2025 15:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants