Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider running without syscall tracepoints #173

Closed
umanwizard opened this issue Oct 8, 2024 · 5 comments
Closed

Consider running without syscall tracepoints #173

umanwizard opened this issue Oct 8, 2024 · 5 comments

Comments

@umanwizard
Copy link
Contributor

This discussion began in Slack: https://cloud-native.slack.com/archives/C03J794L0BV/p1728340374847259

The issue is that the agent refuses to run when syscall tracepoints are disabled; that is, when the path /sys/kernel/debug/tracing/events/syscalls does not exist. This was reported by a user of parca-agent, which is based on a fork of opentelemetry-ebpf-profiler: parca-dev/parca-agent#2976 . However, I don't think the issue is unique to Parca, and exactly the same issue would happen in the stock OTel profiler.

The original reporter is running Unraid, but similar issues can happen in various other distributions that might have CONFIG_FTRACE_SYSCALLS or other relevant kernel configuration options disabled. For example, evilsocket/opensnitch#774 documents similar brokenness in OpenSnitch when running on Liquorix kernels.

If I understand the OTel code correctly, we rely on the ability to attach syscall tracepoints in three places:

  1. In ProbeTracepoint, called here. This is just a check that syscall tracepoints work, so we can exit early in main if they are unavailable. It isn't actually functionally required.
  2. In loadKernelCode, called from loadTPBaseOffset, called from loadSystemConfig. Note that this is only called if parseBTF fails. So it is unnecessarily in kernels where BTF-based offset discovery is available.
  3. In loadKernelCode, called from checkForMaccessPatch. This is only necessary for the specific versions of the kernel that might have the copy_from_user_nofault bug which causes freezes.

So, there is a configuration that we could support, but don't now: if syscall tracepoints are unavailable, but BTF-based offset discovery is possible and we're in a kernel version that is known not to be impacted by the copy_from_user_nofault bug, we will currently fail to run due to the lack of syscall tracepoints, even though logically the agent should be able to work.

Thus I propose that we don't require syscall tracepoints in the case when we're in such a configuration.

@umanwizard
Copy link
Contributor Author

CC: @florianl

@rockdaboot
Copy link
Contributor

If I understand correctly, it would boil down to something like

if !tracer.BtfAvailable() || (kernelVersionCheck && tracer.HasProbeReadBug()) {
	if err = tracer.ProbeTracepoint(); err != nil {
		return failure("Failed to probe tracepoint: %v", err)
	}
}

BtfAvailable() straight forward and HasProbeReadBug() can be extracted from tracer.go, L421ff.

@fabled
Copy link
Contributor

fabled commented Oct 8, 2024

I think we could just remove tracer.ProbeTracepoint and its call completely. The subsequent attempts will fail anyway. I think that's a redundant check that just slow downs startup on current setup. @florianl or was there some strange reason why this has to be done ahead of time?

@umanwizard
Copy link
Contributor Author

umanwizard commented Oct 8, 2024

I agree with @fabled 's suggested approach and this is what I proposed for Parca: https://github.com/parca-dev/parca-agent/pull/2990/files (not removing the call, but just printing a warning when it fails)

There is one minor change that I would also suggest: making it clear in the error messages when checkForMaccessPatch that it's not necessarily the case that the user is running on an unpatched kernel, but rather that we don't know because we couldn't attach the tracepoint.

umanwizard added a commit to umanwizard/otel-profiling-agent that referenced this issue Oct 9, 2024
This implements the proposal in
open-telemetry#173
. Please see there for full details.
umanwizard added a commit to umanwizard/otel-profiling-agent that referenced this issue Oct 9, 2024
This implements the proposal in
open-telemetry#173
. Please see there for full details.
umanwizard added a commit to umanwizard/otel-profiling-agent that referenced this issue Oct 9, 2024
This implements the proposal in
open-telemetry#173
. Please see there for full details.
umanwizard added a commit to umanwizard/otel-profiling-agent that referenced this issue Oct 30, 2024
This implements the proposal in
open-telemetry#173
. Please see there for full details.
@umanwizard
Copy link
Contributor Author

Implemented in #175

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants