Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running samply 'in-proc' #158

Open
bruno-garcia opened this issue Apr 26, 2024 · 4 comments
Open

Running samply 'in-proc' #158

bruno-garcia opened this issue Apr 26, 2024 · 4 comments

Comments

@bruno-garcia
Copy link

After a convo earlier this week, I wonder if folks familiar with the code base here know what would it take to have run samply in-process. At a reduce sample rate in order to manage the overhead and have it run in production.

@vvuk
Copy link
Collaborator

vvuk commented May 21, 2024

This is tricky. For macOS, this is doable because samply does its own capture (by suspending threads and grabbing a call stack).

For Linux and Windows, samply relies on system-provided functionality (perf and ETW) for profiler capture. So it's not actually clear what running it "in process" really would mean there, since the in process profile capture piece would just be setting up the system facilities and processing events. But doing that in-proc is nearly identical to doing it out of process.

What you could do though is implement the macOS approach for in-proc usage for all platforms. This would be a stub that knows how to suspend its own process' threads and capture a stack, and then forward it to the rest of samply's machinery for processing and converting into a format that the front end can consume. This wouldn't be a huge amount of effort to get something basic running (this is basically what Firefox does, I believe).

The macOS code in mac/thread_profiler.rs, specifically ThreadProfiler::sample would be what the core of this looks like. Capture a stack and add it to the set of unresolved_stacks/samples, which get processed and flushed to a Profile at the end.

@mstange
Copy link
Owner

mstange commented May 21, 2024

What Vlad said is exactly right - for macOS, an in-process implementation would be relatively straightforward, but an Windows and Linux it would be a fully separate implementation.

That said, the Gecko profiler that is built into Firefox is an in-process implementation.

The two hard bits are:

  • Get information about loaded shared libraries, and update this information when new libraries are loaded or old libraries are unloaded.
  • Interrupt threads and get their stacks.

The first one is tricky because libraries can be unloaded in a racy manner from different threads, and getting library information often involves groveling around the library's loaded bytes, so you must be sure that those bytes don't go away while you're looking.
On macOS, Firefox uses _dyld_register_func_for_add_image and _dyld_register_func_for_remove_image.
On Linux, Firefox doesn't track newly-added libraries and just gets a snapshot of the libraries that have been loaded by the time at which the profiler is initialized, using /proc/self/maps.
On Windows, Firefox also only gets the list of shared libraries once, and it has to increment their reference count using LoadLibraryExW while getting the library information, and has a hack to skip certain libraries for which incrementing the reference count by 1 isn't enough because of extra unbalanced unloads.

Interrupting threads and getting their state is done with SuspendThread / GetThreadContext / ResumeThread on Windows, and with a signal handler on Linux - the sampler thread sends a SIGPROF to the sampled thread for each sample. This post by Nikhil has more information.

@roblabla
Copy link

For Linux and Windows, samply relies on system-provided functionality (perf and ETW) for profiler capture. So it's not actually clear what running it "in process" really would mean there, since the in process profile capture piece would just be setting up the system facilities and processing events. But doing that in-proc is nearly identical to doing it out of process.

Heyo, I'm also interested in "in-proc" functionality. However, I'm mostly interested in a way to profile my binary without having to ship a separate binary for the sampling - I don't actually care if the sampling is truly in-process or through an OS subsystem. Is it possible to use ETW/perf to self-sample, or if there was some fundamental reason why that couldn't work?

@vvuk
Copy link
Collaborator

vvuk commented May 28, 2024

Is it possible to use ETW/perf to self-sample, or if there was some fundamental reason why that couldn't work?

It's possible -- I'm not as familiar with perf, but with ETW, any captured etl file with the same providers that samply sets up (see xperf.rs) would work. However, ETW needs an elevated process/admin privileges. I think perf does as well, but there's also a group that users can be added to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants