Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libssh2 issues with Ruby profilers -- would a workaround PR be acceptable? #959

Open
ivoanjo opened this issue Apr 12, 2023 · 15 comments
Open

Comments

@ivoanjo
Copy link

ivoanjo commented Apr 12, 2023

Hello there :)

I work on Datadog's Ruby profiler and I arrived here while investigating a customer issue where they were using the rugged gem and it breaks in combination with our profiler due to a libssh2 bug.

The TL;DR is that libssh2 doesn't yet correctly handle the interruption of system calls caused by profilers that use unix signals, and thus having such a profiler running breaks network calls using rugged.

This issue is not specific to the Datadog Ruby profiler 😭; the also great stackprof profiler gem is unfortunately affected by this issue:

require 'rugged'
require 'stackprof'

puts "Cloning..."
creds = Rugged::Credentials::SshKey.new(username: 'git', publickey: '...', privatekey: '...')

StackProf.run(mode: :wall, out: 'tmp/stackprof-cpu-myapp.dump') do
  Rugged::Repository.clone_at('ssh://[email protected]/libgit2/rugged.git', '/tmp/some-directory', credentials: creds)
end

puts "Cloned!"

gets this output:

Cloning...
Traceback (most recent call last):
	3: from repro-rugged.rb:8:in `<main>'
	2: from repro-rugged.rb:8:in `run'
	1: from repro-rugged.rb:9:in `block in <main>'
repro-rugged.rb:9:in `clone_at': remote rejected authentication: Error waiting on socket (Rugged::SshError)

...but removing StackProf makes the clone work fine -- it's not actually an authentication error, it's the system call interruption at work.

Ok so why am I double-reporting this issue when I've already reported it to the libssh2 developers as well?

Since it's common to use a system libssh2 with rugged, even if the fix to libssh2 was released today, it'll take months/years to arrive on Linux distros.

Furthermore, it's hard to detect from Ruby code if rugged is linked with a broken libssh2, because while rugged provides a Rugged.libgit2_version, there's no corresponding API to probe the libssh2 version (at least I didn't find one in libgit2 directly or rugged).

Currently, when rugged is detected, the Datadog profiler needs to fall back to an alternative code path that yields lower-quality data.
I would love for this to not be the case!

Thus my question is: Would you consider accepting a pull request modifying rugged calls that can trigger network operations (I think it's only clone/fetch/pull/push/submodule stuff?) with a method that temporarily disables signal handling (for SIGPROF and SIGALARM) during that call?

Something similar to:

 static VALUE rb_git_repo_clone_at(int argc, VALUE *argv, VALUE klass)
 {
        VALUE url, local_path, rb_options_hash;
        git_clone_options options = GIT_CLONE_OPTIONS_INIT;
        struct rugged_remote_cb_payload remote_payload = { Qnil, Qnil, Qnil, Qnil, Qnil, Qnil, Qnil, 0 };
        git_repository *repo;
        int error;
 
        rb_scan_args(argc, argv, "21", &url, &local_path, &rb_options_hash);
        Check_Type(url, T_STRING);
        FilePathValue(local_path);
 
        parse_clone_options(&options, rb_options_hash, &remote_payload);
 
+        block_profiling_signals();
        error = git_clone(&repo, StringValueCStr(url), StringValueCStr(local_path), &options);
+        unblock_profiling_signals();
 
        if (RTEST(remote_payload.exception))
                rb_jump_tag(remote_payload.exception);
        rugged_exception_check(error);
 
        return rugged_repo_new(klass, repo);
 }

This would make rugged work great under stackprof / the Datadog profiler and as a bonus we could detect the fixed version and avoid the fall back to an alternative code path that yields lower-quality data that we currently have.

Thoughts? I'm shamelessly tagging @tenderlove here since you're also a maintainer of stackprof ;)

@carlosmn
Copy link
Member

Well this is annoying. TIL there's a specific signal for profiling. Ruby profilers aren't the only ones doing this though, right? This feels like a libgit2 thing rather than a rugged thing as it wouldn't be working around an issue that happens in Ruby.

@ethomson
Copy link
Member

I think so, too. I'm not entirely sure what that API would look like - presumably a runtime opt of some sort - but it makes sense to do this in libgit2.

@ivoanjo
Copy link
Author

ivoanjo commented Apr 17, 2023

Well this is annoying. TIL there's a specific signal for profiling. Ruby profilers aren't the only ones doing this though, right? This feels like a libgit2 thing rather than a rugged thing as it wouldn't be working around an issue that happens in Ruby.

Right! In libssh2/libssh2#955 I list a few other profilers that use this (dart, Java, etc) and I know of more (Google's profiler for Python -- for instance).

I suggested the workaround because while definitely being a bit ugly, it's not that hard to implement -- e.g. block_profiling_signals() can be implemented by calling pthread_sigmask which is only a handful of lines including error handling.

I'd be happy to open a PR with an actual implementation and a testcase showing how it fixes things for stackprof/the Datadog profiler -- let me know if there's interest :)

@ivoanjo
Copy link
Author

ivoanjo commented Apr 17, 2023

Edit: To clarify, I don't mind taking a stab at opening a PR to do this in libgit2 directly either :)

@carlosmn
Copy link
Member

carlosmn commented Apr 18, 2023

Messing with signals should be the app's responsibility but as a bug work-around, maybe libgit2 can get an option to compile with sigmasks in the libssh2 backend so it's limited to that instead of any network function. Ed would really be the one to have the final say, but I guess a define you can set wouldn't be the worst.

@ivoanjo
Copy link
Author

ivoanjo commented Apr 18, 2023

As long as there's a way to query "is this version of rugged safe or not" I'll take it, as I'd be able to tweak the Datadog profiler to use signals on the safe version 😄

For other profilers, one part of this issue is that it's not very obvious what's happening -- e.g. if you have rugged and then you added a profiler it's immediately obvious (it's the profiler/something that the profiler triggered!); but if you had a profiler and then you added rugged/libgit2 then it may be really unclear why network requests are failing.

@ethomson
Copy link
Member

Messing with signals should be the app's responsibility but as a bug work-around, maybe libgit2 can get an option to compile with sigmasks in the libssh2 backend so it's limited to that instead of any network function. Ed would really be the one to have the final say, but I guess a define you can set wouldn't be the worst.

Yeah, I wonder if the API should be "give me a callback that runs at the start and end of network functions" and the usage here is signal handling. Or if we should just give you a signal handler. I lead toward the first one.

@ethomson
Copy link
Member

OK, revisiting this thought -- should libgit2 really have a callback for "network function about to start" and "network function about to end"... ? What's the granularity? Is it git_clone / git_fetch / etc. If I'm writing C that callback is ... deeply pointless... though it has value for people using bindings...

Is it around every read / write / etc call? I'm skeptical that this is really what people want.

@ivoanjo
Copy link
Author

ivoanjo commented Apr 25, 2023

Yeah, I'm not sure it's worth expanding the libgit2 API in a permanent way for this versus having a more temporary solution (with the intention of removing the temporary solution at some point).

@ivoanjo
Copy link
Author

ivoanjo commented Oct 16, 2024

Hey folks! Poking this zombie ticket into life as the just-released upstream libssh2 1.11.1 includes a fix for this issue ( libssh2/libssh2#1058 ).

To be able to selectively work around the issue would you accept a PR that exposes the LIBSSH2_VERSION define from libssh2.h? (As in, puts that into a constant that Ruby code can check?)

Currently, the Datadog Ruby profiler applies a workaround whenever it sees rugged installed, which reduces profiler data quality (because it stops interrupting threads with a signal).

If the profiler had a way to detect the version of the underlying libssh2, we could use that to conditionally apply our workaround, rather than doing it always.

Would that be a reasonable addition?

@ethomson
Copy link
Member

Basically, you want to be able to query the libssh2 version that libgit2 is compiled against via libgit2? I'm okay with that, but I'd like us to be able to make sure that it's a relatively holistic API.

There's another issue (I can't seem to find it right now) which is interested in more information about how libgit2 was built / configured. Right now we have git_libgit2_features() which returns things like GIT_FEATURE_HTTPS or GIT_FEATURE_SSH, but ... that's pretty weak. I think that I would want an API that told me which SSH support was compiled in and - extending that out with your request - the version.

Now that I've increased the scope, let me know if this is still something that you'd be interested in tackling. :D

@ivoanjo
Copy link
Author

ivoanjo commented Oct 17, 2024

Hmm... actually, you got me thinking, maybe I asked for the version as a proxy for "is the fix in", but in the end, what I want is to know "is it safe to use signals/profiling".

So looking at git_libgit2_features/rb_git_features we could represent this as a feature as well, e.g. libssh_signal_safety or something as well.

I think that I would want an API that told me which SSH support was compiled in and - extending that out with your request - the version.

I'm not deeply familiar with some of the details here; by "which ssh", do you mean if libgi2 is using libssh2 or shelling out to openssh?

Now that I've increased the scope, let me know if this is still something that you'd be interested in tackling. :D

As long as it doesn't get too crazy, I have some space to work on this as it can affect Datadog customers :)

@ivoanjo
Copy link
Author

ivoanjo commented Nov 4, 2024

Hey @ethomson any thoughts on my suggestion above?

@pjmelling
Copy link

I'm a Ruby user and DataDog customer and would love to have the full feature set of the DD profiler working correctly in my apps.

@ivoanjo
Copy link
Author

ivoanjo commented Nov 25, 2024

@pjmelling for what it's worth, if you know you're using a fixed version of libssh2, you can
manually set the "no signals workaround" to false.

But yeah it'd be great if rugged exposed this tiny bit of extra info to allow the profiler to automatically be able to do the right thing :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants