-
Notifications
You must be signed in to change notification settings - Fork 373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PROF-11003] Fix allocation profiling + otel tracing causing Ruby crash #4240
[PROF-11003] Fix allocation profiling + otel tracing causing Ruby crash #4240
Conversation
This checker is used to detect accidental thread scheduling switching points happening during profiling sampling. See the bigger comment in unsafe_api_calls_check.h . I was able to check that this checker correctly triggers for the bug in #4195, and also the bug I'm going to fix next, which is the use of `rb_hash_lookup` in the otel context reading code.
`rb_hash_lookup` calls `#hash` on the key being looked up so it's safe to use unless during sampling. This can cause the same issue as we saw in #4195 leading to ``` [BUG] unexpected situation - recordd:1 current:0 -- C level backtrace information ------------------------------------------- ruby(rb_print_backtrace+0x11) [0x55ba03ccf90f] vm_dump.c:820 ruby(rb_vm_bugreport) vm_dump.c:1151 ruby(bug_report_end+0x0) [0x55ba03e91607] error.c:1042 ruby(rb_bug_without_die) error.c:1042 ruby(die+0x0) [0x55ba03ac0998] error.c:1050 ruby(rb_bug) error.c:1052 ruby(disallow_reentry+0x0) [0x55ba03ab6dcc] vm_sync.c:226 ruby(rb_ec_vm_lock_rec_check+0x1a) [0x55ba03cb17aa] eval_intern.h:144 ruby(rb_ec_tag_state) eval_intern.h:155 ruby(rb_vm_exec) vm.c:2484 ruby(vm_invoke_proc+0x201) [0x55ba03cb62b1] vm.c:1509 ruby(rb_vm_invoke_proc+0x33) [0x55ba03cb65d3] vm.c:1728 ruby(thread_do_start_proc+0x176) [0x55ba03c63516] thread.c:598 ruby(thread_do_start+0x12) [0x55ba03c648a2] thread.c:615 ruby(thread_start_func_2) thread.c:672 ruby(nt_start+0x107) [0x55ba03c65137] thread_pthread.c:2187 /lib/x86_64-linux-gnu/libpthread.so.0(start_thread+0xd9) [0x7ff360b66609] /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7ff360a70353] ```
During my experiments to reproduce issues around allocation profiling, I've noted that the VM is in an especially delicate state during exception raising, so let's just decline to sample in this situation.
It occurs to me that if a symbol is dynamic, we were causing it to become a static symbol (e.g. making it never be able to be garbage collected). This can be very bad! And also, we know the symbol we're looking for must be a static symbol because if nothing else, our initialization caused it to become a static symbol. Thus, if we see a dynamic symbol, we can stop there, since by definition it won't be the symbol we're looking after. This is... really awkward to add a specific unit test for, so I've just relied on our existing test coverage to show that this has not affected the correctness of our otel code.
Thank you for updating Change log entry section 👏 Visited at: 2024-12-19 17:13:07 UTC |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #4240 +/- ##
==========================================
- Coverage 97.74% 97.74% -0.01%
==========================================
Files 1355 1355
Lines 82366 82404 +38
Branches 4230 4230
==========================================
+ Hits 80512 80544 +32
- Misses 1854 1860 +6 ☔ View full report in Codecov by Sentry. |
} | ||
|
||
static void check_for_unsafe_api_calls(DDTRACE_UNUSED void *_unused) { | ||
if (inside_unsafe_context) rb_bug( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a hard crash right? I see how it would make issues more visible but won't it also lead the profiler to crash client apps more often than what may be caused by some "racy" issues? Are we trading off potential profiling accuracy with increased crashing or would the unsafe calls potentially manifest in issues more dangerous than a simple crash (e.g. a data corruption)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, but there's a key detail: I've only placed calls to debug_enter_unsafe_context
and debug_leave_unsafe_context
in methods that are in the testing_module
, so this code never gets exercised in the "normal" in-production profiling path.
But re-reading the diff this is a subtle detail, so I'll push a commit to document and make this more clear that these calls should not be added outside of testing code paths.
(I did consider introducing an option to enable this behavior + adding it also to the normal profiling path, so we could tell folks that ran into issues "try with this setting"; but to be honest I wasn't sure it was worth the time spent into it and additional complexity, given we already have some coverage for this in the tests + it's not particularly something that has happened in the past.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added big scary comment in 7e49324
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've only placed calls to debug_enter_unsafe_context and debug_leave_unsafe_context in methods that are in the testing_module, so this code never gets exercised in the "normal" in-production profiling path.
Ah I did wonder if that was true given the first few places you added it to had the "don't use this anywhere else" comment. But some of the other methods didn't and I was lazy to check where they were registered or called from :shame:
@@ -236,7 +236,7 @@ def safely_synchronize | |||
rescue ThreadError => e | |||
logger_without_components.error( | |||
'Detected deadlock during datadog initialization. ' \ | |||
'Please report this at https://github.com/DataDog/dd-trace-rb/blob/master/CONTRIBUTING.md#found-a-bug' \ | |||
'Please report this at https://github.com/datadog/dd-trace-rb/blob/master/CONTRIBUTING.md#found-a-bug' \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Intended? I mean it works but the official casing is the previous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I usually try to lower-case whenever I see it linked like this, trying to convince the world it's not DataDog ;)
Datadog ReportBranch report: ✅ 0 Failed, 22130 Passed, 1476 Skipped, 5m 28.92s Total Time |
Is the iteration limit for hash lookup added in this PR customer visible? Seems to me that it would be? If yes should it be noted somewhere? |
It's a good call-out but... I'm somewhat hesitant to introduce too many knobs of this kind if we can avoid it. This whole feature itself is still under preview (and off by default), so my thinking is: let's see if any customer reports any issue with the current hardcoded value for this setting, and in that case I'll work with them do decide what value would be needed. After that, we can decide if we bump the default, or do something else. Does this sound reasonable, or are you not convinced yet? |
Ok so the failure in the benchmarks is expected and an unfortunate effect of how the benchmarks are currently ran. In particular, our harness tries to run the benchmarks from this branch with the main branch, which is an interesting idea BUT forces the benchmark code to be backwards-compatible, which is really annoying for some changes such as this one. The master branch is currently red due to unrelated issues; my current plan for this PR is to wait until master is green again, then merge it into this branch, and then merge it once everything is green other than the benchmarks. |
What does this PR do?
This PR is a follow-up to #4195 and fixes a similar issue to that earlier PR.
Specifically, when the profiler is trying to read from the opentelemetry ("otel") context, we would try to look up a value from a Ruby hash using
rb_hash_lookup(...)
. This innocent-looking method would call#hash
on the key being used, which meant that a Ruby method could start running in the middle of a profiler sample.When combined with allocation profiling, we had reports of Ruby crashing with:
This PR fixes this by replacing the
rb_hash_lookup
with a native-level linear iteration on the hash contents, which allows us to skip invoking any kind of Ruby methods.Additionally, this PR:
Adds a checker mechanism to detect these kinds of problem conditions when executing our test suite. This mechanism was able to flag both the issue in [PROF-11003] Fix unsafe initialization when using profiler with otel tracing #4195 as well as the one in
rb_hash_lookup
Adds a tweak to avoid allocation sampling while Ruby is raising an exception, as in my testing this was another way of very easily causing Ruby to crash or misbehave
Fixes a potential issue with static vs dynamic Ruby symbols when reading from the otel context code.
Motivation:
Fix the crash.
Change log entry
Yes. Fix allocation profiling + otel tracing causing Ruby crash
Additional Notes:
As a reminder, reading from the opentelemetry context is disabled by default, so only customers setting
DD_PROFILING_PREVIEW_OTEL_CONTEXT_ENABLED
totrue/both/only
could be affected by this issue.Also, the crash was only triggered when allocation profiling was in use.
This PR may make a bit more sense to read commit-by-commit.
How to test the change?
This PR adds code coverage + a new checker to find the fixed issues.