-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Erlang VM segfaults in get_process_info () from prometheus_process_collector-1.3.1/priv/prometheus_process_collector.so
at startup
#9
Comments
What's the libc version? did you try to recompile and archive it yourself? |
No, didn't try to recompile. It worked fine until the RabbitMQ node got restarted. Nothing changed on the OS.
|
This only happens when the VM starts with the |
yea, this likely means binary incompatibility. Can you clone https://github.com/deadtrickster/prometheus_process_collector and run |
OK, will do, most likely next week. Thanks! |
Since this plugin is a NIF, we must ensure that it's linked against the correct libraries, in this case libc. re deadtrickster/prometheus_process_collector#9 re #48 [#157015426]
I'm now compiling promehteus_process_collector using Is there anything else that we can do to address this? |
my gdb complains about missing file, symbols, etc. could you please post gdb output here? all these bt, bt full. |
-O3 can be too much, maybe you could also try -O0 |
We will try to get the debug symbols to have a better view. |
Running into various issues so far. Anyway. Details are fuzzy, but the crash only occurs when there is a uid change just before starting RabbitMQ, and doesn't occur without the rabbitmq-server script (or at least I couldn't reproduce yet). Whatever happens, I believe catching exceptions around here[1] and returning an empty list to Erlang would solve our issue. Not enough time today but worth experimenting on this next because if this is a transient issue at startup then catching is a good idea, and if the issue is different then being able to inspect the VM state would help. |
hmm, I wonder what's so special about your environment.
How do you mean? Also, about this being start-up only issue, why this function called on startup? or it just coincides with scraping? |
It uses start-stop-daemon to start it as user vcap instead of user root. For the other questions I don't know yet. I think it coincides with scraping yes. |
anything special about this vcap user? start-stop-daemon simply calls setuid AFAIK, I could try to check this starting as root and calling setuid in on_load |
Nothing special about the vcap user, it's the equivalent of ubuntu or debian:
We are using vcap as a good practice, which is to not run services as root. This is the entire process tree:
This is how we start RabbitMQ: https://github.com/rabbitmq/rabbitmq-server-boshrelease/blob/master/jobs/rabbitmq-server/templates/bin/_start_rabbitmq-server This is the exact start-stop-daemon.c that we use: https://github.com/rabbitmq/rabbitmq-server-boshrelease/blob/master/src/start-stop-daemon-1.9.18/start-stop-daemon.c Can you see a problem with this approach of starting rabbitmq-server for |
no, I should probably try changing uid myself. The problem with suggested catch is that SIGSEGV is a signal not a C++ exception so try...catch won't work. Signal handler can be set up but recovery strategy is unclear. Now I'm really intrigued what's going on |
Thanks for digging into this, let me know if there is anything that I can help with. |
prometheus_process_collector is crashing the entire Erlang VM under the following scenarios: * when the Erlang VM is restarted * when disabling rabbitmq_management plugin * when disabling prometheus_rabbitmq_exporter An issue is open deadtrickster/prometheus_process_collector#9, will consider re-adding when/if this gets addressed, or when switching to bpm-release, whichever happens first. prometheus.erl is being compiled explicitly, since the latest v3 exposes extra information about Erlang VM allocators (see deadtrickster/prometheus.erl#75). prometheus.erl v3 is required since it's compatible with both RabbitMQ 3.6.x & 3.7.x, as well as Erlang 20.0 and above. For more details, see deadtrickster/prometheus.erl#75 (comment) For Erlang versions prior to 20, a pre-compiled version of prometheus.erl v3.4.x will be used instead.
I've stopped using |
Yeah, I see. looks like I just don't have time for this, leave this issue open |
Not a problem, didn't want this to become a drag, happy for you to get around to this in your own time. |
I believe the root cause is throwing C++ exceptions. This is not what a well-behaved NIF should do. There are other places like this that do that. |
I think returning some well-defined number (like -1) in such situations would be more preferable than crashing the VM. |
System information
Backtrace
Artefacts
The text was updated successfully, but these errors were encountered: