Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Erlang VM segfaults in get_process_info () from prometheus_process_collector-1.3.1/priv/prometheus_process_collector.so at startup #9

Open
gerhard opened this issue Apr 19, 2018 · 21 comments

Comments

@gerhard
Copy link

gerhard commented Apr 19, 2018

System information

--------------- System Information ---------------
OTP release: 20
ERTS version: 9.3
Compile date: Tue Apr  3 08:53:44 2018
Arch: x86_64-unknown-linux-gnu
Endianness: Little
Word size: 64-bit
HiPE support: yes
SMP support: yes
Thread support: yes
Kernel poll: Supported and used
Debug compiled: no
Lock checking: no
Lock counting: no
Node name: 'rabbit@rmq0-memory-alloc-a'
Number of schedulers: 2
Number of async-threads: 64

Backtrace

#0  0x0000000000000cd6 in ?? ()
#1  0x00007f43ce87b157 in get_process_info () from /var/vcap/store/rabbitmq-server/mnesia/rabbit@rmq0-memory-alloc-a-plugins-expand/prometheus_process_collector-1.3.1/priv/prometheus_process_collector.so
#2  0x000000000044b5c4 in process_main (x_reg_array=0x7f440f1a3df0, f_reg_array=0x0) at beam/beam_emu.c:3601
#3  0x00000000004f5489 in sched_thread_func (vesdp=0x7f4410dc2100) at beam/erl_process.c:8906
#4  0x000000000067806f in thr_wrapper (vtwd=<optimized out>) at pthread/ethread.c:118
#5  0x00007f4452c61184 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#6  0x00007f445278603d in clone () from /lib/x86_64-linux-gnu/libc.so.6

Artefacts

@deadtrickster
Copy link
Owner

What's the libc version? did you try to recompile and archive it yourself?

@gerhard
Copy link
Author

gerhard commented Apr 19, 2018

No, didn't try to recompile. It worked fine until the RabbitMQ node got restarted. Nothing changed on the OS.

ii  libc-bin                            2.19-0ubuntu6.14                           amd64        Embedded GNU C Library: Binaries
ii  libc-dev-bin                        2.19-0ubuntu6.14                           amd64        Embedded GNU C Library: Development binaries
ii  libc6:amd64                         2.19-0ubuntu6.14                           amd64        Embedded GNU C Library: Shared libraries
ii  libc6-dev:amd64                     2.19-0ubuntu6.14                           amd64        Embedded GNU C Library: Development Libraries and Header Files

@gerhard
Copy link
Author

gerhard commented Apr 19, 2018

This only happens when the VM starts with the prometheus_process_collector enabled. If the plugin is disabled, the Erlang VM restarts successfully.

@deadtrickster
Copy link
Owner

yea, this likely means binary incompatibility. Can you clone https://github.com/deadtrickster/prometheus_process_collector and run rebar3 archive (on target machine preferably). This will generate ez archive.

@gerhard
Copy link
Author

gerhard commented Apr 19, 2018

OK, will do, most likely next week. Thanks!

gerhard added a commit to rabbitmq/rabbitmq-server-boshrelease that referenced this issue May 1, 2018
Since this plugin is a NIF, we must ensure that it's linked against the
correct libraries, in this case libc.

re deadtrickster/prometheus_process_collector#9
re #48

[#157015426]
@gerhard
Copy link
Author

gerhard commented May 1, 2018

I'm now compiling promehteus_process_collector using rebar3 archive, the same segfault is still there.

Is there anything else that we can do to address this?

@deadtrickster
Copy link
Owner

my gdb complains about missing file, symbols, etc. could you please post gdb output here? all these bt, bt full.

@deadtrickster
Copy link
Owner

-O3 can be too much, maybe you could also try -O0

@essen
Copy link

essen commented May 1, 2018

#0  0x0000000000000cd6 in ?? ()
No symbol table info available.
#1  0x00007febd737c137 in get_process_info ()
   from /var/vcap/store/rabbitmq-server/mnesia/rabbit@rmq0-memory-alloc-a-plugins-expand/prometheus_process_collector-1.3.1/priv/prometheus_process_collector.so
No symbol table info available.
#2  0x000000000044b5c4 in process_main (x_reg_array=0x7febdb9e2df0, f_reg_array=0x0)
    at beam/beam_emu.c:3601
        fp = 0x7febd737c120 <get_process_info>
        env = {mod_nif = 0x7fec1a601050, proc = 0x7febd9eea968, hp = 0x7febd84f1ee0, 
          hp_end = 0x7febd84f27a8, heap_frag = 0x0, fpe_was_unmasked = 0, tmp_obj_list = 0x0, 
          exception_thrown = 0, tracee = 0x0, exiting = 0}
        live_hf_end = 0x0
        nif_bif_result = 140650912738208
        bif_nif_arity = 0
        init_done = 1
        c_p = 0x7febd9eea968
        reds_used = 0
        reg = 0x7fec1a980100
        opcodes = {0x44d086 <process_main+25782>, 0x448f5f <process_main+9103>, 
          0x448fbc <process_main+9196>, 0x449039 <process_main+9321>, 0x44d0d0 <process_main+25856>, 
          0x44a8d1 <process_main+15617>, 0x44ccca <process_main+24826>, 0x44a939 <process_main+15721>, 
          0x44dad6 <process_main+28422>, 0x44a9e8 <process_main+15896>, 0x44b046 <process_main+17526>, 
          0x44b0c4 <process_main+17652>, 0x44bc5f <process_main+20623>, 0x44c4f9 <process_main+22825>, 
          0x44cbba <process_main+24554>, 0x44bd0d <process_main+20797>, 0x44cd8f <process_main+25023>, 
          0x44c808 <process_main+23608>, 0x44c831 <process_main+23649>, 0x44c48e <process_main+22718>, 
          0x44c85f <process_main+23695>, 0x449f2a <process_main+13146>, 0x44aa08 <process_main+15928>, 
          0x44b56f <process_main+18847>, 0x44cbfc <process_main+24620>, 0x449cad <process_main+12509>, 
          0x4499f3 <process_main+11811>, 0x44aabd <process_main+16109>, 0x44d59c <process_main+27084>, 
          0x44d13e <process_main+25966>, 0x44aa52 <process_main+16002>, 0x446fbd <process_main+1005>, 
          0x44d5bb <process_main+27115>, 0x44d759 <process_main+27529>, 0x44d6fc <process_main+27436>, 
          0x44d6c8 <process_main+27384>, 0x44d8e2 <process_main+27922>, 0x447458 <process_main+2184>, 
          0x44747b <process_main+2219>, 0x4474a6 <process_main+2262>, 0x4474d0 <process_main+2304>, 
          0x4474f2 <process_main+2338>, 0x44751e <process_main+2382>, 0x447555 <process_main+2437>, 
          0x44758b <process_main+2491>, 0x4475c1 <process_main+2545>, 0x4475f6 <process_main+2598>, 
          0x44762c <process_main+2652>, 0x447661 <process_main+2705>, 0x447696 <process_main+2758>, 
          0x44d8d4 <process_main+27908>, 0x44b8e7 <process_main+19735>, 0x44b991 <process_main+19905>, 
          0x44d940 <process_main+28016>, 0x44ba2d <process_main+20061>, 0x44d905 <process_main+27957>, 
          0x44afd0 <process_main+17408>, 0x44a73f <process_main+15215>, 0x44a7a6 <process_main+15318>, 
          0x44a811 <process_main+15425>, 0x44a28c <process_main+14012>, 0x44a306 <process_main+14134>, 
          0x44a6b1 <process_main+15073>, 0x44a3ae <process_main+14302>, 0x44b4c2 <process_main+18674>, 
          0x449ea0 <process_main+13008>, 0x44a3f7 <process_main+14375>, 0x44bd37 <process_main+20839>, 
          0x44beb9 <process_main+21225>, 0x44bf80 <process_main+21424>, 0x44c535 <process_main+22885>, 
          0x44bfd7 <process_main+21511>, 0x44c027 <process_main+21591>, 0x44c65c <process_main+23180>, 
          0x44c884 <process_main+23732>, 0x44c8ff <process_main+23855>, 0x44c98d <process_main+23997>, 
          0x44c40a <process_main+22586>, 0x44c9fc <process_main+24108>, 0x44c4b5 <process_main+22757>, 
          0x44c7c8 <process_main+23544>, 0x44bbaa <process_main+20442>, 0x44cb04 <process_main+24372>, 
          0x44cd75 <process_main+24997>, 0x44cb88 <process_main+24504>, 0x44cb12 <process_main+24386>, 
          0x44ca12 <process_main+24130>, 0x44cfea <process_main+25626>, 0x44ce30 <process_main+25184>, 
          0x44cf11 <process_main+25409>, 0x44ba68 <process_main+20120>, 0x44bc4e <process_main+20606>, 
          0x44bbc0 <process_main+20464>, 0x44c5dd <process_main+23053>, 0x44bdcd <process_main+20989>, 
          0x44ce80 <process_main+25264>, 0x44be99 <process_main+21193>, 0x44be79 <process_main+21161>, 
          0x44c0c8 <process_main+21752>, 0x44c140 <process_main+21872>, 0x44c1c0 <process_main+22000>, 
          0x44c1f5 <process_main+22053>, 0x44d074 <process_main+25764>, 0x44b75c <process_main+19340>, 
          0x44cebe <process_main+25326>, 0x44d00b <process_main+25659>, 0x44cdd1 <process_main+25089>, 
          0x44cf96 <process_main+25542>, 0x44aedc <process_main+17164>, 0x44ab19 <process_main+16201>, 
          0x44ac2b <process_main+16475>, 0x44712c <process_main+1372>, 0x44719c <process_main+1484>, 
          0x44715b <process_main+1419>, 0x446fd6 <process_main+1030>, 0x44a86d <process_main+15517>, 
          0x44a53a <process_main+14698>, 0x4470f5 <process_main+1317>, 0x4470d1 <process_main+1281>, 
          0x44d97b <process_main+28075>, 0x449b4d <process_main+12157>, 0x449ba0 <process_main+12240>, 
          0x44d630 <process_main+27232>, 0x449974 <process_main+11684>, 0x449ca0 <process_main+12496>, 
          0x446fbd <process_main+1005>, 0x44b85c <process_main+19596>, 0x44b80a <process_main+19514>, 
          0x44b8a0 <process_main+19664>, 0x44d67c <process_main+27308>, 0x44cc2c <process_main+24668>, 
          0x44ace2 <process_main+16658>, 0x44b316 <process_main+18246>, 0x44b196 <process_main+17862>, 
          0x44d778 <process_main+27560>, 0x44cc80 <process_main+24752>, 0x44cd28 <process_main+24920>, 
          0x4476ca <process_main+2810>, 0x4476fa <process_main+2858>, 0x447728 <process_main+2904>, 
          0x447756 <process_main+2950>, 0x447787 <process_main+2999>, 0x4477b8 <process_main+3048>, 
          0x4477e8 <process_main+3096>, 0x447818 <process_main+3144>, 0x44b3ea <process_main+18458>, 
          0x4478f3 <process_main+3363>, 0x44c222 <process_main+22098>, 0x44791b <process_main+3403>, 
          0x44c246 <process_main+22134>, 0x447847 <process_main+3191>, 0x44787b <process_main+3243>, 
          0x4478b7 <process_main+3303>, 0x44d9e3 <process_main+28179>, 0x44962c <process_main+10844>, 
          0x4496dc <process_main+11020>, 0x4496ea <process_main+11034>, 0x44ad7b <process_main+16811>, 
          0x44a179 <process_main+13737>, 0x447942 <process_main+3442>, 0x44795a <process_main+3466>, 
          0x447977 <process_main+3495>, 0x4498fd <process_main+11565>, 0x449921 <process_main+11601>, 
          0x447993 <process_main+3523>, 0x4479b0 <process_main+3552>, 0x4498cf <process_main+11519>, 
          0x4498f3 <process_main+11555>, 0x44a711 <process_main+15169>, 0x44a6f3 <process_main+15139>, 
          0x44aeb2 <process_main+17122>, 0x44ae94 <process_main+17092>, 0x447017 <process_main+1095>, 
          0x44a0cf <process_main+13567>, 0x44c269 <process_main+22169>, 0x449807 <process_main+11319>, 
          0x44992b <process_main+11611>, 0x447120 <process_main+1360>, 0x447190 <process_main+1472>, 
          0x447153 <process_main+1411>, 0x446fce <process_main+1022>, 0x4470ed <process_main+1309>, 
          0x4470c9 <process_main+1273>, 0x44c360 <process_main+22416>, 0x44c2c0 <process_main+22256>, 
          0x44c310 <process_main+22336>, 0x44b6b1 <process_main+19169>, 0x44b660 <process_main+19088>, 
          0x44d4ba <process_main+26858>, 0x44d467 <process_main+26775>, 0x44da32 <process_main+28258>, 
          0x4496f7 <process_main+11047>, 0x4497a8 <process_main+11224>, 0x4497be <process_main+11246>, 
          0x4479cc <process_main+3580>, 0x447a59 <process_main+3721>, 0x44a9a7 <process_main+15831>, 
          0x446ffc <process_main+1068>, 0x44abc8 <process_main+16376>, 0x44a653 <process_main+14979>...}
#3  0x00000000004f5489 in sched_thread_func (vesdp=0x7fec19642100) at beam/erl_process.c:8906
        callbacks = {arg = 0x7fec19641c00, wakeup = 0x4fa030 <thr_prgr_wakeup>, 
          prepare_wait = 0x4f50a0 <thr_prgr_prep_wait>, wait = 0x4f6350 <thr_prgr_wait>, 
          finalize_wait = 0x4f5080 <thr_prgr_fin_wait>}
        esdp = 0x7fec19642100
        no = 1
#4  0x000000000067806f in thr_wrapper (vtwd=<optimized out>) at pthread/ethread.c:118
        result = 0
        c = 0 '\000'
        res = <optimized out>
        twd = <optimized out>
        thr_func = 0x4f5370 <sched_thread_func>
        arg = 0x7fec19642100
        tsep = 0x7fec1a400100
#5  0x00007fec5b4be184 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
No symbol table info available.
#6  0x00007fec5afe303d in clone () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.

We will try to get the debug symbols to have a better view.

@essen
Copy link

essen commented May 1, 2018

Running into various issues so far. Anyway. Details are fuzzy, but the crash only occurs when there is a uid change just before starting RabbitMQ, and doesn't occur without the rabbitmq-server script (or at least I couldn't reproduce yet).

Whatever happens, I believe catching exceptions around here[1] and returning an empty list to Erlang would solve our issue. Not enough time today but worth experimenting on this next because if this is a transient issue at startup then catching is a good idea, and if the issue is different then being able to inspect the VM state would help.

[1] https://github.com/deadtrickster/prometheus_process_collector/blob/master/c_src/prometheus_process_collector_nif.cc#L43

@deadtrickster
Copy link
Owner

hmm, I wonder what's so special about your environment.

when there is a uid change just before starting RabbitMQ

How do you mean?

Also, about this being start-up only issue, why this function called on startup? or it just coincides with scraping?

@essen
Copy link

essen commented May 1, 2018

It uses start-stop-daemon to start it as user vcap instead of user root.

For the other questions I don't know yet. I think it coincides with scraping yes.

@deadtrickster
Copy link
Owner

anything special about this vcap user? start-stop-daemon simply calls setuid AFAIK, I could try to check this starting as root and calling setuid in on_load

@gerhard
Copy link
Author

gerhard commented May 2, 2018

Nothing special about the vcap user, it's the equivalent of ubuntu or debian:

id vcap
uid=1000(vcap) gid=1000(vcap) groups=1000(vcap),4(adm),30(dip),44(video),46(plugdev),1003(google-sudoers)

We are using vcap as a good practice, which is to not run services as root. This is the entire process tree:

init─┬─auditd─┬─audispd───{audispd}
     │        └─{auditd}
     ├─beam.smp(vcap)─┬─erl_child_setup
     │                └─81*[{beam.smp}]
     ├─cron
     ├─dhclient
     ├─epmd(vcap)
     ├─6*[getty]
     ├─google_accounts
     ├─google_clock_sk
     ├─google_ip_forwa
     ├─netdata(netdata)─┬─apps.plugin(root)
     │                  ├─bash
     │                  ├─python───{python}
     │                  └─13*[{netdata}]
     ├─route-registrar(vcap)─┬─2*[route_registrar─┬─route_registrar───perl]
     │                       │                    └─tee───route_registrar───logger]
     │                       └─8*[{route-registrar}]
     ├─rpc.idmapd
     ├─rpc.statd(statd)
     ├─rpcbind
     ├─rsyslogd(syslog)───3*[{rsyslogd}]
     ├─runsvdir─┬─runsv─┬─bosh-agent───13*[{bosh-agent}]
     │          │       └─svlogd
     │          └─runsv─┬─monit───{monit}
     │                  └─svlogd
     ├─sshd───sshd───sshd(bosh_514cb0a8bcd048a)───bash───pstree
     ├─systemd-udevd
     ├─upstart-file-br
     ├─upstart-socket-
     └─upstart-udev-br

monit suns as root and supervises in this case route-registrar & beam.smp. epmd starts implicitly, but the uid is already vcap.

This is how we start RabbitMQ: https://github.com/rabbitmq/rabbitmq-server-boshrelease/blob/master/jobs/rabbitmq-server/templates/bin/_start_rabbitmq-server

This is the exact start-stop-daemon.c that we use: https://github.com/rabbitmq/rabbitmq-server-boshrelease/blob/master/src/start-stop-daemon-1.9.18/start-stop-daemon.c

Can you see a problem with this approach of starting rabbitmq-server for prometheus_process_collector?

@deadtrickster
Copy link
Owner

no, I should probably try changing uid myself. The problem with suggested catch is that SIGSEGV is a signal not a C++ exception so try...catch won't work. Signal handler can be set up but recovery strategy is unclear. Now I'm really intrigued what's going on

@gerhard
Copy link
Author

gerhard commented May 2, 2018

Thanks for digging into this, let me know if there is anything that I can help with.

gerhard added a commit to rabbitmq/rabbitmq-server-boshrelease that referenced this issue May 18, 2018
prometheus_process_collector is crashing the entire Erlang VM under the
following scenarios:
* when the Erlang VM is restarted
* when disabling rabbitmq_management plugin
* when disabling prometheus_rabbitmq_exporter

An issue is open deadtrickster/prometheus_process_collector#9, will
consider re-adding when/if this gets addressed, or when switching to
bpm-release, whichever happens first.

prometheus.erl is being compiled explicitly, since the latest v3 exposes
extra information about Erlang VM allocators (see
deadtrickster/prometheus.erl#75). prometheus.erl v3 is required since
it's compatible with both RabbitMQ 3.6.x & 3.7.x, as well as Erlang 20.0
and above. For more details, see
deadtrickster/prometheus.erl#75 (comment)

For Erlang versions prior to 20, a pre-compiled version of
prometheus.erl v3.4.x will be used instead.
@gerhard
Copy link
Author

gerhard commented May 18, 2018

I've stopped using prometheus_process_collector for now, I'll be looking into bridging netdata with prometheus instead. Thanks for your help!

@deadtrickster
Copy link
Owner

Yeah, I see. looks like I just don't have time for this, leave this issue open

@gerhard
Copy link
Author

gerhard commented May 18, 2018

Not a problem, didn't want this to become a drag, happy for you to get around to this in your own time.

@letrec
Copy link

letrec commented Nov 4, 2022

I believe the root cause is throwing C++ exceptions. This is not what a well-behaved NIF should do.

https://github.com/deadtrickster/prometheus_process_collector/blob/master/c_src/prometheus_process_info_linux.cc#L76

There are other places like this that do that.

@letrec
Copy link

letrec commented Nov 4, 2022

I think returning some well-defined number (like -1) in such situations would be more preferable than crashing the VM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants