Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: How to download and use gcore extension to view user space stack? #1

Open
sandeepmvd opened this issue Dec 18, 2020 · 15 comments

Comments

@sandeepmvd
Copy link

I am looking for gcore extension to use it alongside crash so that i can generate the coredump of a process use gdb to analyse userside stack. If gcore has been replaced with a newer extension, please let me know.

@k-hagio
Copy link
Contributor

k-hagio commented Dec 18, 2020

You can download the latest gcore (crash-gcore-command-1.6.1.tar.gz) from:
https://github.com/crash-utility/crash-extensions
-> crash-gcore-command-1.6.1.tar.gz -> Download
or
https://crash-utility.github.io/extensions.html
-> crash-gcore-command-1.6.1.tar.gz

The latter has its instruction as well. If you have any questions, please let us know.

@sandeepmvd
Copy link
Author

sandeepmvd commented Dec 18, 2020

Thanks a lot.
However when i am trying to use it, crash aborts showing the following error

crash> extend ~/kernel/crash-gcore-command-1.6.1/gcore.so
/home/mk/kernel/crash-gcore-command-1.6.1/gcore.so: shared object loaded
crash> gcore 4725
*** Error in `crash': free(): invalid next size (normal): 0x0000559cecf22b50 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x70bfb)[0x7f83f8614bfb]
/lib/x86_64-linux-gnu/libc.so.6(+0x76fc6)[0x7f83f861afc6]
/lib/x86_64-linux-gnu/libc.so.6(+0x7780e)[0x7f83f861b80e]
crash(freebuf+0x1b1)[0x559ce820c881]
/home/mk/kernel/crash-gcore-command-1.6.1/gcore.so(+0x53eb)[0x7f83e954c3eb]
/home/mk/kernel/crash-gcore-command-1.6.1/gcore.so(+0x56cd)[0x7f83e954c6cd]
/home/mk/kernel/crash-gcore-command-1.6.1/gcore.so(gcore_coredump+0x39a)[0x7f83e954abb4]
/home/mk/kernel/crash-gcore-command-1.6.1/gcore.so(+0xf18b)[0x7f83e955618b]
/home/mk/kernel/crash-gcore-command-1.6.1/gcore.so(cmd_gcore+0x272)[0x7f83e9555f43]
crash(exec_command+0x33a)[0x559ce8200dca]
crash(main_loop+0x1aa)[0x559ce8200fca]
crash(+0x356683)[0x559ce8453683]
crash(catch_errors+0x8a)[0x559ce845221a]
crash(+0x3577a6)[0x559ce84547a6]
crash(catch_errors+0x8a)[0x559ce845221a]
crash(gdb_main_entry+0x5e)[0x559ce8454b9e]
crash(main+0x8e1)[0x559ce81ff2b1]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1)[0x7f83f85c42e1]
crash(_start+0x2a)[0x559ce820057a]
======= Memory map: ========
...
...
Aborted

It generates a partial core file. But it is not usable.

GDB says :

warning: Couldn't find general-purpose registers in core file.

#0  <unavailable> in ?? ()
(gdb) bt
#0  <unavailable> in ?? ()
Backtrace stopped: not enough registers or memory available to unwind further

@k-hagio
Copy link
Contributor

k-hagio commented Dec 18, 2020

@d-hatayama san, this looks gcore side issue to me. (buffer overflow?)
Should we discuss on the crash utility mailing list?

@d-hatayama
Copy link

@d-hatayama san, this looks gcore side issue to me. (buffer overflow?)

Me too.

Should we discuss on the crash utility mailing list?

Either is fine to me 😄

@Binary-Nerd , could you tell me the following things? which could be helpful to reproduce the issue.

  • distribution
  • kernel version (rpm/dep package version is preferable)
  • target process (rpm/dep package version is preferable)

@sandeepmvd
Copy link
Author

@d-hatayama san,
Here are the details :

distribution :   debian stretch
kernel version (rpm/dep package version is preferable) : 4.9-rt
target process (rpm/dep package version is preferable) : The process is a user space process specific to my application.

Please let me know if there is some test or anything i can do.

@d-hatayama
Copy link

: target process (rpm/dep package version is preferable) : The process is a user space process specific to my application.

Hmm, your original application.

@Binary-Nerd The best is that you share the problematic vmcore. If I can look into the vmcore directly, I can figure out what is going on soon. Of course, I'd like you to avoid including important data in the vmcore to be shared.

@sandeepmvd
Copy link
Author

@d-hatayama san,
Sorry but i am not allowed to share vmcore. But surely I can assist in debugging.

@d-hatayama
Copy link

I had not left any logs here, sorry.

I tried to check whether the issue is reproducible on debian stretch with 4.9-rt kernel, but I didn't see the issue. I also looked into crash and crash gcore command from the viewpoint of basic memory management with malloc/free using valgrind and I didn't find anything that appears relevant to the issue.

I think the issue is likely to depend on characteristics of your system and your application where the problematic kernel crash dump was collected; for example, multi thread program or (and) some kind of floating-point registers?

Anyway, for additional information, could you provide a debug message during the execution of gcore command? You can enable debug message as follows:

crash> set debug 15
crash> gcore
<readmem: ffff8bb64ff05a00, KVADDR, "fill_task_struct", 5760, (ROE), 558dc981fa90>
<read_diskdump: addr: ffff8bb64ff05a00 paddr: ff05a00 cnt: 1536>
read_diskdump: paddr/pfn: ff05a00/ff05 -> cache physical page: ff05000
...<snip; many debug messages>...

Also, I did valgrind support on crash during the investigation of this issue. If possible, could you try valgrind and provide me its result?

To enable the feature, build crash command as:

 # make valgrind

Then, run crash commnad using valgrind as:

# valgrind ./crash vmlinux vmcore

This feature is not released officially so you need to use the current master version.

@k-hagio
Copy link
Contributor

k-hagio commented Feb 18, 2021

This feature is not released officially so you need to use the current master version.

ah, I gave it my ack but it is still under review so you need to apply the patchset:
https://listman.redhat.com/archives/crash-utility/2021-January/msg00002.html

@sandeepmvd
Copy link
Author

sandeepmvd commented Jun 14, 2021

@d-hatayama san

The issue is still there on latest master.
I tried to gcore a process and it aborts.

I ran crash with valgrind and following is the output:

crash> gcore 14264
==22826== Invalid write of size 2
==22826==    at 0x4C3269B: memmove (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==22826==    by 0x1D485F4F: fill_files_note (in /home/manty/kernel/crash-gcore-command-1.6.1/gcore.so)
==22826==    by 0x1D48439F: fill_write_thread_core_info (in /home/manty/kernel/crash-gcore-command-1.6.1/gcore.so)
==22826==    by 0x1D4846CC: fill_write_note_info (in /home/manty/kernel/crash-gcore-command-1.6.1/gcore.so)
==22826==    by 0x1D482BB3: gcore_coredump (in /home/manty/kernel/crash-gcore-command-1.6.1/gcore.so)
==22826==    by 0x1D48E18A: do_gcore (gcore.c:317)
==22826==    by 0x1D48DF42: cmd_gcore (gcore.c:258)
==22826==    by 0x211619: exec_command (main.c:892)
==22826==    by 0x211819: main_loop (main.c:839)
==22826==    by 0x476442: captured_command_loop (main.c:258)
==22826==    by 0x475079: catch_errors (exceptions.c:557)
==22826==    by 0x477515: captured_main (main.c:1064)
==22826==  Address 0x27a5a9c0 is 0 bytes after a block of size 135,168 alloc'd
==22826==    at 0x4C2DBC5: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==22826==    by 0x21E596: getbuf (tools.c:6062)
==22826==    by 0x1D485C3E: fill_files_note (in /home/manty/kernel/crash-gcore-command-1.6.1/gcore.so)
==22826==    by 0x1D48439F: fill_write_thread_core_info (in /home/manty/kernel/crash-gcore-command-1.6.1/gcore.so)
==22826==    by 0x1D4846CC: fill_write_note_info (in /home/manty/kernel/crash-gcore-command-1.6.1/gcore.so)
==22826==    by 0x1D482BB3: gcore_coredump (in /home/manty/kernel/crash-gcore-command-1.6.1/gcore.so)
==22826==    by 0x1D48E18A: do_gcore (gcore.c:317)
==22826==    by 0x1D48DF42: cmd_gcore (gcore.c:258)
==22826==    by 0x211619: exec_command (main.c:892)
==22826==    by 0x211819: main_loop (main.c:839)
==22826==    by 0x476442: captured_command_loop (main.c:258)
==22826==    by 0x475079: catch_errors (exceptions.c:557)
==22826== 
==22826== Invalid write of size 1
==22826==    at 0x4C326CB: memmove (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==22826==    by 0x1D485F4F: fill_files_note (in /home/manty/kernel/crash-gcore-command-1.6.1/gcore.so)
==22826==    by 0x1D48439F: fill_write_thread_core_info (in /home/manty/kernel/crash-gcore-command-1.6.1/gcore.so)
==22826==    by 0x1D4846CC: fill_write_note_info (in /home/manty/kernel/crash-gcore-command-1.6.1/gcore.so)
==22826==    by 0x1D482BB3: gcore_coredump (in /home/manty/kernel/crash-gcore-command-1.6.1/gcore.so)
==22826==    by 0x1D48E18A: do_gcore (gcore.c:317)
==22826==    by 0x1D48DF42: cmd_gcore (gcore.c:258)
==22826==    by 0x211619: exec_command (main.c:892)
==22826==    by 0x211819: main_loop (main.c:839)
==22826==    by 0x476442: captured_command_loop (main.c:258)
==22826==    by 0x475079: catch_errors (exceptions.c:557)
==22826==    by 0x477515: captured_main (main.c:1064)
==22826==  Address 0x27a5a9d8 is 24 bytes after a block of size 135,168 in arena "client"
==22826== 

valgrind: m_mallocfree.c:303 (get_bszB_as_is): Assertion 'bszB_lo == bszB_hi' failed.
valgrind: Heap block lo/hi size mismatch: lo = 135232, hi = 7596498840077020928.
This is probably caused by your program erroneously writing past the
end of a heap block and corrupting heap metadata.  If you fix any
invalid writes reported by Memcheck, this assertion failure will
probably go away.  Please try that before reporting this as a bug.


host stacktrace:
==22826==    at 0x38083828: ??? (in /usr/lib/valgrind/memcheck-amd64-linux)
==22826==    by 0x38083944: ??? (in /usr/lib/valgrind/memcheck-amd64-linux)
==22826==    by 0x38083AD1: ??? (in /usr/lib/valgrind/memcheck-amd64-linux)
==22826==    by 0x38091394: ??? (in /usr/lib/valgrind/memcheck-amd64-linux)
==22826==    by 0x3807CF23: ??? (in /usr/lib/valgrind/memcheck-amd64-linux)
==22826==    by 0x3807B7A3: ??? (in /usr/lib/valgrind/memcheck-amd64-linux)
==22826==    by 0x3807F9DA: ??? (in /usr/lib/valgrind/memcheck-amd64-linux)
==22826==    by 0x3807AD3A: ??? (in /usr/lib/valgrind/memcheck-amd64-linux)
==22826==    by 0x8043189CA: ???
==22826==    by 0x802FB5F2F: ???

sched status:
  running_tid=1

Thread 1: status = VgTs_Runnable (lwpid 22826)
==22826==    at 0x4C32643: memmove (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==22826==    by 0x1D485F4F: fill_files_note (in /home/manty/kernel/crash-gcore-command-1.6.1/gcore.so)
==22826==    by 0x1D48439F: fill_write_thread_core_info (in /home/manty/kernel/crash-gcore-command-1.6.1/gcore.so)
==22826==    by 0x1D4846CC: fill_write_note_info (in /home/manty/kernel/crash-gcore-command-1.6.1/gcore.so)
==22826==    by 0x1D482BB3: gcore_coredump (in /home/manty/kernel/crash-gcore-command-1.6.1/gcore.so)
==22826==    by 0x1D48E18A: do_gcore (gcore.c:317)
==22826==    by 0x1D48DF42: cmd_gcore (gcore.c:258)
==22826==    by 0x211619: exec_command (main.c:892)
==22826==    by 0x211819: main_loop (main.c:839)
==22826==    by 0x476442: captured_command_loop (main.c:258)
==22826==    by 0x475079: catch_errors (exceptions.c:557)
==22826==    by 0x477515: captured_main (main.c:1064)
==22826==    by 0x475079: catch_errors (exceptions.c:557)
==22826==    by 0x4778FD: gdb_main (main.c:1079)
==22826==    by 0x4778FD: gdb_main_entry (main.c:1099)
==22826==    by 0x20FAC0: main (main.c:720)


Note: see also the FAQ in the source distribution.
It contains workarounds to several common problems.
In particular, if Valgrind aborted or crashed after
identifying problems in your program, there's a good chance
that fixing those problems will prevent Valgrind aborting or
crashing, especially if it happened in m_mallocfree.c.

If that doesn't help, please report this bug to: www.valgrind.org

In the bug report, send all the above text, the valgrind
version, and what OS and version you are using.  Thanks.

With set debug 15 the logs are too many. If the above is not sufficient, i will share the debug logs as well.
Hopefully we can fix this quickly..

@sandeepmvd
Copy link
Author

Attaching gcore debug log file
gcore_debug_15.log

@sandeepmvd
Copy link
Author

@d-hatayama san,
sorry to bother. Just wondering if you were able to check the issue.
Thank you in advance.

@d-hatayama
Copy link

@Binary-Nerd @k-hagio Sorry for the delayed response. I'm a little tied up now but I think I can look into the issue this weekend.

@sandeepmvd
Copy link
Author

@d-hatayama Thank you for looking into it.
If you want additional debug logs etc.. please let me know.

@d-hatayama
Copy link

crash> gcore 14264
==22826== Invalid write of size 2
==22826== at 0x4C3269B: memmove (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==22826== by 0x1D485F4F: fill_files_note (in /home/manty/kernel/crash-gcore-command-1.6.1/gcore.so)
==22826== by 0x1D48439F: fill_write_thread_core_info (in /home/manty/kernel/crash-gcore-command-1.6.1/gcore.so)
==22826== by 0x1D4846CC: fill_write_note_info (in /home/manty/kernel/crash-gcore-command-1.6.1/gcore.so)
==22826== by 0x1D482BB3: gcore_coredump (in /home/manty/kernel/crash-gcore-command-1.6.1/gcore.so)
==22826== by 0x1D48E18A: do_gcore (gcore.c:317)
==22826== by 0x1D48DF42: cmd_gcore (gcore.c:258)

fill_files_note() creates NT_FILE note segment. There are two
memmove() calls in fill_files_note() and one of them performs invalid
write.

==22826== Address 0x27a5a9c0 is 0 bytes after a block of size 135,168 alloc'd
==22826== at 0x4C2DBC5: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==22826== by 0x21E596: getbuf (tools.c:6062)
==22826== by 0x1D485C3E: fill_files_note (in /home/manty/kernel/crash-gcore-command-1.6.1/gcore.so)
==22826== by 0x1D48439F: fill_write_thread_core_info (in /home/manty/kernel/crash-gcore-command-1.6.1/gcore.so)
==22826== by 0x1D4846CC: fill_write_note_info (in /home/manty/kernel/crash-gcore-command-1.6.1/gcore.so)
==22826== by 0x1D482BB3: gcore_coredump (in /home/manty/kernel/crash-gcore-command-1.6.1/gcore.so)
==22826== by 0x1D48E18A: do_gcore (gcore.c:317)
==22826== by 0x1D48DF42: cmd_gcore (gcore.c:258)

A block in malloc() consists of header part and data part. The above
massage "0 bytes after a block" says there is no header part. This
means the state of malloc's data structure is illegal for some reason.

Based on these information, there would be a buffer overrun at the
memmove() above and would have broken some data structure of
malloc(). This seems consistent with the initial report that gcore
resulted in abort() via free() due to invalid next size.

The simplest workaround for @Binary-Nerd is to revert the commit that
implemented NT_FILES:

# git revert c52b6ed92937bec783174474dd069926fc4aedd4

There is no problem without NT_FILES because gdb doesn't use NT_FILES
at all and you can see files mapped into a given task by crash's vm
command.

I have no idea what the root cause is from the current
information. But my local test set contains a test case where many
files are mapped, up to 0xffff that is defined as PX_XNUM in elf.h,
and it shows positive result. This means just mapping many files is
insufficient to reproduce this issue.

There are two memmove() in fill_files_notes(). Could you check which
one results in invalid write?

 921static int
 922fill_files_note(struct elf_note_info *info, struct task_context *tc,
 923               struct memelfnote *memnote)
 924{
...snip...
 980                file_buf = fill_file_cache(vm_file);
 981                dentry = ULONG(file_buf + OFFSET(file_f_dentry));
 982                if (dentry) {
 983                        fill_dentry_cache(dentry);
 984                        if (VALID_MEMBER(file_f_vfsmnt)) {
 985                                vfsmnt = ULONG(file_buf + OFFSET(file_f_vfsmnt));
 986                                get_pathname(dentry, buf, BUFSIZE, 1, vfsmnt);
 987                        } else {
 988                                get_pathname(dentry, buf, BUFSIZE, 1, 0);
 989                        }
 990                }
 991
 992                /* get_pathname() fills at the end, move name down */
 993                n = strlen(buf)*sizeof(char) + 1;
 994                remaining -= n;
 995                memmove(name_curpos, buf, n);
 996                progressf("FILE %s\n", name_curpos);
 997                name_curpos += n;
 998
 999                *start_end_ofs++ = vm_start;
1000                *start_end_ofs++ = vm_end;
1001                *start_end_ofs++ = vm_pgoff;
1002                count++;
1003        }
1004
1005        /* Now we know exact count of files, can store it */
1006        data[0] = count;
1007        data[1] = size;
1008
1009        /*
1010         * Count usually is less than map_count,
1011         * we need to move filenames down.
1012         */
1013        n = map_count - count;
1014        if (n != 0) {
1015                unsigned shift_bytes = n * 3 * sizeof(data[0]);
1016                memmove(name_base - shift_bytes, name_base,
1017                        name_curpos - name_base);
1018                name_curpos -= shift_bytes;
1019        }

If in the first one, does strlen() in line 993 return a large value?
In such case, could you show me output of crash's vm command?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants