Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Buffer overflow in PotentialTree with OpenMP #93

Closed
rtobar opened this issue Jun 23, 2021 · 3 comments
Closed

Buffer overflow in PotentialTree with OpenMP #93

rtobar opened this issue Jun 23, 2021 · 3 comments
Labels
bug Something isn't working

Comments

@rtobar
Copy link

rtobar commented Jun 23, 2021

When running with the inputs and configuration from #87/#88 with 4 OpenMP threads, 1 MPI rank and compiling with clang address sanitizer I got the following output:

[0000] [  87.191] [debug] unbind.cxx:284 Unbinding 1521 groups ...
=================================================================
==182466==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x6020001e99a8 at pc 0x000000a457dc bp 0x7ff6f2c55470 sp 0x7ff6f2c55468
WRITE of size 8 at 0x6020001e99a8 thread T5
    #0 0xa457db in .omp_outlined._debug__.7 /home/rtobar/scm/git/VELOCIraptor-STF/src/unbind.cxx:1240:19
    #1 0xa46cfb in .omp_outlined..8 /home/rtobar/scm/git/VELOCIraptor-STF/src/unbind.cxx:1228:1
    #2 0x7ff7114e3ca2 in __kmp_invoke_microtask (/usr/lib/x86_64-linux-gnu/libomp.so.5+0xabca2)
    #3 0x7ff7114789c2  (/usr/lib/x86_64-linux-gnu/libomp.so.5+0x409c2)
    #4 0x7ff7114775f9  (/usr/lib/x86_64-linux-gnu/libomp.so.5+0x3f5f9)
    #5 0x7ff7114cb149  (/usr/lib/x86_64-linux-gnu/libomp.so.5+0x93149)
    #6 0x7ff71153f58f in start_thread nptl/pthread_create.c:463:8
    #7 0x7ff71134c222 in clone misc/../sysdeps/unix/sysv/linux/x86_64/clone.S:95

0x6020001e99a8 is located 16 bytes to the right of 8-byte region [0x6020001e9990,0x6020001e9998)
allocated by thread T0 here:
    #0 0x53ca8d in operator new[](unsigned long) (/home/rtobar/scm/git/VELOCIraptor-STF/builds/88-debug-addrsan/stf+0x53ca8d)
    #1 0xa40b32 in PotentialTree(Options&, long long, NBody::Particle*&, NBody::KDTree*&) /home/rtobar/scm/git/VELOCIraptor-STF/src/unbind.cxx:1185:11
    #2 0xa3db16 in Potential(Options&, long long, NBody::Particle*) /home/rtobar/scm/git/VELOCIraptor-STF/src/unbind.cxx:960:5
    #3 0xa5357e in CalculatePotentials(Options&, NBody::Particle**, long long&, long long*) /home/rtobar/scm/git/VELOCIraptor-STF/src/unbind.cxx:434:13
    #4 0xa38e98 in Unbind(Options&, NBody::Particle**, long long&, long long*, long long*, long long**, int) /home/rtobar/scm/git/VELOCIraptor-STF/src/unbind.cxx:793:5
    #5 0xa48efd in CheckUnboundGroups(Options, long long, NBody::Particle*, long long&, long long*&, long long*, long long**, int, long long*) /home/rtobar/scm/git/VELOCIraptor-STF/src/unbind.cxx:350:18
    #6 0x8b040f in SearchBaryons(Options&, long long&, NBody::Particle*&, long long, std::vector<NBody::Particle, std::allocator<NBody::Particle> >&, long long*&, long long&, long long&, int, int, PropData*) /home/rtobar/scm/git/VELOCIraptor-STF/src/search.cxx:3840:13
    #7 0x547208 in main /home/rtobar/scm/git/VELOCIraptor-STF/src/main.cxx:463:13
    #8 0x7ff71125bcb1 in __libc_start_main csu/../csu/libc-start.c:314:16

Thread T5 created by T0 here:
    #0 0x4f770a in pthread_create (/home/rtobar/scm/git/VELOCIraptor-STF/builds/88-debug-addrsan/stf+0x4f770a)
    #1 0x7ff7114ca823  (/usr/lib/x86_64-linux-gnu/libomp.so.5+0x92823)

SUMMARY: AddressSanitizer: heap-buffer-overflow /home/rtobar/scm/git/VELOCIraptor-STF/src/unbind.cxx:1240:19 in .omp_outlined._debug__.7
Shadow bytes around the buggy address:
  0x0c04800352e0: fa fa fd fa fa fa fd fa fa fa fd fa fa fa fd fa
  0x0c04800352f0: fa fa fd fa fa fa fd fa fa fa fd fa fa fa fd fa
  0x0c0480035300: fa fa fd fa fa fa fd fa fa fa fd fa fa fa fd fa
  0x0c0480035310: fa fa fd fa fa fa fd fa fa fa fd fa fa fa fd fa
  0x0c0480035320: fa fa fd fa fa fa fd fa fa fa fd fa fa fa 00 fa
=>0x0c0480035330: fa fa 00 fa fa[fa]fa fa fa fa fa fa fa fa fa fa
  0x0c0480035340: fa fa fa fa fa fa fa fa fa fa fa fa fa fa 00 fa
  0x0c0480035350: fa fa 00 fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c0480035360: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c0480035370: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c0480035380: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
  Shadow gap:              cc
==182466==ABORTING
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[57249,1],0]
  Exit code:    1
--------------------------------------------------------------------------
rtobar added a commit that referenced this issue Jun 23, 2021
There were a couple of problems with the old version of this code.
Firstly, it had an unused maxthreads variable, which has now been
removed. Secondly, and more importantly, it used a complex and incorrect
way of calculating the number of threads (stored in the nthreads
variable) that would be used in new "omp parallel" blocks: instead of
directly calling omp_get_max_threads and writing its result into
nthreads (the new behavior) it opened an "omp parallel" block and called
omp_get_num_threads, only on the first thread of the newly created
thread team. The result was then written back to nthreads.

Although this seems fine, it creates a race condition, as there is no
guarantee that thread #0 in the newly formed team is the same thread
from which the team was created. This in turn means that the write to
nthreads is not necessarily visible immediately from the parent thread.
The nthreads value is later on used to allocate memory to be used by
"omp parallel" regions; having a wrong value (indeed the original 1 to
which nthreads is initialised to at declaration time) leads to
wrongly-sized memory blocks, and to invalid writes down the line.

We saw this problem when running with clang's address sanitizer, which
led us to find this particular bug in the code.

This problem was reported in #93.

Signed-off-by: Rodrigo Tobar <[email protected]>
@rtobar
Copy link
Author

rtobar commented Jun 23, 2021

As mentioned in 511767e this was a race condition in the calculation of available OpenMP threads. With the commit above I don't get the errors anymore, so I'll merge to the master straight away after double-checking that the Travis runs don't fail either.

@rtobar rtobar added the bug Something isn't working label Jun 23, 2021
@rtobar
Copy link
Author

rtobar commented Jun 23, 2021

BTW, I also thought that maybe this could have been related to the unsolved problem of #53 but that's not the case: I can still reproduce #53 after applying this fix.

@rtobar
Copy link
Author

rtobar commented Jun 23, 2021

This has been solved now and the fix has been merged to the master branch.

@rtobar rtobar closed this as completed Jun 23, 2021
rtobar added a commit that referenced this issue Jun 24, 2021
This is similar to the error detected in #93 and fixed in 511767e, but
with the difference that in this case it doesn't seem to cause issues in
real life, as most probably at this point of the execution thread #0 of
the newly created team corresponds to the main thread executing the rest
of the code, so the write to nthreads *is* visible.

Signed-off-by: Rodrigo Tobar <[email protected]>
rtobar added a commit that referenced this issue Jun 24, 2021
This is similar to the error detected in #93 and fixed in 511767e, but
with the difference that in this case it doesn't seem to cause issues in
real life, as most probably at this point of the execution thread #0 of
the newly created team corresponds to the main thread executing the rest
of the code, so the write to nthreads *is* visible.

Signed-off-by: Rodrigo Tobar <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant