-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tool-induced Fence function pointer is null #233
Comments
Please improve the description and use some judgments on what is actually needed for a minimal reproducer. |
Got it, thanks - I have improved the description and will be putting in action items as well. For a minimal reproducer, yes: I will look into an even simpler Kokkos program as well. I think this problem occurs with any Kokkos program having a single invocation of a Kokkos::parallel_for() - the tool-induced fence function gets called from the kokkosp_begin_parallel_for() function in the Kokkos Tools sampler. (I could make that happen by setting the number of outer iterations in stream to 1 and using just one kernel in stream, e.g., "add", but I will look at making a standalone reproducer). |
What does the backtrace look like? |
@masterleinad Yes, getting this soon - thanks! |
@masterleinad Here is the backtrace of the stream benchmark's copy' parallel_for with gdb on Perlmutter, and I got this by taking the exit(1) in the case of a Null pointer (i.e., just letting the code run to failure of a seg. fault). The run is with just the Kokkos serial backend with 200 iterations of the stream benchmark using just the copy, i.e., just one Kokkos::parallel_for iterating 200 times. I am using the Kokkos develop branch. This is with the use-probability-sampling branch on my fork (github.com/vlkale/kokkos-tools/tree/use-probability-sampling), or Kokkos Tools PR #181. From the below, I cannot quite see in depth what the problem could be but one can see the backtrace goes through Kokkos core. Right now having looked at this, I think the problem is in my tool connector and the way I have initialized the pointers and not in Kokkos core. I have reviewed literature/basics on C++ function pointers and I think David Poliakoff's code for the tool_invoked fence in Kokkos_Profiling.hpp and Kokkos_Profiling_C_Interface.h is right (though maybe a bit more documentation can help). Before I show the run with gdb output, here are the modules loaded on Perlmutter.
|
This got resolved today with @crtrott. The problem is not in Kokkos core, in particular, As a separate note: it could be useful to have better documentation in Kokkos Tools to explain to a Kokkos Tools tool connector developer that the type See PR #213. |
The tool-induced fence function when accessed from the ToolsProgrammingInterface and from within a Kokkos Tools connector returns null.
This is a problem currently only for the randomized sampler (a new feature to be added, shown in PR #213), so it is not a critical bug per se. However, not fixing this limits new functionality of Kokkos Tools where Kokkos Tools callbacks can invoke Kokkos core functionality within it. This does not impact any of Kokkos core by itself and it does not impact any other existing tools connectors. If someone wants to use the tool_invoked_fence() in their Kokkos Tools connector, please be aware that it may not work properly at present.
The problem comes when I run the stream benchmark with the serial backend on Perlmutter or a 2017 MacOS MacBook Pro laptop, but not a 2022 MacOS MacBook Pro. It also doesn't seem arise for the OpenMP+CUDA backend on Perlmutter, though this may be serendipitous and I am looking into this.
Here is the output from the reproducer.
In the case when the sampling rate skip rate is set and no randomized sampling is done (thus not using a tool-invoked fence), the sampler works correctly.
Currently, I think the problem comes from Kokkos_Profiling.hpp, where one can there is no tool-induced fence function, as shown in the screenshot of the code file below.
I am looking into this and will update the Issue as I find more.
The text was updated successfully, but these errors were encountered: