Dynamic instrumentation and binary re-writing command-line tool.
timemory-run requires DynInst, which must be externally installed. Dyninst has several 3rd-party library dependencies so it is highly recommended to use a package manager such as spack to install it.
Quick start to installing DynInst via spack:
git clone https://github.com/spack/spack.git
source ${PWD}/spack/share/spack/setup-env.sh
spack compiler find
spack external find
spack install dyninst
spack load -r dyninst
Once DynInst is installed, enable -DTIMEMORY_BUILD_TOOLS=ON -DTIMEMORY_USE_DYNINST=ON
in CMake.
There are two execution modes: (1) runtime-instrumentation and (2) binary rewriting. Runtime instrumentation
will temporarily patch an executable with timemory instrumentation and can be launched on an existing process
or timemory-run
can launch the executable as a subprocess.
Binary rewriting generates a new executable from an existing executable and cannot be applied to an existing
process.
Runtime-instrumentation generates more profiling info because the entire executable along with the linked libraries are fully loaded into memory, thus instrumentation can be generated for the function calls which exist in linked libraries. However, in binary rewriting mode, only the executable itself is loaded and the functions which exist in a linked library cannot be instrumented directly: the executable only has a reference to the functions and cannot modify the symbol. Thus, one should choose runtime instrumentation when detailed profilers are desired and binary rewriting should be chosen for targeted analysis of a specific executable and/or library.
In general, binary rewriting is an excellent choice if only interested in the profiling the function calls in your executable and/or library. Runtime instrumentation is an excellent choice for detailed profiling for one single process -- runtime instrumentation is generally not the ideal choice for distributed memory parallelism, e.g. MPI, UPC, UPC++.
Development Note: Currently, the binary rewriting is slightly more stable than runtime-instrumentation
# general form to run exe as a subprocess
timemory-run <OPTIONS> -- <EXECUTABLE> <ARGS>
# example running exe "foo"
timemory-run -- ./foo
# general from to attach to running executable
timemory-run <OPTIONS> -p <PID> -- <EXECUTABLE>
# example attaching to exe with PID of 3252
timemory-run -p 3252
In order to use the binary rewriting mode, specify an output file via the -o
short option
or the --output
long option followed by the name of the instrumented file to be generated.
The target executable/library for instrumentation will be the first argument after the --
.
timemory-run <OPTIONS> -o <OUTPUT_EXECUTABLE> -- <EXECUTABLE>
The example below creates a new instrumented executable (foo.inst
)
from an existing executable foo
.
timemory-run -o foo.inst -- ./foo
The example below creates a new instrumented library (libomp.so
) in the current
working directory from the system /usr/lib/libomp.so
.
timemory-run -o libomp.so -- /usr/lib/libomp.so
Default components:
wall_clock
The timemory-run
executable has a -d/--default-components
option for specifying which components
to use for analysis. The available components can be viewed via the timemory-avail
command line tool
and the -s
option to this tool will display the valid string identifiers for these components.
This command line option is overridden by the environment variables:
TIMEMORY_TRACE_COMPONENTS
in trace mode and TIMEMORY_COMPONENTS
in region mode
(See Region vs. Trace). This command line option can also be left blank or set to none
and the environment variable TIMEMORY_GLOBAL_COMPONENTS
can be used to control the components in
trace and region mode. However, TIMEMORY_GLOBAL_COMPONENTS
is a fallback environment environment
variable and will be superceded by nearly any other component environment variable. When using
the --mpip
and/or --ompt
command line options, these tools check for TIMEMORY_MPIP_COMPONENTS
and TIMEMORY_OMPT_COMPONENTS
respectively, and in the absence of this environment variable,
use TIMEMORY_GLOBAL_COMPONENTS
. In other words, the modularity of timemory allows for specific tools
to collect their own sets of metrics so each tool generally checks an environment variable unique
to the tool and then search a series of generic environment variables.
# binary rewrite w/ cpu_clock
timemory-run -d cpu_clock -o foo.inst -- ./foo
./foo.inst
# set the PAPI hardware counters
export TIMEMORY_PAPI_EVENTS="PAPI_TOT_CYC,PAPI_TOT_INS,PAPI_LST_INS"
# binary rewrite w/ wall_clock, peak_rss, and PAPI hardware counters
timemory-run -d wall_clock peak_rss papi_vector -o foo.inst -- ./foo
./foo.inst
# override default components
export TIMEMORY_TRACE_COMPONENTS="wall_clock, cpu_clock"
./foo.inst
# runtime instrumentation with a trip counter
timemory-run -d trip_count -- ./foo
# runtime instrumentation using TIMEMORY_GLOBAL_COMPONENTS environment variable
export TIMEMORY_GLOBAL_COMPONENTS="wall_clock, thread_cpu_clock"
timemory-run -d -- ./foo
The toolkit used for dynamic instrumentation acts as a supervisor for the process when the toolkit launches an executable
as a subprocess. The toolkit is not designed to forward the communicators and thus no communication would occur if
timemory-run
was launched via mpirun
, e.g. mpirun -np 2 timemory-run -- ./foo
because timemory-run
launches
foo
via a fork/join operation and all communicator info is lost.
Similarly, timemory-run -- mpirun -np 2 ./foo
does not work because mpirun
would be instrumented instead
of foo
and even if foo
could be identified as the executable to modify, the instrumentation would be quite
complicated for numerous reasons. If runtime instrumentation is desired for an MPI process, the only current
solution is to launch the MPI jobs, e.g. mpirun -np 2 ./foo
and attach to one of the processes.
# launch the MPI executable
mpirun -np 2 ./foo &
# get the process ID of one of the MPI ranks
PID=$(pgrep foo | head -n 1)
# attach timemory-run to this process ID
timemory-run -p ${PID}
NOTE: In above, we attach to the first PID of two PIDs generated by
mpirun
. The rank whichtimemory-run
attached to (rank 0) will be stopped, instrumented, and then will resume execution once instrumentation is complete. The second process (rank 1) will continue to execute until a synchronization is required with the instrumented rank. Thus, communication wait times measured between the instrumented rank and the non-instrumented rank(s) will be misleading.
Binary rewriting is the preferred method for instrumenting an executable or library which will be utilizing
distributed memory parallelism. In order to use binary rewriting with an MPI process, use the command line
option --mpi
on any executable or library targeted for instrumentation. This option enables a GOTCHA
wrapper around MPI_Init
or MPI_Init_thread
in order to delay the initialization of the timemory
library until after one of these functions have been invoked in the application. If the executable/library
dynamic links to the MPI library, binary rewriting will not instrument the MPI functions. In order to
instrument dynamically linked function calls, one must either create a locally instrumented copy
of libmpich.so
or libopenmpi.so
(as demonstrated above with libomp.so
) or use the --mpip
command-line option. The
--mpip
command line option uses a pre-compiled set of MPI GOTCHA wrappers which are activated when
the executable is launched.
timemory-run -o foo.inst --mpi --mpip -- ./foo
mpirun -np 2 ./foo.inst
timemory-run -o foo.inst --mpi -- ./foo
timemory-run -o libmpich.so --mpi -- /usr/lib/libmpich.so
export LD_LIBRARY_PATH=$PWD:$LD_LIBRARY_PATH
mpirun -np 2 ./foo.inst
timemory-run --help
will provide a help menu with regard to all the possible options.
There are a few key concepts to understand, however.
Identifying a function for instrumentation has two components: a module name and a function name. The function name as an identifier is self-explanatory. The module is either the name of file which contained the definition of the function when it was compiled or the name of the library which contains the symbol for the function.
In general:
- If function
foo
was compiled infoo.c
and linked intolibfoo.so
(dynamic library)- Module name:
libfoo.so
- Module name:
- If function
foo
was compiled infoo.c
and linked intolibfoo.a
(static library)- Module name:
foo.c
- Module name:
- If function
foo
was compiled infoo.c
and linked directly into the executable- Module name:
foo.c
- Module name:
Thus, timemory-run
provides command line options which use regular expressions (regex)
to permit explicit selection of which modules/functions to include, exclude, or the union of
an exclude and include based on module names and/or function names. When an include option
is present for either category (function or module), timemory-run
defaults to excluding any
functions or modules which do not match the include expression. In general, the exclude option
should be used remove instrumentation from unwanted functions/modules and the include option should
be used for selecting specific functions/modules.
# include only function names which start with 'foo' or end with 'bar'
timemory-run -I '(^foo|bar$)' -- ./foo
# include only functions defined in modules 'libfoo.so' and 'bar.cpp'
timemory-run -MI libfoo.so bar.cpp -- ./foo
# exclude any functions starting with 'ompt_'
timemory-run -E '^ompt_' -- ./foo
# exclude any functions in libomptarget.so
timemory-run -ME 'libomptarget.so' -- ./foo
timemory-run
can accept "collection" files which are an explicit list of the
function names to be instrumented. Several pre-defined collection sets for popular libraries
(e.g. BLAS, CUDA, FFTW, GMP, HDF5, HIP, LAPACK, MPI, OMP, OPENCL, PETSc, UPC)
and category sets of library functions (e.g. memory contains memcmp, memcpy, etc.).
timemory-run
has a command line option (-M/--mode
) which designates whether to synchronize the
instrumentation with the timemory library API.
When --mode=trace
(default), the components used by the dynamic instrumentation will be independent
of any changes made to the components via the library API. Thus, the library interface can enable/disable
components freely without affecting the dynamic instrumentation and the dynamic instrumentation can be
use for detailed analysis of certain components.
When --mode=region
, the dynamic instrumentation uses the timemory_push_region
and timemory_pop_region
function calls exposed by the library API.
Any changes to the measurement components via timemory_set_default
, timemory_push_components
,
and timemory_pop_components
will also modify the components used by the dynamic instrumentation.
timemory-run
provides options to enable OpenMP tools (--ompt
) and MPI (--mpip
) instrumentation in binary rewrite mode.
These options are useful if generic OpenMP and/or MPI performance info is desired instead of the detailed instrumentation
that would arise from creating an instrumented version of these libraries.
A generic set of options are provided to add instrumentation from custom instrumentation libraries: --load
takes a
list of libraries, e.g. --load libfoo libbar
and the default behavior provided by timemory-run
is search
for two symbols: void timemory_register_<NAME>()
and void timemory_deregister_<NAME>()
where <NAME>
is
the name of the package, i.e. foo
and bar
for libfoo
and libbar
, respectively. However, one can also
specify a list of initialization and finalization functions via --init-functions
and --fini-functions
.
These symbols get inserted into before and after main
in an executable and within _init
and _fini
in a library. Although it might seem more intuitive for these libraries to be injected into the instrumentation
around the function calls, new components can be added easily:
#include "timemory/library.h"
#include "timemory/timemory.hpp"
extern "C" void
timemory_register_ex_custom_dynamic_instr()
{
using namespace tim::component;
// insert monotonic clock component into structure
// used by timemory-run in --mode=trace
user_trace_bundle::global_init(nullptr);
user_trace_bundle::configure<monotonic_clock>();
// insert monotonic clock component into structure
// used by timemory-run in --mode=region
timemory_add_components("monotonic_clock");
}
extern "C" void
timemory_deregister_ex_custom_dynamic_instr()
{}
timemory-run -o foo.inst --load libex_custom_dynamic_instr.so -- ./foo