This is an archived copy of [https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view], written by eddyb from the rustc team.
This report is a companion to these GitHub PRs:
rust-lang/measureme#143
for the bulk of the implementationrust-lang/rust#78781
for integration intorustc
- adds a
-Z self-profile-counter
flag to choose frommeasureme
's named counters
- adds a
Some details are only found in those PRs, while the rest is below.
We were attempting to gather detailed information about the impact of Rust PRs on the performance of specific parts of the compiler, such as the trait system.
The -Z self-profile
data, however, was far too noisy for this, due to its reliance on measuring time, which factors in far more about the system, and it would be an incredibly uphill battle to try to make the results more reliable, that everyone would have to replicate in order to get the same effect.
You can see the noise in the -Z self-profile
data here (third graph): rust-lang/rustc-perf#661 - it's several time larger than the actual changes we wanted to see.
Hardware performance counters offer, in theory, an out: the CPU itself can accurately keep track of events like "Instructions retired", and even allow the OS to isolate individual userspace threads from anything else running on the same machine, including the kernel itself.
perf.rust-lang.org
already tracks such counters, via perf stat
, but there an entire process is measured, as opposed to the query-level granularity of -Z self-profile
.
While tracking only the number of executed instructions might hide certain effects, the long-term plan is to experiment with tapping into some parts of the cache hierarchy (likely L1d and L2) and using cache misses as a proxy for the cost of non-locality.
To determine the cost of sampling the instruction counter (via rdpmc
), as opposed to measuring time (via std::time::Instant
, which uses clock_gettime(CLOCK_MONOTONIC)
), the total number of instructions was tracked while compiling libcore in "check mode".
By gathering groups of 10 runs per counter type (including a "zero" counter that always produces the 0
constant, for the baseline), and dividing the difference between counter types' totals by the number of counter reads, we can reliably know the per-read cost (in the number of instructions).
Note that the totals are measured by instructions-minus-irqs:u
, regardless of which counter is being used by the profiler, so the noise here is not indicative of the quality of the profile data, but rather the determinism of the whole execution (including sampling the counter used by the profiler).
More details, and a revision history, are available in a gist, but these are the final results:
Counter | Totalinstructions-minus-irqs:u |
Overhead from "Baseline" (for all 1903881 counter reads) |
Overhead from "Baseline" (per each counter read) |
---|---|---|---|
Baseline | 63637621286 ±6 | ||
instructions:u |
63658815885 ±2 | +21194599 ±8 | +11 |
instructions-minus-irqs:u |
63680307361 ±13 | +42686075 ±19 | +22 |
wall-time |
63951958376 ±10275 | +314337090 ±10281 | +165 |
From this we can gather that counting instructions, even when subtracting IRQs, can be an order of magnitude faster than measuring time (though this may vary depending on the actual instructions executed).
You may also notice measuring time is non-deterministic in the total number of instructions (around ±5 instructions for every 1000 clock_gettime(CLOCK_MONOTONIC)
calls, though keep in mind we don't know the distribution), likely due to the synchronization mechanism between libc and the kernel requiring an extra loop iteration.
Using the profile data from the runs above (ignoring the "Baseline"), we've looked at the 10 largest queries, and variance in the "Self time" (or "Self count") between runs, for each counter:
wall-time (ns) |
instructions:u |
instructions-minus-irqs:u |
---|---|---|
typeck |
||
5478261360 ±283933373 (±~5.2%) | 17350144522 ±6392 (±~0.00004%) | 17351035832.5 ±4.5 (±~0.00000003%) |
expand_crate |
||
2342096719 ±110465856 (±~4.7%) | 8263777916 ±2937 (±~0.00004%) | 8263708389 ±0 (±~0%) |
mir_borrowck |
||
2216149671 ±119458444 (±~5.4%) | 8340920100 ±2794 (±~0.00003%) | 8341613983.5 ±2.5 (±~0.00000003%) |
mir_built |
||
1269059734 ±91514604 (±~7.2%) | 4454959122 ±1618 (±~0.00004%) | 4455303811 ±1 (±~0.00000002%) |
resolve_crate |
||
942154987.5 ±53068423.5 (±~5.6%) | 3951197709 ±39 (±~0.000001%) | 3951196865 ±0 (±~0%) |
hir_lowering |
||
530050063.5 ±30695921.5 (±~5.8%) | 1381364181 ±22 (±~0.000002%) | 1381363724 ±0 (±~0%) |
build_hir_map |
||
367098179.5 ±18410803.5 (±~5%) | 2099522033.5 ±12.5 (±~0.0000006%) | 2099521684 ±0 (±~0%) |
check_mod_item_types |
||
354434450.5 ±21655950.5 (±~6.1%) | 898871840.5 ±21.5 (±~0.000002%) | 899018351.5 ±0.5 (±~0.00000006%) |
check_impl_item_well_formed |
||
323550938.5 ±16711421.5 (±~5.2%) | 805382346 ±42 (±~0.000005%) | 805841502 ±1 (±~0.0000001%) |
check_item_well_formed |
||
321212644.5 ±18409991.5 (±~5.7%) | 898599944 ±22 (±~0.000002%) | 899122104.5 ±1.5 (±~0.0000002%) |
From this, the noise reduction becomes pretty clear, as even the worst case (thanks to unpredictable IRQs) for instruction-counting is 100,000 times (i.e. 5 orders of magnitude) smaller than the 5% wall-time
noise.
See Caveats/Subtracting IRQs below for why instructions-minus-irqs:u
is much better but still not perfect.
Using summarize aggregate
(see also an explanation below), on the same profile data again, we get these smallest and largest 5 variances (for "sampling intervals", i.e. distance between consecutive pairs of counter reads):
wall-time
±0 ns: 44 occurrences, or 0.00%
±4.5 ns: `hir_owner()`
±5 ns: 10085 occurrences, or 0.53%
±5.5 ns: 99 occurrences, or 0.01%
±9.5 ns: 18 occurrences, or 0.00%
...
±16510104 ns: in `typeck()`, between `parent_module_from_def_id()` and `upvars_mentioned()`
±18410803.5 ns: `build_hir_map()`
±30695921.5 ns: `hir_lowering()`
±53068423.5 ns: `resolve_crate()`
±106282082.5 ns: in `expand_crate()`, after `pre_AST_expansion_lint_checks()`
instructions:u
±0 instructions: 1830563 occurrences, or 96.15%
±0.5 instructions: 69126 occurrences, or 3.63%
±1 instructions: 2466 occurrences, or 0.13%
±1.5 instructions: 715 occurrences, or 0.04%
±2 instructions: 219 occurrences, or 0.01%
...
±175.5 instructions: in `typeck()`, between `parent_module_from_def_id()` and `upvars_mentioned()`
±214 instructions: `free_global_ctxt()`
±226 instructions: in `typeck()`, between `opt_const_param_of()` and `in_scope_traits_map()`
±238.5 instructions: in `typeck()`, between `opt_const_param_of()` and `in_scope_traits_map()`
±2278.5 instructions: in `expand_crate()`, after `pre_AST_expansion_lint_checks()`
instructions-minus-irqs:u
±0 instructions: 1903452 occurrences, or 99.98%
±0.5 instructions: 423 occurrences, or 0.02%
±1 instructions: 3 occurrences, or 0.00%
±4.5 instructions: `write_crate_metadata()`
These are similar to the previous "Results" section, but instead of libcore, they're for the polkadot-runtime-common
crate, at tag v0.8.26
(also in "check mode").
See Caveats/Non-deterministic proc macros below for an ASLR-like effect we encountered here (but not for libcore, as it uses no proc macros), and how we worked around it.
wall-time (ns) |
instructions:u |
instructions-minus-irqs:u |
---|---|---|
expand_crate |
||
1255640143 ±62610049 (±~5%) | 3646160968.5 ±1940.5 (±~0.00005%) | 3646160240 ±0 (±~0%) |
typeck |
||
678068989 ±39533817 (±~5.8%) | 1664414250.5 ±1110.5 (±~0.00007%) | 1664595300.5 ±2.5 (±~0.0000002%) |
mir_borrowck |
||
286718041 ±19533032 (±~6.8%) | 837858656 ±502 (±~0.00006%) | 837973316 ±1 (±~0.0000001%) |
metadata_register_crate |
||
131774774 ±11642890 (±~8.8%) | 537739743.5 ±203.5 (±~0.00004%) | 537746710 ±0 (±~0%) |
evaluate_obligation |
||
98972055 ±5956386 (±~6%) | 245732462.5 ±184.5 (±~0.00008%) | 245852905.5 ±2.5 (±~0.000001%) |
resolve_crate |
||
95908303.5 ±8120924.5 (±~8.5%) | 281321111 ±108 (±~0.00004%) | 281329454 ±0 (±~0%) |
mir_built |
||
91490959.5 ±6749247.5 (±~7.4%) | 255743130.5 ±160.5 (±~0.00006%) | 255789874.5 ±1.5 (±~0.0000006%) |
type_op_prove_predicate |
||
70004550.5 ±5436502.5 (±~7.8%) | 175309312.5 ±128.5 (±~0.00007%) | 175357244 ±0 (±~0%) |
check_item_well_formed |
||
50069210.5 ±4185722.5 (±~8.4%) | 144582558 ±83 (±~0.00006%) | 144642844 ±1 (±~0.0000007%) |
check_impl_item_well_formed |
||
40564784.5 ±2546996.5 (±~6.3%) | 107322787.5 ±82.5 (±~0.00008%) | 107359747.5 ±0.5 (±~0.0000005%) |
wall-time
±0 ns: 228 occurrences, or 0.03%
±0.5 ns: 4 occurrences, or 0.00%
±4.5 ns: in `implementations_of_trait()`, after `metadata_decode_entry_implementations_of_trait()`
±5 ns: 13744 occurrences, or 1.63%
±5.5 ns: 119 occurrences, or 0.01%
...
±4402505.5 ns: `free_global_ctxt()`
±8120924.5 ns: `resolve_crate()`
±10358487 ns: in `expand_crate()`, between `metadata_load_macro()` and `metadata_load_macro()`
±11484159 ns: in `expand_crate()`, between `metadata_load_macro()` and `metadata_load_macro()`
±44234813 ns: in `expand_crate()`, between `metadata_load_macro()` and `metadata_load_macro()`
instructions:u
±0 instructions: 828814 occurrences, or 98.00%
±0.5 instructions: 15484 occurrences, or 1.83%
±1 instructions: 899 occurrences, or 0.11%
±1.5 instructions: 291 occurrences, or 0.03%
±2 instructions: 63 occurrences, or 0.01%
...
±125.5 instructions: in `expand_crate()`, between `metadata_load_macro()` and `metadata_load_macro()`
±169 instructions: between `codegen_crate()` and `free_global_ctxt()`
±242 instructions: `free_global_ctxt()`
±1428 instructions: in `expand_crate()`, between `metadata_load_macro()` and `metadata_load_macro()`
±7588.5 instructions: `self_profile_alloc_query_strings()`
instructions-minus-irqs:u
±0 instructions: 845456 occurrences, or 99.97%
±0.5 instructions: 229 occurrences, or 0.03%
±1 instructions: 3 occurrences, or 0.00%
±4.5 instructions: `write_crate_metadata()`
±5 instructions: in `link()`, between `finish_ongoing_codegen()` and `llvm_dump_timing_file()`
±18 instructions: in `link()`, after `link_crate()`
±20 instructions: in `codegen_crate()`, between `assert_dep_graph()` and `serialize_dep_graph()`
±188.5 instructions: between `codegen_crate()` and `free_global_ctxt()`
±332 instructions: `free_global_ctxt()`
±9430.5 instructions: `self_profile_alloc_query_strings()`
One thing to note here is that even though instructions-minus-irqs:u
is worse here than it was for libcore, all of the extra noise is contained in IO (during/after writing the .rmeta
file).
And because it's isolated in those parts of the compiler, it has no effect whatsoever on queries.
See also Challenges/ASLR: it’s free entropy for some backstory and more details.
In order to get the best results, disabling ASLR by e.g. running rustc
under setarch -R
is necessary.
Note that the rustup
wrapper binary which allows rustc +foo ...
does not currently appear to propagate ASLR being disabled to the actual rustc
binary (e.g. ~/.rustup/toolchains/.../bin/rustc
), so you will need to refer to it directly, for now.
Long-term we may want to make rustc
's use of pointer hashing more intrinsically deterministic, by using special arenas which prefer known addresses, instead of letting the OS pick them.
When profiling a rustc
compilation which uses proc macros, they may contain arbitrary user code, including sources of randomness and IO. The easiest way this can impact the rest of rustc
is through introducing ASLR-like effects with non-deterministic allocation sizes.
For example, the very-widely-used serde_derive
proc macro happens to use HashSet
, which by default uses a randomized hash (just like HashMap
), and that will result in different (re)allocation patterns of the HashSet
heap data, effectively randomizing any heap addresses allocated afterwards (by other parts of rustc
).
In order to fully eliminate ASLR-like effects due to the use of HashMap
/HashSet
in proc macros, during testing and data collection, we had to temporarily change the behavior of std::collections::hash_map::RandomState
to not use the getrandom
syscall.
Long-term we'd want either rustc
/proc_macro
to hook std
and disable getrandom
for HashMap
/HashSet
, or some sort of tool that intercepts the getrandom
syscall using ptrace
and/or SECCOMP filters. Alternatively/additionally, we could try getting std
to use the glibc wrapper when available, making it possible to use LD_PRELOAD
to defuse getrandom
.
See also Challenges/Getting constantly interrupted for some backstory and more details.
instructions-minus-irqs:u
exists to account for/work around overcounting (in instructions:u
) associated with interrupts.
Our hypothesis is that the iret
instruction, which the kernel uses to return execution from an interrupt handler, back to the (interrupted) userspace code, is counted as "retiring in userspace".
While this (if true) would be pedantically/technically correct, it's practically undesirable, as the otherwise userspace-only count is now tainted by 1 extra instruction per (unpredictable) interrupt.
Additionally, subtracting IRQs requires CPU family/model detection logic to determine the correct configuration for the IRQs counter.
This is the most ad-hoc part of the entire measureme
PR, and some of it relies on undocumented/redacted-out information (verified through testing) - but also, we later learned that projects like rr
do something similar, long-term we probably want to split this into a general-purpose library.
instructions-minus-irqs:u
is still not perfect as it requires two rdpmc
s, and interrupts can come in between them, but the resulting noise is usually limited to 1-2 instructions in the "total count" for any measureme
event, and relatively rare, so much better than the interrupt noise it's removing.
There is a way (CONFIG_NO_HZ_FULL
) to get the Linux kernel to avoid using timers on some cores, which would further decrease the noise (or perhaps even allow using instructions:u
directly), but we've decided against experimenting with it, as it impacts scheduling system-wide and so would require a dedicated machine for profiling - the documentation even says:
Unless you are running realtime applications or certain types of HPC workloads, you will normally -not- want this option.
The initial implementation doesn't support reading the counter from threads other than the original one it was created on. rdpmc
will be attempted, but no useful value can be read, as the counter is only running for the original thread.
This works for cargo check
(i.e. rustc --emit=metadata
) compilations, and those were our main focus, as -Z self-profile
is most fine-grained (i.e. at the query level) in the "check" part of the compilation. It could be possible to measure LLVM as well using -Z no-llvm-threads
, but we haven't confirmed this works reliably.
Additionally, it's harder to get deterministic and comparable results if multiple threads are involved, though queries themselves could remain pretty stable (except for unpredictable allocation in other threads inducing an ASLR-like effect).
Long-term, we should try to do (at least) one of:
- refactor
rustc
to hold one "profiler handle" (with its own counter) per-thread - use a
Vec<Counter>
inmeasureme
, and index it bystd::thread::current().id()
- or rather, the thread index passed to
Profiler
methods byrustc
- or rather, the thread index passed to
- enable
inherit
when creating the counter withperf_event_open
- this might not do anything useful for
measureme
, further experimentation needed - if it does work, it would be the most convenient way
- this might not do anything useful for
The Linux API for interfacing with both "software events" (tracked manually in the kernel) and hardware performance counters, is perf_event_open
, which returns a file descriptor for a pseudo-file that can then be read from to sample the counter.
However, we wanted to avoid the cost of a read
syscall, especially as x86 has the rdpmc
instruction, and Linux claims to allow it to be used in userspace, once perf_event_open
sets up a counter.
The information necessary for using rdpmc
is documented as being available by mmap
-ing the pseudo-file perf_event_open
returns, creating a region of shared memory (which can also be used as a ring buffer for software events, sampling profiling, etc. - though none of that is relevant here).
We started out by looking at examples which used a combination of perf_event_open
, mmap
and asm("rdpmc")
, and while there was some variation (such as also keeping track of time), they all seemed to share this rough structure:
p = mmap(perf_event_open(...));
// Every time you want to read the counter:
do {
// The kernel increments `lock` when it changes anything, hence
// the do-while loop to retry if the "sequence number" changed.
seq = p->lock;
barrier(); // Avoids reading stale values below.
// Read the counter (not shown here: the `rdpmc` function
// using `asm("rdpmc")`) and add a kernel-supplied "offset".
count = rdpmc(p->index - 1) + p->offset;
// Sign-extend from `p->pmc_width` bits to 64 bits.
shift = 64 - p->pmc_width;
count = count << shift >> shift;
barrier(); // Avoids reading stale `lock` in condition below.
} while(p->lock != seq);
You can see an example of this in deater/perf_event_tests
(with several copies of that function existing across the repository).
The Linux kernel doc comments have a simpler example with the sign-extension aspect documented separately.
There are a few downsides to this "standard" approach:
- the synchronization mechanism seems somewhat sketchy, at least for Rust (unsynchronized reads + manual barriers, instead of atomic reads)
- if the kernel does update any of the fields, there is non-determinism introduced, as the loop will execute more than once (and therefore count more instructions)
- for the fields that aren't updated (like
index
), or which may be constant (likepmc_width
), treating them like they could change is less efficient than it needs to be
Taking a closer look at each of the 3 fields:
index
: can't avoid reading it at least once (e.g. if we're running underperf stat -e ...
, we wouldn't get the first counter)- as long as we don't stop the counter, its
index
shouldn't change, so we could keep our own copy of it
- as long as we don't stop the counter, its
offset
: this was the most worrying, as the documentation makes it seem like it could be updated by the kernel during e.g. context switching- in practice, it seemed to remain at 2pmc_width-1 - 1
- the specific value would suggest it's a "bias", so that counters start in the middle of their range, as opposed to around 0
- this is presumably related to the ability to wait for the counter to increment by a certain amount, detected through the counter overflowing, but thankfully we don't need any of that
- additionally, as we wanted to subtract some initial counter value to get a relative count, the
offset
field could be completely ignored (i.e.(b + offset) - (a + offset)
is justb - a
)
pmc_width
: it seems possible that different CPUs may have different physical counter widths (up to the 64 bitsrdpmc
can provide)- on every x86 CPU we've tested, it was always 48
- we could therefore check it's what we expect at the start, and hardcode it for the reads themselves
In the end, we were able to avoid accessing any of the shared memory and dealing with the associated synchronization, when reading the counter, but kept the synchronization loop for the initial setup.
From previous experience, we were aware of the impact ASLR ("Address Space Layout Randomization", i.e. placing the program, and its stack, heap, etc. at random addresses, as a security measure) can have on rustc
's instruction-level determinism.
While rustc
's external outputs should never change on repeated executions, some internal data structures may change performance characteristics due to rustc
's reliance on using pointer addresses (after having deduplicated the data being referenced) for faster equality and hashing, such as when used as HashMap
keys.
In order to run e.g. some program ./foo
, with ASLR disabled, on Linux, one can use the setarch -R ./foo ...
command (or setarch x86_64 -R ./foo ...
in older versions), where -R
specifically is the flag for disabling ASLR.
Going this route avoids needing root access, and won't affect the rest of the system.
It can also be done with personality(ADDR_NO_RANDOMIZE)
, if you already have a rustc
wrapper, but for manual testing setarch -R
works great.
We did run into a problem, however, where rustup
's wrapper commands that allow picking toolchains (e.g. rustc +nightly ...
vs rustc +stable ...
) don't seem to propagate the personality flags (when ran as setarch -R rustc +local ...
), so for a while disabling ASLR didn't appear to have an effect.
The solution was to run the locally-built rustc
directly, i.e. setarch -R ./build/x86_64-unknown-linux-gnu/stage1/bin/rustc
, which did show a significant difference.
With ASLR enabled, the variance in the totals (43 billion at the time) was around ±2.5 million, but disabling ASLR brought it to under ±10k (which we were later able to also eliminate).
That makes "non-deterministic addresses" (whether from ASLR or something else) by far the largest source of noise we've observed, and we're able to rely on the large magnitude to tell it apart.
One example of such an effect produced without ASLR itself, was a correlation between the length of PIDs (in base 10), and the total number of instructions, which we accidentally noticed across a set of 100 runs:
That turned out to be from format!("{}-{}", crate_name, std::process::id())
, used by rustc -Z self-profile
as the prefix for the profile data file(s). The size of that single allocation was able to act as a (very weak) pseudorandom form of ASLR, after amplifying the deviation, through cloning and concatenation.
Our workaround was to pad the PID with zeros (up to 6 digits), but similar sources of non-deterministic allocation sizes could still be hiding elsewhere, and we will have to address them as they show up in the data.
Both Intel and AMD warn about using the rdpmc
instruction without some sort of "serializing instruction", and how it could lead to inaccurate values being read out.
In this context, "serialized execution" refers to executing instructions "serially", i.e. in the order they exist in memory, as opposed to the CPU reordering them based on dependencies between them and available hardware resources.
As a more concrete example, for the "Instructions retired" counter, rdpmc
may show a lower value as a few earlier instructions might not have "retired" yet, by the time rdpmc
executed.
While not making rdpmc
self-serializing is a counterproductive design decision, it does simplify the CPU (and as we found out later, this is a bit of a theme).
Coming back to the perf_event_open
+ mmap
+ asm("rdpmc")
examples we found, none of them were using a "serializing instruction", so that's another downside to add, on top of the synchronized loop.
As Intel and AMD do not entirely agree on which instructions count as "fully serializing", we had to go with the clunkier cpuid
(which both Intel and AMD document as "fully serializing"), even if we experimentally saw that e.g. lfence
has a similar noise-reducing effect on AMD Zen (an effect which is likely only enabled as part of mitigations for speculative execution vulnerabilities such as Spectre).
This was before we developed the summarize aggregate
tool, so we had limited insight into fine-grained noise, but the difference between serialized and unserialized was visible in the variance of per-query statistics, i.e. in the summarize summarize --json
data.
There are news of a possible serialize
instruction which does exactly what's needed, and no more, but it looks like using it would require CPU detection and dynamically picking it (which would also allow using lfence
, where happens to also be fully serializing).
Out of all the noise prevention measures we've taken, this one has the least impact, as there aren't that many instruction reordering opportunities just before the rdpmc
, so it's more of a preventative measure, and we may end up relaxing it in the future.
By this point, we were getting variance (in the total number of instructions) anywhere between ±5k and ±10k, and started investigating it, as there didn't appear to be any obvious source for that noise.
In order the validate some of our basic assumptions, we combined the relevant measureme
and some dependencies into a self-contained rdpmc-bench.rs
, and had the initial version measure trivial loops, printing the deviation from the expected number of instructions.
So for each of the measured loops it could show:
- negative values: undercounting (i.e. some instructions weren't counted)
- exactly
0
: perfect (and our original expectation) - positive values: overcounting (i.e. either some instructions were counted more than once, or unrelated instructions were counted)
What we saw was overcounting: something else was getting into the counter, and it appeared to be proportional to the number of instructions being executed (but relatively small).
One of our early suspicions was that perhaps the context-switching associated with multithreading (especially on laptop CPUs with few cores) was failing to correctly save and restore the counters. We'd later learn that the counters are largely impervious to such problems, thanks to the hardware being able to automatically pause them when outside of user mode, but at the time we started looking at various metrics available through perf stat
, such as "software events" or even Linux kernel "tracepoints".
Eventually we noticed that the total overcounting correlated strongly with the time the process was running for, even when that varied between different runs, and @m-ou-se realized that on her machine it was ~300 extra instructions per second and her Linux kernel was configured with CONFIG_HZ=300
.
We were then able to confirm that the overcounting did match CONFIG_HZ
across several different machines (with different Linux distros and varying CONFIG_HZ
). On some machines we were able to confirm the value by simply executing zgrep CONFIG_HZ /proc/config.gz
, but on others the kernel configuration was only found under /boot
.
So what is this CONFIG_HZ
? On the average Linux system, it's the frequency at which a (per-core) periodic timer will fire an interrupt (or "IRQ"), letting the kernel perform scheduling of all the active threads, and other periodic bookkeeping.
Soon thereafter @m-ou-se found a paper on non-determinism in hardware performance counters, which pretty much confirmed the overcounting is directly tied to interrupts, and also offered a potential solution: subtract the number of interrupts, assuming you can count them.
One small note about the paper: it doesn't propose a mechanism for how the overcounting happens, and we ended up assuming it's the iret
instruction the kernel uses to leave an interrupt handler and return to the code which was interrupted.
The reason for we went with that explanation is that an iret
back to userspace technically retires in userspace - or in other words, the "Instructions retired" counter gets effectively unpaused half-way through executing iret
, and extra silicon would be required to avoid that.
While it should be possible to confirm this is the case, with some custom baremetal code, or by getting the kernel itself to read the performance counter while in an interrupt handler (e.g. using the read
syscall via the x86 int 0x80
syscall interface, on a perf_event_open
pseudo-file), we haven't done so yet.
Now that we knew how we might remove the interrupt noise, we started looking for a counter that would at least include the timer IRQs, and since perf list
didn't seem to show anything of the sort on the machines we were looking at (Intel Ivy Bridge and AMD Zen 1 EPYC), we went to the (Intel and AMD) manuals, and found:
- Intel had
HW_INTERRUPTS.RECEIVED
, documented since Skylake as01cb
- there was also a similar counter documented on some Atom microarchitectures, but given the lack of Atom test hardware, we decided against support them for now
- AMD had
LsIntTaken
(or "Interrupts Taken"), documented in two ways:- pre-Zen, it was
00cf
(going at least as far back as K8, the first AMD64 microarchitecture) - for Zen, it was
002c
- while it's missing from the older 1st gen EPYC documentation, which likely explains why
perf list
doesn't show it, it is documented nowadays
- while it's missing from the older 1st gen EPYC documentation, which likely explains why
- pre-Zen, it was
Despite Intel not documenting the counter on anything older than Skylake, we started trying it (via e.g. perf stat -e r01cb:u
), and immediately confirmed that it works on Ivy Bridge.
Over the following days, we were able to also confirm it on Sandy Bridge (thanks to @alyssais) and Haswell (thanks to @m-ou-se, @cuviper and @nagisa.
We later noticed that the aforementioned paper shows, on page 5, HW_INTERRUPTS.RECEIVED
for Sandy Bridge and Ivy Bridge (and similar counters for even older CPUs, on both pages 4 and 5), leading us to conclude that Intel had "undocumented" these counters, instead of publishing errata and explaining the potential associated caveats.
Once we had the above information for the "hardware interrupts" (or "IRQs") counters we wanted to use, the cpuid
instruction was used to determine which generation of CPU we're running on, whether we know of a counter to use, and which exact configuration is needed.
We then enhanced our instructions:u
measurement to also rdpmc
the IRQs counter, and subtract it, creating what we would later dub instructions-minus-irqs:u
(to distinguish it from e.g. perf stat -e instructions:u
, which doesn't account for IRQs at all).
For the rdpmc-bench.rs
synthetic benchmark, after subtracting IRQs we stopped seeing overcounting at all, and instead had:
- some undercounting, on most machines, especially Intel Ivy Bridge and Haswell
- perhaps these were the hyperthreading issues Intel "undocumented" the counters for?
- perfect results (
0
deviation from expected) on the AMD Zen 1 EPYC server- this machine may have an unfair advantage, i.e. its 48 hardware threads, 6-12x more than the Intel CPUs we looked at
- long term we could look at intensive stress testing, using several times more userspace threads than hardware threads, to hunt down remaining non-determinism, but we haven't seen any evidence to suggest there could be any left
However, subtracting IRQs mostly didn't impact the (up to ±10k) noise we were seeing when compiling libcore, as the IRQ-driven overcounting stayed within a few hundred instructions between different runs, most of the time (though it did help with the occasional outlier).
And that's because...
Having not made as much progress in removing noise by subtracting interrupts, as we were hoping, focus shifted to building the tools necessary to delve deeper into the -Z self-profile
data, to narrow down which parts of rustc
were being affected.
What we came up is summarize aggregate
(summarize
being the main tool for inspecting measureme
profiles). The pull request has more details and examples, but in short, it:
- takes a group of profiles collected from identical
rustc
executions (we were using groups of 10 runs) - lines up every event endpoint (i.e. starts and ends) across the entire group, making sure the stream of events has identical details (other than the timestamps/instruction counts)
- compares the distance (in time/instruction count) between every consecutive pair of event endpoints, and shows the best and worst such intervals (in terms of variation between runs)
From the instructions-counting example in the PR, there were the 2 smallest and largest variances, at the time:
±0 instructions: 1701795 occurrences, or 99.61%
±0.5 instructions: 519 occurrences, or 0.03%
±659 instructions: `free_global_ctxt()`
±1147 instructions: in `expand_crate()`, after `pre_AST_expansion_lint_checks()`
The important aspect here is that every interval corresponds to a consecutive pair of time/counter samples, so even if e.g. reading the counter was imprecise by a few instructions, it could not add up, as it might when looking at summarize summarize
output.
Additionally, we had two avenues for learning more about what was executing in those high-noise intervals:
- adding more fine-grained events to
rustc
, e.g. for only a few statements in a function - enabling the serialization of function arguments, via
-Zself-profile-events=default,args
, for the events that support it
The main outcome of employing these techniques was learning that the noise is quite rare, and highly correlated with heap allocation/deallocation (i.e. malloc
/realloc
/free
) of rustc
's AST/IRs (macro expansion on the AST, AST->HIR lowering, MIR transformations).
We couldn't find any obvious sources of non-determinism in glibc
's heap allocator, and strace
wasn't showing a difference in the order, arguments, or return values of syscalls (other than process/thread IDs), so we decided to try and replace the allocator used by rustc
with a simpler one written in Rust (such as wee_alloc
, commonly used for WASM).
As #[global_allocator]
wasn't compatible with rustc
's reliance on dylib
crates (used to avoid duplication on disk between rustc
, rustdoc
, clippy
, miri
, etc.), we had to write an adapter between C heap functions and a Rust allocator, but once we had that we could start comparing simpler allocators to glibc
's, in terms of noise.
Surprisingly, both wee_alloc
and simpler allocators like bump_alloc
(which never deallocates any memory), had similar noise to glibc
, though deallocation became deterministic with bump_alloc
(as free
was a noop).
So we turned our attention to synthetic benchmarking again, and added bump_alloc
allocation to rdpmc-bench.rs
, which produced occasional overcounting (even if it could take a while for it to happen).
One important detail here is that the heap allocator noise we were seeing was on the AMD Zen 1 EPYC server (we were relying on it as it was otherwise the least noisy machine we had access to), and we could only reproduce the occasional overcounting (in the synthetic benchmark) on AMD Zen CPUs (including Zen 2).
At this point our assumption became that this was a Zen-specific bug, and we quickly reduced the bump_alloc
allocation function down to a single atomic fetch_add
(i.e. the lock xadd
instruction).
The lock xadd
overcounting suspiciously came in exact multiples of the number of instructions in the loop body, as if the benchmark loop was doing extra iterations - e.g. instead of 5000
instructions for 1000
iterations, we might see 5005
or 5010
, when the loop body was 5
instructions - and this effect remained as we added extra nop
s to the loop body.
This was our first indication that something much stranger than a mere faulty counter, was happening here: it shouldn't be possible for the hardware to redo whole loop iterations, especially with a conditional branch (the loop backedge) in the overcounted instructions.
But @m-ou-se said the unthinkable out loud anyway:
is it speculatively executing until the next
lock
which then conflicts with the existinglock
and throws the whole thing out?
After a bit more experimentation to confirm that the distance between consecutive lock xadd
instructions mattered, rather than the loop itself, we started looking for any mentions of such a behavior, and the Google search query instructions retired lock
had (at the time) this as one of the results: https://www.freepatentsonline.com/y2018/0074977.html
That is AMD's US20180074977A1 (read on Google Patents), which has this in its abstract:
A lock instruction and younger instructions are allowed to speculatively retire prior to the store portion of the lock instruction committing its value to memory. (...) In the event that the processor detects a violation (...), the processor rolls back state and executes the lock instruction in a slow mode in which younger instructions are not allowed to retire until the stored value of the lock instruction is committed.
(Here "younger instructions" refers to instructions after the atomic, i.e. lock
, instruction, which would always execute later in a simple "in-order" CPU, but modern "out-of-order" CPUs do some amount of on-the-fly reordering of instructions, hence the specific terminology)
This was the confirmation we needed: AMD decided to "transactionally" execute (or rather, speculate) instructions past an atomic, without any special opt-in from software, and while the CPU was able to "time-travel" back to the atomic instruction, it wasn't saving performance counters in order to restore them as well.
What could we even do about missing silicon like this? Well, it so happens that we live in a post-Spectre world, and while the mitigations for it (and related speculative execution vulnerabilities) have had a negative impact on performance, there is an upside: both Intel and AMD have revealed previously undocumented special registers, and bits in them which can turn off various forms of speculation.
Having that ability all along (for anything non-essential in a CPU) makes sense given silicon development cycles, and it would be especially beneficial for AMD to have such a disable bit for their new "speculate past atomics, possibly having to roll all that back later" optimization, just in case there are flaws found later (at which point an UEFI "BIOS" update would set the bit).
So we were expecting there would be a disable bit and that we might be able to find it, especially given our synthetic benchmark, though randomly changing undocumented special registers (aka MSRs, or "model-specific registers") can be somewhat dangerous, mostly in terms of corrupting the OS and/or running processes (but it shouldn't be easy/possible to permanently damage the hardware itself).
Regarding the synthetic benchmark, @puckipedia came up with the idea of using a child thread to constantly interfere with the same cache line the main thread was lock xadd
-ing to, and this led to constant overcounting, instead of merely occasional, which made it much easier to confirm whether an undocumented bit had any relevant effect.
Before we were able to ever try a single MSR bit, however, @jix already tried a few and found that MSRC001_102C[55]
(i.e. bit 55
of MSR 0xC001_102C
) solved the overcounting problem, when set. That was already great, but then they found that the ASRock "B450M Pro4" motherboard manual contains this description for "Enable IBS":
Enables IBS through
MSRC001_1005[42]
and disablesSpecLockMap
throughMSRC001_1020[54]
.
Not only did MSRC001_1020[54]
also solve our overcounting problem, we now had a name for this "post-lock
speculation" feature: SpecLockMap
. More recently, someone from AMD implied that ASRock effectively leaked the correct name, so we'll keep using it.
While not officially documented by AMD, the fact that "UEFI Setup" (aka "BIOS") toggles may set this bit, meant it was less likely to do anything else, and we decided to stick with it (rather than the other, completely unknown, bit, or any others).
By this point, we had heard that the rr
("Record and Replay") tool didn't work on AMD Zen CPUs because of some accuracy issue with hardware performance counters, so we left a comment about SpecLockMap
and how to disable it, in case that might help them. @glandium quickly confirmed disabling SpecLockMap
did the trick, and started looking into remaining unrelated issues blocking rr
on AMD Zen CPUs (now that the performance counters could be made reliable).
While almost everyone testing our synthetic benchmark or rr
, was using an AMD Zen 2 (Ryzen 3xxx
) CPU, and could use tools like rdmsr
/wrmsr
to test MSR bits, our EPYC server was (and remains) Zen 1, and so we ran into another problem because of that: the mitigation for "Speculative Store Bypass" (i.e. its disable bit) is MSRC001_1020[10]
on Zen CPUs predating Zen 2, and the Linux kernel caches the value of the MSR in order to set/unset that one bit, meaning any other changes (such as our bit 54
to disable SpecLockMap
) to the same MSR would get periodically reverted.
To work around this problem, we developed a small kernel module to override the cached MSR value, which @glandium later replaced with an improved version that instead hooked any writes to MSRC001_1020
, via the kernel's own "tracepoints" system, to also set bit 54
.
Given the SpecLockMap
name, we later stumbled over a once-documented (only in an older Zen 1 EPYC manual) performance counter named SpecLockMapCommit
(r0825:u
), and @jix came up with the idea of using it to detect whether SpecLockMap
was active (and warn the user that they should try to disable it). All it took was one (uncontended) atomic instruction for SpecLockMapCommit
to increment - presumably made more reliable by us pairing rdpmc
(used both before and after the atomic) with a serializing instruction.
Digging through various manuals, we found several ways in which the lock
performance counters were documented, but the main takeaway is that SpecLockMapCommit
was renamed to SpecLockHiSpec
for Zen 2, while serving the same function.
When all was said and done:
rr
was now working on AMD CPUs (and we'd heard multiple people claim that the only reason they didn't go with an AMD CPU was the lack ofrr
support)- the variance in
instructions-minus-irqs:u
totals went away (except non-determinism sources we could quantify, e.g. creating a temp dir with a random name) - both
rr
and our implementation could detectSpecLockMap
and link the user to https://github.com/mozilla/rr/wiki/Zen for more information on how to disable it
Up until this point, we've focused on testing with the default glibc
allocator, but we had seen an ASLR-like effect when using jemalloc
, that we couldn't track down originally. But official builds use jemalloc
, so we had to take another look.
With the SpecLockMap
noise gone from non-jemalloc
results, we were now sure some kind of non-determinism was being introduced by jemalloc
, and strace
confirmed some memory ranges are released back to the OS at different times, resulting in unpredictable addresses for later memory allocations (and therefore act like a weak form of ASLR).
By looking at all the places in jemalloc
which can call madvise(..., MADV_FREE)
(the initial non-deterministic syscall), we came across a time-delayed "purging" mechanism, which perfectly explained what we were seeing: when exactly (in the execution of the program) the timer would fire, is unpredictable, and the ordering of the madvise(..., MADV_FREE)
syscall respective to all the mmap
syscalls, would determine all later memory addresses.
However, this "purging" would only happen every 10 seconds by default, so anything taking less time to compile should be just as deterministic instruction-wise, with jemalloc
as with glibc
's allocator. It just so happened that our testcase would take around 13 seconds
We were able to disable the time-delayed mechanism by either setting environment variable (MALLOC_CONF=dirty_decay_ms:0,muzzy_decay_ms:0
) or by placing that same value in a static
with the symbol name malloc_conf
, inside rustc
.
Some experiments were attempted using Rust PR #77162, but as most perf.rust-lang.org
benchmarks take less than 10 seconds, jemalloc
wouldn't behave differently for them.
Sadly, losing this "purging" feature might come at some performance loss, but there may be a way to disable it as soon as rustc
knows it's being ran with -Z self-profile
, if necessary.
In order to be able to submit the measureme
and rustc
PRs, we had to first update our changes to latest Rust (i.e. git rebase
onto latest rust-lang/rust
), and after doing so we got another set of measurements, just to make sure everything still works.
To our disappointment, there was now noise that wasn't there before the rebase, and we decided to track it down, just in case we made a significantly flawed assumption along the way.
Using binary search on the PR merge commits (as listed by git log --first-parent
) between our starting point, and the latest rust-lang/rust
, we found 3 relevant PRs:
- #75600 "Improve codegen for
align_offset
" by @nagisa: ~±2 instructions- seems to have went away since
- #75642 "Move doc comment parsing to rustc_lexer" by @matklad: ~±100 instructions
- specifically this commit turned out to be responsible
- we were able to work around it by using
match_indices
to further simplifycook_doc_comment
(removing thestr::contains
call in the process)
- #73905 "Separate projection bounds and predicates" by @matthewjasper: ~±100 instructions
We haven't been able to identify an explanation as to what is actually happening in any of these cases, so for now we've worked around them, but number 3 was much harder.
By introducing more fine-grained measureme
intervals, and repeatedly splitting them into smaller and smaller intervals, we were able to shrink the source of the noise to one loop.
However, any attempt at measuring candidate_should_be_dropped_in_favor_of
itself resulted in the noise going away. While this stopped us from narrowing down the source of the noise further, it did suggest the serializing instruction (which we use for rdpmc
) was helping.
And indeed, cpuid
(i.e. what we use as the serializing instruction for rdpmc
) alone is enough to stop the noise, and so is mfence
. We kept the latter because it's simpler, though it's not clear which one would be cheaper, or what impact on performance it even has.
Our theory right now is that LLVM (rarely) generates an instruction which has a performance counting bug (unless serialized), but further work is necessary to confirm or disprove that.
In the process of investigating the post-rebase issues above, we were reminded of page faults, and how, while we never observed perf stat -e page-faults:u ...
produce different totals across runs, they still overcount the instruction counter the same way IRQs do, so non-deterministic page faults would result in instructions-counting noise.
But unlike with IRQs, there is no documented hardware performance counter for "page faults", or "CPU exceptions" in general, so there is no easy way to undo the associated overcounting.
(page-faults:u
is a "software event", i.e. the kernel manually tracks it, so while we used it in our investigations, it's no substitute for a fault/exception counter implemented in hardware)
However, at least for AMD Zen CPUs, there are obvious small gaps in the numbering of the documented counters, so we decided to try and find an undocumented counter of the right magnitude (that of page-faults:u
), hiding in those gaps.
Most performance counters are configured with a 16-bit 0xMM_XX
(rMMXX:u
in perf
notation) value, where XX
and MM
are:
XX
aka "event selector": a numeric index (sequential except for undocumented gaps) corresponding to one event, or a group of related eventsMM
aka "mask": a bitset of which events, sharing the same event selector, are enabled, or entirely ignored for simpler events
Because of how forgiving the hardware is, one can just iterate through event selectors, and always set the mask to 0xff
to get the maximal number of events counted, no matter how many bits are actually supported for that event selector.
The main limitation is that there are only 6 counters that can be active at any time, so the test workload has to be run for batches of 6, e.g. for 0x06..=0x0b
event selectors:
perf stat -e rff06:u,rff07:u,rff08:u,rff09:u,rff0a:u,rff0b:u ...
Once an event selector is known to be produce any (non-zero) results, or if it specifically results in an interesting value, each bit of the mask can be checked separately, i.e. r01XX:u,r02XX:u,r04XX:u,r08XX:u,...
, in order to determine which bits are functional.
We went through a few dozen counters before stumbling over one (rff20:u
) which was much closer in magnitude to page-faults:u
than any others, and only one of the mask bits seemed active, resulting in r0420:u
as the simplest way to refer to the counter.
Looking at page-faults:u
, IRQs (r002c:u
) and r0420:u
together, it became clear that r0420:u
was only barely larger than the sum of page-faults:u
and IRQs, leading us to speculate that it's either counting both hardware interrupts and CPU exceptions, or perhaps even the iret
s the kernel uses to return from handling any of those interrupts/exceptions.
We might be able to figure out what it actually is by using the 32-bit x86 syscall interface (int 0x80
) to read
the perf_event_open
pseudo-file, which would cause the kernel to rdpmc
between the interrupt firing and the iret
back to userspace, and by comparing the read value with what userspace rdpmc
reports, just after the syscall returns.
However, we have not attempted this yet, and so it remains as possible future work (if we want to start using this counter).
The main usefulness of a general counter like this (as opposed to just IRQs) is being able to account for non-deterministic page faults, which may be more of a problem on machines where context-switching, or even swapping parts of RAM to disk, are more common.