Skip to content

Commit

Permalink
i#6471 idle: Add better idle time modeling (#6505)
Browse files Browse the repository at this point in the history
Changes instruction quantum idle handling to use wall-clock time,
instead of a counter decremented in queue pops. The counter was skewed
unfairly by the rate of queue queries. This results in all quanta using
a unified time-based approach for blocked time.

Changes block_time_scale to remove the division by the latency
threshold. Now the scale is directly multiplied by the latency to result
in the blocked time units. The default scale is set to 1000 which
matches the wall-clock time to process an instruction by the
schedule_stats tool (about 1 instruction per microsecond) and which
should be a rough match for many simulators passing one nanosecond cycle
per instruction as the time unit.

Adds a new scheduler option block_time_max and drcachesim option
-sched_block_max which caps the blocked time for a syscall (default 25
seconds) to avoid outliers being scaled to extreme amounts of time.

Adds a heartbeat of queue lookups, currently only in debug build, to
help understand behavior over time.

Sets the input ordinal to invalid while idle. This also causes
schedule_stats to consider a transition from idle to a valid input to be
a context switch, which matches how the Linux kernel counts switches.

Augments schedule_stats with a wall-clock measure of cpu and idle time
for a much fairer %cpu metric than the previous one based purely on
record counts, as schedule_stats processes an instruction much more
quickly than an idle record. Adds two %cpu metrics: one that does not
include idle time past the final instruction (for skewed finishes across
cores) and one that includes idle time until all cpus are done.

Increases the default -schedule_stats_print_every to 500K to avoid
extremely long strings for larger workloads.

Updates the scheduler unit tests for the timing changes, changing the
ones testing idle time to use a deterministic mock time quantum to avoid
wall-clock time flakiness.

Tested on large traces on dozens of cores with known idle
characteristics where by tweaking the parameters I was able to get
reasonably representative idle times.

Issue: #6471
  • Loading branch information
derekbruening authored Dec 14, 2023
1 parent dede2ba commit 6d1912f
Show file tree
Hide file tree
Showing 10 changed files with 395 additions and 157 deletions.
1 change: 1 addition & 0 deletions clients/drcachesim/analyzer_multi.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -262,6 +262,7 @@ analyzer_multi_t::init_dynamic_schedule()
sched_ops.syscall_switch_threshold = op_sched_syscall_switch_us.get_value();
sched_ops.blocking_switch_threshold = op_sched_blocking_switch_us.get_value();
sched_ops.block_time_scale = op_sched_block_scale.get_value();
sched_ops.block_time_max = op_sched_block_max_us.get_value();
#ifdef HAS_ZIP
if (!op_record_file.get_value().empty()) {
record_schedule_zip_.reset(new zipfile_ostream_t(op_record_file.get_value()));
Expand Down
30 changes: 23 additions & 7 deletions clients/drcachesim/common/options.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -851,11 +851,27 @@ droption_t<uint64_t> op_sched_blocking_switch_us(
"maybe-blocking to incur a context switch. Applies to -core_sharded and "
"-core_serial. ");

droption_t<double>
op_sched_block_scale(DROPTION_SCOPE_ALL, "sched_block_scale", 1.,
"Input block time scale factor",
"A higher value here results in blocking syscalls "
"keeping inputs unscheduled for longer.");
droption_t<double> op_sched_block_scale(
DROPTION_SCOPE_ALL, "sched_block_scale", 1000., "Input block time scale factor",
"The scale applied to the microsecond latency of blocking system calls. A higher "
"value here results in blocking syscalls keeping inputs unscheduled for longer. "
"This should roughly equal the slowdown of instruction record processing versus the "
"original (untraced) application execution.");

// We have a max to avoid outlier latencies that are already a second or more from
// scaling up to tens of minutes. We assume a cap is representative as the outliers
// likely were not part of key dependence chains. Without a cap the other threads all
// finish and the simulation waits for tens of minutes further for a couple of outliers.
// The cap remains a flag and not a constant as different length traces and different
// speed simulators need different idle time ranges, so we need to be able to tune this
// to achieve desired cpu usage targets. The default value was selected while tuning
// a 1-minute-long schedule_stats run on a 112-core 500-thread large application
// to produce good cpu usage without unduly increasing tool runtime.
droption_t<uint64_t> op_sched_block_max_us(DROPTION_SCOPE_ALL, "sched_block_max_us",
25000000,
"Maximum blocked input time, in microseconds",
"The maximum blocked time, after scaling with "
"-sched_block_scale.");
#ifdef HAS_ZIP
droption_t<std::string> op_record_file(DROPTION_SCOPE_FRONTEND, "record_file", "",
"Path for storing record of schedule",
Expand All @@ -875,8 +891,8 @@ droption_t<std::string>

// Schedule_stats options.
droption_t<uint64_t>
op_schedule_stats_print_every(DROPTION_SCOPE_ALL, "schedule_stats_print_every", 5000,
"A letter is printed every N instrs",
op_schedule_stats_print_every(DROPTION_SCOPE_ALL, "schedule_stats_print_every",
500000, "A letter is printed every N instrs",
"A letter is printed every N instrs or N waits");

droption_t<std::string> op_syscall_template_file(
Expand Down
1 change: 1 addition & 0 deletions clients/drcachesim/common/options.h
Original file line number Diff line number Diff line change
Expand Up @@ -194,6 +194,7 @@ extern dynamorio::droption::droption_t<bool> op_sched_order_time;
extern dynamorio::droption::droption_t<uint64_t> op_sched_syscall_switch_us;
extern dynamorio::droption::droption_t<uint64_t> op_sched_blocking_switch_us;
extern dynamorio::droption::droption_t<double> op_sched_block_scale;
extern dynamorio::droption::droption_t<uint64_t> op_sched_block_max_us;
#ifdef HAS_ZIP
extern dynamorio::droption::droption_t<std::string> op_record_file;
extern dynamorio::droption::droption_t<std::string> op_replay_file;
Expand Down
153 changes: 87 additions & 66 deletions clients/drcachesim/scheduler/scheduler.cpp

Large diffs are not rendered by default.

40 changes: 27 additions & 13 deletions clients/drcachesim/scheduler/scheduler.h
Original file line number Diff line number Diff line change
Expand Up @@ -524,16 +524,27 @@ template <typename RecordType, typename ReaderType> class scheduler_tmpl_t {
uint64_t blocking_switch_threshold = 100;
/**
* Controls the amount of time inputs are considered blocked at a syscall whose
* latency exceeds #syscall_switch_threshold or #blocking_switch_threshold. A
* "block time factor" is computed from the syscall latency divided by either
* #syscall_switch_threshold or #blocking_switch_threshold. This factor is
* multiplied by this field #block_time_scale to produce a final value. For
* #QUANTUM_TIME, that final value's amount of time, as reported by the time
* parameter to next_record(), must pass before the input is no longer considered
* blocked. For instruction quanta, that final value's count of scheduler
* selections must occur before the input is actually selected.
* latency exceeds #syscall_switch_threshold or #blocking_switch_threshold. The
* syscall latency (in microseconds) is multiplied by this field to produce the
* blocked time. For #QUANTUM_TIME, that blocked time in the units reported by
* the time parameter to next_record() must pass before the input is no longer
* considered blocked. Since the system call latencies are in microseconds, this
* #block_time_scale should be set to the number of next_record() time units in
* one simulated microsecond. For #QUANTUM_INSTRUCTIONS, the blocked time in
* wall-clock microseconds must pass before the input is actually selected
* (wall-clock time is used as there is no reasonable alternative with no other
* uniform notion of time); thus, the #block_time_scale value here should equal
* the slowdown of the instruction record processing versus the original
* (untraced) application execution. The blocked time is clamped to a maximum
* value controlled by #block_time_max.
*/
double block_time_scale = 1.;
double block_time_scale = 1000.;
/**
* The maximum time, in microseconds, for an input to be considered blocked
* for any one system call. This is applied after multiplying by
* #block_time_scale.
*/
uint64_t block_time_max = 25000000;
};

/**
Expand Down Expand Up @@ -1018,8 +1029,9 @@ template <typename RecordType, typename ReaderType> class scheduler_tmpl_t {
uint64_t prev_time_in_quantum = 0;
uint64_t time_spent_in_quantum = 0;
// These fields model waiting at a blocking syscall.
double block_time_factor = 0.;
uint64_t blocked_start_time = 0; // For QUANTUM_TIME only.
// The units are us for instr quanta and simuilation time for time quanta.
uint64_t blocked_time = 0;
uint64_t blocked_start_time = 0;
};

// Format for recording a schedule to disk. A separate sequence of these records
Expand Down Expand Up @@ -1232,7 +1244,7 @@ template <typename RecordType, typename ReaderType> class scheduler_tmpl_t {
// Finds the next input stream for the 'output_ordinal'-th output stream.
// No input_info_t lock can be held on entry.
stream_status_t
pick_next_input(output_ordinal_t output, double block_time_factor);
pick_next_input(output_ordinal_t output, uint64_t blocked_time);

// Helper for pick_next_input() for MAP_AS_PREVIOUSLY.
// No input_info_t lock can be held on entry.
Expand Down Expand Up @@ -1354,7 +1366,7 @@ template <typename RecordType, typename ReaderType> class scheduler_tmpl_t {
// The input's lock must be held by the caller.
// Returns a multiplier for how long the input should be considered blocked.
bool
syscall_incurs_switch(input_info_t *input, double &block_time_factor);
syscall_incurs_switch(input_info_t *input, uint64_t &blocked_time);

// sched_lock_ must be held by the caller.
// "for_output" is which output stream is looking for a new input; only an
Expand Down Expand Up @@ -1385,6 +1397,8 @@ template <typename RecordType, typename ReaderType> class scheduler_tmpl_t {
// timestamp in each workload in order to mix inputs from different workloads in the
// same queue. FIFO ordering is used for same-priority entries.
flexible_queue_t<input_info_t *, InputTimestampComparator> ready_priority_;
// Trackes the count of blocked inputs. Protected by sched_lock_.
int num_blocked_ = 0;
// Global ready queue counter used to provide FIFO for same-priority inputs.
uint64_t ready_counter_ = 0;
// Count of inputs not yet at eof.
Expand Down
43 changes: 34 additions & 9 deletions clients/drcachesim/tests/schedule_stats_nopreempt.templatex
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,10 @@ Total counts:
4 cores
8 threads
638938 instructions
5 total context switches
0\.0078255 CSPKI \(context switches per 1000 instructions\)
127788 instructions per context switch
5 voluntary context switches
6 total context switches
0\.0093906 CSPKI \(context switches per 1000 instructions\)
106490 instructions per context switch
6 voluntary context switches
0 direct context switches
100\.00% voluntary switches
0\.00% direct switches
Expand All @@ -15,7 +15,12 @@ Total counts:
0 direct switch requests
0 waits
*[0-9]* idles
*[0-9\.]*% cpu busy
*[0-9\.]*% cpu busy by record count
*[0-9]* cpu microseconds
*[0-9]* idle microseconds
*[0-9]* idle microseconds at last instr
*[0-9\.]*% cpu busy by time
*[0-9\.]*% cpu busy by time, ignoring idle past last instr
Core #0 counts:
. threads
*[0-9]* instructions
Expand All @@ -31,7 +36,12 @@ Core #0 counts:
0 direct switch requests
0 waits
*[0-9]* idles
*[0-9\.]*% cpu busy
*[0-9\.]*% cpu busy by record count
*[0-9]* cpu microseconds
*[0-9]* idle microseconds
*[0-9]* idle microseconds at last instr
*[0-9\.]*% cpu busy by time
*[0-9\.]*% cpu busy by time, ignoring idle past last instr
Core #1 counts:
. threads
*[0-9]* instructions
Expand All @@ -47,7 +57,12 @@ Core #1 counts:
0 direct switch requests
0 waits
*[0-9]* idles
*[0-9\.]*% cpu busy
*[0-9\.]*% cpu busy by record count
*[0-9]* cpu microseconds
*[0-9]* idle microseconds
*[0-9]* idle microseconds at last instr
*[0-9\.]*% cpu busy by time
*[0-9\.]*% cpu busy by time, ignoring idle past last instr
Core #2 counts:
. threads
*[0-9]* instructions
Expand All @@ -63,7 +78,12 @@ Core #2 counts:
0 direct switch requests
0 waits
*[0-9]* idles
*[0-9\.]*% cpu busy
*[0-9\.]*% cpu busy by record count
*[0-9]* cpu microseconds
*[0-9]* idle microseconds
*[0-9]* idle microseconds at last instr
*[0-9\.]*% cpu busy by time
*[0-9\.]*% cpu busy by time, ignoring idle past last instr
Core #3 counts:
. threads
*[0-9]* instructions
Expand All @@ -79,7 +99,12 @@ Core #3 counts:
0 direct switch requests
0 waits
*[0-9]* idles
*[0-9\.]*% cpu busy
*[0-9\.]*% cpu busy by record count
*[0-9]* cpu microseconds
*[0-9]* idle microseconds
*[0-9]* idle microseconds at last instr
*[0-9\.]*% cpu busy by time
*[0-9\.]*% cpu busy by time, ignoring idle past last instr
Core #0 schedule: [A-Ha-h_]*
Core #1 schedule: [A-Ha-h_]*
Core #2 schedule: [A-Ha-h_]*
Expand Down
19 changes: 18 additions & 1 deletion clients/drcachesim/tests/schedule_stats_test.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -196,6 +196,11 @@ test_basic_stats()
assert(result.maybe_blocking_syscalls == 3);
assert(result.direct_switch_requests == 2);
assert(result.waits == 3);
assert(result.idle_microseconds == 0);
// XXX: For Windows test VMs we see coarse time updates resulting in 0's.
#ifndef WIN32
assert(result.cpu_microseconds > 0);
#endif
return true;
}

Expand All @@ -216,6 +221,7 @@ test_idle()
gen_marker(TID_B, TRACE_MARKER_TYPE_CORE_IDLE, 0),
gen_marker(TID_B, TRACE_MARKER_TYPE_CORE_IDLE, 0),
gen_marker(TID_B, TRACE_MARKER_TYPE_CORE_IDLE, 0),
// A switch from idle w/ no syscall is an involuntary switch.
gen_instr(TID_B),
gen_instr(TID_B),
gen_instr(TID_B),
Expand All @@ -229,6 +235,7 @@ test_idle()
gen_marker(TID_C, TRACE_MARKER_TYPE_CORE_IDLE, 0),
gen_marker(TID_C, TRACE_MARKER_TYPE_CORE_IDLE, 0),
gen_marker(TID_C, TRACE_MARKER_TYPE_CORE_IDLE, 0),
// A switch from idle w/ no syscall is an involuntary switch.
gen_instr(TID_C),
gen_instr(TID_C),
// Wait.
Expand All @@ -243,14 +250,24 @@ test_idle()
};
auto result = run_schedule_stats(memrefs, tid2ord);
assert(result.instrs == 13);
assert(result.total_switches == 3);
assert(result.total_switches == 5);
assert(result.voluntary_switches == 0);
assert(result.direct_switches == 0);
assert(result.syscalls == 0);
assert(result.maybe_blocking_syscalls == 0);
assert(result.direct_switch_requests == 0);
assert(result.waits == 3);
assert(result.idles == 6);
// It is hard to test wall-clock time precise values so we have sanity checks.
std::cerr << "got idle " << result.idle_microseconds << "us, cpu "
<< result.cpu_microseconds << "us\n"; // NOCHECK
// XXX: For Windows test VMs we see coarse time updates resulting in 0's.
#ifndef WIN32
assert(result.idle_microseconds > 0);
assert(result.idle_micros_at_last_instr > 0 &&
result.idle_micros_at_last_instr <= result.idle_microseconds);
assert(result.cpu_microseconds > 0);
#endif
return true;
}

Expand Down
Loading

0 comments on commit 6d1912f

Please sign in to comment.