-
Notifications
You must be signed in to change notification settings - Fork 1
/
profiling.tex
461 lines (414 loc) · 20.7 KB
/
profiling.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
\section{Profiling Arbitrary Basic Blocks}
The majority of basic blocks extracted from real-world programs cannot be profiled directly
using the standard profiling methodology,
as a basic block executed out of its original program context is likely to access memory addresses illegally.
The standard methodology for measuring the throughput of a basic block
is to measure the latency of $n$ unrolled copies of the basic block
and divide that latency by $n$.
One of the commonly used latency measurement tools is Agner Fog's measurement script~\cite{agner}.
This script is straightforward.
It uses the core cycle counter to measure the latency of an unrolled basic block and
does not modify the basic block's execution context in any way.
The first row of Table~\ref{tab:full-ablation} presents the results of applying
this methodology to our benchmark suite:
This methodology---which assumes that it is the user's responsibility to ensure a basic block's successful execution---is only able to {\em successfully profile}\footnote{
We say a basic block is successfully profiled when it meets all of the following criteria:
The basic blocks can be successfully executed by our profiler;
the execution does not incur any cache misses (including data and instruction cache);
we are able to reproduce the measurement.
} 16.65\% of our basic blocks, failing to profile the rest of the basic
blocks due to Segmentation Faults.
Our goal is to maximize the number of basic blocks that execute without crashing, subject to the constraint that their executions satisfy our performance modeling assumptions (Section~\ref{sec:perf-models}).
We first describe a technique to map all virtual memory pages accessed by a basic block
to a single physical page before the execution of this basic block;
this achieves two objectives: the basic blocks can then be executed without crashing,
and all of the memory accesses hit the L1 data cache.
This allows us to execute most basic blocks (see Table~\ref{tab:full-ablation})
but does not guarantee that there is no instruction cache
miss (an assumption of all existing performance models),
which can become an issue when one attempts to profile large basic blocks using
the standard unrolling-based throughput profiling.
In particular, the issue with the standard approach is that the unroll factor
has to be large enough to hide the latency of the first and last few iterations,
but such an unroll factor can cause instruction cache misses for large basic blocks,
violating our modeling assumptions.
We describe our technique to profile basic blocks with smaller
unroll factors while providing accurate steady-state throughput estimation.
Table~\ref{tab:full-ablation} shows the percentage of basic blocks
we are able to profile using these two techniques.
We additionally discuss three rare events---gradual underflow of floating-points, unaligned memory accesses,
and performance loss due to virtual address aliasing---that
can significantly affect profile fidelity.
We avoid the first event entirely and detect (and filter) the last two.
\subsection{Handling memory accesses}\label{sec:mapping}
We now describe our technique to execute arbitrary, memory-accessing basic blocks
without user intervention.
We first show a motivating example.
Consider the basic block in figure \ref{fig:mem-ex},
which is used to compute CRC code by Gzip.
Highlighted instructions show the flow of pointer values;
essentially \textit{bits} of \verb|%rdx| are used to index into a lookup table
and the content of which is then used in the next iteration to
update \verb|%rdx|.
How would one profile the throughput of this basic block?
Without the original application context,
this basic block \textit{cannot} be directly executed.
Developers of Gzip familiar with this piece of code could
properly setup a micro-benchmark by reproducing the
relevant lookup table used by this code snippet,
but such an approach would not scale to the hundreds of
thousands of basic blocks in our data set, most of which contain memory accesses.
We needed some way to automatically generate the benchmarking environment
without \textit{a priori} knowledge/assumption of the code being profiled.
\if 0
add rdi, 1
#hl{mov eax, edx}
shr rdx, 8
#hl{xor al, [rdi - 1]}
movzx eax, al
#hl{xor rdx, [8*rax + 0x4110a]}
cmp rdi, rcx
\fi
\begin{figure}[h]
\begin{Verbatim}[commandchars=\#\{\}]
add $1, %rdi
#hl{mov %edx, %eax}
shr $8, %rdx
#hl{xor -1(%rdi), %al}
movzx %al, %eax
#hl{xor 0x4110a(, %rax, 8), %rdx}
cmp %rcx, %rdi
\end{Verbatim}
\caption{Inner loop body of updcrc from Gzip.
This basic block cannot be directly executed because
of its memory accesses.}
\label{fig:mem-ex}
\end{figure}
\begin{figure}
\begin{algorithmic}
\Function{measure}{$pageToMap$}
\State $\Call{mmapToPhysPage}{pageToMap, ...}$
\State $\Call{initialize}{}$
\State $\Call{executeUnrolledBasicBlock}{}$
\Comment{After this point all memory accesses made by the basic block is legal
and goes to L1 cache}
\State $begin \gets \Call{readPerformanceCounters}{}$
\State $\Call{initialize}{}$
\State $\Call{executeUnrolledBasicBlock}{}$
\State $end \gets \Call{readPerformanceCounters}{}$
\State $\Call{analyzeAndReportCounters}{begin, end}$
\EndFunction
\Function{monitor}{}
\State $pid \gets \Call{launchAndMonitor}{measure}$
\State $numFaults \gets 0$
\While{$true$}
\State $memAddr \gets \Call{interceptSegFault}{pid}$
\If{$\Call{isValidAddr}{memAddr}$}
\State $p \gets \Call{getPageAddr}{memAddr}$
\State $\Call{resumeProcessExecution}{pid, measure(p)}$
\State $numFaults \gets numFaults + 1$
\EndIf
\If{$numFaults > maxNumFaults$}
\State $\Call{kill}{pid}$
\State \textbf{break}
\EndIf
\EndWhile
\EndFunction
\end{algorithmic}
\caption{Pseudocode of the profiling routines.
The function \textit{measure} runs in a separate process monitored by
\textit{monitor}.
We execute an unrolled basic block twice.
The first execution is used to discover virtual pages accessed
by the basic block, which are the then mapped to a single physical page.
We collect performance statistics in the second execution.}
\label{fig:code}
\end{figure}
We now describe our solution to profile arbitrary basic blocks with memory accesses.
Our profiling algorithm consists of two steps.
\begin{enumerate*}
\item We first identify the virtual pages accessed by
the basic block
and map them to a single physical page.
\item After the appropriate mapping is created,
the unrolled basic block is then profiled normally.
\end{enumerate*}
\textbf{Mapping pages to avoid SegFaults}
In addition to making all the memory accesses made by the basic block valid,
the mapping stage also ensures that all memory accesses go to the same physical page and hence stay in the L1 data cache.
Figure~\ref{fig:code} shows the algorithm we use to obtain this mapping.
Mapping is done by executing the unrolled basic block in a forked process
monitored by a parent process using \verb|ptrace|.
Each access of an unmapped virtual page triggers a Segmentation fault, which is intercepted by
the monitoring process. The monitor process then instructs the
executing process to create the appropriate mapping
and to restart execution from the \textit{beginning}
with all registers, memory values,
as well as the flag registers reinitialized.
Re-initialization ensures that the memory addresses accessed in the final profiling
stage are \textit{identical} to those from the mapping stage.
An alternative we considered was using a single process for both execution
and monitoring, but we ultimately discarded this alternative because
it leads to the possibility of the basic block inadvertently overwriting
the monitoring and bookkeeping code.
%We use this methodology as the baseline and compare it with three techniques.
%\begin{enumerate}
% \item Naive memory mapping.
%
% This is an naive attempt to reduce the number of crashes due to segmentation faults.
% We try to allocate as much memory as possible in the hope that
% a basic block only try to access the allocated memory.
% Since we want to avoid cache misses, we only allocate 32KB of memory (size of the L1 cache).
% We also initialize all general purpose registers to point to the allocated memory.
%
% \item Mapping all virtual pages accessed by a basic block to a single physical page
% (while using identical initialization).
%
% This is our full solution to executing basic blocks with ``random'' memory traffic.
% We describe how we construct this mapping in more detail in Section~\ref{sec:mapping}.
% We are able to profile most basic blocks using this technique alone (90\%).
%
% \item More intelligent unrolling of basic blocks.
%
% This is our technique for profiling large basic blocks that could cache instruction cache misses
% if profiled naively.
% A common practice of basic block profiling is unrolling the basic block
% multiple times and measuring the latency of executing the unrolled code;
% We show that naively applying this technique to large basic blocks
% leads to mis-profiling by overflowing instruction cache,
% and we show how we are able to use smaller unroll factors that gives
% accurate throughput estimation.
%\end{enumerate}
%Table~\ref{tab:ablation} shows the effect of incrementally applying
%our techniques to measure a basic block sampled from
%TensorFlow~\cite{tensorflow}'s CNN training benchmark.
%As demonstrated, without some of these techniques,
%the results vary from obtaining no throughput due to crashes or
%estimations that are two orders of magnitude away from the
%actual throughput.
%
%
\begin{table}
\begin{tabular}{
|p{0.54\columnwidth}|p{0.39\columnwidth}|}
\hline
\textbf{(Additional) Technique} & \textbf{\% of Basic Blocks Profiled} \\
\hline
None & 16.65\% \\
\hline
Mapping all accessed pages (Section~\ref{sec:mapping}) & 91.28\%\\
\hline
More intelligent unrolling (Section~\ref{sec:unrolling}) & 94.24\%\\
\hline
\end{tabular}
\\
\caption{Percentage of \textit{additional} basic blocks we are able to
profile using our techniques.
The baseline (None) is directly running an off-the-shelf profiler such as Ager Fog'script.}
\label{tab:full-ablation}
\end{table}
\subsection{Deriving throughput from measurement}\label{sec:unrolling}
We use IACA's definition of throughput:
the average number of cycles required to execute a basic block
at steady state.
\footnote{Note that this definition is the inverse of the textbook definition of throughput.}
The basic strategy one usually takes to measure basic block throughput
is unrolling a basic block multiple times and dividing the latency of the
unrolled basic block by the unroll factor.
The formula for estimating the throughput using this approach is shown
in equation~\ref{eq:naive}, where $u$ is a large unroll factor
and $\mathit{cycles}(b,u)$ is the number of cycles taken to execute basic block $b$
unrolled $u$ times.
\begin{equation}
\mathit{throughput}(b) \approx \frac{\mathit{cycles}(b,u)}{u}
\label{eq:naive}
\end{equation}
Compared to running the basic block inside a loop,
unrolling has an advantage in that the measurement is not polluted by
the control overhead incurred by looping.
A typical unroll factor is 100\cite{ithemal,uops}.
Such a high unroll factor can, however, incur significant
L1-instruction cache misses for large basic blocks
(e.g., the already unrolled inner-loop body of a GEMM kernel).
The issue with unrolling naively is that,
an unroll-factor small enough to
avoid instruction cache misses is
insufficient to hide the latency of the first few iterations.
We thus adopted a different strategy to cope with large basic blocks.
Equation~\ref{eq:new} shows the formula we use to estimate throughput.
We unroll each basic block with two different unroll factors.
These factors should be large enough
to warm-up the processor into a steady-state.
We then measure the latency of the two unrolled basic blocks,
calculate the difference in the measurements, and divide it
by difference of the unroll factors.
The resulting number is the throughput of the basic block.
\begin{equation}
\mathit{throughput}(b) \approx
\frac{\mathit{cycles}(b, u) - \mathit{cycles}(b, u')}{u-u'}
\label{eq:new}
\end{equation}
\begin{equation}
\mathit{cycles}(b, u') \approx
\mathit{cycles}(b, u) + \frac{\mathit{throughput(b)}}{u'-u}
\label{eq:new-rearranged}
\end{equation}
Equation~\ref{eq:new} is a re-arrangement of equation~\ref{eq:new-rearranged},
which is derived from the following observation:
if unroll factors $u$ and $u'$ are large enough
to get the processor into a steady-state,
then the additional $u'-u$ (suppose $u' > u$) iterations will be processed at
the rate of steady-state throughput.
The only requirement for unroll factors $u$ and $u'$ in equation~\ref{eq:new}
is that they are large enough to get the processor into a steady-state,
while the unroll factor $u$ in equation~\ref{eq:new} has to be additionally
large enough to amortize the cost of the first few iterations.
Being able to profile with smaller unroll factors is essential
for large basic blocks that typically show up in numerical kernels where
hot inner loops are unrolled.
\subsection{Enforcing Modeling Invariants}\label{sec:invariants}
We guarantee that each measurement satisfies the following modeling assumptions:
\begin{enumerate*}
\item Every memory accesses leads to an L1 data cache hit.
\item Measurements are not affected by interrupts, context switches,
or hyper-threading.
\end{enumerate*}
These are the same assumptions made by existing models;
admittedly many of the memory accesses in deployed programs do not hit the L1 data cache;
we nonetheless make sure that during our measurement all memory accesses hit the L1 data cache for compatibility with existing models.
%We acknowledge that the throughput of the basic blocks depends on the program context, but
%the throughput of these basic blocks (i.e. those with memory accesses)
%is inherently dependent on the program execution context.
To enforce these invariants, we monitor the following statistics in addition to core cycle counts;
we reject a measurement if one of these statistics is non-zero.
\begin{enumerate}
\item L1 Data Cache Read Misses
\item L1 Data Cache Write Misses
\item L1 Instruction Cache Misses
\item Context Switches
\end{enumerate}
We measure the cache misses with hardware performance counters,
and we use one of the system-calls provided by Linux to track the number of context switches.
Each \textit{unrolled} basic block is timed 16 times by default,
and we include a basic block for further analysis only if the basic block has at least
8 \textit{clean}, identical timings.
All basic blocks are measured in machines with hyper-threading disabled.
We count the elapsed cycles by reading the core cycle counter.
Unlike the reference cycle counter or the time-stamp counter, the core cycle counter is invariant
to performance scaling.
As discussed in~\ref{sec:mapping}, we map all virtual pages accessed by
a basic block to a single physical page.
We thus ensure that the entire range of physical memory accessed fits in the L1 cache: The L1 caches of recent Intel processors are physically tagged, and
they are sized such that the indexing bits of an address are unaffected by address translation.
\subsection{Additional measurement optimizations}
We make additional optimizations to increase our chance
of successfully obtaining clean measurements additionally.
\textbf{Initialization}
Prior to the mapping-run,
we unmap all pages (except those containing the basic block).
This allows us to ensure that all the memory accesses go to our single
physical page and not to some previously mapped address (e.g., libc).
The physical page is
always initialized to fill with a “moderately sized” constant
(we used \verb|0x12345600| in our experiments)
to accommodate for indirect memory accesses.
To see why this is necessary,
consider a basic block that first loads a pointer $p$ from memory
and then de-references $p$.
If the value of $p$ is too low (e.g., 0)
or too high
(i.e., bigger than the highest address that
a user space program is allowed to addressed),
we will not be able to map the virtual page pointed by $p$.
All general-purpose registers are initialized similarly.
\textbf{Handling Subnormal Numbers}
General speaking -- aside from division
and specialized instructions such as \verb|sqrt|
-- floating-point instructions have constant timings
except when the inputs of the operations are subnormal numbers.
In our experiments where we encountered subnormal numbers,
floating-point operations can be slowed down by up to 20x.
To ``normalize'' our timing, we configured the \verb|MXCSR| register
to disable gradual underflow.
We found 334 ($0.1\%$) basic blocks that would have been affected by
gradual underflow if we had not taken this measure.
\textbf{Detecting unaligned accesses}
Unaligned memory accesses can be slower than aligned accesses; in particular,
non-cacheline-aligned accesses can cause
an order of magnitude slowdown.
We exclude basic blocks with non-zero unaligned loads out of our analysis.
To track the number of unaligned stores and loads occurred during profiling,
we use the
\verb|MISALIGNED_MEM_REFERENCE| hardware counter.
There are 553 (0.183\%) basic blocks that are affected by unaligned accesses.
\textbf{Virtual page aliasing}
Under our page mapping scheme,
two virtual addresses that differ by a multiple of page size gets mapped to
the same physical address. This can introduce slowdown when such aliasing causes
extra memory dependence. Consider the following code snippet: \verb|*p = x; y = *(p + page_size)|, where
\textit{ordinarily} the load can be executed independently from the first store
(because there is no dependence),
but due to page aliasing introduced by our profiler, the load can only be issued
after the store finishes.
We remove basic blocks whose execution that \textit{could} be affected by
page aliasing out of our analysis,
and there are 20,729 (6.28\%) such basic blocks.
Because there is no hardware counter that tracks stores serialized due
to such aliasing, we instead trace all addresses accessed during profiling and
conservatively mark a basic block if we cannot prove that its execution time is not
affected by page aliasing.
We note that we can reduce the occurrence of page aliasing by mapping the set of virtual
pages to a larger range of physical memory (e.g., two pages instead of one).
\begin{table}
\begin{tabular}{
|p{0.2\columnwidth}|p{0.18\columnwidth}|p{0.18\columnwidth}|p{0.18\columnwidth}|}
\hline \textbf{(Additional) Optimizations} &
\textbf{Measured Throughput} &
\textbf{L1 D-Cache Misses} &
\textbf{L1 I-Cache Misses} \\
\hline
None & Crashed & N/A & N/A \\
\hline
Page mapping & 6377.0 & 956 & 0 \\
\hline
Single physical page & 2273.7 & 0 & 0 \\
\hline
Disabling gradual underflow & 65.0 & 0 & 35 \\
\hline
Using smaller unroll factor & 59.0 & 0 & 0\\
\hline
\end{tabular}
\\
\caption{Measured throughput for a sample basic block when
different measurement optimizations are incrementally applied.
The basic block is extracted from one of the critical
inner loop body of TensorFlow\cite{tensorflow}'s CNN training benchmark.}
\label{tab:ablation}
\end{table}
\subsection{Generalization to Other Architectures}\label{sec:generalization}
We briefly discuss how to generalize our profiling technique to other architectures.
On a high level, there are three requirements to implementing our profiler:
\begin{itemize}
\item Being able to alias multiple virtual pages to the few physical pages with
(on average) zero overhead.
\item Performance counters that detect cache misses.
\item Being able to either detect or disable floating-point underflow.
\end{itemize}
Our technique assumes that, on average, virtual page aliasing does not incur performance penalty---in particular the penalty
caused by otherwise unnecessary cache invalidation
(i.e., when two aliased virtual addresses are mapped to different cache entries,
the cache evicts one entry to ensure coherence).
We, therefore, assume that the target processor has a physically tagged
data cache;
and that, if the cache is virtually indexed,
the page size is small enough, so the indexing bits are not affected by address translation.
These two conditions are sufficient to establish that any two aliased virtual addresses
are always mapped to the same cache entries;
to our knowledge, this applies to recent implementations of x86-64 and ARM.
The latter two requirements regarding
the portability of our technique---being able to
count cache misses and detect floating-point underflows---depend
on hardware implementations. Latest ARM microarchitectures, to our knowledge,
have support for both types of hardware counters.
Recent AMD x86-64 implementations also
have similar support.