From e786d216bacfdd42bddda67737b11725900ba4d5 Mon Sep 17 00:00:00 2001 From: Patrick Lam Date: Wed, 11 Sep 2024 12:48:38 +1200 Subject: [PATCH] L13 cosmetic, plus mention TPUs --- lectures/L13-slides.tex | 6 +++--- lectures/L13.tex | 16 ++++++++-------- 2 files changed, 11 insertions(+), 11 deletions(-) diff --git a/lectures/L13-slides.tex b/lectures/L13-slides.tex index f9a07b0c..18958f0f 100644 --- a/lectures/L13-slides.tex +++ b/lectures/L13-slides.tex @@ -270,7 +270,7 @@ We do more with less! -Well, you can use \texttt{float/f32} instead of \texttt{double/f64}. +Well, you can use \texttt{float/f32} instead of \texttt{double/f64} (or smaller, on TPUs). But you can also work with integers to represent floating point numbers. @@ -355,7 +355,7 @@ \frametitle{Oh, never mind...} -The clever hack is somewhat obsoleted now by the fact that CPU instructions now exist to give you fast inverse square root. +The clever hack is somewhat obsoleted now by the fact that CPU instructions now exist to give you fast inverse square root (CISC-y!) This was obviously not something you could rely on in 1999. @@ -504,7 +504,7 @@ \item With approximations: 147 seconds. \end{itemize} -Of course, parallelizing this helps even more... +Of course, parallelizing this helps even more\ldots (but, why not both?) \end{frame} diff --git a/lectures/L13.tex b/lectures/L13.tex index 878f11e7..fae0d765 100644 --- a/lectures/L13.tex +++ b/lectures/L13.tex @@ -12,7 +12,7 @@ \section*{Trading Accuracy for Time} \subsection*{Early Phase Termination} -The formal name for the first idea, quitting early, is is early phase +A formal name for the first idea, quitting early, is early phase termination~\cite{Rinard:2007:EarlyPhaseTermination}. So, to apply it to a concrete idea: we've talked about barriers quite a bit. Recall that the idea is that no thread may proceed past a barrier until all of the threads reach the barrier. Waiting for other threads causes delays. @@ -35,7 +35,7 @@ \subsection*{Early Phase Termination} One way we can know if we're wasting time is to remember previous outcomes. The solution we're evaluating will have some travel cost in units (maybe kms). If the currently-accumulated cost in kms is larger than the total of the thus-far best solution, give up. To be specific, if we have a route that has 400 km of driving and we are partway through building a solution and we have already got 412 km of driving, we can give up on this option (and not evaluate the rest of it) because we already know it won't be the best. -Another approach is to stop as soon as you have a solution that's reasonable. If our target is to get total travel under 500 km then we can stop searching as soon as we find one that satisfies this constraint. Yes, we might stop at 499 km and the optimal solution might be 400 (25\% more driving for the poor peon) -- but it does not have to be perfect; it just has to be acceptable. And if traffic in the hypothetical region is anything like that of the GTA, the route that is shortest in kilometres may not be the shortest in terms of time anyway. +Another approach is to stop as soon as you have a solution that's reasonable. If our target is to get total travel under 500 km then we can stop searching as soon as we find one that satisfies this constraint. Yes, we might stop at 499 km and the optimal solution might be 400 (25\% more driving for the poor peon)---but it does not have to be perfect; it just has to be acceptable. And if traffic in the hypothetical region is anything like that of the GTA, the route that is shortest in kilometres may not be the shortest in terms of time anyway. You can also choose to reduce the amount of effort by trying, say, five or ten different possibilities and seeing which of those is the best. There's no guarantee you'll get an optimal solution: you might have randomly chosen the ten worst options you could choose. @@ -58,7 +58,7 @@ \subsection*{Early Phase Termination} In other cases, some threads simply take too long, but we don't need all of them to produce a result. If we are evaluating some protocol where the majority wins, we can stop as soon as sufficient results have been returned; either an outright majority for an option or that the remaining votes couldn't change the outcome. This happens to some extent with election projections: even if not all polling stations are reporting a result, news channels will declare a winner if the remaining votes would not be enough to change the outcome. Actually, news channels probably take it a bit too far in that they will declare a winner even if the outstanding votes exceed the margin, on a theory that it probably won't be the case that they are 100\% for the candidate who is in second place. But they can be wrong. \paragraph{Slow Road...} -For some categories of problem, we know not only that a solution will exist, but also how many steps it takes to solve (optimally). Consider the Rubik's Cube -- it's much easier to explain if you have seen one. It'll appear in the slides, but if you're just reading the note(book/s) then I suggest you google it\footnote{If you've waited for the exam to read this and you can't google... whoops!}. +For some categories of problem, we know not only that a solution will exist, but also how many steps it takes to solve (optimally). Consider the Rubik's Cube---it's much easier to explain if you have seen one. It'll appear in the slides, but if you're just reading the note(book/s) then I suggest you google it\footnote{If you've waited for the exam to read this and you can't google... whoops!}. This is a problem with a huge number of possible permutations and brute force isn't going to work. However, research has proven that no matter what the state of the cube is, it can be transitioned to a solved state in 20 moves or fewer. This number is called God's Number, presumably because it is the maximum number of moves it would take an all-knowing deity to solve the puzzle. So if you have a solver for a Rubik's cube, and if you don't find a solution in (fewer than) 20 moves, you should cancel this solution attempt and try another one. @@ -66,7 +66,7 @@ \subsection*{Early Phase Termination} \subsection*{Reduced-Resource Computation} -The formal name for the second idea is ``reduced resource computation'' -- that is to say, we do more with less! Austerity programs for our computer programs. Well, you can use \texttt{float} instead of \texttt{double}. But you can also work with integers to represent floating point numbers (e.g., representing money in an integer number of cents). But let's really think about when this is appropriate. +The formal name for the second idea is ``reduced resource computation''---that is to say, we do more with less! Austerity programs for our computer programs. Well, you can use \texttt{float} instead of \texttt{double}. Indeed, Google's Tensor Processing Units appear to use perhaps even fewer than 8 bits to represent a floating-point number. Or, you can also work with integers to represent floating point numbers (e.g., representing money in an integer number of cents). But let's really think about when this is appropriate. \paragraph{Circuit... Analysis!} Recall that, in scientific computations, you're entering points that were measured (with some error) and that you're computing using machine numbers (also with some error). Computers are only providing simulations, not the ground truth; the question is whether the simulation is good enough. @@ -92,7 +92,7 @@ \subsection*{Reduced-Resource Computation} There's a lot of explanation and a lot of math in the source material but it comes down to how the float is stored. The float starts with a sign bit, then the exponent, and then the mantissa (math reminder: in $1.95 \times 10^{3}$, the exponent is 3 and the mantissa is 1.95). -The clever hack is somewhat obsoleted now by the fact that CPU instructions now exist to give you fast inverse square root. This was obviously not something you could rely on in 1999, but we're going to revisit the idea of using clever CPU instructions to speed things along in the next lecture. So if we say pretend this float is an integer we end up with this~\cite{fisqrt2}: +The clever hack is somewhat obsoleted now by the fact that CPU instructions now exist to give you fast inverse square root. This was obviously not something you could rely on in 1999 (and is pretty CISC-y), but we're going to revisit the idea of using clever CPU instructions to speed things along in the next lecture. So if we say pretend this float is an integer we end up with this~\cite{fisqrt2}: \begin{center} \includegraphics[width=0.4\textwidth]{images/float-to-int.png} @@ -100,7 +100,7 @@ \subsection*{Reduced-Resource Computation} If it's not obvious, this plot rather resembles the plot of $-1/\sqrt{x}$. So we are pretty close to getting where we need to go. All we need is to invert it and then do a little bit of an offset. The seemingly magic number of \texttt{0x5f3759df} is not a bit pattern, but just a calculated offset to make the approximation a little bit better. Then we turn it back into a float. -The last step is then to do one quick iteration of Newton's method to refine the calculation a little bit and we have a great solution: it is a fast, constant-time calculation for something that normally would be difficult, and it's very accurate, something like 0.175\% error at the most. And in a 3D game a tiny inaccuracy is not a big deal! Especially in one from 1999. It wasn't exactly photorealistic to begin with, now was it...? +The last step is then to do one quick iteration of Newton's method to refine the calculation a little bit and we have a great solution: it is a fast, constant-time calculation for something that normally would be difficult, and it's very accurate, something like 0.175\% error at the most. And in a 3D game a tiny inaccuracy is not a big deal! Especially in one from 1999. It wasn't exactly photorealistic to begin with, now was it\ldots? This is the best case scenario: the accuracy that we trade for speed is both very small and its application is one in which a small difference is not noticeable. This is beyond ``close enough is good enough'', this is hardly any tradeoff at all. @@ -108,7 +108,7 @@ \subsection*{Reduced-Resource Computation} \paragraph{N-Body Problem} A common physics problem that programmers are asked to simulate is the N-Body problem: you have some number of bodies (N, obviously) and they interact via gravitational forces. The program needs to compute the movements of the bodies over time. This is a typical example of a program that is well suited to parallelization: you can compute the forces on each body $n$ from all other bodies in parallel. This was even at one time an OpenCL assignment in this course, although now there are too many good solutions on the internet so it was replaced. Bummer. -What can you do here if you want to speed it up even more? You could look for optimizations that trade off accuracy for performance. As you might imagine, using \texttt{float} instead of \texttt{double} can save half the space which should make things quite a bit faster. But you want more... +What can you do here if you want to speed it up even more? You could look for optimizations that trade off accuracy for performance. As you might imagine, using \texttt{float} instead of \texttt{double} can save half the space which should make things quite a bit faster. But you want more\ldots Then we need some domain knowledge. That is, we need to think about what we know about the problem and we can make a decision about what is important and what is not. If we thought about what's important for determining the forces, what would we consider to be the most important? @@ -154,7 +154,7 @@ \subsection*{Reduced-Resource Computation} Also, this is before any parallelization (no threads are spawned). We can calculate forces on each point pretty effectively in parallel; we can also parallelize the calculations of the centre of mass quite easily. Both would speed up the program quite a lot! -If I just parallelize version without approximations (using the rayon parallel iterator), it takes about 25 seconds to run, and parallelizing the version with bins (using the same in a very naive parallelization) gets the execution time for 100~000 points down to about the same 25 seconds. It is clear that parallelizing the problem has a much greater effect than the tradeoff of accuracy for time (at least in this implementation), but on a sufficiently large problem, everything counts. +If I just parallelize the version without approximations (using the rayon parallel iterator), it takes about 25 seconds to run, and parallelizing the version with bins (using the same in a very naive parallelization) gets the execution time for 100~000 points down to about the same 25 seconds. It is clear that parallelizing the problem has a much greater effect than the tradeoff of accuracy for time (at least in this implementation), but on a sufficiently large problem, everything counts. \subsection*{Loop perforation}