Skip to content

Commit

Permalink
L19 cosmetic
Browse files Browse the repository at this point in the history
  • Loading branch information
patricklam committed Sep 13, 2024
1 parent f105751 commit f88fe65
Show file tree
Hide file tree
Showing 2 changed files with 15 additions and 15 deletions.
14 changes: 7 additions & 7 deletions lectures/L19-slides.tex
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@
\begin{frame}
\frametitle{Query Processing}

The procedure for the database server to carry out the query are the same
The procedure for the database server to carry out the query are the same:

\begin{enumerate}
\item Parsing and translation
Expand Down Expand Up @@ -285,7 +285,7 @@

We aren't going to learn the rules here.

Analogy: in math class you would have learned...\\
Analogy: in math class you would have learned (at least for $\mathbb{N}, \mathbb{Z}, \mathbb{R}\ldots$)\\
\quad $3 \times 10 \times 7$ is the same as $7 \times 10 \times 3$;\\
\quad $14 \times 13$ is the same as $14 \times 10 + 14 \times 3$.

Expand Down Expand Up @@ -344,7 +344,7 @@


\begin{frame}
\frametitle{Evaluation Plan Selection - Join Focus}
\frametitle{Evaluation Plan Selection---Join Focus}

A simplified approach, then, focuses just on what order in which join operations are done and then how those joins are carried out.

Expand Down Expand Up @@ -380,7 +380,7 @@
\end{center}


It cannot examine all (non-symmetric) approaches and choose the optimal one. It would take too long.
It cannot examine all (non-symmetric) approaches and choose the optimal one. That would take too long.

\end{frame}

Expand Down Expand Up @@ -479,7 +479,7 @@
\begin{frame}
\frametitle{Cost Centres}

Areas where costs for performing a query accumulates:
Areas where costs for performing a query accumulate:

\begin{enumerate}
\item \textbf{Disk I/O}
Expand Down Expand Up @@ -636,7 +636,7 @@
\begin{frame}
\frametitle{Or Don't...?}

If, however, both constraints are removed and we cannot be sure that there is at most one address corresponding to a customer.
If, however, both constraints are removed, and we cannot be sure that there is at most one address corresponding to a customer\ldots

Then we have to do the join.

Expand Down Expand Up @@ -738,7 +738,7 @@

If we're not implementing a database is this still useful?

Yes -- the database is just an example!
Yes---the database is just an example!

Real lesson: how to programmatically generate, evaluate, and choose amongst alternatives.

Expand Down
16 changes: 8 additions & 8 deletions lectures/L19.tex
Original file line number Diff line number Diff line change
Expand Up @@ -18,21 +18,21 @@ \section*{Optimizing Database Queries}
\item Evaluation---execution of the query according to the plan just developed.
\end{enumerate}

The new and interesting part here is that the database server does not just execute the a pre-planned series of steps to get the result, but will adapt its approach at run-time based on what it thinks will be most efficient. It is, yes, still executing the executable code of its binary file and that does not change, but the path taken for a given request can and does vary wildly based on factors known only at run-time. How does that happen?
The new and interesting part here is that the database server does not just execute a pre-planned series of steps to get the result, but will adapt its approach at run-time based on what it thinks will be most efficient. It is, yes, still executing the executable code of its binary file and that does not change, but the path taken for a given request can and does vary wildly based on factors known only at run-time. How does that happen?

Usually a query is expressed in SQL, and that must then be translated into an equivalent internal expression using relational algebra. Relational algebra, super briefly, is just the set theory representation of database operations. Complex SQL queries are typically turned into \textit{query blocks}, which are translatable into relational algebra expressions. A query block has a single select-from-where expression, as well as related group-by and having clauses; nested queries are a separate query block~\cite{fds}.

A query like \texttt{SELECT salary FROM employee WHERE salary > 100000;} consists of one query block because it has only one part to it. We have possibilities. We can select all tuples where salary is more than 100~000 and then perform a projection of the salary field of that result (i.e., throw away the fields we do not need). The alternative is to do the projection of salary first and then perform the selection on the cut-down intermediate relation.

Suppose there is a subquery, like \texttt{SELECT name, street, city, province, postalCode FROM address WHERE id IN (SELECT addressID FROM employee WHERE department = 'Development');}. Then there are two query blocks, one for the subquery and one for the outer query. If there are multiple query blocks, then the server does not have to follow the same strategy for both.

What we need instead is a \textit{query execution plan}\footnote{\url{https://www.youtube.com/watch?v=fQk_832EAx4}, or \url{https://www.youtube.com/watch?v=l3FcbZXn4jM}}. To build that up, each step of the plan needs annotations that specify how to evaluate the operation, including information such as what algorithm or what index to use. An algebraic operation with the associated annotations about how to get it done is called an \textit{evaluation primitive}. The sequence of these primitives forms the plan, that is, how exactly to execute the query~\cite{dsc}.
What we need instead is a \textit{query execution plan}\footnote{\url{https://www.youtube.com/watch?v=fQk_832EAx4}, or \url{https://www.youtube.com/watch?v=l3FcbZXn4jM}}. To build that, each step of the plan needs annotations that specify how to evaluate the operation, including information such as what algorithm or what index to use. An algebraic operation with the associated annotations about how to get it done is called an \textit{evaluation primitive}. The sequence of these primitives forms the plan, that is, how exactly to execute the query~\cite{dsc}.

If there are multiple possible ways to carry out the plan, which there very often are, then the system will need to make some assessment about which plan is the best. It is not expected that users will write optimal queries; instead the database server should choose the best approach via \textit{query optimization}. Optimization is perhaps the wrong name for this because we are not choosing the \textit{optimal} approach; instead we will make some estimates about the query plans and try to choose the one that is most likely to be best. This suggests, as you may have guessed, we're going to use heuristics and consider trading accuracy for time.

\subsection*{Measures of Query Cost}

If you are asked to drive a car from point A to point B and there are multiple routes, you can evaluate your choices. To do so you need to break it down into different sections, such as drive along University Avenue, then get on Highway 85, then merge onto 401... Each segment has a length and a speed, such as knowing that you will drive 4 km along University Avenue and it is signed at 50 km/h (although with red lights and traffic and whatnot the actual average speed may be more like 30 km/h). By combining all of the segments, you get an estimate of how long that particular route will take. If you do this for all routes, you can see which route is the best.
If you are asked to drive a car from point A to point B and there are multiple routes, you can evaluate your choices. To do so you need to break it down into different sections, such as drive along University Avenue, then get on Highway 85, then merge onto 401... Each segment has a length and a speed, such as knowing that you will drive 4 km along University Avenue and it is signed at 50 km/h (although with red lights and traffic and whatnot the actual average speed may be more like 30 km/h, or even slower than bicycle speed if you time it right). By combining all of the segments, you get an estimate of how long that particular route will take. If you do this for all routes, you can see which route is the best.

Of course, it may turn out that real life gets in the way: if there is a crash on the highway, traffic really sucks and your conclusion that taking this particular route would be fastest turns out to be wrong. Short of being able to see into the future, this is more or less inevitable: estimates are just informed opinions, and things may be worse (or better) than expected.

Expand Down Expand Up @@ -64,19 +64,19 @@ \subsection*{Alternative Routes}

A simplified approach, then, focuses just on what order in which join operations are done and then how those joins are carried out. The theory is that the join operations are likely to be the slowest and take the longest, so any optimization here is going to have the most potential benefit.

We already know that the order of joins in a statement like $r_{1} \bowtie r_{2} \bowtie r_{3}$ (the bowtie symbol means join) is something the optimizer can choose. In this case there are 3 relations and there are 12 different join orderings. In fact, for $n$ relations there are $\dfrac{(2(n-1))!}{(n-1)!}$ possible orderings~\cite{dsc}. Some of them, are symmetric, which reduces the number that we have to calculate, since $r_{1} \bowtie r_{2}$ is not different from $r_{2} \bowtie r_{1}$ (in relational algebra). In any case, even if we can cut down the symmetrical cases the problem grows out of hand very quickly when $n$ gets larger.
We already know that the order of joins in a statement like $r_{1} \bowtie r_{2} \bowtie r_{3}$ (the bowtie symbol means join) is something the optimizer can choose. In this case there are 3 relations and there are 12 different join orderings. In fact, for $n$ relations there are $\dfrac{(2(n-1))!}{(n-1)!}$ possible orderings~\cite{dsc}. Some of them are symmetric, which reduces the number that we have to calculate, since $r_{1} \bowtie r_{2}$ is not different from $r_{2} \bowtie r_{1}$ (in relational algebra). In any case, even if we can cut down the symmetrical cases, the problem grows out of hand very quickly as $n$ gets larger.

Once more than three relations are affected by a join query it may be an opportunity to stop and think very hard about what is going on here, because this is quite unusual if the database design is good. The database server may want to ask why do you have a join query that goes across six or eight or twelve relations, but the database server (sadly) does not get to write the developers a nasty resignation letter saying that it can't continue to work this hard due to the negative effects on its health. It will dutifully do the work you asked it to and even try to make the best of this inefficient situation by optimizing it. But clearly it cannot examine all (non-symmetric) approaches and choose the optimal one. It would take too long.

Fortunately, we can create an algorithm that can ``remember'' subsets of the choices. If we have, for example, $r_{1} \bowtie r_{2} \bowtie r_{3} \bowtie r_{4} \bowtie r_{5}$ and the database server does not segmentation fault in disgust, we can break that down a bit. We could compute the best order for a subpart, say $(r_{1} \bowtie r_{2} \bowtie r_{3})$ and then re-use that repeatedly for any further joins with $r_{4}$ and $r_{5}$~\cite{dsc}. This ``saved'' result can be re-used repeatedly turning our problem from five relations into two three-relation problems.

This is a really big improvement, actually, considering how quickly the factorial term scales up. The trade-off for this approach is that the resultant approach may not be globally optimal (but instead just locally optimal). If $r_{1} \bowtie r_{4}$ produces very few tuples, it may be maximally efficient to do that join computation first, a strategy that will never be tried in an algorithm where $r_{1}$, $r_{2}$, and $r_{3}$ are combined to a subexpression for evaluation.

Remember though, this is an estimating process. The previous statement that said $r_{1} \bowtie r_{4}$ produces very few tuples as if it is a fact. The optimizer does not know that for sure and must rely on estimates where available. So even though the optimizer may, if it had tried all possibilities, have determined that $r_{1} \bowtie r_{4}$ produces the fewest tuples and should be joined first, it is possible that estimate was off and the actual cost of a different plan was lower.
Remember, though, this is an estimation process. The previous statement that said $r_{1} \bowtie r_{4}$ produces very few tuples as if it is a fact. The optimizer does not know that for sure and must rely on estimates where available. So even though the optimizer may, if it had tried all possibilities, have determined that $r_{1} \bowtie r_{4}$ produces the fewest tuples and should be joined first, it is possible that estimate was off and the actual cost of a different plan was lower.

The sort order in which tuples are generated is important if the result will be used in another join. A sort order is called \textit{interesting} if it is useful in a later operation. If $r_{1}$ and $r_{2}$ are being computed for a join with $r_{3}$ it is advantageous if the combined result $r_{1} \bowtie r_{2}$ is sorted on attributes that match to $r_{3}$ to make that join more efficient; if it is sorted by some attribute not in $r_{3}$ that means an additional sort will be necessary~\cite{dsc}.

With this in mind it means that the best plan for computing a particular subset of the join query is not necessarily the best plan overall, because that extra sort may cost more than was saved by doing the join itself faster. This increases the complexity, obviously, of deciding what is optimal. Fortunately there are, usually anyway, not too many interesting sort orders~\cite{dsc}.
With this in mind, it means that the best plan for computing a particular subset of the join query is not necessarily the best plan overall, because that extra sort may cost more than was saved by doing the join itself faster. This increases the complexity, obviously, of deciding what is optimal. Fortunately there are, usually anyway, not too many interesting sort orders~\cite{dsc}.

\subsection*{Estimating Statistics}

Expand Down Expand Up @@ -107,7 +107,7 @@ \subsection*{Estimating Statistics}

The above numbers are exact values which we can know and, hopefully, trust although they could be slightly out of date depending on when exactly metadata updates are performed. The more exact values we have, the better our guesses. But things start to get interesting when, in the previous example, we ask something that does not have a category, such as how many people have a salary larger than \$150~000, where there isn't an obvious answer found in the metadata?

\subsection*{Join Elimination and Making a Nested Suqbery: I know a shortcut}
\subsection*{Join Elimination and Making a Nested Subquery: I know a shortcut}
Join elimination is simply the idea of replacing a query that has a join (expected to be expensive) with an equivalent that does not (expected to be better). It can also turn a join query into a one with a nested subquery, on the theory that two smaller queries might be easier to carry out than a big join. This is a small extension of the idea of choosing the best route to complete the request, because it's more like rewriting the original request to be a little different.

You may ask, of course, why should the optimizer do this work at all? Why not simply count on the developers who wrote the SQL in the first place to refactor/change it so that it is no longer so inefficient? Grind leetcode\footnote{For the record, I don't think grind leetcode to get hired is a great plan, and I don't like it when companies expect that of you. It's very artificial. In my experience, most of the time, the challenge lies in understanding the requirements of the work and delivering a good experience (to users in the UI, other developers via API, to your future self/team if you want to build on this, etc...), not writing a provably optimal implementation. I get the impression leetcode interviews are as much hazing as actual assessment of your skills.} and use a better algorithm.
Expand All @@ -126,7 +126,7 @@ \subsubsection*{Shortcuts}

There are exceptions, however. One from~\cite{dsc}: suppose the query is a selection on $r \bowtie s$ where we only want attributes that are in $s$. If we do the selection first and (1) $r$ is small compared to $s$ and (2) there is an index on the join attributes of $s$, but not on any of the columns we want, then the selection is not so nice. It would throw away some useful information and force a scan on $s$; it may be better to do the join using the index and then remove the rows we don't want afterwards.

\paragraph{Perform projection early.} Analogous to the idea of doing selection early, performing projection early is good because it tosses away information we do not need and means less input to the next operations. Just like selection, however, it is possible the projection throws away an attribute that will be useful. If the query does not ask for the join attribute in the output (e.g., does it matter what a person's address ID is?) then that join attribute will need to be removed from the output but if removed too soon it makes it impossible to do the join.
\paragraph{Perform projection early.} Analogous to the idea of doing selection early, performing projection early is good because it tosses away information we do not need and means less input to the next operations. Just like selection, however, it is possible the projection throws away an attribute that will be useful. If the query does not ask for the join attribute in the output (e.g., does it matter what a person's address ID is?) then that join attribute will need to be removed from the output. But if removed too soon, it makes it impossible to do the join.

\paragraph{Set limits.} Another strategy for making sure we choose something appropriate within a reasonable amount of time is to set a time limit. Optimization has a certain cost and once this cost is exceeded, the process of trying to find something better stops. But how much time to we decide to allocate? A suggested strategy from~\cite{dsc} says to use some heuristic rules to very quickly guess at how long it will be. If it will be very quick then don't bother doing any further searching, just do it. If it will be moderate in cost, then a moderate optimization budget should be allocated. If it is expensive then a larger optimization budget is warranted.

Expand Down

0 comments on commit f88fe65

Please sign in to comment.