Skip to content

Commit

Permalink
L30 minor
Browse files Browse the repository at this point in the history
  • Loading branch information
patricklam committed Sep 17, 2024
1 parent 9e73e8e commit ce5d39e
Show file tree
Hide file tree
Showing 2 changed files with 12 additions and 13 deletions.
7 changes: 3 additions & 4 deletions lectures/L30-slides.tex
Original file line number Diff line number Diff line change
Expand Up @@ -223,8 +223,8 @@
\begin{itemize}
\item
Once upon a time: physical machine; shared hosting.
\item Virtualization:
\item Clouds
\item Virtualization.
\item Clouds.
\end{itemize}

Servers typically share persistent storage, also in
Expand Down Expand Up @@ -384,8 +384,7 @@
big data systems.

Domain: graph processing
algorithms---
PageRank and graph connectivity \\
algorithms---PageRank and graph connectivity \\
(bottleneck is label propagation).

Subjects: graphs with billions of edges\\
Expand Down
18 changes: 9 additions & 9 deletions lectures/L30.tex
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ \section*{Clusters and Cloud Computing}
multiple threads or multiple processes, you can do the same with multiple
computers. We'll survey techniques for programming for
performance using multiple computers; although there's overlap with
distributed systems, we're looking more at calculations here.
distributed systems, we're looking more at calculations here rather than coordination mechanisms.

\paragraph*{Message Passing.} Rust encourages message-passing, but
a lot of your previous experience when working with C may have centred around
Expand All @@ -30,12 +30,12 @@ \section*{Clusters and Cloud Computing}
Interface}, a de facto standard for programming message-passing multi-
computer systems. This is, unfortunately, no longer the way.
MPI sounds good, but in practice people tend to use other things.
Here's a detailed piece about the relevance of MPI today:~\cite{hpcmpi}, if
Here's a detailed piece about the relevance of MPI as of 10 years ago:~\cite{hpcmpi}, if
you are curious.

\paragraph{REST}
We've already seen asynchronous I/O using HTTP (curl) which we could use to
consume a REST API as one mechanism for multi-computer communication. You
interact with a REST API as one mechanism for multi-computer communication. You
may have also learned about sockets and know how to use those, which would
underlie a lot of the mechanisms we're discussing. The socket approach is too
low-level for what we want to discuss, while the REST API approach is at a
Expand All @@ -54,11 +54,12 @@ \section*{Clusters and Cloud Computing}
Communication is based around the idea of producers writing a record (some data element, like an invoice) into a topic (categorizing messages) and consumers taking the item from the topic and doing something useful with it. A message remains available for a fixed period of time and can be replayed if needed. I think at this point you have enough familiarity with the concept of the producer-consumer problem and channels/topics/subscriptions that we don't need to spend a lot of time on it.


Kafka's basic strategy is to write things into an immutable log. The log is split into different partitions; you choose how many when creating the topic, where more partitions equals higher parallelism. The producer writes something and it goes into one of the partitions. Consumers read from each one of the partitions and writes down its progress (``commit its offset'') to keep track of how much of the topic it has consumed. See this image from \url{kafka.apache.org}:
Kafka's basic strategy is to write things into an immutable log. The log is split into different partitions; you choose how many when creating the topic, where more partitions equals higher parallelism. The producer writes something and it goes into one of the partitions. Consumers read from each one of the partitions and writes down its progress (``commits its offset'') to keep track of how much of the topic it has consumed. See this image from \url{kafka.apache.org}:

\begin{center}
\includegraphics[width=0.4\textwidth]{images/kafka-partition.png}
\end{center}
\vspace*{-1.5em}

The nice part about such an architecture is that we can provision the parallelism that we want, and the logic for the broker (the system between the producer and the consumer, that is, Kafka) is simple. Also, consumers can take items and deal with them at their own speed and there's no need for consumers to coordinate; they manage their own offsets. Messages are removed from the topic based on their expiry, so it's not important for consumers to get them out of the queue as quickly as possible.

Expand Down Expand Up @@ -104,8 +105,7 @@ \subsection*{Cloud Computing}
instances, that you've started up. Providers offer different instance
sizes, where the sizes vary according to the number of cores, local
storage, and memory. Some instances even have GPUs, but it seemed
uneconomic to use this for Assignment 3, at least in previous years (I
have not done the calculation this year).
uneconomic to use this for Assignment 3.
Instead we have the {\tt ecetesla} machines.

\paragraph{Launching Instances.} When you need more compute power,
Expand Down Expand Up @@ -152,7 +152,7 @@ \section*{Clusters versus Laptops}
\paragraph{Results.} 128 cores don't consistently beat a laptop at PageRank: e.g. 249--857s on the twitter\_rv dataset for the big data system vs 300s for the laptop, and they are 2$\times$ slower for label
propagation, at 251--1784s for the big data system vs 153s on
twitter\_rv. From the blogpost:

\vspace*{-1.5em}
\begin{center}
\includegraphics[width=0.60\textwidth]{images/pagerank.png}
\end{center}
Expand All @@ -164,7 +164,7 @@ \section*{Clusters versus Laptops}
$2\times$ speedup for PageRank and $10\times$ speedup for label propagation.

\paragraph{Takeaways.} Some thoughts to keep in mind, from the authors:
\begin{itemize}
\begin{itemize}[noitemsep]
\item ``If you are going to use a big data system for yourself, see if it is faster than your laptop.''
\item ``If you are going to build a big data system for others, see that it is faster than my laptop.''
\end{itemize}
Expand All @@ -173,7 +173,7 @@ \section*{Clusters versus Laptops}

\section*{Movie Hour}
Let's take a humorous look at cloud computing: James Mickens' session from Monitorama PDX 2014.

\vspace*{-1.5em}
\begin{center}
\url{https://vimeo.com/95066828}
\end{center}
Expand Down

0 comments on commit ce5d39e

Please sign in to comment.