add visualization figures and tools for chapter 4

harvard-edge · Jan 5, 2025 · e57b3b5 · e57b3b5
1 parent 8f29053
commit e57b3b5
Show file tree

Hide file tree

Showing 4 changed files with 9 additions and 3 deletions.
diff --git a/contents/core/dnn_architectures/dnn_architectures.qmd b/contents/core/dnn_architectures/dnn_architectures.qmd
@@ -193,7 +193,7 @@ Here, $(i,j)$ represents spatial positions, $k$ indexes output channels, $c$ ind
 
 For a concrete example, consider our MNIST digit classification task with 28×28 grayscale images. Each convolutional layer applies a set of filters (say 3×3) that slide across the image, computing local weighted sums. If we use 32 filters, the layer produces a 28×28×32 output, where each spatial position contains 32 different feature measurements of its local neighborhood. This is in stark contrast to our MLP approach where we flattened the entire image into a 784-dimensional vector.
 
-This algorithmic structure directly implements the requirements we identified for spatial pattern processing, creating distinct computational patterns that influence system design.
+This algorithmic structure directly implements the requirements we identified for spatial pattern processing, creating distinct computational patterns that influence system design. [CNN Explainer](https://poloclub.github.io/cnn-explainer/) offers a visual representation of the structure of various convolutional neural networks.
 
 ::: {.content-visible when-format="html"}
 ![Convolution operation, image data (blue) and 3x3 filter (green). Source: V. Dumoulin, F. Visin, MIT](images/gif/cnn.gif){#fig-cnn}
@@ -313,6 +313,8 @@ For example, in processing a sequence of words, each word might be represented a
 
 This recurrent structure directly implements our requirements for sequential processing through the introduction of recurrent connections, which maintain internal state and allow the network to carry information forward in time. Instead of processing all inputs independently, RNNs process sequences of data by iteratively updating a hidden state based on the current input and the previous hidden state. This makes RNNs well-suited for tasks such as language modeling, speech recognition, and time-series forecasting.
 
+![RNN architecture. Source: A. Amidi, S. Amidi, Stanford](images/png/rnn.png){#fig-rnn}
+
 ### Computational Mapping
 
 The sequential structure of RNNs maps to computational patterns quite different from both MLPs and CNNs. Let's examine how this mapping progresses from mathematical abstraction to computational reality.
@@ -413,6 +415,8 @@ In this equation, Q (queries), K (keys), and V (values) represent learned projec
 
 The attention operation involves several key steps. First, it computes query, key, and value projections for each position in the sequence. Next, it generates an N×N attention matrix through query-key interactions. Finally, it uses these attention weights to combine value vectors, producing the output. Unlike the fixed weight matrices found in previous architectures, these attention weights are computed dynamically for each input, allowing the model to adapt its processing based on the content at hand.
 
+![(left) Scaled dot-product attention. (right) Multi-head attention consists of several attention layers running in parallel. Source: [Attention Is All You Need](https://arxiv.org/pdf/1706.03762)](images/png/attention.png){#fig-attention}
+
 #### Computational Mapping
 
 The dynamic structure of attention operations maps to computational patterns that differ significantly from those of previous architectures. To understand this mapping, let's examine how it progresses from mathematical abstraction to computational reality:
@@ -494,7 +498,9 @@ $$
 
 Here, X is the input sequence, and $W_Q$, $W_K$, and $W_V$ are learned weight matrices for queries, keys, and values respectively. This formulation highlights how self-attention derives all its components from the same input, creating a dynamic, content-dependent processing pattern.
 
-The Transformer architecture leverages this self-attention mechanism within a broader structure that typically includes feed-forward layers, layer normalization, and residual connections. This combination allows Transformers to process input sequences in parallel, capturing complex dependencies without the need for sequential computation. As a result, Transformers have demonstrated remarkable effectiveness across a wide range of tasks, from natural language processing to computer vision, revolutionizing the landscape of deep learning architectures.
+The Transformer architecture leverages this self-attention mechanism within a broader structure that typically includes feed-forward layers, layer normalization, and residual connections. This combination allows Transformers to process input sequences in parallel, capturing complex dependencies without the need for sequential computation. As a result, Transformers have demonstrated remarkable effectiveness across a wide range of tasks, from natural language processing to computer vision, revolutionizing the landscape of deep learning architectures. [Transformer Explainer](https://poloclub.github.io/transformer-explainer/) offers a visual representation of the structure of Transformer models.
+
+![The Transformer model architecture. Source: [Attention Is All You Need](https://arxiv.org/pdf/1706.03762)](images/png/transformer.png){#fig-transformer}
 
 #### Computational Mapping
 
@@ -732,7 +738,7 @@ The data movement primitives have particularly influenced the design of intercon
 +-----------------------+---------------------------+--------------------------+----------------------------+
 | Dynamic Computation   | Flexible routing          | Dynamic graph execution  | Load balancing             |
 +-----------------------+---------------------------+--------------------------+----------------------------+
-| Sequential Access     | Burst mode DRAM           | Contiguous allocation    |                            |
+| Sequential Access     | Burst mode DRAM           | Contiguous allocation    | Access latency             |
 +-----------------------+---------------------------+--------------------------+----------------------------+
 | Random Access         | Large caches              | Memory-aware scheduling  | Cache misses               |
 +-----------------------+---------------------------+--------------------------+----------------------------+

diff --git a/contents/core/dnn_architectures/images/png/attention.png b/contents/core/dnn_architectures/images/png/attention.png
diff --git a/contents/core/dnn_architectures/images/png/rnn.png b/contents/core/dnn_architectures/images/png/rnn.png
diff --git a/contents/core/dnn_architectures/images/png/transformer.png b/contents/core/dnn_architectures/images/png/transformer.png