Skip to content

Commit

Permalink
add visualization figures and tools for chapter 4
Browse files Browse the repository at this point in the history
  • Loading branch information
Zishen Wan authored and Zishen Wan committed Jan 5, 2025
1 parent 8f29053 commit e57b3b5
Show file tree
Hide file tree
Showing 4 changed files with 9 additions and 3 deletions.
12 changes: 9 additions & 3 deletions contents/core/dnn_architectures/dnn_architectures.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -193,7 +193,7 @@ Here, $(i,j)$ represents spatial positions, $k$ indexes output channels, $c$ ind

For a concrete example, consider our MNIST digit classification task with 28×28 grayscale images. Each convolutional layer applies a set of filters (say 3×3) that slide across the image, computing local weighted sums. If we use 32 filters, the layer produces a 28×28×32 output, where each spatial position contains 32 different feature measurements of its local neighborhood. This is in stark contrast to our MLP approach where we flattened the entire image into a 784-dimensional vector.

This algorithmic structure directly implements the requirements we identified for spatial pattern processing, creating distinct computational patterns that influence system design.
This algorithmic structure directly implements the requirements we identified for spatial pattern processing, creating distinct computational patterns that influence system design. [CNN Explainer](https://poloclub.github.io/cnn-explainer/) offers a visual representation of the structure of various convolutional neural networks.

::: {.content-visible when-format="html"}
![Convolution operation, image data (blue) and 3x3 filter (green). Source: V. Dumoulin, F. Visin, MIT](images/gif/cnn.gif){#fig-cnn}
Expand Down Expand Up @@ -313,6 +313,8 @@ For example, in processing a sequence of words, each word might be represented a

This recurrent structure directly implements our requirements for sequential processing through the introduction of recurrent connections, which maintain internal state and allow the network to carry information forward in time. Instead of processing all inputs independently, RNNs process sequences of data by iteratively updating a hidden state based on the current input and the previous hidden state. This makes RNNs well-suited for tasks such as language modeling, speech recognition, and time-series forecasting.

![RNN architecture. Source: A. Amidi, S. Amidi, Stanford](images/png/rnn.png){#fig-rnn}

### Computational Mapping

The sequential structure of RNNs maps to computational patterns quite different from both MLPs and CNNs. Let's examine how this mapping progresses from mathematical abstraction to computational reality.
Expand Down Expand Up @@ -413,6 +415,8 @@ In this equation, Q (queries), K (keys), and V (values) represent learned projec

The attention operation involves several key steps. First, it computes query, key, and value projections for each position in the sequence. Next, it generates an N×N attention matrix through query-key interactions. Finally, it uses these attention weights to combine value vectors, producing the output. Unlike the fixed weight matrices found in previous architectures, these attention weights are computed dynamically for each input, allowing the model to adapt its processing based on the content at hand.

![(left) Scaled dot-product attention. (right) Multi-head attention consists of several attention layers running in parallel. Source: [Attention Is All You Need](https://arxiv.org/pdf/1706.03762)](images/png/attention.png){#fig-attention}

#### Computational Mapping

The dynamic structure of attention operations maps to computational patterns that differ significantly from those of previous architectures. To understand this mapping, let's examine how it progresses from mathematical abstraction to computational reality:
Expand Down Expand Up @@ -494,7 +498,9 @@ $$

Here, X is the input sequence, and $W_Q$, $W_K$, and $W_V$ are learned weight matrices for queries, keys, and values respectively. This formulation highlights how self-attention derives all its components from the same input, creating a dynamic, content-dependent processing pattern.

The Transformer architecture leverages this self-attention mechanism within a broader structure that typically includes feed-forward layers, layer normalization, and residual connections. This combination allows Transformers to process input sequences in parallel, capturing complex dependencies without the need for sequential computation. As a result, Transformers have demonstrated remarkable effectiveness across a wide range of tasks, from natural language processing to computer vision, revolutionizing the landscape of deep learning architectures.
The Transformer architecture leverages this self-attention mechanism within a broader structure that typically includes feed-forward layers, layer normalization, and residual connections. This combination allows Transformers to process input sequences in parallel, capturing complex dependencies without the need for sequential computation. As a result, Transformers have demonstrated remarkable effectiveness across a wide range of tasks, from natural language processing to computer vision, revolutionizing the landscape of deep learning architectures. [Transformer Explainer](https://poloclub.github.io/transformer-explainer/) offers a visual representation of the structure of Transformer models.

![The Transformer model architecture. Source: [Attention Is All You Need](https://arxiv.org/pdf/1706.03762)](images/png/transformer.png){#fig-transformer}

#### Computational Mapping

Expand Down Expand Up @@ -732,7 +738,7 @@ The data movement primitives have particularly influenced the design of intercon
+-----------------------+---------------------------+--------------------------+----------------------------+
| Dynamic Computation | Flexible routing | Dynamic graph execution | Load balancing |
+-----------------------+---------------------------+--------------------------+----------------------------+
| Sequential Access | Burst mode DRAM | Contiguous allocation | |
| Sequential Access | Burst mode DRAM | Contiguous allocation | Access latency |
+-----------------------+---------------------------+--------------------------+----------------------------+
| Random Access | Large caches | Memory-aware scheduling | Cache misses |
+-----------------------+---------------------------+--------------------------+----------------------------+
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit e57b3b5

Please sign in to comment.