From 40381bcb70ef71ad2fc195a5c1fd48ed0977a640 Mon Sep 17 00:00:00 2001
From: wizardforcel <562826179@qq.com>
Date: Thu, 8 Feb 2024 19:10:21 +0800
Subject: [PATCH] 2024-02-08 19:10:19

---
 totrans/gen-dl_12.yaml |  17 +++
 totrans/gen-dl_13.yaml | 335 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 352 insertions(+)

diff --git a/totrans/gen-dl_12.yaml b/totrans/gen-dl_12.yaml
index 6b273c1..f8d2f66 100644
--- a/totrans/gen-dl_12.yaml
+++ b/totrans/gen-dl_12.yaml
@@ -1,51 +1,68 @@
 - en: Part III. Applications
+  id: totrans-0
   prefs:
   - PREF_H1
   type: TYPE_NORMAL
+  zh: 第三部分. 应用
 - en: In [Part III](#part_applications), we will explore some of the key applications
     of the generative modeling techniques that we have seen so far, across images,
     text, music, and games. We will also see how these domains can be traversed using
     state-of-the-art multimodal models.
+  id: totrans-1
   prefs: []
   type: TYPE_NORMAL
+  zh: 在第三部分中，我们将探索迄今为止所见的生成建模技术在图像、文本、音乐和游戏等领域的一些关键应用。我们还将看到如何使用最先进的多模态模型穿越这些领域。
 - en: In [Chapter 9](ch09.xhtml#chapter_transformer) we shall turn our attention to
     Transformers, a start-of-the-art architecture that powers most modern-day text
     generation models. In particular, we shall explore the inner workings of GPT and
     build our own version using Keras, and we’ll see how it forms the foundation of
     tools such as ChatGPT.
+  id: totrans-2
   prefs: []
   type: TYPE_NORMAL
+  zh: 在第9章中，我们将把注意力转向Transformers，这是一种现代文本生成模型的先进架构。特别是，我们将探索GPT的内部工作原理，并使用Keras构建我们自己的版本，我们将看到它如何构建了诸如ChatGPT之类的工具的基础。
 - en: In [Chapter 10](ch10.xhtml#chapter_image_generation) we will look at some of
     the most important GAN architectures that have influenced image generation, including
     ProGAN, StyleGAN, StyleGAN2, SAGAN, BigGAN, VQ-GAN, and ViT VQ-GAN. We shall explore
     the key contributions of each and look to understand how the technique has evolved
     over time.
+  id: totrans-3
   prefs: []
   type: TYPE_NORMAL
+  zh: 在第10章中，我们将看一些对图像生成产生影响的最重要的GAN架构，包括ProGAN、StyleGAN、StyleGAN2、SAGAN、BigGAN、VQ-GAN和ViT
+    VQ-GAN。我们将探索每个架构的关键贡献，并了解这种技术如何随着时间的推移而发展。
 - en: '[Chapter 11](ch11.xhtml#chapter_music) looks at music generation, which presents
     additional challenges such as modeling musical pitch and rhythm. We’ll see that
     many of the techniques that work for text generation (such as Transformers) can
     also be applied in this domain, but we’ll also explore a deep learning architecture
     known as MuseGAN that applies a GAN-based approach to generating music.'
+  id: totrans-4
   prefs: []
   type: TYPE_NORMAL
+  zh: 第11章探讨音乐生成，这带来了额外的挑战，比如对音乐音高和节奏进行建模。我们将看到许多适用于文本生成的技术（如Transformers）也可以应用于这个领域，但我们还将探索一种称为MuseGAN的深度学习架构，该架构应用了基于GAN的方法来生成音乐。
 - en: '[Chapter 12](ch12.xhtml#chapter_world_models) shows how generative models can
     be used within other machine learning domains, such as reinforcement learning.
     We will focus on the “World Models” paper, which shows how a generative model
     can be used as the environment in which the agent trains, allowing it to train
     within a hallucinated dream version of the environment rather than the real thing.'
+  id: totrans-5
   prefs: []
   type: TYPE_NORMAL
+  zh: 第12章展示了生成模型如何在其他机器学习领域中使用，比如强化学习。我们将重点关注“世界模型”论文，该论文展示了如何将生成模型用作代理训练的环境，使其能够在幻想的梦境版本的环境中进行训练，而不是真实环境。
 - en: In [Chapter 13](ch13.xhtml#chapter_multimodal) we will explore state-of-the-art
     multimodal models that cross over domains such as images and text. This includes
     text-to-image models such as DALL.E 2, Imagen, and Stable Diffusion, as well as
     visual language models such as Flamingo.
+  id: totrans-6
   prefs: []
   type: TYPE_NORMAL
+  zh: 在第13章中，我们将探索跨越图像和文本等领域的最先进的多模态模型。这包括文本到图像模型，如DALL.E 2、Imagen和Stable Diffusion，以及视觉语言模型，如Flamingo。
 - en: Finally, [Chapter 14](ch14.xhtml#chapter_conclusion) summarizes the generative
     AI journey so far, the current generative AI landscape, and where we may be heading
     in the future. We will explore how generative AI may change the way we live and
     work, as well as considering whether it has the potential to unlock deeper forms
     of artificial intelligence in the years to come.
+  id: totrans-7
   prefs: []
   type: TYPE_NORMAL
+  zh: 最后，在第14章中总结了迄今为止的生成人工智能之旅，当前的生成人工智能格局，以及我们未来可能走向何方。我们将探讨生成人工智能如何改变我们的生活和工作方式，以及考虑它是否有潜力在未来几年解锁更深层次的人工智能形式。
diff --git a/totrans/gen-dl_13.yaml b/totrans/gen-dl_13.yaml
index c9c7b37..f519abe 100644
--- a/totrans/gen-dl_13.yaml
+++ b/totrans/gen-dl_13.yaml
@@ -1,4 +1,5 @@
 - en: Chapter 9\. Transformers
+  id: totrans-0
   prefs:
   - PREF_H1
   type: TYPE_NORMAL
@@ -10,15 +11,18 @@
     applying a dense layer and softmax activation over the hidden vector. This was
     considered the most sophisticated way to generatively produce text until 2017,
     when one paper changed the landscape of text generation forever.
+  id: totrans-1
   prefs: []
   type: TYPE_NORMAL
 - en: Introduction
+  id: totrans-2
   prefs:
   - PREF_H1
   type: TYPE_NORMAL
 - en: The Google Brain paper, confidently entitled “Attention Is All You Need,”^([1](ch09.xhtml#idm45387006840576))
     is famous for popularizing the concept of *attention*—a mechanism that now powers
     most state-of-the-art text generation models.
+  id: totrans-3
   prefs: []
   type: TYPE_NORMAL
 - en: The authors show how it is possible to create powerful neural networks called
@@ -27,6 +31,7 @@
     approach overcomes a key downside to the RNN approach, which is that it is challenging
     to parallelize, as it must process sequences one token as a time. Transformers
     are highly paralellizable, allowing them to be trained on massive datasets.
+  id: totrans-4
   prefs: []
   type: TYPE_NORMAL
 - en: In this chapter, we are going to delve into how modern text generation models
@@ -34,20 +39,24 @@
     on text generation challenges. In particular, we will explore a type of autoregressive
     model known as the *generative pre-trained transformer* (GPT), which powers OpenAI’s
     GPT-4 model, widely considered to be the current state of the art for text generation.
+  id: totrans-5
   prefs: []
   type: TYPE_NORMAL
 - en: GPT
+  id: totrans-6
   prefs:
   - PREF_H1
   type: TYPE_NORMAL
 - en: OpenAI introduced GPT in June 2018, in the paper “Improving Language Understanding
     by Generative Pre-Training,”^([2](ch09.xhtml#idm45387006828736)) almost exactly
     a year after the appearance of the original Transformer paper.
+  id: totrans-7
   prefs: []
   type: TYPE_NORMAL
 - en: In this paper, the authors show how a Transformer architecture can be trained
     on a huge amount of text data to predict the next word in a sequence and then
     subsequently fine-tuned to specific downstream tasks.
+  id: totrans-8
   prefs: []
   type: TYPE_NORMAL
 - en: The pre-training process of GPT involves training the model on a large corpus
@@ -56,12 +65,14 @@
     a sequence given the previous words. This process is known as *language modeling*
     and is used to teach the model to understand the structure and patterns of natural
     language.
+  id: totrans-9
   prefs: []
   type: TYPE_NORMAL
 - en: After pre-training, the GPT model can be fine-tuned for a specific task by providing
     it with a smaller, task-specific dataset. Fine-tuning involves adjusting the parameters
     of the model to better fit the task at hand. For example, the model can be fine-tuned
     for tasks such as classification, similarity scoring, or question answering.
+  id: totrans-10
   prefs: []
   type: TYPE_NORMAL
 - en: The GPT architecture has since been improved and extended by OpenAI with the
@@ -70,96 +81,132 @@
     generate more complex and coherent text. The GPT models have been widely adopted
     by researchers and industry practitioners and have contributed to significant
     advancements in natural language processing tasks.
+  id: totrans-11
   prefs: []
   type: TYPE_NORMAL
 - en: In this chapter, we will build our own variation of the original GPT model,
     trained on less data, but still utilizing the same components and underlying principles.
+  id: totrans-12
   prefs: []
   type: TYPE_NORMAL
 - en: Running the Code for This Example
+  id: totrans-13
   prefs:
   - PREF_H1
   type: TYPE_NORMAL
 - en: The code for this example can be found in the Jupyter notebook located at *notebooks/09_transformer/01_gpt/gpt.ipynb*
     in the book repository.
+  id: totrans-14
   prefs: []
   type: TYPE_NORMAL
 - en: The code is adapted from the excellent [GPT tutorial](https://oreil.ly/J86pg)
     created by Apoorv Nandan available on the Keras website.
+  id: totrans-15
   prefs: []
   type: TYPE_NORMAL
 - en: The Wine Reviews Dataset
+  id: totrans-16
   prefs:
   - PREF_H2
   type: TYPE_NORMAL
 - en: We’ll be using the [Wine Reviews dataset](https://oreil.ly/DC9EG) that is available
     through Kaggle. This is a set of over 130,000 reviews of wines, with accompanying
     metadata such as description and price.
+  id: totrans-17
   prefs: []
   type: TYPE_NORMAL
+  zh: 我们将使用通过Kaggle提供的[Wine Reviews数据集](https://oreil.ly/DC9EG)。这是一个包含超过130,000条葡萄酒评论的数据集，附带元数据，如描述和价格。
 - en: You can download the dataset by running the Kaggle dataset downloader script
     in the book repository, as shown in [Example 9-1](#downloading-wine-dataset).
     This will save the wine reviews and accompanying metadata locally to the */data*
     folder.
+  id: totrans-18
   prefs: []
   type: TYPE_NORMAL
+  zh: 您可以通过在书库中运行Kaggle数据集下载脚本来下载数据集，如[示例9-1](#downloading-wine-dataset)所示。这将把葡萄酒评论和相关元数据保存在本地的*/data*文件夹中。
 - en: Example 9-1\. Downloading the Wine Reviews dataset
+  id: totrans-19
   prefs:
   - PREF_H5
   type: TYPE_NORMAL
+  zh: 示例9-1\. 下载葡萄酒评论数据集
 - en: '[PRE0]'
+  id: totrans-20
   prefs: []
   type: TYPE_PRE
+  zh: '[PRE0]'
 - en: '`The data preparation steps are identical to the steps used in [Chapter 5](ch05.xhtml#chapter_autoregressive)
     for preparing data for input into an LSTM, so we will not repeat them in detail
     here. The steps, as shown in [Figure 9-1](#transformer_data_prep), are as follows:'
+  id: totrans-21
   prefs: []
   type: TYPE_NORMAL
+  zh: '`数据准备步骤与[第5章](ch05.xhtml#chapter_autoregressive)中用于准备输入到LSTM的数据的步骤是相同的，因此我们不会在这里详细重复它们。如[图9-1](#transformer_data_prep)所示，步骤如下：'
 - en: Load the data and create a list of text string descriptions of each wine.
+  id: totrans-22
   prefs:
   - PREF_OL
   type: TYPE_NORMAL
+  zh: 加载数据并创建每种葡萄酒的文本字符串描述列表。
 - en: Pad punctuation with spaces, so that each punctuation mark is treated as a separate
     word.
+  id: totrans-23
   prefs:
   - PREF_OL
   type: TYPE_NORMAL
+  zh: 用空格填充标点符号，以便每个标点符号被视为一个单独的单词。
 - en: Pass the strings through a `TextVectorization` layer that tokenizes the data
     and pads/clips each string to a fixed length.
+  id: totrans-24
   prefs:
   - PREF_OL
   type: TYPE_NORMAL
+  zh: 通过`TextVectorization`层将字符串传递，对数据进行标记化，并将每个字符串填充/裁剪到固定长度。
 - en: Create a training set where the inputs are the tokenized text strings and the
     outputs to predict are the same strings shifted by one token.
+  id: totrans-25
   prefs:
   - PREF_OL
   type: TYPE_NORMAL
+  zh: 创建一个训练集，其中输入是标记化的文本字符串，输出是预测的相同字符串向后移动一个标记。
 - en: '![](Images/gdl2_0901.png)'
+  id: totrans-26
   prefs: []
   type: TYPE_IMG
+  zh: '![](Images/gdl2_0901.png)'
 - en: Figure 9-1\. Data processing for the Transformer`  `## Attention
+  id: totrans-27
   prefs:
   - PREF_H6
   type: TYPE_NORMAL
+  zh: 图9-1\. Transformer的数据处理`  `## 注意力
 - en: The first step to understanding how GPT works is to understand how the *attention
     mechanism* works. This mechanism is what makes the Transformer architecture unique
     and distinct from recurrent approaches to language modeling. When we have developed
     a solid understanding of attention, we will then see how it is used within Transformer
     architectures such as GPT.
+  id: totrans-28
   prefs: []
   type: TYPE_NORMAL
+  zh: 了解GPT如何工作的第一步是了解*注意力机制*的工作原理。这个机制是使Transformer架构与循环方法在语言建模方面独特和不同的地方。当我们对注意力有了扎实的理解后，我们将看到它如何在GPT等Transformer架构中使用。
 - en: 'When you write, the choice that you make for the next word in the sentence
     is influenced by other words that you have already written. For example, suppose
     you start a sentence as follows:'
+  id: totrans-29
   prefs: []
   type: TYPE_NORMAL
+  zh: 当您写作时，句子中下一个词的选择受到您已经写过的其他单词的影响。例如，假设您开始一个句子如下：
 - en: '[PRE1]'
+  id: totrans-30
   prefs: []
   type: TYPE_PRE
+  zh: '[PRE1]'
 - en: Clearly, the next word should be something synonymous with *big*. How do we
     know this?
+  id: totrans-31
   prefs: []
   type: TYPE_NORMAL
+  zh: 显然，下一个词应该是与*big*同义的。我们怎么知道这一点？
 - en: Certain other words in the sentence are important for helping us to make our
     decision. For example, the fact that it is an elephant, rather than a sloth, means
     that we prefer *big* rather than *slow*. If it were a swimming pool, rather than
@@ -167,26 +214,35 @@
     action of *getting into* the car implies that size is the problem—if the elephant
     was trying to *squash* the car instead, we might choose *fast* as the final word,
     with *it* now referring to the car.
+  id: totrans-32
   prefs: []
   type: TYPE_NORMAL
+  zh: 句子中的某些其他单词对帮助我们做出决定很重要。例如，它是大象而不是树懒，意味着我们更喜欢*big*而不是*slow*。如果它是游泳池而不是汽车，我们可能会选择*scared*作为*big*的一个可能替代。最后，*getting
+    into*汽车的行为意味着大小是问题所在——如果大象试图*压扁*汽车，我们可能会选择*fast*作为最后一个词，现在*it*指的是汽车。
 - en: Other words in the sentence are not important at all. For example, the fact
     that the elephant is pink has no influence on our choice of final word. Equally,
     the minor words in the sentence (*the*, *but*, *it*, etc.) give the sentence grammatical
     form, but here aren’t important to determine the required adjective.
+  id: totrans-33
   prefs: []
   type: TYPE_NORMAL
+  zh: 句子中的其他单词一点都不重要。例如，大象是粉红色这个事实对我们选择最终词汇没有影响。同样，句子中的次要单词（*the*、*but*、*it*等）给句子以语法形式，但在这里并不重要，以确定所需形容词。
 - en: In other words, we are *paying attention* to certain words in the sentence and
     largely ignoring others. Wouldn’t it be great if our model could do the same thing?
+  id: totrans-34
   prefs: []
   type: TYPE_NORMAL
+  zh: 换句话说，我们正在*关注*句子中的某些单词，而基本上忽略其他单词。如果我们的模型也能做同样的事情，那不是很好吗？
 - en: An attention mechanism (also know as an *attention head*) in a Transformer is
     designed to do exactly this. It is able to decide where in the input it wants
     to pull information from, in order to efficiently extract useful information without
     being clouded by irrelevant details. This makes it highly adaptable to a range
     of circumstances, as it can decide where it wants to look for information at inference
     time.
+  id: totrans-35
   prefs: []
   type: TYPE_NORMAL
+  zh: Transformer中的注意力机制（也称为*注意力头*）旨在做到这一点。它能够决定从输入的哪个位置提取信息，以有效地提取有用信息而不被无关细节混淆。这使得它非常适应各种情况，因为它可以在推断时决定在哪里寻找信息。
 - en: In contrast, a recurrent layer tries to build up a generic hidden state that
     captures an overall representation of the input at each timestep. A weakness of
     this approach is that many of the words that have already been incorporated into
@@ -194,15 +250,20 @@
     (e.g., predicting the next word), as we have just seen. Attention heads do not
     suffer from this problem, because they can pick and choose how to combine information
     from nearby words, depending on the context.
+  id: totrans-36
   prefs: []
   type: TYPE_NORMAL
+  zh: 相比之下，循环层试图建立一个捕捉每个时间步输入的整体表示的通用隐藏状态。这种方法的一个弱点是，已经合并到隐藏向量中的许多单词对当前任务（例如，预测下一个单词）并不直接相关，正如我们刚刚看到的。注意力头不会遇到这个问题，因为它们可以选择如何从附近的单词中组合信息，具体取决于上下文。
 - en: Queries, Keys, and Values
+  id: totrans-37
   prefs:
   - PREF_H2
   type: TYPE_NORMAL
+  zh: 查询、键和值
 - en: So how does an attention head decide where it wants to look for information?
     Before we get into the details, let’s explore how it works at a high level, using
     our *pink elephant* example.
+  id: totrans-38
   prefs: []
   type: TYPE_NORMAL
 - en: Imagine that we want to predict what follows the word *too*. To help with this
@@ -211,22 +272,27 @@
     that follow *too*. For example, the word *elephant* might confidently contribute
     that it is more likely to be a word related to size or loudness, whereas the word
     *was* doesn’t have much to offer to narrow down the possibilities.
+  id: totrans-39
   prefs: []
   type: TYPE_NORMAL
 - en: In other words, we can think of an attention head as a kind of information retrieval
     system, where a *query* (“What word follows *too*?”) is made into a *key/value*
     store (other words in the sentence) and the resulting output is a sum of the values,
     weighted by the *resonance* between the query and each key.
+  id: totrans-40
   prefs: []
   type: TYPE_NORMAL
 - en: We will now walk through the process in detail ([Figure 9-2](#attention_head)),
     again with reference to our *pink elephant* sentence.
+  id: totrans-41
   prefs: []
   type: TYPE_NORMAL
 - en: '![](Images/gdl2_0902.png)'
+  id: totrans-42
   prefs: []
   type: TYPE_IMG
 - en: Figure 9-2\. The mechanics of an attention head
+  id: totrans-43
   prefs:
   - PREF_H6
   type: TYPE_NORMAL
@@ -237,6 +303,7 @@
     <mi>Q</mi></msub></math> to change the dimensionality of the vector from <math
     alttext="d Subscript e"><msub><mi>d</mi> <mi>e</mi></msub></math> to <math alttext="d
     Subscript k"><msub><mi>d</mi> <mi>k</mi></msub></math> .
+  id: totrans-44
   prefs: []
   type: TYPE_NORMAL
 - en: The *key* vectors ( <math alttext="upper K"><mi>K</mi></math> ) are representations
@@ -248,6 +315,7 @@
     <mi>e</mi></msub></math> to <math alttext="d Subscript k"><msub><mi>d</mi> <mi>k</mi></msub></math>
     . Notice that the keys and the query are the same length ( <math alttext="d Subscript
     k"><msub><mi>d</mi> <mi>k</mi></msub></math> ).
+  id: totrans-45
   prefs: []
   type: TYPE_NORMAL
 - en: Inside the attention head, each key is compared to the query using a dot product
@@ -261,6 +329,7 @@
     keep the variance of the vector sum stable (approximately equal to 1), and a softmax
     is applied to ensure the contributions sum to 1\. This is a vector of *attention
     weights*.
+  id: totrans-46
   prefs: []
   type: TYPE_NORMAL
 - en: The *value* vectors ( <math alttext="upper V"><mi>V</mi></math> ) are also representations
@@ -271,17 +340,23 @@
     <mi>e</mi></msub></math> to <math alttext="d Subscript v"><msub><mi>d</mi> <mi>v</mi></msub></math>
     . Notice that the value vectors do not necessarily have to have the same length
     as the keys and query (but often do, for simplicity).
+  id: totrans-47
   prefs: []
   type: TYPE_NORMAL
 - en: The value vectors are multiplied by the attention weights to give the *attention*
     for a given <math alttext="upper Q"><mi>Q</mi></math> , <math alttext="upper K"><mi>K</mi></math>
     , and <math alttext="upper V"><mi>V</mi></math> , as shown in [Equation 9-1](#attention_equation).
+  id: totrans-48
   prefs: []
   type: TYPE_NORMAL
+  zh: 值向量乘以注意力权重，给出给定<math alttext="upper Q"><mi>Q</mi></math>，<math alttext="upper
+    K"><mi>K</mi></math>和<math alttext="upper V"><mi>V</mi></math>的*注意力*，如[方程9-1](#attention_equation)所示。
 - en: Equation 9-1\. Attention equation
+  id: totrans-49
   prefs:
   - PREF_H5
   type: TYPE_NORMAL
+  zh: 方程9-1。注意力方程
 - en: <math alttext="StartLayout 1st Row  upper A t t e n t i o n left-parenthesis
     upper Q comma upper K comma upper V right-parenthesis equals s o f t m a x left-parenthesis
     StartFraction upper Q upper K Superscript upper T Baseline Over StartRoot d Subscript
@@ -292,141 +367,213 @@
     <mi>s</mi> <mi>o</mi> <mi>f</mi> <mi>t</mi> <mi>m</mi> <mi>a</mi> <mi>x</mi> <mrow><mo>(</mo>
     <mfrac><mrow><mi>Q</mi><msup><mi>K</mi> <mi>T</mi></msup></mrow> <msqrt><msub><mi>d</mi>
     <mi>k</mi></msub></msqrt></mfrac> <mo>)</mo></mrow> <mi>V</mi></mrow></mtd></mtr></mtable></math>
+  id: totrans-50
   prefs: []
   type: TYPE_NORMAL
+  zh: <math alttext="StartLayout 1st Row  upper A t t e n t i o n left-parenthesis
+    upper Q comma upper K comma upper V right-parenthesis equals s o f t m a x left-parenthesis
+    StartFraction upper Q upper K Superscript upper T Baseline Over StartRoot d Subscript
+    k Baseline EndRoot EndFraction right-parenthesis upper V EndLayout" display="block"><mtable
+    displaystyle="true"><mtr><mtd columnalign="right"><mrow><mi>A</mi> <mi>t</mi>
+    <mi>t</mi> <mi>e</mi> <mi>n</mi> <mi>t</mi> <mi>i</mi> <mi>o</mi> <mi>n</mi> <mrow><mo>(</mo>
+    <mi>Q</mi> <mo>,</mo> <mi>K</mi> <mo>,</mo> <mi>V</mi> <mo>)</mo></mrow> <mo>=</mo>
+    <mi>s</mi> <mi>o</mi> <mi>f</mi> <mi>t</mi> <mi>m</mi> <mi>a</mi> <mi>x</mi> <mrow><mo>(</mo>
+    <mfrac><mrow><mi>Q</mi><msup><mi>K</mi> <mi>T</mi></msup></mrow> <msqrt><msub><mi>d</mi>
+    <mi>k</mi></msub></msqrt></mfrac> <mo>)</mo></mrow> <mi>V</mi></mrow></mtd></mtr></mtable></math>
 - en: To obtain the final output vector from the attention head, the attention is
     summed to give a vector of length <math alttext="d Subscript v"><msub><mi>d</mi>
     <mi>v</mi></msub></math> . This *context vector* captures a blended opinion from
     words in the sentence on the task of predicting what word follows *too*.
+  id: totrans-51
   prefs: []
   type: TYPE_NORMAL
+  zh: 从注意力头中获取最终输出向量，将注意力求和得到长度为<math alttext="d Subscript v"><msub><mi>d</mi> <mi>v</mi></msub></math>的向量。这个*上下文向量*捕捉了句子中单词对于预测接下来的单词是什么的任务的混合意见。
 - en: Multihead Attention
+  id: totrans-52
   prefs:
   - PREF_H2
   type: TYPE_NORMAL
+  zh: 多头注意力
 - en: There’s no reason to stop at just one attention head! In Keras, we can build
     a `MultiHeadAttention` layer that concatenates the output from multiple attention
     heads, allowing each to learn a distinct attention mechanism so that the layer
     as a whole can learn more complex relationships.
+  id: totrans-53
   prefs: []
   type: TYPE_NORMAL
+  zh: 没有理由只停留在一个注意力头上！在Keras中，我们可以构建一个`MultiHeadAttention`层，将多个注意力头的输出连接起来，使每个头学习不同的注意力机制，从而使整个层能够学习更复杂的关系。
 - en: The concatenated outputs are passed through one final weights matrix <math alttext="upper
     W Subscript upper O"><msub><mi>W</mi> <mi>O</mi></msub></math> to project the
     vector into the desired output dimension, which in our case is the same as the
     input dimension of the query ( <math alttext="d Subscript e"><msub><mi>d</mi>
     <mi>e</mi></msub></math> ), so that the layers can be stacked sequentially on
     top of each other.
+  id: totrans-54
   prefs: []
   type: TYPE_NORMAL
+  zh: 连接的输出通过一个最终的权重矩阵<math alttext="upper W Subscript upper O"><msub><mi>W</mi> <mi>O</mi></msub></math>传递，将向量投影到所需的输出维度，这在我们的情况下与查询的输入维度相同（<math
+    alttext="d Subscript e"><msub><mi>d</mi> <mi>e</mi></msub></math>），以便层可以顺序堆叠在一起。
 - en: '[Figure 9-3](#multi_attention_layer) shows how the output from a `MultiHeadAttention`
     layer is constructed. In Keras we can simply write the line shown in [Example 9-2](#multihead_attention_keras)
     to create such a layer.'
+  id: totrans-55
   prefs: []
   type: TYPE_NORMAL
+  zh: '[图9-3](#multi_attention_layer)展示了一个`MultiHeadAttention`层的输出是如何构建的。在Keras中，我们可以简单地写下[示例9-2](#multihead_attention_keras)中显示的代码来创建这样一个层。'
 - en: Example 9-2\. Creating a `MultiHeadAttention` layer in Keras
+  id: totrans-56
   prefs:
   - PREF_H5
   type: TYPE_NORMAL
+  zh: 示例9-2。在Keras中创建一个`MultiHeadAttention`层
 - en: '[PRE2]'
+  id: totrans-57
   prefs: []
   type: TYPE_PRE
+  zh: '[PRE2]'
 - en: '[![1](Images/1.png)](#co_transformers_CO1-1)'
+  id: totrans-58
   prefs: []
   type: TYPE_NORMAL
+  zh: '[![1](Images/1.png)](#co_transformers_CO1-1)'
 - en: This multihead attention layer has four heads.
+  id: totrans-59
   prefs: []
   type: TYPE_NORMAL
+  zh: 这个多头注意力层有四个头。
 - en: '[![2](Images/2.png)](#co_transformers_CO1-2)'
+  id: totrans-60
   prefs: []
   type: TYPE_NORMAL
+  zh: '[![2](Images/2.png)](#co_transformers_CO1-2)'
 - en: The keys (and query) are vectors of length 128.
+  id: totrans-61
   prefs: []
   type: TYPE_NORMAL
+  zh: 键（和查询）是长度为128的向量。
 - en: '[![3](Images/3.png)](#co_transformers_CO1-3)'
+  id: totrans-62
   prefs: []
   type: TYPE_NORMAL
+  zh: '[![3](Images/3.png)](#co_transformers_CO1-3)'
 - en: The values (and therefore also the output from each head) are vectors of length
     64.
+  id: totrans-63
   prefs: []
   type: TYPE_NORMAL
+  zh: 值（因此也是每个头的输出）是长度为64的向量。
 - en: '[![4](Images/4.png)](#co_transformers_CO1-4)'
+  id: totrans-64
   prefs: []
   type: TYPE_NORMAL
+  zh: '[![4](Images/4.png)](#co_transformers_CO1-4)'
 - en: The output vector has length 256.
+  id: totrans-65
   prefs: []
   type: TYPE_NORMAL
+  zh: 输出向量的长度为256。
 - en: '![](Images/gdl2_0903.png)'
+  id: totrans-66
   prefs: []
   type: TYPE_IMG
+  zh: '![](Images/gdl2_0903.png)'
 - en: Figure 9-3\. A multihead attention layer with four heads
+  id: totrans-67
   prefs:
   - PREF_H6
   type: TYPE_NORMAL
+  zh: 图9-3。一个具有四个头的多头注意力层
 - en: Causal Masking
+  id: totrans-68
   prefs:
   - PREF_H2
   type: TYPE_NORMAL
+  zh: 因果掩码
 - en: So far, we have assumed that the query input to our attention head is a single
     vector. However, for efficiency during training, we would ideally like the attention
     layer to be able to operate on every word in the input at once, predicting for
     each what the subsequent word will be. In other words, we want our GPT model to
     be able to handle a group of query vectors in parallel (i.e., a matrix).
+  id: totrans-69
   prefs: []
   type: TYPE_NORMAL
+  zh: 到目前为止，我们假设我们的注意力头的查询输入是一个单一的向量。然而，在训练期间为了效率，我们理想情况下希望注意力层能够一次操作输入中的每个单词，为每个单词预测接下来的单词。换句话说，我们希望我们的GPT模型能够并行处理一组查询向量（即一个矩阵）。
 - en: You might think that we can just batch the vectors together into a matrix and
     let linear algebra handle the rest. This is true, but we need one extra step—we
     need to apply a mask to the query/key dot product, to avoid information from future
     words leaking through. This is known as *causal masking* and is shown in [Figure 9-4](#causal_mask).
+  id: totrans-70
   prefs: []
   type: TYPE_NORMAL
+  zh: 您可能会认为我们可以将向量批量处理成一个矩阵，让线性代数处理剩下的部分。这是正确的，但我们需要一个额外的步骤——我们需要对查询/键的点积应用一个掩码，以避免未来单词的信息泄漏。这被称为*因果掩码*，在[图9-4](#causal_mask)中显示。
 - en: '![](Images/gdl2_0904.png)'
+  id: totrans-71
   prefs: []
   type: TYPE_IMG
+  zh: '![](Images/gdl2_0904.png)'
 - en: Figure 9-4\. Matrix calculation of the attention scores for a batch of input
     queries, using a causal attention mask to hide keys that are not available to
     the query (because they come later in the sentence)
+  id: totrans-72
   prefs:
   - PREF_H6
   type: TYPE_NORMAL
+  zh: 图9-4。对一批输入查询计算注意力分数的矩阵，使用因果注意力掩码隐藏对查询不可用的键（因为它们在句子中后面）
 - en: Without this mask, our GPT model would be able to perfectly guess the next word
     in the sentence, because it would be using the key from the word itself as a feature!
     The code for creating a causal mask is shown in [Example 9-3](#causal_mask_code),
     and the resulting `numpy` array (transposed to match the diagram) is shown in
     [Figure 9-5](#causal_mask_numpy).
+  id: totrans-73
   prefs: []
   type: TYPE_NORMAL
+  zh: 如果没有这个掩码，我们的GPT模型将能够完美地猜测句子中的下一个单词，因为它将使用单词本身的键作为特征！创建因果掩码的代码显示在[示例9-3](#causal_mask_code)中，结果的`numpy`数组（转置以匹配图表）显示在[图9-5](#causal_mask_numpy)中。
 - en: Example 9-3\. The causal mask function
+  id: totrans-74
   prefs:
   - PREF_H5
   type: TYPE_NORMAL
+  zh: 示例9-3。因果掩码函数
 - en: '[PRE3]'
+  id: totrans-75
   prefs: []
   type: TYPE_PRE
+  zh: '[PRE3]'
 - en: '![](Images/gdl2_0905.png)'
+  id: totrans-76
   prefs: []
   type: TYPE_IMG
+  zh: '![](Images/gdl2_0905.png)'
 - en: Figure 9-5\. The causal mask as a `numpy` array—1 means unmasked and 0 means
     masked
+  id: totrans-77
   prefs:
   - PREF_H6
   type: TYPE_NORMAL
+  zh: 图9-5。作为`numpy`数组的因果掩码——1表示未掩码，0表示掩码
 - en: Tip
+  id: totrans-78
   prefs:
   - PREF_H6
   type: TYPE_NORMAL
+  zh: 提示
 - en: Causal masking is only required in *decoder Transformers* such as GPT, where
     the task is to sequentially generate tokens given previous tokens. Masking out
     future tokens during training is therefore essential.
+  id: totrans-79
   prefs: []
   type: TYPE_NORMAL
+  zh: 因果掩码仅在*解码器Transformer*（如GPT）中需要，其中任务是根据先前的标记顺序生成标记。在训练期间屏蔽未来标记因此至关重要。
 - en: Other flavors of Transformer (e.g., *encoder Transformers*) do not need causal
     masking, because they are not trained to predict the next token. For example Google’s
     BERT predicts masked words within a given sentence, so it can use context from
     both before and after the word in question.^([3](ch09.xhtml#idm45387006370384))
+  id: totrans-80
   prefs: []
   type: TYPE_NORMAL
 - en: We will explore the different types of Transformers in more detail at the end
     of the chapter.
+  id: totrans-81
   prefs: []
   type: TYPE_NORMAL
 - en: This concludes our explanation of the multihead attention mechanism that is
@@ -439,25 +586,31 @@
     to reshape the output ( <math alttext="upper W Subscript upper O"><msub><mi>W</mi>
     <mi>O</mi></msub></math> ). There are no convolutions or recurrent mechanisms
     at all in a multihead attention layer!
+  id: totrans-82
   prefs: []
   type: TYPE_NORMAL
 - en: Next, we shall take a step back and see how the multihead attention layer forms
     just one part of a larger component known as a *Transformer block*.
+  id: totrans-83
   prefs: []
   type: TYPE_NORMAL
 - en: The Transformer Block
+  id: totrans-84
   prefs:
   - PREF_H2
   type: TYPE_NORMAL
 - en: A *Transformer block* is a single component within a Transformer that applies
     some skip connections, feed-forward (dense) layers, and normalization around the
     multihead attention layer. A diagram of a Transformer block is shown in [Figure 9-6](#transformer_block).
+  id: totrans-85
   prefs: []
   type: TYPE_NORMAL
 - en: '![](Images/gdl2_0906.png)'
+  id: totrans-86
   prefs: []
   type: TYPE_IMG
 - en: Figure 9-6\. A Transformer block
+  id: totrans-87
   prefs:
   - PREF_H6
   type: TYPE_NORMAL
@@ -467,6 +620,7 @@
     not suffer as much from the vanishing gradient problem, because the skip connection
     provides a gradient-free *highway* that allows the network to transfer information
     forward uninterrupted.
+  id: totrans-88
   prefs: []
   type: TYPE_NORMAL
 - en: Secondly, *layer normalization* is used in the Transformer block to provide
@@ -474,6 +628,7 @@
     layer in action throughout this book, where the output from each channel is normalized
     to have a mean of 0 and standard deviation of 1\. The normalization statistics
     are calculated across the batch and spatial dimensions.
+  id: totrans-89
   prefs: []
   type: TYPE_NORMAL
 - en: In contrast, layer normalization in a Transformer block normalizes each position
@@ -481,17 +636,21 @@
     the channels. It is the complete opposite of batch normalization, in terms of
     how the normalization statistics are calculated. A diagram showing the difference
     between batch normalization and layer normalization is shown in [Figure 9-7](#layer_norm).
+  id: totrans-90
   prefs: []
   type: TYPE_NORMAL
 - en: '![](Images/gdl2_0907.png)'
+  id: totrans-91
   prefs: []
   type: TYPE_IMG
 - en: 'Figure 9-7\. Layer normalization versus batch normalization—the normalization
     statistics are calculated across the blue cells (source: [Sheng et al., 2020](https://arxiv.org/pdf/2003.07845.pdf))^([4](ch09.xhtml#idm45387006340992))'
+  id: totrans-92
   prefs:
   - PREF_H6
   type: TYPE_NORMAL
 - en: Layer Normalization Versus Batch Normalization
+  id: totrans-93
   prefs:
   - PREF_H1
   type: TYPE_NORMAL
@@ -500,61 +659,80 @@
     in the batch. However, recent work such as Shen et al.*s* challenges this assumption,
     showing that with some tweaks a form of batch normalization can still be used
     within Transformers, outperforming more traditional layer normalization.
+  id: totrans-94
   prefs: []
   type: TYPE_NORMAL
 - en: Lastly, a set of feed-forward (i.e., densely connected) layers is included in
     the Transformer block, to allow the component to extract higher-level features
     as we go deeper into the network.
+  id: totrans-95
   prefs: []
   type: TYPE_NORMAL
 - en: A Keras implementation of a Transformer block is shown in [Example 9-4](#transformer_block_code2).
+  id: totrans-96
   prefs: []
   type: TYPE_NORMAL
 - en: Example 9-4\. A `TransformerBlock` layer in Keras
+  id: totrans-97
   prefs:
   - PREF_H5
   type: TYPE_NORMAL
 - en: '[PRE4]'
+  id: totrans-98
   prefs: []
   type: TYPE_PRE
+  zh: '[PRE4]'
 - en: '[![1](Images/1.png)](#co_transformers_CO2-1)'
+  id: totrans-99
   prefs: []
   type: TYPE_NORMAL
 - en: The sublayers that make up the `TransformerBlock` layer are defined within the
     initialization function.
+  id: totrans-100
   prefs: []
   type: TYPE_NORMAL
 - en: '[![2](Images/2.png)](#co_transformers_CO2-2)'
+  id: totrans-101
   prefs: []
   type: TYPE_NORMAL
 - en: The causal mask is created to hide future keys from the query.
+  id: totrans-102
   prefs: []
   type: TYPE_NORMAL
 - en: '[![3](Images/3.png)](#co_transformers_CO2-3)'
+  id: totrans-103
   prefs: []
   type: TYPE_NORMAL
 - en: The multihead attention layer is created, with the attention masks specified.
+  id: totrans-104
   prefs: []
   type: TYPE_NORMAL
 - en: '[![4](Images/4.png)](#co_transformers_CO2-4)'
+  id: totrans-105
   prefs: []
   type: TYPE_NORMAL
 - en: The first *add and normalization* layer.
+  id: totrans-106
   prefs: []
   type: TYPE_NORMAL
 - en: '[![5](Images/5.png)](#co_transformers_CO2-5)'
+  id: totrans-107
   prefs: []
   type: TYPE_NORMAL
 - en: The feed-forward layers.
+  id: totrans-108
   prefs: []
   type: TYPE_NORMAL
 - en: '[![6](Images/6.png)](#co_transformers_CO2-6)'
+  id: totrans-109
   prefs: []
   type: TYPE_NORMAL
 - en: The second *add and normalization* layer.
+  id: totrans-110
   prefs: []
   type: TYPE_NORMAL
 - en: Positional Encoding
+  id: totrans-111
   prefs:
   - PREF_H2
   type: TYPE_NORMAL
@@ -565,13 +743,16 @@
     recurrent neural network. This is a strength (because of the parallelization efficiency
     gains) but also a problem, because we clearly need the attention layer to be able
     to predict different outputs for the following two sentences:'
+  id: totrans-112
   prefs: []
   type: TYPE_NORMAL
 - en: The dog looked at the boy and …​ (barked?)
+  id: totrans-113
   prefs:
   - PREF_UL
   type: TYPE_NORMAL
 - en: The boy looked at the dog and …​ (smiled?)
+  id: totrans-114
   prefs:
   - PREF_UL
   type: TYPE_NORMAL
@@ -579,66 +760,84 @@
     creating the inputs to the initial Transformer block. Instead of only encoding
     each token using a *token embedding*, we also encode the position of the token,
     using a *position embedding*.
+  id: totrans-115
   prefs: []
   type: TYPE_NORMAL
 - en: The *token embedding* is created using a standard `Embedding` layer to convert
     each token into a learned vector. We can create the *positional embedding* in
     the same way, using a standard `Embedding` layer to convert each integer position
     into a learned vector.
+  id: totrans-116
   prefs: []
   type: TYPE_NORMAL
 - en: Tip
+  id: totrans-117
   prefs:
   - PREF_H6
   type: TYPE_NORMAL
 - en: While GPT uses an `Embedding` layer to embed the position, the original Transformer
     paper used trigonometric functions—we’ll cover this alternative in [Chapter 11](ch11.xhtml#chapter_music),
     when we explore music generation.
+  id: totrans-118
   prefs: []
   type: TYPE_NORMAL
 - en: To construct the joint token–position encoding, the token embedding is added
     to the positional embedding, as shown in [Figure 9-8](#positional_enc). This way,
     the meaning and position of each word in the sequence are captured in a single
     vector.
+  id: totrans-119
   prefs: []
   type: TYPE_NORMAL
 - en: '![](Images/gdl2_0908.png)'
+  id: totrans-120
   prefs: []
   type: TYPE_IMG
 - en: Figure 9-8\. The token embeddings are added to the positional embeddings to
     give the token position encoding
+  id: totrans-121
   prefs:
   - PREF_H6
   type: TYPE_NORMAL
 - en: The code that defines our `TokenAndPositionEmbedding` layer is shown in [Example 9-5](#positional_embedding_code).
+  id: totrans-122
   prefs: []
   type: TYPE_NORMAL
 - en: Example 9-5\. The `TokenAndPositionEmbedding` layer
+  id: totrans-123
   prefs:
   - PREF_H5
   type: TYPE_NORMAL
 - en: '[PRE5]'
+  id: totrans-124
   prefs: []
   type: TYPE_PRE
+  zh: '[PRE5]'
 - en: '[![1](Images/1.png)](#co_transformers_CO3-1)'
+  id: totrans-125
   prefs: []
   type: TYPE_NORMAL
 - en: The tokens are embedded using an `Embedding` layer.
+  id: totrans-126
   prefs: []
   type: TYPE_NORMAL
 - en: '[![2](Images/2.png)](#co_transformers_CO3-2)'
+  id: totrans-127
   prefs: []
   type: TYPE_NORMAL
 - en: The positions of the tokens are also embedded using an `Embedding` layer.
+  id: totrans-128
   prefs: []
   type: TYPE_NORMAL
 - en: '[![3](Images/3.png)](#co_transformers_CO3-3)'
+  id: totrans-129
   prefs: []
   type: TYPE_NORMAL
 - en: The output from the layer is the sum of the token and position embeddings.
+  id: totrans-130
   prefs: []
   type: TYPE_NORMAL
 - en: Training GPT
+  id: totrans-131
   prefs:
   - PREF_H2
   type: TYPE_NORMAL
@@ -646,117 +845,163 @@
     we need to pass our input text through the token and position embedding layer,
     then through our Transformer block. The final output of the network is a simple
     `Dense` layer with softmax activation over the number of words in the vocabulary.
+  id: totrans-132
   prefs: []
   type: TYPE_NORMAL
 - en: Tip
+  id: totrans-133
   prefs:
   - PREF_H6
   type: TYPE_NORMAL
 - en: For simplicity, we will use just one Transformer block, rather than the 12 in
     the paper.
+  id: totrans-134
   prefs: []
   type: TYPE_NORMAL
 - en: The overall architecture is shown in [Figure 9-9](#transformer) and the equivalent
     code is provided in [Example 9-6](#transformer_code).
+  id: totrans-135
   prefs: []
   type: TYPE_NORMAL
 - en: '![](Images/gdl2_0909.png)'
+  id: totrans-136
   prefs: []
   type: TYPE_IMG
 - en: Figure 9-9\. The simplified GPT model architecture
+  id: totrans-137
   prefs:
   - PREF_H6
   type: TYPE_NORMAL
 - en: Example 9-6\. A GPT model in Keras
+  id: totrans-138
   prefs:
   - PREF_H5
   type: TYPE_NORMAL
 - en: '[PRE6]'
+  id: totrans-139
   prefs: []
   type: TYPE_PRE
+  zh: '[PRE6]'
 - en: '[![1](Images/1.png)](#co_transformers_CO4-1)'
+  id: totrans-140
   prefs: []
   type: TYPE_NORMAL
 - en: The input is padded (with zeros).
+  id: totrans-141
   prefs: []
   type: TYPE_NORMAL
 - en: '[![2](Images/2.png)](#co_transformers_CO4-2)'
+  id: totrans-142
   prefs: []
   type: TYPE_NORMAL
 - en: The text is encoded using a `TokenAndPositionEmbedding` layer.
+  id: totrans-143
   prefs: []
   type: TYPE_NORMAL
 - en: '[![3](Images/3.png)](#co_transformers_CO4-3)'
+  id: totrans-144
   prefs: []
   type: TYPE_NORMAL
 - en: The encoding is passed through a `TransformerBlock`.
+  id: totrans-145
   prefs: []
   type: TYPE_NORMAL
 - en: '[![4](Images/4.png)](#co_transformers_CO4-4)'
+  id: totrans-146
   prefs: []
   type: TYPE_NORMAL
 - en: The transformed output is passed through a `Dense` layer with softmax activation
     to predict a distribution over the subsequent word.
+  id: totrans-147
   prefs: []
   type: TYPE_NORMAL
+  zh: 转换后的输出通过具有softmax激活的`Dense`层传递，以预测后续单词的分布。
 - en: '[![5](Images/5.png)](#co_transformers_CO4-5)'
+  id: totrans-148
   prefs: []
   type: TYPE_NORMAL
+  zh: '[![5](Images/5.png)](#co_transformers_CO4-5)'
 - en: The `Model` takes a sequence of word tokens as input and outputs the predicted
     subsequent word distribution. The output from the Transformer block is also returned
     so that we can inspect how the model is directing its attention.
+  id: totrans-149
   prefs: []
   type: TYPE_NORMAL
+  zh: '`Model`以单词标记序列作为输入，并输出预测的后续单词分布。还返回了Transformer块的输出，以便我们可以检查模型如何引导其注意力。'
 - en: '[![6](Images/6.png)](#co_transformers_CO4-6)'
+  id: totrans-150
   prefs: []
   type: TYPE_NORMAL
+  zh: '[![6](Images/6.png)](#co_transformers_CO4-6)'
 - en: The model is compiled with `SparseCategoricalCrossentropy` loss over the predicted
     word distribution.
+  id: totrans-151
   prefs: []
   type: TYPE_NORMAL
+  zh: 模型使用预测的单词分布上的`SparseCategoricalCrossentropy`损失进行编译。
 - en: Analysis of GPT
+  id: totrans-152
   prefs:
   - PREF_H2
   type: TYPE_NORMAL
+  zh: GPT的分析
 - en: Now that we have compiled and trained our GPT model, we can start to use it
     to generate long strings of text. We can also interrogate the attention weights
     that are output from the `TransformerBlock`, to understand where the Transformer
     is looking for information at different points in the generation process.
+  id: totrans-153
   prefs: []
   type: TYPE_NORMAL
+  zh: 现在我们已经编译并训练了我们的GPT模型，我们可以开始使用它生成长文本字符串。我们还可以询问从`TransformerBlock`输出的注意权重，以了解Transformer在生成过程中不同点处寻找信息的位置。
 - en: Generating text
+  id: totrans-154
   prefs:
   - PREF_H3
   type: TYPE_NORMAL
+  zh: 生成文本
 - en: 'We can generate new text by applying the following process:'
+  id: totrans-155
   prefs: []
   type: TYPE_NORMAL
+  zh: 我们可以通过以下过程生成新文本：
 - en: Feed the network with an existing sequence of words and ask it to predict the
     following word.
+  id: totrans-156
   prefs:
   - PREF_OL
   type: TYPE_NORMAL
+  zh: 将现有单词序列馈送到网络中，并要求它预测接下来的单词。
 - en: Append this word to the existing sequence and repeat.
+  id: totrans-157
   prefs:
   - PREF_OL
   type: TYPE_NORMAL
+  zh: 将此单词附加到现有序列并重复。
 - en: The network will output a set of probabilities for each word that we can sample
     from, so we can make the text generation stochastic, rather than deterministic.
+  id: totrans-158
   prefs: []
   type: TYPE_NORMAL
+  zh: 网络将为每个单词输出一组概率，我们可以从中进行抽样，因此我们可以使文本生成具有随机性，而不是确定性。
 - en: We will use the same `TextGenerator` class introduced in [Chapter 5](ch05.xhtml#chapter_autoregressive)
     for LSTM text generation, including the `temperature` parameter that specifies
     how deterministic we would like the sampling process to be. Let’s take a look
     at this in action, at two different temperature values ([Figure 9-10](#transformer_examples)).
+  id: totrans-159
   prefs: []
   type: TYPE_NORMAL
+  zh: 我们将使用在[第5章](ch05.xhtml#chapter_autoregressive)中引入的相同`TextGenerator`类进行LSTM文本生成，包括指定采样过程的确定性程度的`temperature`参数。让我们看看这在两个不同的温度值（[图9-10](#transformer_examples)）下是如何运作的。
 - en: '![](Images/gdl2_0910.png)'
+  id: totrans-160
   prefs: []
   type: TYPE_IMG
+  zh: '![](Images/gdl2_0910.png)'
 - en: Figure 9-10\. Generated outputs at `temperature = 1.0` and `temperature = 0.5`.
+  id: totrans-161
   prefs:
   - PREF_H6
   type: TYPE_NORMAL
+  zh: 图9-10。在`temperature = 1.0`和`temperature = 0.5`时生成的输出。
 - en: There are a few things to note about these two passages. First, both are stylistically
     similar to a wine review from the original training set. They both open with the
     region and type of wine, and the wine type stays consistent throughout the passage
@@ -765,40 +1010,54 @@
     accurate than the example with temperature 0.5\. Generating multiple samples with
     temperature 1.0 will therefore lead to more variety as the model is sampling from
     a probability distribution with greater variance.
+  id: totrans-162
   prefs: []
   type: TYPE_NORMAL
+  zh: 关于这两段文字有几点需要注意。首先，两者在风格上与原始训练集中的葡萄酒评论相似。它们都以葡萄酒的产地和类型开头，而葡萄酒类型在整个段落中保持一致（例如，它不会在中途更换颜色）。正如我们在[第5章](ch05.xhtml#chapter_autoregressive)中看到的，使用温度为1.0生成的文本更加冒险，因此比温度为0.5的示例不够准确。因此，使用温度为1.0生成多个样本将导致更多的变化，因为模型正在从具有更大方差的概率分布中进行抽样。
 - en: Viewing the attention scores
+  id: totrans-163
   prefs:
   - PREF_H3
   type: TYPE_NORMAL
+  zh: 查看注意力分数
 - en: We can also ask the model to tell us how much attention is being placed on each
     word, when deciding on the next word in the sentence. The `TransformerBlock` outputs
     the attention weights for each head, which are a softmax distribution over the
     preceding words in the sentence.
+  id: totrans-164
   prefs: []
   type: TYPE_NORMAL
+  zh: 我们还可以要求模型告诉我们在决定句子中的下一个单词时，每个单词放置了多少注意力。`TransformerBlock`输出每个头的注意权重，这是对句子中前面单词的softmax分布。
 - en: To demonstrate this, [Figure 9-11](#attention_probs) shows the top five tokens
     with the highest probabilities for three different input prompts, as well as the
     average attention across both heads, against each preceding word. The preceding
     words are colored according to their attention score, averaged across the two
     attention heads. Darker blue indicates more attention is being placed on the word.
+  id: totrans-165
   prefs: []
   type: TYPE_NORMAL
+  zh: 为了证明这一点，[图9-11](#attention_probs)显示了三个不同输入提示的前五个具有最高概率的标记，以及两个注意力头的平均注意力，针对每个前面的单词。根据其注意力分数对前面的单词进行着色，两个注意力头的平均值。深蓝色表示对该单词放置更多的注意力。
 - en: '![](Images/gdl2_0911.png)'
+  id: totrans-166
   prefs: []
   type: TYPE_IMG
+  zh: '![](Images/gdl2_0911.png)'
 - en: Figure 9-11\. Distribution of word probabilities following various sequences
+  id: totrans-167
   prefs:
   - PREF_H6
   type: TYPE_NORMAL
+  zh: 图9-11。各种序列后单词概率分布
 - en: In the first example, the model attends closely to the country (*germany*) in
     order to decide on the word that relates to the region. This makes sense! To pick
     a region, it needs to take lots of information from the words that relate to the
     country, to ensure they match. It doesn’t need to pay as much attention to the
     first two tokens (*wine review*) because they don’t hold any useful information
     regarding the region.
+  id: totrans-168
   prefs: []
   type: TYPE_NORMAL
+  zh: 在第一个示例中，模型密切关注国家（*德国*），以决定与地区相关的单词。这是有道理的！为了选择一个地区，它需要从与国家相关的单词中获取大量信息，以确保它们匹配。它不需要太关注前两个标记（*葡萄酒评论*），因为它们不包含有关地区的任何有用信息。
 - en: In the second example, it needs to refer back to the grape (*riesling*), so
     it pays attention to the first time that it was mentioned. It can pull this information
     by directly attending to the word, no matter how far back it is in the sentence
@@ -806,6 +1065,7 @@
     a recurrent neural network, which relies on a hidden state to maintain all interesting
     information over the length of the sequence so that it can be drawn upon if required—a
     much less efficient approach.
+  id: totrans-169
   prefs: []
   type: TYPE_NORMAL
 - en: The final sequence shows an example of how our GPT model can choose an appropriate
@@ -814,6 +1074,7 @@
     As Riesling is typically a sweet wine, and sugar is already mentioned, it makes
     sense that it should be described as *slightly sweet* rather than *slightly earthy*,
     for example.
+  id: totrans-170
   prefs: []
   type: TYPE_NORMAL
 - en: It is incredibly informative to be able to interrogate the network in this way,
@@ -822,6 +1083,7 @@
     input prompts to see if you can get the model to attend to words really far back
     in the sentence, to convince yourself of the power of attention-based models over
     more traditional recurrent models!`  `# Other Transformers
+  id: totrans-171
   prefs: []
   type: TYPE_NORMAL
 - en: Our GPT model is a *decoder Transformer*—it generates a text string one token
@@ -832,39 +1094,49 @@
     are also *encoder-decoder Transformers* that can translate from one text string
     to another; this type of model contains both encoder Transformer blocks and decoder
     Transformer blocks.
+  id: totrans-172
   prefs: []
   type: TYPE_NORMAL
 - en: '[Table 9-1](#transformer_types) summarizes the three types of Transformers,
     with the best examples of each architecture and typical use cases.'
+  id: totrans-173
   prefs: []
   type: TYPE_NORMAL
 - en: Table 9-1\. The three Transformer architectures
+  id: totrans-174
   prefs: []
   type: TYPE_NORMAL
 - en: '| Type | Examples | Use cases |'
+  id: totrans-175
   prefs: []
   type: TYPE_TB
 - en: '| --- | --- | --- |'
+  id: totrans-176
   prefs: []
   type: TYPE_TB
 - en: '| Encoder | BERT (Google) | Sentence classification, named entity recognition,
     extractive question answering |'
+  id: totrans-177
   prefs: []
   type: TYPE_TB
 - en: '| Encoder-decoder | T5 (Google) | Summarization, translation, question answering
     |'
+  id: totrans-178
   prefs: []
   type: TYPE_TB
 - en: '| Decoder | GPT-3 (OpenAI) | Text generation |'
+  id: totrans-179
   prefs: []
   type: TYPE_TB
 - en: A well-known example of an encoder Transformer is the *Bidirectional Encoder
     Representations from Transformers* (BERT) model, developed by Google (Devlin et
     al., 2018) that predicts missing words from a sentence, given context from both
     before and after the missing word in all layers.
+  id: totrans-180
   prefs: []
   type: TYPE_NORMAL
 - en: Encoder Transformers
+  id: totrans-181
   prefs:
   - PREF_H1
   type: TYPE_NORMAL
@@ -874,14 +1146,17 @@
     so we will not explore them in detail in this book—see Lewis Tunstall et al.’s
     [*Natural Language Processing with Transformers*](https://www.oreilly.com/library/view/natural-language-processing/9781098136789)
     (O’Reilly) for more information.
+  id: totrans-182
   prefs: []
   type: TYPE_NORMAL
 - en: In the following sections we will explore how encoder-decoder transformers work
     and discuss extensions of the original GPT model architecture released by OpenAI,
     including ChatGPT, which has been specifically designed for conversational applications.
+  id: totrans-183
   prefs: []
   type: TYPE_NORMAL
 - en: T5
+  id: totrans-184
   prefs:
   - PREF_H2
   type: TYPE_NORMAL
@@ -889,14 +1164,17 @@
     the T5 model from Google.^([5](ch09.xhtml#idm45387005361120)) This model reframes
     a range of tasks into a text-to-text framework, including translation, linguistic
     acceptability, sentence similarity, and document summarization, as shown in [Figure 9-12](#t5).
+  id: totrans-185
   prefs: []
   type: TYPE_NORMAL
 - en: '![](Images/gdl2_0912.png)'
+  id: totrans-186
   prefs: []
   type: TYPE_IMG
 - en: 'Figure 9-12\. Examples of how T5 reframes a range of tasks into a text-to-text
     framework, including translation, linguistic acceptability, sentence similarity,
     and document summarization (source: [Raffel et al., 2019](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html))'
+  id: totrans-187
   prefs:
   - PREF_H6
   type: TYPE_NORMAL
@@ -906,13 +1184,16 @@
     Colossal Clean Crawled Corpus, or C4), whereas the original Transformer paper
     was focused only on language translation, so it was trained on 1.4 GB of English–German
     sentence pairs.
+  id: totrans-188
   prefs: []
   type: TYPE_NORMAL
 - en: '![](Images/gdl2_0913.png)'
+  id: totrans-189
   prefs: []
   type: TYPE_IMG
 - en: 'Figure 9-13\. An encoder-decoder Transformer model: each gray box is a Transformer
     block (source: [Vaswani et al., 2017](https://arxiv.org/abs/1706.03762))'
+  id: totrans-190
   prefs:
   - PREF_H6
   type: TYPE_NORMAL
@@ -920,6 +1201,7 @@
     being repeated and positional embedding being used to capture the ordering of
     the input sequences. The two key differences between this model and the GPT model
     that we built earlier in the chapter are as follows:'
+  id: totrans-191
   prefs: []
   type: TYPE_NORMAL
 - en: On the lefthand side, a set of *encoder* Transformer blocks encode the sequence
@@ -929,6 +1211,7 @@
     that can be fed to the decoder. Therefore, the attention layers in the encoder
     can be completely unmasked to capture all the cross-dependencies between words,
     no matter the order.
+  id: totrans-192
   prefs:
   - PREF_UL
   type: TYPE_NORMAL
@@ -941,6 +1224,7 @@
     is called *cross-referential* attention and means that the decoder can attend
     to the encoder representation of the input sequence to be translated. This is
     how the decoder knows what meaning the translation needs to convey!
+  id: totrans-193
   prefs:
   - PREF_UL
   type: TYPE_NORMAL
@@ -951,18 +1235,22 @@
     on the gender of the noun, but the Transformer knows to choose *die* because one
     attention head is able to attend to the word *street* (a feminine word in German),
     while another attends to the word to translate (*the*).'
+  id: totrans-194
   prefs: []
   type: TYPE_NORMAL
 - en: '![](Images/gdl2_0914.png)'
+  id: totrans-195
   prefs: []
   type: TYPE_IMG
 - en: Figure 9-14\. An example of how one attention head attends to the word “the”
     and another attends to the word “street” in order to correctly translate the word
     “the” to the German word “die” as the feminine definite article of “Straße”
+  id: totrans-196
   prefs:
   - PREF_H6
   type: TYPE_NORMAL
 - en: Tip
+  id: totrans-197
   prefs:
   - PREF_H6
   type: TYPE_NORMAL
@@ -970,43 +1258,53 @@
     which contains a Colab notebook that allows you to play around with a trained
     encoder-decoder Transformer model and see how the attention mechanisms of the
     encoder and decoder impact the translation of a given sentence into German.
+  id: totrans-198
   prefs: []
   type: TYPE_NORMAL
 - en: GPT-3 and GPT-4
+  id: totrans-199
   prefs:
   - PREF_H2
   type: TYPE_NORMAL
 - en: Since the original 2018 publication of GPT, OpenAI has released multiple updated
     versions that improve upon the original model, as shown in [Table 9-2](#gpt_releases).
+  id: totrans-200
   prefs: []
   type: TYPE_NORMAL
 - en: Table 9-2\. The evolution of OpenAI’s GPT collection of models
+  id: totrans-201
   prefs: []
   type: TYPE_NORMAL
 - en: '| Model | Date | Layers | Attention heads | Word embedding size | Context window
     | # parameters | Training data |'
+  id: totrans-202
   prefs: []
   type: TYPE_TB
 - en: '| --- | --- | --- | --- | --- | --- | --- | --- |'
+  id: totrans-203
   prefs: []
   type: TYPE_TB
 - en: '| [GPT](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf)
     | Jun 2018 | 12 | 12 | 768 | 512 | 120,000,000 | BookCorpus: 4.5 GB of text from
     unpublished books |'
+  id: totrans-204
   prefs: []
   type: TYPE_TB
 - en: '| [GPT-2](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
     | Feb 2019 | 48 | 48 | 1,600 | 1,024 | 1,500,000,000 | WebText: 40 GB of text
     from outbound Reddit links |'
+  id: totrans-205
   prefs: []
   type: TYPE_TB
 - en: '| [GPT-3](https://arxiv.org/abs/2005.14165) | May 2020 | 96 | 96 | 12,888 |
     2,048 | 175,000,000,000 | CommonCrawl, WebText, English Wikipedia, book corpora
     and others: 570 GB |'
+  id: totrans-206
   prefs: []
   type: TYPE_TB
 - en: '| [GPT-4](https://arxiv.org/abs/2303.08774) | Mar 2023 | - | - | - | - | -
     | - |'
+  id: totrans-207
   prefs: []
   type: TYPE_TB
 - en: The model architecture of GPT-3 is fairly similar to the original GPT model,
@@ -1016,6 +1314,7 @@
     so crosses over into being a multimodal model for the first time. The model weights
     of GPT-3 and GPT-4 are not open source, though the models are available through
     a [commercial tool and API](https://platform.openai.com).
+  id: totrans-208
   prefs: []
   type: TYPE_NORMAL
 - en: GPT-3 can also be [fine-tuned to your own training data](https://oreil.ly/B-Koo)—this
@@ -1025,16 +1324,20 @@
     simply by providing a few examples in the prompt itself (this is known as *few-shot
     learning*). The benefit of fine-tuning is that you do not need to provide these
     examples as part of every single input prompt, saving costs in the long run.
+  id: totrans-209
   prefs: []
   type: TYPE_NORMAL
 - en: An example of the output from GPT-3, given a system prompt sentence, is shown
     in [Figure 9-15](#gpt3_story).
+  id: totrans-210
   prefs: []
   type: TYPE_NORMAL
 - en: '![](Images/gdl2_0915.png)'
+  id: totrans-211
   prefs: []
   type: TYPE_IMG
 - en: Figure 9-15\. An example of how GPT-3 can extend a given system prompt
+  id: totrans-212
   prefs:
   - PREF_H6
   type: TYPE_NORMAL
@@ -1042,9 +1345,11 @@
     of model weights and dataset size. The ceiling of large language model capability
     has yet to be reached, with researchers continuing to push the boundaries of what
     is possible with increasingly larger models and datasets.
+  id: totrans-213
   prefs: []
   type: TYPE_NORMAL
 - en: ChatGPT
+  id: totrans-214
   prefs:
   - PREF_H2
   type: TYPE_NORMAL
@@ -1053,18 +1358,22 @@
     a conversational interface. The original release in November 2022 was powered
     by *GPT-3.5*, a version of the model that was more powerful that GPT-3 and was
     fine-tuned to conversational responses.
+  id: totrans-215
   prefs: []
   type: TYPE_NORMAL
 - en: Example dialogue is shown in [Figure 9-16](#chatgpt_example). Notice how the
     agent is able to maintain state between inputs, understanding that the *attention*
     mentioned in the second question refers to attention in the context of Transformers,
     rather than a person’s ability to focus.
+  id: totrans-216
   prefs: []
   type: TYPE_NORMAL
 - en: '![](Images/gdl2_0916.png)'
+  id: totrans-217
   prefs: []
   type: TYPE_IMG
 - en: Figure 9-16\. An example of ChatGPT answering questions about Transformers
+  id: totrans-218
   prefs:
   - PREF_H6
   type: TYPE_NORMAL
@@ -1075,14 +1384,17 @@
     group’s earlier paper^([6](ch09.xhtml#idm45387005277024)) that introduced the
     *InstructGPT* model, a fine-tuned GPT-3 model that is specifically designed to
     more accurately follow written instructions.
+  id: totrans-219
   prefs: []
   type: TYPE_NORMAL
 - en: 'The training process for ChatGPT is as follows:'
+  id: totrans-220
   prefs: []
   type: TYPE_NORMAL
 - en: '*Supervised fine-tuning*: Collect a demonstration dataset of conversational
     inputs (prompts) and desired outputs that have been written by humans. This is
     used to fine-tune the underlying language model (GPT-3.5) using supervised learning.'
+  id: totrans-221
   prefs:
   - PREF_OL
   type: TYPE_NORMAL
@@ -1090,6 +1402,7 @@
     sampled model outputs and ask them to rank the outputs from best to worst. Train
     a reward model that predicts the score given to each output, given the conversation
     history.'
+  id: totrans-222
   prefs:
   - PREF_OL
   type: TYPE_NORMAL
@@ -1100,26 +1413,32 @@
     by the reward model trained in step 2\. A reinforcement learning algorithm—proximal
     policy optimization (PPO)—can then be trained to maximize the reward, by adjusting
     the weights of the language model.'
+  id: totrans-223
   prefs:
   - PREF_OL
   type: TYPE_NORMAL
 - en: Reinforcement Learning
+  id: totrans-224
   prefs:
   - PREF_H1
   type: TYPE_NORMAL
 - en: For an introduction to reinforcement learning see [Chapter 12](ch12.xhtml#chapter_world_models),
     where we explore how generative models can be used in a reinforcement learning
     setting.
+  id: totrans-225
   prefs: []
   type: TYPE_NORMAL
 - en: The RLHF process is shown in [Figure 9-17](#rlhf).
+  id: totrans-226
   prefs: []
   type: TYPE_NORMAL
 - en: '![](Images/gdl2_0917.png)'
+  id: totrans-227
   prefs: []
   type: TYPE_IMG
 - en: 'Figure 9-17\. The reinforcement learning from human feedback fine-tuning process
     used in ChatGPT (source: [OpenAI](https://openai.com/blog/chatgpt))'
+  id: totrans-228
   prefs:
   - PREF_H6
   type: TYPE_NORMAL
@@ -1129,6 +1448,7 @@
     and novel output that is often indistinguishable from human-generated text. The
     progress made thus far by models like ChatGPT serves as a testament to the potential
     of AI and its transformative impact on the world.
+  id: totrans-229
   prefs: []
   type: TYPE_NORMAL
 - en: Moreover, it is evident that AI-driven communication and interaction will continue
@@ -1138,32 +1458,38 @@
     text, but also images. The fusion of linguistic and visual capabilities in projects
     like Visual ChatGPT and GPT-4 have the potential to herald a new era in human–computer
     interaction.
+  id: totrans-230
   prefs: []
   type: TYPE_NORMAL
 - en: Summary
+  id: totrans-231
   prefs:
   - PREF_H1
   type: TYPE_NORMAL
 - en: In this chapter, we explored the Transformer model architecture and built a
     version of GPT—a model for state-of-the-art text generation.
+  id: totrans-232
   prefs: []
   type: TYPE_NORMAL
 - en: GPT makes use of a mechanism known as attention, which removes the need for
     recurrent layers (e.g., LSTMs). It works like an information retrieval system,
     utilizing queries, keys, and values to decide how much information it wants to
     extract from each input token.
+  id: totrans-233
   prefs: []
   type: TYPE_NORMAL
 - en: Attention heads can be grouped together to form what is known as a multihead
     attention layer. These are then wrapped up inside a Transformer block, which includes
     layer normalization and skip connections around the attention layer. Transformer
     blocks can be stacked to create very deep neural networks.
+  id: totrans-234
   prefs: []
   type: TYPE_NORMAL
 - en: Causal masking is used to ensure that GPT cannot leak information from downstream
     tokens into the current prediction. Also, a technique known as positional encoding
     is used to ensure that the ordering of the input sequence is not lost, but instead
     is baked into the input alongside the traditional word embedding.
+  id: totrans-235
   prefs: []
   type: TYPE_NORMAL
 - en: When analyzing the output from GPT, we saw it was possible not only to generate
@@ -1173,41 +1499,50 @@
     because the attention scores are calculated in parallel and do not rely on a hidden
     state that is carried through the network sequentially, as is the case with recurrent
     neural networks.
+  id: totrans-236
   prefs: []
   type: TYPE_NORMAL
 - en: We saw how there are three families of Transformers (encoder, decoder, and encoder-decoder)
     and the different tasks that can be accomplished with each. Finally, we explored
     the structure and training process of other large language models such as Google’s
     T5 and OpenAI’s ChatGPT.
+  id: totrans-237
   prefs: []
   type: TYPE_NORMAL
 - en: ^([1](ch09.xhtml#idm45387006840576-marker)) Ashish Vaswani et al., “Attention
     Is All You Need,” June 12, 2017, [*https://arxiv.org/abs/1706.03762*](https://arxiv.org/abs/1706.03762).
+  id: totrans-238
   prefs: []
   type: TYPE_NORMAL
 - en: ^([2](ch09.xhtml#idm45387006828736-marker)) Alec Radford et al., “Improving
     Language Understanding by Generative Pre-Training,” June 11, 2018, [*https://openai.com/research/language-unsupervised*](https://openai.com/research/language-unsupervised).
+  id: totrans-239
   prefs: []
   type: TYPE_NORMAL
 - en: '^([3](ch09.xhtml#idm45387006370384-marker)) Jacob Devlin et al., “BERT: Pre-Training
     of Deep Bidirectional Transformers for Language Understanding,” October 11, 2018,
     [*https://arxiv.org/abs/1810.04805*](https://arxiv.org/abs/1810.04805).'
+  id: totrans-240
   prefs: []
   type: TYPE_NORMAL
 - en: '^([4](ch09.xhtml#idm45387006340992-marker)) Sheng Shen et al., “PowerNorm:
     Rethinking Batch Normalization in Transformers,” June 28, 2020, [*https://arxiv.org/abs/2003.07845*](https://arxiv.org/abs/2003.07845).'
+  id: totrans-241
   prefs: []
   type: TYPE_NORMAL
 - en: ^([5](ch09.xhtml#idm45387005361120-marker)) Colin Raffel et al., “Exploring
     the Limits of Transfer Learning with a Unified Text-to-Text Transformer,” October
     23, 2019, [*https://arxiv.org/abs/1910.10683*](https://arxiv.org/abs/1910.10683).
+  id: totrans-242
   prefs: []
   type: TYPE_NORMAL
 - en: ^([6](ch09.xhtml#idm45387005277024-marker)) Long Ouyang et al., “Training Language
     Models to Follow Instructions with Human Feedback,” March 4, 2022, [*https://arxiv.org/abs/2203.02155*](https://arxiv.org/abs/2203.02155).
+  id: totrans-243
   prefs: []
   type: TYPE_NORMAL
 - en: '^([7](ch09.xhtml#idm45387005252672-marker)) Chenfei Wu et al., “Visual ChatGPT:
     Talking, Drawing and Editing with Visual Foundation Models,” March 8, 2023, [*https://arxiv.org/abs/2303.04671*](https://arxiv.org/abs/2303.04671).`'
+  id: totrans-244
   prefs: []
   type: TYPE_NORMAL