From e4ab2e1a07fd2e03692eee787f3770f4904daf14 Mon Sep 17 00:00:00 2001
From: wizardforcel <562826179@qq.com>
Date: Thu, 8 Feb 2024 19:11:21 +0800
Subject: [PATCH] 2024-02-08 19:11:19

---
 totrans/gen-dl_13.yaml | 150 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 150 insertions(+)

diff --git a/totrans/gen-dl_13.yaml b/totrans/gen-dl_13.yaml
index f519abe..33d04c0 100644
--- a/totrans/gen-dl_13.yaml
+++ b/totrans/gen-dl_13.yaml
@@ -3,6 +3,7 @@
   prefs:
   - PREF_H1
   type: TYPE_NORMAL
+  zh: 第9章 变压器
 - en: We saw in [Chapter 5](ch05.xhtml#chapter_autoregressive) how we can build generative
     models on text data using recurrent neural networks (RNNs), such as LSTMs and
     GRUs. These autoregressive models process sequential data one token at a time,
@@ -14,17 +15,20 @@
   id: totrans-1
   prefs: []
   type: TYPE_NORMAL
+  zh: 我们在[第5章](ch05.xhtml#chapter_autoregressive)中看到，我们可以使用循环神经网络（RNNs）（如LSTM和GRU）在文本数据上构建生成模型。这些自回归模型一次处理一个令牌的顺序数据，不断更新一个捕获输入当前潜在表示的隐藏向量。可以设计RNN以通过在隐藏向量上应用密集层和softmax激活来预测序列中的下一个单词。直到2017年，这被认为是生成文本的最复杂方式，当一篇论文永久改变了文本生成的格局。
 - en: Introduction
   id: totrans-2
   prefs:
   - PREF_H1
   type: TYPE_NORMAL
+  zh: 介绍
 - en: The Google Brain paper, confidently entitled “Attention Is All You Need,”^([1](ch09.xhtml#idm45387006840576))
     is famous for popularizing the concept of *attention*—a mechanism that now powers
     most state-of-the-art text generation models.
   id: totrans-3
   prefs: []
   type: TYPE_NORMAL
+  zh: 谷歌Brain的论文，自信地命名为“注意力就是一切”^([1](ch09.xhtml#idm45387006840576))，因推广*注意力*的概念而闻名，这个概念现在驱动着大多数最先进的文本生成模型。
 - en: The authors show how it is possible to create powerful neural networks called
     *Transformers* for sequential modeling that do not require complex recurrent or
     convolutional architectures but instead only rely on attention mechanisms. This
@@ -34,6 +38,7 @@
   id: totrans-4
   prefs: []
   type: TYPE_NORMAL
+  zh: 作者展示了如何创建称为*变压器*的强大神经网络，用于顺序建模，而不需要复杂的循环或卷积架构，而只依赖于注意机制。这种方法克服了RNN方法的一个关键缺点，即难以并行化，因为它必须一次处理一个令牌的序列。变压器是高度可并行化的，使它们能够在大规模数据集上进行训练。
 - en: In this chapter, we are going to delve into how modern text generation models
     make use of the Transformer architecture to reach state-of-the-art performance
     on text generation challenges. In particular, we will explore a type of autoregressive
@@ -42,23 +47,27 @@
   id: totrans-5
   prefs: []
   type: TYPE_NORMAL
+  zh: 在本章中，我们将深入探讨现代文本生成模型如何利用Transformer架构在文本生成挑战中达到最先进的性能。特别是，我们将探索一种称为*生成式预训练变压器*（GPT）的自回归模型，它驱动着OpenAI的GPT-4模型，被广泛认为是当前文本生成领域的最先进技术。
 - en: GPT
   id: totrans-6
   prefs:
   - PREF_H1
   type: TYPE_NORMAL
+  zh: GPT
 - en: OpenAI introduced GPT in June 2018, in the paper “Improving Language Understanding
     by Generative Pre-Training,”^([2](ch09.xhtml#idm45387006828736)) almost exactly
     a year after the appearance of the original Transformer paper.
   id: totrans-7
   prefs: []
   type: TYPE_NORMAL
+  zh: OpenAI于2018年6月推出了GPT，在论文“通过生成式预训练改进语言理解”中^([2](ch09.xhtml#idm45387006828736))，几乎与原始Transformer论文出现一年后完全一致。
 - en: In this paper, the authors show how a Transformer architecture can be trained
     on a huge amount of text data to predict the next word in a sequence and then
     subsequently fine-tuned to specific downstream tasks.
   id: totrans-8
   prefs: []
   type: TYPE_NORMAL
+  zh: 在本文中，作者展示了如何训练Transformer架构以预测序列中的下一个单词，然后随后对特定下游任务进行微调。
 - en: The pre-training process of GPT involves training the model on a large corpus
     of text called BookCorpus (4.5 GB of text from 7,000 unpublished books of different
     genres). During pre-training, the model is trained to predict the next word in
@@ -68,6 +77,7 @@
   id: totrans-9
   prefs: []
   type: TYPE_NORMAL
+  zh: GPT的预训练过程涉及在名为BookCorpus的大型文本语料库上训练模型（来自不同流派的7,000本未发表书籍的4.5 GB文本）。在预训练期间，模型被训练以预测给定前面单词的序列中的下一个单词。这个过程被称为*语言建模*，用于教导模型理解自然语言的结构和模式。
 - en: After pre-training, the GPT model can be fine-tuned for a specific task by providing
     it with a smaller, task-specific dataset. Fine-tuning involves adjusting the parameters
     of the model to better fit the task at hand. For example, the model can be fine-tuned
@@ -75,6 +85,7 @@
   id: totrans-10
   prefs: []
   type: TYPE_NORMAL
+  zh: 在预训练之后，GPT模型可以通过提供较小的、特定于任务的数据集来进行微调以适应特定任务。微调涉及调整模型的参数以更好地适应手头的任务。例如，模型可以针对分类、相似性评分或问题回答等任务进行微调。
 - en: The GPT architecture has since been improved and extended by OpenAI with the
     release of subsequent models such as GPT-2, GPT-3, GPT-3.5, and GPT-4\. These
     models are trained on larger datasets and have larger capacities, so they can
@@ -84,31 +95,37 @@
   id: totrans-11
   prefs: []
   type: TYPE_NORMAL
+  zh: 自GPT架构推出以来，OpenAI通过发布后续模型如GPT-2、GPT-3、GPT-3.5和GPT-4对其进行了改进和扩展。这些模型在更大的数据集上进行训练，并具有更大的容量，因此可以生成更复杂和连贯的文本。研究人员和行业从业者广泛采用了GPT模型，并为自然语言处理任务的重大进展做出了贡献。
 - en: In this chapter, we will build our own variation of the original GPT model,
     trained on less data, but still utilizing the same components and underlying principles.
   id: totrans-12
   prefs: []
   type: TYPE_NORMAL
+  zh: 在本章中，我们将构建我们自己的变体GPT模型，该模型在较少数据上进行训练，但仍利用相同的组件和基本原则。
 - en: Running the Code for This Example
   id: totrans-13
   prefs:
   - PREF_H1
   type: TYPE_NORMAL
+  zh: 运行此示例的代码
 - en: The code for this example can be found in the Jupyter notebook located at *notebooks/09_transformer/01_gpt/gpt.ipynb*
     in the book repository.
   id: totrans-14
   prefs: []
   type: TYPE_NORMAL
+  zh: 此示例的代码可以在位于书籍存储库中的Jupyter笔记本中找到，位置为*notebooks/09_transformer/01_gpt/gpt.ipynb*。
 - en: The code is adapted from the excellent [GPT tutorial](https://oreil.ly/J86pg)
     created by Apoorv Nandan available on the Keras website.
   id: totrans-15
   prefs: []
   type: TYPE_NORMAL
+  zh: 该代码改编自由Apoorv Nandan创建的优秀[GPT教程](https://oreil.ly/J86pg)，该教程可在Keras网站上找到。
 - en: The Wine Reviews Dataset
   id: totrans-16
   prefs:
   - PREF_H2
   type: TYPE_NORMAL
+  zh: 葡萄酒评论数据集
 - en: We’ll be using the [Wine Reviews dataset](https://oreil.ly/DC9EG) that is available
     through Kaggle. This is a set of over 130,000 reviews of wines, with accompanying
     metadata such as description and price.
@@ -266,6 +283,7 @@
   id: totrans-38
   prefs: []
   type: TYPE_NORMAL
+  zh: 那么，注意力头如何决定在哪里查找信息呢？在深入细节之前，让我们以高层次的方式探讨它是如何工作的，使用我们的*粉色大象*示例。
 - en: Imagine that we want to predict what follows the word *too*. To help with this
     task, other preceding words chime in with their opinions, but their contributions
     are weighted by how confident they are in their own expertise in predicting words
@@ -275,6 +293,7 @@
   id: totrans-39
   prefs: []
   type: TYPE_NORMAL
+  zh: 想象一下，我们想预测跟在单词*too*后面的是什么。为了帮助完成这个任务，其他前面的单词发表意见，但他们的贡献受到他们对自己预测跟在*too*后面的单词的信心程度的加权。例如，单词*elephant*可能自信地贡献说，它更有可能是与大小或响度相关的单词，而单词*was*没有太多可以提供来缩小可能性。
 - en: In other words, we can think of an attention head as a kind of information retrieval
     system, where a *query* (“What word follows *too*?”) is made into a *key/value*
     store (other words in the sentence) and the resulting output is a sum of the values,
@@ -282,20 +301,24 @@
   id: totrans-40
   prefs: []
   type: TYPE_NORMAL
+  zh: 换句话说，我们可以将注意力头视为一种信息检索系统，其中一个“查询”（“后面跟着什么词？”）被转换为一个*键/值*存储（句子中的其他单词），输出结果是值的加权和，权重由查询和每个键之间的*共鸣*决定。
 - en: We will now walk through the process in detail ([Figure 9-2](#attention_head)),
     again with reference to our *pink elephant* sentence.
   id: totrans-41
   prefs: []
   type: TYPE_NORMAL
+  zh: 我们现在将详细介绍这个过程（[图9-2](#attention_head)），再次参考我们的*粉色大象*句子。
 - en: '![](Images/gdl2_0902.png)'
   id: totrans-42
   prefs: []
   type: TYPE_IMG
+  zh: '![](Images/gdl2_0902.png)'
 - en: Figure 9-2\. The mechanics of an attention head
   id: totrans-43
   prefs:
   - PREF_H6
   type: TYPE_NORMAL
+  zh: 图9-2。注意力头的机制
 - en: The *query* ( <math alttext="upper Q"><mi>Q</mi></math> ) can be thought of
     as a representation of the current task at hand (e.g., “What word follows *too*?”).
     In this example, it is derived from the embedding of the word *too*, by passing
@@ -306,6 +329,10 @@
   id: totrans-44
   prefs: []
   type: TYPE_NORMAL
+  zh: '*查询*（<math alttext="upper Q"><mi>Q</mi></math>）可以被视为当前任务的表示（例如，“后面跟着什么词？”）。在这个例子中，它是从单词*too*的嵌入中导出的，通过将其通过权重矩阵<math
+    alttext="upper W Subscript upper Q"><msub><mi>W</mi> <mi>Q</mi></msub></math>传递来将向量的维度从<math
+    alttext="d Subscript e"><msub><mi>d</mi> <mi>e</mi></msub></math>更改为<math alttext="d
+    Subscript k"><msub><mi>d</mi> <mi>k</mi></msub></math>。'
 - en: The *key* vectors ( <math alttext="upper K"><mi>K</mi></math> ) are representations
     of each word in the sentence—you can think of these as descriptions of the kinds
     of prediction tasks that each word can help with. They are derived in a similar
@@ -318,6 +345,11 @@
   id: totrans-45
   prefs: []
   type: TYPE_NORMAL
+  zh: '*键*向量（<math alttext="upper K"><mi>K</mi></math>）是句子中每个单词的表示——您可以将这些视为每个单词可以帮助的预测任务的描述。它们以类似的方式导出查询，通过将每个嵌入通过权重矩阵<math
+    alttext="upper W Subscript upper K"><msub><mi>W</mi> <mi>K</mi></msub></math>传递来将每个向量的维度从<math
+    alttext="d Subscript e"><msub><mi>d</mi> <mi>e</mi></msub></math>更改为<math alttext="d
+    Subscript k"><msub><mi>d</mi> <mi>k</mi></msub></math>。请注意，键和查询具有相同的长度（<math alttext="d
+    Subscript k"><msub><mi>d</mi> <mi>k</mi></msub></math>）。'
 - en: Inside the attention head, each key is compared to the query using a dot product
     between each pair of vectors ( <math alttext="upper Q upper K Superscript upper
     T"><mrow><mi>Q</mi> <msup><mi>K</mi> <mi>T</mi></msup></mrow></math> ). This is
@@ -332,6 +364,9 @@
   id: totrans-46
   prefs: []
   type: TYPE_NORMAL
+  zh: 在注意力头内部，每个键与查询之间的向量对之间使用点积进行比较（<math alttext="upper Q upper K Superscript upper
+    T"><mrow><mi>Q</mi> <msup><mi>K</mi> <mi>T</mi></msup></mrow></math>）。这就是为什么键和查询必须具有相同的长度。对于特定的键/查询对，这个数字越高，键与查询的共鸣就越强，因此它可以更多地对注意力头的输出做出贡献。结果向量被缩放为<math
+    alttext="StartRoot d Subscript k Baseline EndRoot"><msqrt><msub><mi>d</mi> <mi>k</mi></msub></msqrt></math>，以保持向量和的方差稳定（大约等于1），并且应用softmax以确保贡献总和为1。这是一个*注意力权重*向量。
 - en: The *value* vectors ( <math alttext="upper V"><mi>V</mi></math> ) are also representations
     of the words in the sentence—you can think of these as the unweighted contributions
     of each word. They are derived by passing each embedding through a weights matrix
@@ -343,6 +378,10 @@
   id: totrans-47
   prefs: []
   type: TYPE_NORMAL
+  zh: '*值*向量（<math alttext="upper V"><mi>V</mi></math>）也是句子中单词的表示——您可以将这些视为每个单词的未加权贡献。它们通过将每个嵌入通过权重矩阵<math
+    alttext="upper W Subscript upper V"><msub><mi>W</mi> <mi>V</mi></msub></math>传递来导出，以将每个向量的维度从<math
+    alttext="d Subscript e"><msub><mi>d</mi> <mi>e</mi></msub></math>更改为<math alttext="d
+    Subscript v"><msub><mi>d</mi> <mi>v</mi></msub></math>。请注意，值向量不一定要与键和查询具有相同的长度（但通常为了简单起见）。'
 - en: The value vectors are multiplied by the attention weights to give the *attention*
     for a given <math alttext="upper Q"><mi>Q</mi></math> , <math alttext="upper K"><mi>K</mi></math>
     , and <math alttext="upper V"><mi>V</mi></math> , as shown in [Equation 9-1](#attention_equation).
@@ -571,11 +610,13 @@
   id: totrans-80
   prefs: []
   type: TYPE_NORMAL
+  zh: 其他类型的Transformer（例如*编码器Transformer*）不需要因果掩码，因为它们不是训练来预测下一个标记。例如，Google的BERT预测给定句子中的掩码单词，因此它可以使用单词之前和之后的上下文。^([3](ch09.xhtml#idm45387006370384))
 - en: We will explore the different types of Transformers in more detail at the end
     of the chapter.
   id: totrans-81
   prefs: []
   type: TYPE_NORMAL
+  zh: 我们将在本章末尾更详细地探讨不同类型的Transformer。
 - en: This concludes our explanation of the multihead attention mechanism that is
     present in all Transformers. It is remarkable that the learnable parameters of
     such an influential layer consist of nothing more than three densely connected
@@ -589,31 +630,41 @@
   id: totrans-82
   prefs: []
   type: TYPE_NORMAL
+  zh: 这结束了我们对存在于所有Transformer中的多头注意力机制的解释。令人惊讶的是，这样一个有影响力的层的可学习参数仅由每个注意力头的三个密集连接权重矩阵（<math
+    alttext="upper W Subscript upper Q"><msub><mi>W</mi> <mi>Q</mi></msub></math>，<math
+    alttext="upper W Subscript upper K"><msub><mi>W</mi> <mi>K</mi></msub></math>，<math
+    alttext="upper W Subscript upper V"><msub><mi>W</mi> <mi>V</mi></msub></math>）和一个进一步的权重矩阵来重塑输出（<math
+    alttext="upper W Subscript upper O"><msub><mi>W</mi> <mi>O</mi></msub></math>）。在多头注意力层中完全没有卷积或循环机制！
 - en: Next, we shall take a step back and see how the multihead attention layer forms
     just one part of a larger component known as a *Transformer block*.
   id: totrans-83
   prefs: []
   type: TYPE_NORMAL
+  zh: 接下来，我们将退一步，看看多头注意力层如何形成更大组件的一部分，这个组件被称为*Transformer块*。
 - en: The Transformer Block
   id: totrans-84
   prefs:
   - PREF_H2
   type: TYPE_NORMAL
+  zh: Transformer块
 - en: A *Transformer block* is a single component within a Transformer that applies
     some skip connections, feed-forward (dense) layers, and normalization around the
     multihead attention layer. A diagram of a Transformer block is shown in [Figure 9-6](#transformer_block).
   id: totrans-85
   prefs: []
   type: TYPE_NORMAL
+  zh: '*Transformer块*是Transformer中的一个单一组件，它应用一些跳跃连接、前馈（密集）层和在多头注意力层周围的归一化。Transformer块的示意图显示在[图9-6](#transformer_block)中。'
 - en: '![](Images/gdl2_0906.png)'
   id: totrans-86
   prefs: []
   type: TYPE_IMG
+  zh: '![](Images/gdl2_0906.png)'
 - en: Figure 9-6\. A Transformer block
   id: totrans-87
   prefs:
   - PREF_H6
   type: TYPE_NORMAL
+  zh: 图9-6。一个Transformer块
 - en: Firstly, notice how the query is passed around the multihead attention layer
     to be added to the output—this is a skip connection and is common in modern deep
     learning architectures. It means we can build very deep neural networks that do
@@ -623,6 +674,7 @@
   id: totrans-88
   prefs: []
   type: TYPE_NORMAL
+  zh: 首先，注意到查询是如何在多头注意力层周围传递并添加到输出中的——这是一个跳跃连接，在现代深度学习架构中很常见。这意味着我们可以构建非常深的神经网络，不会受到梯度消失问题的困扰，因为跳跃连接提供了一个无梯度的*高速公路*，允许网络将信息向前传递而不中断。
 - en: Secondly, *layer normalization* is used in the Transformer block to provide
     stability to the training process. We have already seen the batch normalization
     layer in action throughout this book, where the output from each channel is normalized
@@ -631,6 +683,7 @@
   id: totrans-89
   prefs: []
   type: TYPE_NORMAL
+  zh: 其次，在Transformer块中使用*层归一化*来提供训练过程的稳定性。我们已经在本书中看到了批归一化层的作用，其中每个通道的输出被归一化为均值为0，标准差为1。归一化统计量是跨批次和空间维度计算的。
 - en: In contrast, layer normalization in a Transformer block normalizes each position
     of each sequence in the batch by calculating the normalizing statistics across
     the channels. It is the complete opposite of batch normalization, in terms of
@@ -639,21 +692,25 @@
   id: totrans-90
   prefs: []
   type: TYPE_NORMAL
+  zh: 相比之下，在Transformer块中，层归一化通过计算跨通道的归一化统计量来归一化批次中每个序列的每个位置。就归一化统计量的计算方式而言，它与批归一化完全相反。显示批归一化和层归一化之间差异的示意图显示在[图9-7](#layer_norm)中。
 - en: '![](Images/gdl2_0907.png)'
   id: totrans-91
   prefs: []
   type: TYPE_IMG
+  zh: '![](Images/gdl2_0907.png)'
 - en: 'Figure 9-7\. Layer normalization versus batch normalization—the normalization
     statistics are calculated across the blue cells (source: [Sheng et al., 2020](https://arxiv.org/pdf/2003.07845.pdf))^([4](ch09.xhtml#idm45387006340992))'
   id: totrans-92
   prefs:
   - PREF_H6
   type: TYPE_NORMAL
+  zh: 图9-7。层归一化与批归一化——归一化统计量是跨蓝色单元计算的（来源：[Sheng等人，2020](https://arxiv.org/pdf/2003.07845.pdf))^([4](ch09.xhtml#idm45387006340992))
 - en: Layer Normalization Versus Batch Normalization
   id: totrans-93
   prefs:
   - PREF_H1
   type: TYPE_NORMAL
+  zh: 层归一化与批归一化
 - en: Layer normalization was used in the original GPT paper and is commonly used
     for text-based tasks to avoid creating normalization dependencies across sequences
     in the batch. However, recent work such as Shen et al.*s* challenges this assumption,
@@ -662,21 +719,25 @@
   id: totrans-94
   prefs: []
   type: TYPE_NORMAL
+  zh: 层归一化在原始GPT论文中使用，并且通常用于基于文本的任务，以避免在批次中的序列之间创建归一化依赖关系。然而，最近的工作，如Shen等人的挑战了这一假设，显示通过一些调整，一种形式的批归一化仍然可以在Transformer中使用，胜过更传统的层归一化。
 - en: Lastly, a set of feed-forward (i.e., densely connected) layers is included in
     the Transformer block, to allow the component to extract higher-level features
     as we go deeper into the network.
   id: totrans-95
   prefs: []
   type: TYPE_NORMAL
+  zh: 最后，在Transformer块中包含了一组前馈（即密集连接）层，以允许组件在网络深入时提取更高级别的特征。
 - en: A Keras implementation of a Transformer block is shown in [Example 9-4](#transformer_block_code2).
   id: totrans-96
   prefs: []
   type: TYPE_NORMAL
+  zh: 在Keras中展示了一个Transformer块的实现，详见[示例9-4](#transformer_block_code2)。
 - en: Example 9-4\. A `TransformerBlock` layer in Keras
   id: totrans-97
   prefs:
   - PREF_H5
   type: TYPE_NORMAL
+  zh: 示例9-4。Keras中的`TransformerBlock`层
 - en: '[PRE4]'
   id: totrans-98
   prefs: []
@@ -686,6 +747,7 @@
   id: totrans-99
   prefs: []
   type: TYPE_NORMAL
+  zh: '[![1](Images/1.png)](#co_transformers_CO2-1)'
 - en: The sublayers that make up the `TransformerBlock` layer are defined within the
     initialization function.
   id: totrans-100
@@ -1068,6 +1130,7 @@
   id: totrans-169
   prefs: []
   type: TYPE_NORMAL
+  zh: 在第二个例子中，它需要参考葡萄（*雷司令*），因此它关注第一次提到它的时间。它可以通过直接关注这个词来提取这个信息，无论这个词在句子中有多远（在80个单词的上限内）。请注意，这与递归神经网络非常不同，后者依赖于隐藏状态来维护整个序列的所有有趣信息，以便在需要时可以利用——这是一种效率低下得多的方法。
 - en: The final sequence shows an example of how our GPT model can choose an appropriate
     adjective based on a combination of information. Here the attention is again on
     the grape (*riesling*), but also on the fact that it contains *residual sugar*.
@@ -1077,6 +1140,7 @@
   id: totrans-170
   prefs: []
   type: TYPE_NORMAL
+  zh: 最终的序列展示了我们的GPT模型如何基于信息的组合选择适当的形容词的例子。这里的注意力再次集中在葡萄（*雷司令*）上，但也集中在它含有*残留糖*的事实上。由于雷司令通常是一种甜酒，而且已经提到了糖，因此将其描述为*略带甜味*而不是*略带泥土味*是有道理的。
 - en: It is incredibly informative to be able to interrogate the network in this way,
     to understand exactly where it is pulling information from in order to make accurate
     decisions about each subsequent word. I highly recommend playing around with the
@@ -1086,6 +1150,8 @@
   id: totrans-171
   prefs: []
   type: TYPE_NORMAL
+  zh: 以这种方式询问网络非常有启发性，可以准确了解它从哪里提取信息，以便对每个后续单词做出准确的决策。我强烈建议尝试玩弄输入提示，看看是否可以让模型关注句子中非常遥远的单词，以说服自己关注模型的注意力模型比传统的递归模型更具有力量！`  `#
+    其他Transformer
 - en: Our GPT model is a *decoder Transformer*—it generates a text string one token
     at a time and uses causal masking to only attend to previous words in the input
     string. There are also *encoder Transformers*, which do not use causal masking—instead,
@@ -1097,37 +1163,45 @@
   id: totrans-172
   prefs: []
   type: TYPE_NORMAL
+  zh: 我们的GPT模型是一个*解码器Transformer*——它一次生成一个标记的文本字符串，并使用因果屏蔽只关注输入字符串中的先前单词。还有*编码器Transformer*，它不使用因果屏蔽——相反，它关注整个输入字符串以提取输入的有意义的上下文表示。对于其他任务，比如语言翻译，还有*编码器-解码器Transformer*，可以将一个文本字符串翻译成另一个；这种模型包含编码器Transformer块和解码器Transformer块。
 - en: '[Table 9-1](#transformer_types) summarizes the three types of Transformers,
     with the best examples of each architecture and typical use cases.'
   id: totrans-173
   prefs: []
   type: TYPE_NORMAL
+  zh: '[表9-1](#transformer_types)总结了三种Transformer的类型，以及每种架构的最佳示例和典型用例。'
 - en: Table 9-1\. The three Transformer architectures
   id: totrans-174
   prefs: []
   type: TYPE_NORMAL
+  zh: 表9-1。三种Transformer架构
 - en: '| Type | Examples | Use cases |'
   id: totrans-175
   prefs: []
   type: TYPE_TB
+  zh: '| 类型 | 示例 | 用例 |'
 - en: '| --- | --- | --- |'
   id: totrans-176
   prefs: []
   type: TYPE_TB
+  zh: '| --- | --- | --- |'
 - en: '| Encoder | BERT (Google) | Sentence classification, named entity recognition,
     extractive question answering |'
   id: totrans-177
   prefs: []
   type: TYPE_TB
+  zh: '| 编码器 | BERT（谷歌） | 句子分类、命名实体识别、抽取式问答 |'
 - en: '| Encoder-decoder | T5 (Google) | Summarization, translation, question answering
     |'
   id: totrans-178
   prefs: []
   type: TYPE_TB
+  zh: '| 编码器-解码器 | T5（谷歌） | 摘要、翻译、问答 |'
 - en: '| Decoder | GPT-3 (OpenAI) | Text generation |'
   id: totrans-179
   prefs: []
   type: TYPE_TB
+  zh: '| 解码器 | GPT-3（OpenAI） | 文本生成 |'
 - en: A well-known example of an encoder Transformer is the *Bidirectional Encoder
     Representations from Transformers* (BERT) model, developed by Google (Devlin et
     al., 2018) that predicts missing words from a sentence, given context from both
@@ -1135,11 +1209,13 @@
   id: totrans-180
   prefs: []
   type: TYPE_NORMAL
+  zh: 一个众所周知的编码器Transformer的例子是谷歌开发的*双向编码器表示来自Transformer*（BERT）模型，它可以根据缺失单词的上下文预测句子中的缺失单词（Devlin等，2018）。
 - en: Encoder Transformers
   id: totrans-181
   prefs:
   - PREF_H1
   type: TYPE_NORMAL
+  zh: 编码器Transformer
 - en: Encoder Transformers are typically used for tasks that require an understanding
     of the input as a whole, such as sentence classification, named entity recognition,
     and extractive question answering. They are not used for text generation tasks,
@@ -1149,17 +1225,21 @@
   id: totrans-182
   prefs: []
   type: TYPE_NORMAL
+  zh: 编码器Transformer通常用于需要全面理解输入的任务，比如句子分类、命名实体识别和抽取式问答。它们不用于文本生成任务，因此我们不会在本书中详细探讨它们——有关更多信息，请参阅Lewis
+    Tunstall等人的[*使用Transformer进行自然语言处理*](https://www.oreilly.com/library/view/natural-language-processing/9781098136789)（O'Reilly）。
 - en: In the following sections we will explore how encoder-decoder transformers work
     and discuss extensions of the original GPT model architecture released by OpenAI,
     including ChatGPT, which has been specifically designed for conversational applications.
   id: totrans-183
   prefs: []
   type: TYPE_NORMAL
+  zh: 在接下来的章节中，我们将探讨编码器-解码器Transformer的工作原理，并讨论OpenAI发布的原始GPT模型架构的扩展，包括专门为对话应用设计的ChatGPT。
 - en: T5
   id: totrans-184
   prefs:
   - PREF_H2
   type: TYPE_NORMAL
+  zh: T5
 - en: An example of a modern Transformer that uses the encoder-decoder structure is
     the T5 model from Google.^([5](ch09.xhtml#idm45387005361120)) This model reframes
     a range of tasks into a text-to-text framework, including translation, linguistic
@@ -1167,10 +1247,12 @@
   id: totrans-185
   prefs: []
   type: TYPE_NORMAL
+  zh: 一个使用编码器-解码器结构的现代Transformer的例子是谷歌的T5模型。这个模型将一系列任务重新构建为文本到文本的框架，包括翻译、语言可接受性、句子相似性和文档摘要，如[图9-12](#t5)所示。
 - en: '![](Images/gdl2_0912.png)'
   id: totrans-186
   prefs: []
   type: TYPE_IMG
+  zh: '![](Images/gdl2_0912.png)'
 - en: 'Figure 9-12\. Examples of how T5 reframes a range of tasks into a text-to-text
     framework, including translation, linguistic acceptability, sentence similarity,
     and document summarization (source: [Raffel et al., 2019](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html))'
@@ -1178,6 +1260,7 @@
   prefs:
   - PREF_H6
   type: TYPE_NORMAL
+  zh: 图9-12。T5如何将一系列任务重新构建为文本到文本框架的示例，包括翻译、语言可接受性、句子相似性和文档摘要（来源：[Raffel et al., 2019](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html)）
 - en: The T5 model architecture closely matches the encoder-decoder architecture used
     in the original Transformer paper, shown in [Figure 9-13](#transformer2). The
     key difference is that T5 is trained on an enormous 750 GB corpus of text (the
@@ -1187,16 +1270,20 @@
   id: totrans-188
   prefs: []
   type: TYPE_NORMAL
+  zh: T5模型架构与原始Transformer论文中使用的编码器-解码器架构非常相似，如[图9-13](#transformer2)所示。关键区别在于T5是在一个庞大的750GB文本语料库（Colossal
+    Clean Crawled Corpus，或C4）上进行训练的，而原始Transformer论文仅关注语言翻译，因此它是在1.4GB的英德句对上进行训练的。
 - en: '![](Images/gdl2_0913.png)'
   id: totrans-189
   prefs: []
   type: TYPE_IMG
+  zh: '![](Images/gdl2_0913.png)'
 - en: 'Figure 9-13\. An encoder-decoder Transformer model: each gray box is a Transformer
     block (source: [Vaswani et al., 2017](https://arxiv.org/abs/1706.03762))'
   id: totrans-190
   prefs:
   - PREF_H6
   type: TYPE_NORMAL
+  zh: 图9-13。编码器-解码器Transformer模型：每个灰色框是一个Transformer块（来源：[Vaswani et al., 2017](https://arxiv.org/abs/1706.03762)）
 - en: 'Much of this diagram is already familiar to us—we can see the Transformer blocks
     being repeated and positional embedding being used to capture the ordering of
     the input sequences. The two key differences between this model and the GPT model
@@ -1204,6 +1291,7 @@
   id: totrans-191
   prefs: []
   type: TYPE_NORMAL
+  zh: 这个图表中的大部分内容对我们来说已经很熟悉了——我们可以看到Transformer块被重复，并且使用位置嵌入来捕捉输入序列的顺序。这个模型与我们在本章前面构建的GPT模型之间的两个关键区别如下：
 - en: On the lefthand side, a set of *encoder* Transformer blocks encode the sequence
     to be translated. Notice that there is no causal masking on the attention layer.
     This is because we are not generating further text to extend the sequence to be
@@ -1215,6 +1303,7 @@
   prefs:
   - PREF_UL
   type: TYPE_NORMAL
+  zh: 在左侧，一组*编码器*Transformer块对待翻译的序列进行编码。请注意，注意力层上没有因果屏蔽。这是因为我们不生成更多文本来扩展要翻译的序列；我们只想学习一个可以提供给解码器的整个序列的良好表示。因此，编码器中的注意力层可以完全不加屏蔽，以捕捉单词之间的所有交叉依赖关系，无论顺序如何。
 - en: On the righthand side, a set of *decoder* Transformer blocks generate the translated
     text. The initial attention layer is *self-referential* (i.e., the key, value,
     and query come from the same input) and causal masking is used to ensure information
@@ -1228,6 +1317,7 @@
   prefs:
   - PREF_UL
   type: TYPE_NORMAL
+  zh: 在右侧，一组*解码器*Transformer块生成翻译文本。初始注意力层是*自指*的（即，键、值和查询来自相同的输入），并且使用因果屏蔽确保来自未来标记的信息不会泄漏到当前要预测的单词。然而，我们可以看到随后的注意力层从编码器中提取键和值，只留下查询从解码器本身传递。这被称为*交叉引用*注意力，意味着解码器可以关注输入序列的编码器表示。这就是解码器知道翻译需要传达什么含义的方式！
 - en: '[Figure 9-14](#attention_example) shows an example of cross-referential attention.
     Two attention heads of the decoder layer are able to work together to provide
     the correct German translation for the word *the*, when used in the context of
@@ -1238,10 +1328,13 @@
   id: totrans-194
   prefs: []
   type: TYPE_NORMAL
+  zh: '[图9-14](#attention_example)展示了一个交叉引用注意力的示例。解码器层的两个注意力头能够共同提供单词*the*的正确德语翻译，当它在*the
+    street*的上下文中使用时。在德语中，根据名词的性别有三个定冠词（*der, die, das*），但Transformer知道选择*die*，因为一个注意力头能够关注单词*street*（德语中的一个女性词），而另一个关注要翻译的单词（*the*）。'
 - en: '![](Images/gdl2_0914.png)'
   id: totrans-195
   prefs: []
   type: TYPE_IMG
+  zh: '![](Images/gdl2_0914.png)'
 - en: Figure 9-14\. An example of how one attention head attends to the word “the”
     and another attends to the word “street” in order to correctly translate the word
     “the” to the German word “die” as the feminine definite article of “Straße”
@@ -1249,11 +1342,13 @@
   prefs:
   - PREF_H6
   type: TYPE_NORMAL
+  zh: 图9-14。一个示例，展示一个注意力头关注单词“the”，另一个关注单词“street”，以便正确将单词“the”翻译为德语单词“die”，作为“Straße”的女性定冠词
 - en: Tip
   id: totrans-197
   prefs:
   - PREF_H6
   type: TYPE_NORMAL
+  zh: 提示
 - en: This example is from the [Tensor2Tensor GitHub repository](https://oreil.ly/84lIA),
     which contains a Colab notebook that allows you to play around with a trained
     encoder-decoder Transformer model and see how the attention mechanisms of the
@@ -1261,52 +1356,67 @@
   id: totrans-198
   prefs: []
   type: TYPE_NORMAL
+  zh: 这个例子来自[Tensor2Tensor GitHub存储库](https://oreil.ly/84lIA)，其中包含一个Colab笔记本，让您可以玩转一个经过训练的编码器-解码器Transformer模型，并查看编码器和解码器的注意力机制如何影响将给定句子翻译成德语。
 - en: GPT-3 and GPT-4
   id: totrans-199
   prefs:
   - PREF_H2
   type: TYPE_NORMAL
+  zh: GPT-3和GPT-4
 - en: Since the original 2018 publication of GPT, OpenAI has released multiple updated
     versions that improve upon the original model, as shown in [Table 9-2](#gpt_releases).
   id: totrans-200
   prefs: []
   type: TYPE_NORMAL
+  zh: 自2018年GPT的原始出版以来，OpenAI已发布了多个更新版本，改进了原始模型，如[表9-2](#gpt_releases)所示。
 - en: Table 9-2\. The evolution of OpenAI’s GPT collection of models
   id: totrans-201
   prefs: []
   type: TYPE_NORMAL
+  zh: 表9-2。OpenAI的GPT系列模型的演变
 - en: '| Model | Date | Layers | Attention heads | Word embedding size | Context window
     | # parameters | Training data |'
   id: totrans-202
   prefs: []
   type: TYPE_TB
+  zh: '| 模型 | 日期 | 层 | 注意力头 | 词嵌入大小 | 上下文窗口 | 参数数量 | 训练数据 |'
 - en: '| --- | --- | --- | --- | --- | --- | --- | --- |'
   id: totrans-203
   prefs: []
   type: TYPE_TB
+  zh: '| --- | --- | --- | --- | --- | --- | --- | --- |'
 - en: '| [GPT](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf)
     | Jun 2018 | 12 | 12 | 768 | 512 | 120,000,000 | BookCorpus: 4.5 GB of text from
     unpublished books |'
   id: totrans-204
   prefs: []
   type: TYPE_TB
+  zh: '| [GPT](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf)
+    | 2018年6月 | 12 | 12 | 768 | 512 | 120,000,000 | BookCorpus：来自未发表书籍的4.5 GB文本 |'
 - en: '| [GPT-2](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
     | Feb 2019 | 48 | 48 | 1,600 | 1,024 | 1,500,000,000 | WebText: 40 GB of text
     from outbound Reddit links |'
   id: totrans-205
   prefs: []
   type: TYPE_TB
+  zh: '| [GPT-2](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
+    | 2019年2月 | 48 | 48 | 1,600 | 1,024 | 1,500,000,000 | WebText：来自Reddit外链的40 GB文本
+    |'
 - en: '| [GPT-3](https://arxiv.org/abs/2005.14165) | May 2020 | 96 | 96 | 12,888 |
     2,048 | 175,000,000,000 | CommonCrawl, WebText, English Wikipedia, book corpora
     and others: 570 GB |'
   id: totrans-206
   prefs: []
   type: TYPE_TB
+  zh: '| [GPT-3](https://arxiv.org/abs/2005.14165) | 2020年5月 | 96 | 96 | 12,888 |
+    2,048 | 175,000,000,000 | CommonCrawl，WebText，英文维基百科，书籍语料库等：570 GB |'
 - en: '| [GPT-4](https://arxiv.org/abs/2303.08774) | Mar 2023 | - | - | - | - | -
     | - |'
   id: totrans-207
   prefs: []
   type: TYPE_TB
+  zh: '| [GPT-4](https://arxiv.org/abs/2303.08774) | 2023年3月 | - | - | - | - | - |
+    - |'
 - en: The model architecture of GPT-3 is fairly similar to the original GPT model,
     except it is much larger and trained on much more data. At the time of writing,
     GPT-4 is in limited beta—OpenAI has not publicly released details of the model’s
@@ -1317,6 +1427,7 @@
   id: totrans-208
   prefs: []
   type: TYPE_NORMAL
+  zh: GPT-3的模型架构与原始GPT模型非常相似，只是规模更大，训练数据更多。在撰写本文时，GPT-4处于有限的测试阶段——OpenAI尚未公开发布模型的结构和规模的详细信息，尽管我们知道它能够接受图像作为输入，因此首次跨越成为多模态模型。GPT-3和GPT-4的模型权重不是开源的，尽管这些模型可以通过[商业工具和API](https://platform.openai.com)获得。
 - en: GPT-3 can also be [fine-tuned to your own training data](https://oreil.ly/B-Koo)—this
     allows you to provide multiple examples of how it should react to a given style
     of prompt by physically updating the weights of the network. In many cases this
@@ -1327,20 +1438,25 @@
   id: totrans-209
   prefs: []
   type: TYPE_NORMAL
+  zh: GPT-3也可以[根据您自己的训练数据进行微调](https://oreil.ly/B-Koo)——这使您可以提供多个示例，说明它应该如何对特定风格的提示做出反应，通过物理更新网络的权重。在许多情况下，这可能是不必要的，因为GPT-3可以通过在提示本身提供几个示例来告诉它如何对特定风格的提示做出反应（这被称为*few-shot
+    learning*）。微调的好处在于，您不需要在每个单独的输入提示中提供这些示例，从长远来看可以节省成本。
 - en: An example of the output from GPT-3, given a system prompt sentence, is shown
     in [Figure 9-15](#gpt3_story).
   id: totrans-210
   prefs: []
   type: TYPE_NORMAL
+  zh: 给定系统提示句子的GPT-3输出示例显示在[图9-15](#gpt3_story)中。
 - en: '![](Images/gdl2_0915.png)'
   id: totrans-211
   prefs: []
   type: TYPE_IMG
+  zh: '![](Images/gdl2_0915.png)'
 - en: Figure 9-15\. An example of how GPT-3 can extend a given system prompt
   id: totrans-212
   prefs:
   - PREF_H6
   type: TYPE_NORMAL
+  zh: 图9-15。GPT-3如何扩展给定系统提示的示例
 - en: Language models such as GPT benefit hugely from scaling—both in terms of number
     of model weights and dataset size. The ceiling of large language model capability
     has yet to be reached, with researchers continuing to push the boundaries of what
@@ -1348,11 +1464,13 @@
   id: totrans-213
   prefs: []
   type: TYPE_NORMAL
+  zh: 诸如GPT之类的语言模型在规模上受益巨大——无论是模型权重的数量还是数据集的大小。大型语言模型能力的上限尚未达到，研究人员继续推动着使用越来越大的模型和数据集所能实现的边界。
 - en: ChatGPT
   id: totrans-214
   prefs:
   - PREF_H2
   type: TYPE_NORMAL
+  zh: ChatGPT
 - en: A few months before the beta release of GPT-4, OpenAI announced [*ChatGPT*](https://chat.openai.com)—a
     tool that allows users to interact with their suite of large language models through
     a conversational interface. The original release in November 2022 was powered
@@ -1361,6 +1479,7 @@
   id: totrans-215
   prefs: []
   type: TYPE_NORMAL
+  zh: 在GPT-4的测试版发布几个月前，OpenAI宣布了[*ChatGPT*](https://chat.openai.com)——这是一个允许用户通过对话界面与其一系列大型语言模型进行交互的工具。2022年11月的原始版本由*GPT-3.5*提供支持，这个版本比GPT-3更强大，经过微调以进行对话回应。
 - en: Example dialogue is shown in [Figure 9-16](#chatgpt_example). Notice how the
     agent is able to maintain state between inputs, understanding that the *attention*
     mentioned in the second question refers to attention in the context of Transformers,
@@ -1368,15 +1487,18 @@
   id: totrans-216
   prefs: []
   type: TYPE_NORMAL
+  zh: 示例对话显示在[图9-16](#chatgpt_example)中。请注意，代理能够在输入之间保持状态，理解第二个问题中提到的*attention*指的是Transformer上下文中的注意力，而不是一个人的专注能力。
 - en: '![](Images/gdl2_0916.png)'
   id: totrans-217
   prefs: []
   type: TYPE_IMG
+  zh: '![](Images/gdl2_0916.png)'
 - en: Figure 9-16\. An example of ChatGPT answering questions about Transformers
   id: totrans-218
   prefs:
   - PREF_H6
   type: TYPE_NORMAL
+  zh: 图9-16。ChatGPT回答有关Transformer的问题的示例
 - en: At the time of writing, there is no official paper that describes how ChatGPT
     works in detail, but from the official [blog post](https://openai.com/blog/chatgpt)
     we know that it uses a technique called *reinforcement learning from human feedback*
@@ -1387,10 +1509,13 @@
   id: totrans-219
   prefs: []
   type: TYPE_NORMAL
+  zh: 在撰写本文时，尚无描述ChatGPT工作详细信息的官方论文，但根据官方[博客文章](https://openai.com/blog/chatgpt)，我们知道它使用一种称为*reinforcement
+    learning from human feedback*（RLHF）的技术来微调GPT-3.5模型。这种技术也在ChatGPT小组早期的论文^([6](ch09.xhtml#idm45387005277024))中使用，该论文介绍了*InstructGPT*模型，这是一个经过微调的GPT-3模型，专门设计用于更准确地遵循书面说明。
 - en: 'The training process for ChatGPT is as follows:'
   id: totrans-220
   prefs: []
   type: TYPE_NORMAL
+  zh: ChatGPT的训练过程如下：
 - en: '*Supervised fine-tuning*: Collect a demonstration dataset of conversational
     inputs (prompts) and desired outputs that have been written by humans. This is
     used to fine-tune the underlying language model (GPT-3.5) using supervised learning.'
@@ -1398,6 +1523,7 @@
   prefs:
   - PREF_OL
   type: TYPE_NORMAL
+  zh: '*监督微调*：收集人类编写的对话输入（提示）和期望输出的演示数据集。这用于使用监督学习微调基础语言模型（GPT-3.5）。'
 - en: '*Reward modeling*: Present a human labeler with examples of prompts and several
     sampled model outputs and ask them to rank the outputs from best to worst. Train
     a reward model that predicts the score given to each output, given the conversation
@@ -1406,6 +1532,7 @@
   prefs:
   - PREF_OL
   type: TYPE_NORMAL
+  zh: '*奖励建模*：向人类标记者展示提示的示例和几个抽样的模型输出，并要求他们将输出从最好到最差进行排名。训练一个奖励模型，预测给定对话历史的每个输出的得分。'
 - en: '*Reinforcement learning*: Treat the conversation as a reinforcement learning
     environment where the *policy* is the underlying language model, initialized to
     the fine-tuned model from step 1\. Given the current *state* (the conversation
@@ -1417,31 +1544,37 @@
   prefs:
   - PREF_OL
   type: TYPE_NORMAL
+  zh: '*强化学习*：将对话视为一个强化学习环境，其中*策略*是基础语言模型，初始化为从步骤1中微调的模型。给定当前的*状态*（对话历史），策略输出一个*动作*（一系列标记），由在步骤2中训练的奖励模型评分。然后可以训练一个强化学习算法——近端策略优化（PPO），通过调整语言模型的权重来最大化奖励。'
 - en: Reinforcement Learning
   id: totrans-224
   prefs:
   - PREF_H1
   type: TYPE_NORMAL
+  zh: 强化学习
 - en: For an introduction to reinforcement learning see [Chapter 12](ch12.xhtml#chapter_world_models),
     where we explore how generative models can be used in a reinforcement learning
     setting.
   id: totrans-225
   prefs: []
   type: TYPE_NORMAL
+  zh: 有关强化学习的介绍，请参阅[第12章](ch12.xhtml#chapter_world_models)，在那里我们探讨了生成模型如何在强化学习环境中使用。
 - en: The RLHF process is shown in [Figure 9-17](#rlhf).
   id: totrans-226
   prefs: []
   type: TYPE_NORMAL
+  zh: RLHF过程如[图9-17](#rlhf)所示。
 - en: '![](Images/gdl2_0917.png)'
   id: totrans-227
   prefs: []
   type: TYPE_IMG
+  zh: '![](Images/gdl2_0917.png)'
 - en: 'Figure 9-17\. The reinforcement learning from human feedback fine-tuning process
     used in ChatGPT (source: [OpenAI](https://openai.com/blog/chatgpt))'
   id: totrans-228
   prefs:
   - PREF_H6
   type: TYPE_NORMAL
+  zh: 图9-17。ChatGPT中使用的强化学习来自人类反馈微调过程的示意图（来源：[OpenAI](https://openai.com/blog/chatgpt)）
 - en: While ChatGPT still has many limitations (such as sometimes “hallucinating”
     factually incorrect information), it is a powerful example of how Transformers
     can be used to build generative models that can produce complex, long-ranging,
@@ -1451,6 +1584,7 @@
   id: totrans-229
   prefs: []
   type: TYPE_NORMAL
+  zh: 虽然ChatGPT仍然存在许多限制（例如有时“产生”事实不正确的信息），但它是一个强大的示例，展示了Transformers如何用于构建生成模型，可以产生复杂、长期和新颖的输出，往往难以区分是否为人类生成的文本。像ChatGPT这样的模型迄今取得的进展证明了人工智能的潜力及其对世界的变革性影响。
 - en: Moreover, it is evident that AI-driven communication and interaction will continue
     to rapidly evolve in the future. Projects like *Visual ChatGPT*^([7](ch09.xhtml#idm45387005252672))
     are now combining the linguistic power of ChatGPT with visual foundation models
@@ -1461,16 +1595,20 @@
   id: totrans-230
   prefs: []
   type: TYPE_NORMAL
+  zh: 此外，显而易见的是，基于人工智能的沟通和互动将继续在未来快速发展。像*Visual ChatGPT*^([7](ch09.xhtml#idm45387005252672))这样的项目现在正在将ChatGPT的语言能力与Stable
+    Diffusion等视觉基础模型相结合，使用户不仅可以通过文本与ChatGPT互动，还可以通过图像。在像Visual ChatGPT和GPT-4这样的项目中融合语言和视觉能力，有望开启人机交互的新时代。
 - en: Summary
   id: totrans-231
   prefs:
   - PREF_H1
   type: TYPE_NORMAL
+  zh: 总结
 - en: In this chapter, we explored the Transformer model architecture and built a
     version of GPT—a model for state-of-the-art text generation.
   id: totrans-232
   prefs: []
   type: TYPE_NORMAL
+  zh: 在本章中，我们探讨了Transformer模型架构，并构建了一个GPT的版本——用于最先进文本生成的模型。
 - en: GPT makes use of a mechanism known as attention, which removes the need for
     recurrent layers (e.g., LSTMs). It works like an information retrieval system,
     utilizing queries, keys, and values to decide how much information it wants to
@@ -1478,6 +1616,7 @@
   id: totrans-233
   prefs: []
   type: TYPE_NORMAL
+  zh: GPT利用一种称为注意力的机制，消除了循环层（例如LSTM）的需求。它类似于信息检索系统，利用查询、键和值来决定它想要从每个输入标记中提取多少信息。
 - en: Attention heads can be grouped together to form what is known as a multihead
     attention layer. These are then wrapped up inside a Transformer block, which includes
     layer normalization and skip connections around the attention layer. Transformer
@@ -1485,6 +1624,7 @@
   id: totrans-234
   prefs: []
   type: TYPE_NORMAL
+  zh: 注意力头可以组合在一起形成所谓的多头注意力层。然后将它们包装在一个Transformer块中，其中包括围绕注意力层的层归一化和跳过连接。Transformer块可以堆叠以创建非常深的神经网络。
 - en: Causal masking is used to ensure that GPT cannot leak information from downstream
     tokens into the current prediction. Also, a technique known as positional encoding
     is used to ensure that the ordering of the input sequence is not lost, but instead
@@ -1492,6 +1632,7 @@
   id: totrans-235
   prefs: []
   type: TYPE_NORMAL
+  zh: 因果屏蔽用于确保GPT不能从下游标记泄漏信息到当前预测中。此外，还使用一种称为位置编码的技术，以确保输入序列的顺序不会丢失，而是与传统的词嵌入一起嵌入到输入中。
 - en: When analyzing the output from GPT, we saw it was possible not only to generate
     new text passages, but also to interrogate the attention layer of the network
     to understand where in the sentence it is looking to gather information to improve
@@ -1502,6 +1643,7 @@
   id: totrans-236
   prefs: []
   type: TYPE_NORMAL
+  zh: 在分析GPT的输出时，我们看到不仅可以生成新的文本段落，还可以审查网络的注意力层，以了解它在句子中查找信息以改善预测的位置。GPT可以在不丢失信号的情况下访问远处的信息，因为注意力分数是并行计算的，不依赖于通过网络顺序传递的隐藏状态，这与循环神经网络的情况不同。
 - en: We saw how there are three families of Transformers (encoder, decoder, and encoder-decoder)
     and the different tasks that can be accomplished with each. Finally, we explored
     the structure and training process of other large language models such as Google’s
@@ -1509,40 +1651,48 @@
   id: totrans-237
   prefs: []
   type: TYPE_NORMAL
+  zh: 我们看到了Transformer有三个系列（编码器、解码器和编码器-解码器）以及每个系列可以完成的不同任务。最后，我们探讨了其他大型语言模型的结构和训练过程，如谷歌的T5和OpenAI的ChatGPT。
 - en: ^([1](ch09.xhtml#idm45387006840576-marker)) Ashish Vaswani et al., “Attention
     Is All You Need,” June 12, 2017, [*https://arxiv.org/abs/1706.03762*](https://arxiv.org/abs/1706.03762).
   id: totrans-238
   prefs: []
   type: TYPE_NORMAL
+  zh: ^([1](ch09.xhtml#idm45387006840576-marker)) Ashish Vaswani等人，“注意力就是一切”，2017年6月12日，[*https://arxiv.org/abs/1706.03762*](https://arxiv.org/abs/1706.03762)。
 - en: ^([2](ch09.xhtml#idm45387006828736-marker)) Alec Radford et al., “Improving
     Language Understanding by Generative Pre-Training,” June 11, 2018, [*https://openai.com/research/language-unsupervised*](https://openai.com/research/language-unsupervised).
   id: totrans-239
   prefs: []
   type: TYPE_NORMAL
+  zh: ^([2](ch09.xhtml#idm45387006828736-marker)) Alec Radford等人，“通过生成式预训练改进语言理解”，2018年6月11日，[*https://openai.com/research/language-unsupervised*](https://openai.com/research/language-unsupervised)。
 - en: '^([3](ch09.xhtml#idm45387006370384-marker)) Jacob Devlin et al., “BERT: Pre-Training
     of Deep Bidirectional Transformers for Language Understanding,” October 11, 2018,
     [*https://arxiv.org/abs/1810.04805*](https://arxiv.org/abs/1810.04805).'
   id: totrans-240
   prefs: []
   type: TYPE_NORMAL
+  zh: '^([3](ch09.xhtml#idm45387006370384-marker)) Jacob Devlin等人，“BERT: 深度双向Transformer的语言理解预训练”，2018年10月11日，[*https://arxiv.org/abs/1810.04805*](https://arxiv.org/abs/1810.04805)。'
 - en: '^([4](ch09.xhtml#idm45387006340992-marker)) Sheng Shen et al., “PowerNorm:
     Rethinking Batch Normalization in Transformers,” June 28, 2020, [*https://arxiv.org/abs/2003.07845*](https://arxiv.org/abs/2003.07845).'
   id: totrans-241
   prefs: []
   type: TYPE_NORMAL
+  zh: '^([4](ch09.xhtml#idm45387006340992-marker)) Sheng Shen等人，“PowerNorm: 重新思考Transformer中的批归一化”，2020年6月28日，[*https://arxiv.org/abs/2003.07845*](https://arxiv.org/abs/2003.07845)。'
 - en: ^([5](ch09.xhtml#idm45387005361120-marker)) Colin Raffel et al., “Exploring
     the Limits of Transfer Learning with a Unified Text-to-Text Transformer,” October
     23, 2019, [*https://arxiv.org/abs/1910.10683*](https://arxiv.org/abs/1910.10683).
   id: totrans-242
   prefs: []
   type: TYPE_NORMAL
+  zh: ^([5](ch09.xhtml#idm45387005361120-marker)) Colin Raffel等人，“探索统一文本到文本Transformer的迁移学习极限”，2019年10月23日，[*https://arxiv.org/abs/1910.10683*](https://arxiv.org/abs/1910.10683)。
 - en: ^([6](ch09.xhtml#idm45387005277024-marker)) Long Ouyang et al., “Training Language
     Models to Follow Instructions with Human Feedback,” March 4, 2022, [*https://arxiv.org/abs/2203.02155*](https://arxiv.org/abs/2203.02155).
   id: totrans-243
   prefs: []
   type: TYPE_NORMAL
+  zh: ^([6](ch09.xhtml#idm45387005277024-marker)) Long Ouyang等人，“使用人类反馈训练语言模型遵循指令”，2022年3月4日，[*https://arxiv.org/abs/2203.02155*](https://arxiv.org/abs/2203.02155)。
 - en: '^([7](ch09.xhtml#idm45387005252672-marker)) Chenfei Wu et al., “Visual ChatGPT:
     Talking, Drawing and Editing with Visual Foundation Models,” March 8, 2023, [*https://arxiv.org/abs/2303.04671*](https://arxiv.org/abs/2303.04671).`'
   id: totrans-244
   prefs: []
   type: TYPE_NORMAL
+  zh: '^([7](ch09.xhtml#idm45387005252672-marker)) Chenfei Wu等人，“Visual ChatGPT: 使用视觉基础模型进行对话、绘画和编辑”，2023年3月8日，[*https://arxiv.org/abs/2303.04671*](https://arxiv.org/abs/2303.04671)。'