From 40381bcb70ef71ad2fc195a5c1fd48ed0977a640 Mon Sep 17 00:00:00 2001 From: wizardforcel <562826179@qq.com> Date: Thu, 8 Feb 2024 19:10:21 +0800 Subject: [PATCH] 2024-02-08 19:10:19 --- totrans/gen-dl_12.yaml | 17 +++ totrans/gen-dl_13.yaml | 335 +++++++++++++++++++++++++++++++++++++++++ 2 files changed, 352 insertions(+) diff --git a/totrans/gen-dl_12.yaml b/totrans/gen-dl_12.yaml index 6b273c1..f8d2f66 100644 --- a/totrans/gen-dl_12.yaml +++ b/totrans/gen-dl_12.yaml @@ -1,51 +1,68 @@ - en: Part III. Applications + id: totrans-0 prefs: - PREF_H1 type: TYPE_NORMAL + zh: 第三部分. 应用 - en: In [Part III](#part_applications), we will explore some of the key applications of the generative modeling techniques that we have seen so far, across images, text, music, and games. We will also see how these domains can be traversed using state-of-the-art multimodal models. + id: totrans-1 prefs: [] type: TYPE_NORMAL + zh: 在第三部分中,我们将探索迄今为止所见的生成建模技术在图像、文本、音乐和游戏等领域的一些关键应用。我们还将看到如何使用最先进的多模态模型穿越这些领域。 - en: In [Chapter 9](ch09.xhtml#chapter_transformer) we shall turn our attention to Transformers, a start-of-the-art architecture that powers most modern-day text generation models. In particular, we shall explore the inner workings of GPT and build our own version using Keras, and we’ll see how it forms the foundation of tools such as ChatGPT. + id: totrans-2 prefs: [] type: TYPE_NORMAL + zh: 在第9章中,我们将把注意力转向Transformers,这是一种现代文本生成模型的先进架构。特别是,我们将探索GPT的内部工作原理,并使用Keras构建我们自己的版本,我们将看到它如何构建了诸如ChatGPT之类的工具的基础。 - en: In [Chapter 10](ch10.xhtml#chapter_image_generation) we will look at some of the most important GAN architectures that have influenced image generation, including ProGAN, StyleGAN, StyleGAN2, SAGAN, BigGAN, VQ-GAN, and ViT VQ-GAN. We shall explore the key contributions of each and look to understand how the technique has evolved over time. + id: totrans-3 prefs: [] type: TYPE_NORMAL + zh: 在第10章中,我们将看一些对图像生成产生影响的最重要的GAN架构,包括ProGAN、StyleGAN、StyleGAN2、SAGAN、BigGAN、VQ-GAN和ViT + VQ-GAN。我们将探索每个架构的关键贡献,并了解这种技术如何随着时间的推移而发展。 - en: '[Chapter 11](ch11.xhtml#chapter_music) looks at music generation, which presents additional challenges such as modeling musical pitch and rhythm. We’ll see that many of the techniques that work for text generation (such as Transformers) can also be applied in this domain, but we’ll also explore a deep learning architecture known as MuseGAN that applies a GAN-based approach to generating music.' + id: totrans-4 prefs: [] type: TYPE_NORMAL + zh: 第11章探讨音乐生成,这带来了额外的挑战,比如对音乐音高和节奏进行建模。我们将看到许多适用于文本生成的技术(如Transformers)也可以应用于这个领域,但我们还将探索一种称为MuseGAN的深度学习架构,该架构应用了基于GAN的方法来生成音乐。 - en: '[Chapter 12](ch12.xhtml#chapter_world_models) shows how generative models can be used within other machine learning domains, such as reinforcement learning. We will focus on the “World Models” paper, which shows how a generative model can be used as the environment in which the agent trains, allowing it to train within a hallucinated dream version of the environment rather than the real thing.' + id: totrans-5 prefs: [] type: TYPE_NORMAL + zh: 第12章展示了生成模型如何在其他机器学习领域中使用,比如强化学习。我们将重点关注“世界模型”论文,该论文展示了如何将生成模型用作代理训练的环境,使其能够在幻想的梦境版本的环境中进行训练,而不是真实环境。 - en: In [Chapter 13](ch13.xhtml#chapter_multimodal) we will explore state-of-the-art multimodal models that cross over domains such as images and text. This includes text-to-image models such as DALL.E 2, Imagen, and Stable Diffusion, as well as visual language models such as Flamingo. + id: totrans-6 prefs: [] type: TYPE_NORMAL + zh: 在第13章中,我们将探索跨越图像和文本等领域的最先进的多模态模型。这包括文本到图像模型,如DALL.E 2、Imagen和Stable Diffusion,以及视觉语言模型,如Flamingo。 - en: Finally, [Chapter 14](ch14.xhtml#chapter_conclusion) summarizes the generative AI journey so far, the current generative AI landscape, and where we may be heading in the future. We will explore how generative AI may change the way we live and work, as well as considering whether it has the potential to unlock deeper forms of artificial intelligence in the years to come. + id: totrans-7 prefs: [] type: TYPE_NORMAL + zh: 最后,在第14章中总结了迄今为止的生成人工智能之旅,当前的生成人工智能格局,以及我们未来可能走向何方。我们将探讨生成人工智能如何改变我们的生活和工作方式,以及考虑它是否有潜力在未来几年解锁更深层次的人工智能形式。 diff --git a/totrans/gen-dl_13.yaml b/totrans/gen-dl_13.yaml index c9c7b37..f519abe 100644 --- a/totrans/gen-dl_13.yaml +++ b/totrans/gen-dl_13.yaml @@ -1,4 +1,5 @@ - en: Chapter 9\. Transformers + id: totrans-0 prefs: - PREF_H1 type: TYPE_NORMAL @@ -10,15 +11,18 @@ applying a dense layer and softmax activation over the hidden vector. This was considered the most sophisticated way to generatively produce text until 2017, when one paper changed the landscape of text generation forever. + id: totrans-1 prefs: [] type: TYPE_NORMAL - en: Introduction + id: totrans-2 prefs: - PREF_H1 type: TYPE_NORMAL - en: The Google Brain paper, confidently entitled “Attention Is All You Need,”^([1](ch09.xhtml#idm45387006840576)) is famous for popularizing the concept of *attention*—a mechanism that now powers most state-of-the-art text generation models. + id: totrans-3 prefs: [] type: TYPE_NORMAL - en: The authors show how it is possible to create powerful neural networks called @@ -27,6 +31,7 @@ approach overcomes a key downside to the RNN approach, which is that it is challenging to parallelize, as it must process sequences one token as a time. Transformers are highly paralellizable, allowing them to be trained on massive datasets. + id: totrans-4 prefs: [] type: TYPE_NORMAL - en: In this chapter, we are going to delve into how modern text generation models @@ -34,20 +39,24 @@ on text generation challenges. In particular, we will explore a type of autoregressive model known as the *generative pre-trained transformer* (GPT), which powers OpenAI’s GPT-4 model, widely considered to be the current state of the art for text generation. + id: totrans-5 prefs: [] type: TYPE_NORMAL - en: GPT + id: totrans-6 prefs: - PREF_H1 type: TYPE_NORMAL - en: OpenAI introduced GPT in June 2018, in the paper “Improving Language Understanding by Generative Pre-Training,”^([2](ch09.xhtml#idm45387006828736)) almost exactly a year after the appearance of the original Transformer paper. + id: totrans-7 prefs: [] type: TYPE_NORMAL - en: In this paper, the authors show how a Transformer architecture can be trained on a huge amount of text data to predict the next word in a sequence and then subsequently fine-tuned to specific downstream tasks. + id: totrans-8 prefs: [] type: TYPE_NORMAL - en: The pre-training process of GPT involves training the model on a large corpus @@ -56,12 +65,14 @@ a sequence given the previous words. This process is known as *language modeling* and is used to teach the model to understand the structure and patterns of natural language. + id: totrans-9 prefs: [] type: TYPE_NORMAL - en: After pre-training, the GPT model can be fine-tuned for a specific task by providing it with a smaller, task-specific dataset. Fine-tuning involves adjusting the parameters of the model to better fit the task at hand. For example, the model can be fine-tuned for tasks such as classification, similarity scoring, or question answering. + id: totrans-10 prefs: [] type: TYPE_NORMAL - en: The GPT architecture has since been improved and extended by OpenAI with the @@ -70,96 +81,132 @@ generate more complex and coherent text. The GPT models have been widely adopted by researchers and industry practitioners and have contributed to significant advancements in natural language processing tasks. + id: totrans-11 prefs: [] type: TYPE_NORMAL - en: In this chapter, we will build our own variation of the original GPT model, trained on less data, but still utilizing the same components and underlying principles. + id: totrans-12 prefs: [] type: TYPE_NORMAL - en: Running the Code for This Example + id: totrans-13 prefs: - PREF_H1 type: TYPE_NORMAL - en: The code for this example can be found in the Jupyter notebook located at *notebooks/09_transformer/01_gpt/gpt.ipynb* in the book repository. + id: totrans-14 prefs: [] type: TYPE_NORMAL - en: The code is adapted from the excellent [GPT tutorial](https://oreil.ly/J86pg) created by Apoorv Nandan available on the Keras website. + id: totrans-15 prefs: [] type: TYPE_NORMAL - en: The Wine Reviews Dataset + id: totrans-16 prefs: - PREF_H2 type: TYPE_NORMAL - en: We’ll be using the [Wine Reviews dataset](https://oreil.ly/DC9EG) that is available through Kaggle. This is a set of over 130,000 reviews of wines, with accompanying metadata such as description and price. + id: totrans-17 prefs: [] type: TYPE_NORMAL + zh: 我们将使用通过Kaggle提供的[Wine Reviews数据集](https://oreil.ly/DC9EG)。这是一个包含超过130,000条葡萄酒评论的数据集,附带元数据,如描述和价格。 - en: You can download the dataset by running the Kaggle dataset downloader script in the book repository, as shown in [Example 9-1](#downloading-wine-dataset). This will save the wine reviews and accompanying metadata locally to the */data* folder. + id: totrans-18 prefs: [] type: TYPE_NORMAL + zh: 您可以通过在书库中运行Kaggle数据集下载脚本来下载数据集,如[示例9-1](#downloading-wine-dataset)所示。这将把葡萄酒评论和相关元数据保存在本地的*/data*文件夹中。 - en: Example 9-1\. Downloading the Wine Reviews dataset + id: totrans-19 prefs: - PREF_H5 type: TYPE_NORMAL + zh: 示例9-1\. 下载葡萄酒评论数据集 - en: '[PRE0]' + id: totrans-20 prefs: [] type: TYPE_PRE + zh: '[PRE0]' - en: '`The data preparation steps are identical to the steps used in [Chapter 5](ch05.xhtml#chapter_autoregressive) for preparing data for input into an LSTM, so we will not repeat them in detail here. The steps, as shown in [Figure 9-1](#transformer_data_prep), are as follows:' + id: totrans-21 prefs: [] type: TYPE_NORMAL + zh: '`数据准备步骤与[第5章](ch05.xhtml#chapter_autoregressive)中用于准备输入到LSTM的数据的步骤是相同的,因此我们不会在这里详细重复它们。如[图9-1](#transformer_data_prep)所示,步骤如下:' - en: Load the data and create a list of text string descriptions of each wine. + id: totrans-22 prefs: - PREF_OL type: TYPE_NORMAL + zh: 加载数据并创建每种葡萄酒的文本字符串描述列表。 - en: Pad punctuation with spaces, so that each punctuation mark is treated as a separate word. + id: totrans-23 prefs: - PREF_OL type: TYPE_NORMAL + zh: 用空格填充标点符号,以便每个标点符号被视为一个单独的单词。 - en: Pass the strings through a `TextVectorization` layer that tokenizes the data and pads/clips each string to a fixed length. + id: totrans-24 prefs: - PREF_OL type: TYPE_NORMAL + zh: 通过`TextVectorization`层将字符串传递,对数据进行标记化,并将每个字符串填充/裁剪到固定长度。 - en: Create a training set where the inputs are the tokenized text strings and the outputs to predict are the same strings shifted by one token. + id: totrans-25 prefs: - PREF_OL type: TYPE_NORMAL + zh: 创建一个训练集,其中输入是标记化的文本字符串,输出是预测的相同字符串向后移动一个标记。 - en: '![](Images/gdl2_0901.png)' + id: totrans-26 prefs: [] type: TYPE_IMG + zh: '![](Images/gdl2_0901.png)' - en: Figure 9-1\. Data processing for the Transformer` `## Attention + id: totrans-27 prefs: - PREF_H6 type: TYPE_NORMAL + zh: 图9-1\. Transformer的数据处理` `## 注意力 - en: The first step to understanding how GPT works is to understand how the *attention mechanism* works. This mechanism is what makes the Transformer architecture unique and distinct from recurrent approaches to language modeling. When we have developed a solid understanding of attention, we will then see how it is used within Transformer architectures such as GPT. + id: totrans-28 prefs: [] type: TYPE_NORMAL + zh: 了解GPT如何工作的第一步是了解*注意力机制*的工作原理。这个机制是使Transformer架构与循环方法在语言建模方面独特和不同的地方。当我们对注意力有了扎实的理解后,我们将看到它如何在GPT等Transformer架构中使用。 - en: 'When you write, the choice that you make for the next word in the sentence is influenced by other words that you have already written. For example, suppose you start a sentence as follows:' + id: totrans-29 prefs: [] type: TYPE_NORMAL + zh: 当您写作时,句子中下一个词的选择受到您已经写过的其他单词的影响。例如,假设您开始一个句子如下: - en: '[PRE1]' + id: totrans-30 prefs: [] type: TYPE_PRE + zh: '[PRE1]' - en: Clearly, the next word should be something synonymous with *big*. How do we know this? + id: totrans-31 prefs: [] type: TYPE_NORMAL + zh: 显然,下一个词应该是与*big*同义的。我们怎么知道这一点? - en: Certain other words in the sentence are important for helping us to make our decision. For example, the fact that it is an elephant, rather than a sloth, means that we prefer *big* rather than *slow*. If it were a swimming pool, rather than @@ -167,26 +214,35 @@ action of *getting into* the car implies that size is the problem—if the elephant was trying to *squash* the car instead, we might choose *fast* as the final word, with *it* now referring to the car. + id: totrans-32 prefs: [] type: TYPE_NORMAL + zh: 句子中的某些其他单词对帮助我们做出决定很重要。例如,它是大象而不是树懒,意味着我们更喜欢*big*而不是*slow*。如果它是游泳池而不是汽车,我们可能会选择*scared*作为*big*的一个可能替代。最后,*getting + into*汽车的行为意味着大小是问题所在——如果大象试图*压扁*汽车,我们可能会选择*fast*作为最后一个词,现在*it*指的是汽车。 - en: Other words in the sentence are not important at all. For example, the fact that the elephant is pink has no influence on our choice of final word. Equally, the minor words in the sentence (*the*, *but*, *it*, etc.) give the sentence grammatical form, but here aren’t important to determine the required adjective. + id: totrans-33 prefs: [] type: TYPE_NORMAL + zh: 句子中的其他单词一点都不重要。例如,大象是粉红色这个事实对我们选择最终词汇没有影响。同样,句子中的次要单词(*the*、*but*、*it*等)给句子以语法形式,但在这里并不重要,以确定所需形容词。 - en: In other words, we are *paying attention* to certain words in the sentence and largely ignoring others. Wouldn’t it be great if our model could do the same thing? + id: totrans-34 prefs: [] type: TYPE_NORMAL + zh: 换句话说,我们正在*关注*句子中的某些单词,而基本上忽略其他单词。如果我们的模型也能做同样的事情,那不是很好吗? - en: An attention mechanism (also know as an *attention head*) in a Transformer is designed to do exactly this. It is able to decide where in the input it wants to pull information from, in order to efficiently extract useful information without being clouded by irrelevant details. This makes it highly adaptable to a range of circumstances, as it can decide where it wants to look for information at inference time. + id: totrans-35 prefs: [] type: TYPE_NORMAL + zh: Transformer中的注意力机制(也称为*注意力头*)旨在做到这一点。它能够决定从输入的哪个位置提取信息,以有效地提取有用信息而不被无关细节混淆。这使得它非常适应各种情况,因为它可以在推断时决定在哪里寻找信息。 - en: In contrast, a recurrent layer tries to build up a generic hidden state that captures an overall representation of the input at each timestep. A weakness of this approach is that many of the words that have already been incorporated into @@ -194,15 +250,20 @@ (e.g., predicting the next word), as we have just seen. Attention heads do not suffer from this problem, because they can pick and choose how to combine information from nearby words, depending on the context. + id: totrans-36 prefs: [] type: TYPE_NORMAL + zh: 相比之下,循环层试图建立一个捕捉每个时间步输入的整体表示的通用隐藏状态。这种方法的一个弱点是,已经合并到隐藏向量中的许多单词对当前任务(例如,预测下一个单词)并不直接相关,正如我们刚刚看到的。注意力头不会遇到这个问题,因为它们可以选择如何从附近的单词中组合信息,具体取决于上下文。 - en: Queries, Keys, and Values + id: totrans-37 prefs: - PREF_H2 type: TYPE_NORMAL + zh: 查询、键和值 - en: So how does an attention head decide where it wants to look for information? Before we get into the details, let’s explore how it works at a high level, using our *pink elephant* example. + id: totrans-38 prefs: [] type: TYPE_NORMAL - en: Imagine that we want to predict what follows the word *too*. To help with this @@ -211,22 +272,27 @@ that follow *too*. For example, the word *elephant* might confidently contribute that it is more likely to be a word related to size or loudness, whereas the word *was* doesn’t have much to offer to narrow down the possibilities. + id: totrans-39 prefs: [] type: TYPE_NORMAL - en: In other words, we can think of an attention head as a kind of information retrieval system, where a *query* (“What word follows *too*?”) is made into a *key/value* store (other words in the sentence) and the resulting output is a sum of the values, weighted by the *resonance* between the query and each key. + id: totrans-40 prefs: [] type: TYPE_NORMAL - en: We will now walk through the process in detail ([Figure 9-2](#attention_head)), again with reference to our *pink elephant* sentence. + id: totrans-41 prefs: [] type: TYPE_NORMAL - en: '![](Images/gdl2_0902.png)' + id: totrans-42 prefs: [] type: TYPE_IMG - en: Figure 9-2\. The mechanics of an attention head + id: totrans-43 prefs: - PREF_H6 type: TYPE_NORMAL @@ -237,6 +303,7 @@ Q to change the dimensionality of the vector from d e to d k . + id: totrans-44 prefs: [] type: TYPE_NORMAL - en: The *key* vectors ( K ) are representations @@ -248,6 +315,7 @@ e to d k . Notice that the keys and the query are the same length ( d k ). + id: totrans-45 prefs: [] type: TYPE_NORMAL - en: Inside the attention head, each key is compared to the query using a dot product @@ -261,6 +329,7 @@ keep the variance of the vector sum stable (approximately equal to 1), and a softmax is applied to ensure the contributions sum to 1\. This is a vector of *attention weights*. + id: totrans-46 prefs: [] type: TYPE_NORMAL - en: The *value* vectors ( V ) are also representations @@ -271,17 +340,23 @@ e to d v . Notice that the value vectors do not necessarily have to have the same length as the keys and query (but often do, for simplicity). + id: totrans-47 prefs: [] type: TYPE_NORMAL - en: The value vectors are multiplied by the attention weights to give the *attention* for a given Q , K , and V , as shown in [Equation 9-1](#attention_equation). + id: totrans-48 prefs: [] type: TYPE_NORMAL + zh: 值向量乘以注意力权重,给出给定QKV的*注意力*,如[方程9-1](#attention_equation)所示。 - en: Equation 9-1\. Attention equation + id: totrans-49 prefs: - PREF_H5 type: TYPE_NORMAL + zh: 方程9-1。注意力方程 - en: A t + t e n t i o n ( + Q , K , V ) = + s o f t m a x ( + QK T d + k ) V - en: To obtain the final output vector from the attention head, the attention is summed to give a vector of length d v . This *context vector* captures a blended opinion from words in the sentence on the task of predicting what word follows *too*. + id: totrans-51 prefs: [] type: TYPE_NORMAL + zh: 从注意力头中获取最终输出向量,将注意力求和得到长度为d v的向量。这个*上下文向量*捕捉了句子中单词对于预测接下来的单词是什么的任务的混合意见。 - en: Multihead Attention + id: totrans-52 prefs: - PREF_H2 type: TYPE_NORMAL + zh: 多头注意力 - en: There’s no reason to stop at just one attention head! In Keras, we can build a `MultiHeadAttention` layer that concatenates the output from multiple attention heads, allowing each to learn a distinct attention mechanism so that the layer as a whole can learn more complex relationships. + id: totrans-53 prefs: [] type: TYPE_NORMAL + zh: 没有理由只停留在一个注意力头上!在Keras中,我们可以构建一个`MultiHeadAttention`层,将多个注意力头的输出连接起来,使每个头学习不同的注意力机制,从而使整个层能够学习更复杂的关系。 - en: The concatenated outputs are passed through one final weights matrix W O to project the vector into the desired output dimension, which in our case is the same as the input dimension of the query ( d e ), so that the layers can be stacked sequentially on top of each other. + id: totrans-54 prefs: [] type: TYPE_NORMAL + zh: 连接的输出通过一个最终的权重矩阵W O传递,将向量投影到所需的输出维度,这在我们的情况下与查询的输入维度相同(d e),以便层可以顺序堆叠在一起。 - en: '[Figure 9-3](#multi_attention_layer) shows how the output from a `MultiHeadAttention` layer is constructed. In Keras we can simply write the line shown in [Example 9-2](#multihead_attention_keras) to create such a layer.' + id: totrans-55 prefs: [] type: TYPE_NORMAL + zh: '[图9-3](#multi_attention_layer)展示了一个`MultiHeadAttention`层的输出是如何构建的。在Keras中,我们可以简单地写下[示例9-2](#multihead_attention_keras)中显示的代码来创建这样一个层。' - en: Example 9-2\. Creating a `MultiHeadAttention` layer in Keras + id: totrans-56 prefs: - PREF_H5 type: TYPE_NORMAL + zh: 示例9-2。在Keras中创建一个`MultiHeadAttention`层 - en: '[PRE2]' + id: totrans-57 prefs: [] type: TYPE_PRE + zh: '[PRE2]' - en: '[![1](Images/1.png)](#co_transformers_CO1-1)' + id: totrans-58 prefs: [] type: TYPE_NORMAL + zh: '[![1](Images/1.png)](#co_transformers_CO1-1)' - en: This multihead attention layer has four heads. + id: totrans-59 prefs: [] type: TYPE_NORMAL + zh: 这个多头注意力层有四个头。 - en: '[![2](Images/2.png)](#co_transformers_CO1-2)' + id: totrans-60 prefs: [] type: TYPE_NORMAL + zh: '[![2](Images/2.png)](#co_transformers_CO1-2)' - en: The keys (and query) are vectors of length 128. + id: totrans-61 prefs: [] type: TYPE_NORMAL + zh: 键(和查询)是长度为128的向量。 - en: '[![3](Images/3.png)](#co_transformers_CO1-3)' + id: totrans-62 prefs: [] type: TYPE_NORMAL + zh: '[![3](Images/3.png)](#co_transformers_CO1-3)' - en: The values (and therefore also the output from each head) are vectors of length 64. + id: totrans-63 prefs: [] type: TYPE_NORMAL + zh: 值(因此也是每个头的输出)是长度为64的向量。 - en: '[![4](Images/4.png)](#co_transformers_CO1-4)' + id: totrans-64 prefs: [] type: TYPE_NORMAL + zh: '[![4](Images/4.png)](#co_transformers_CO1-4)' - en: The output vector has length 256. + id: totrans-65 prefs: [] type: TYPE_NORMAL + zh: 输出向量的长度为256。 - en: '![](Images/gdl2_0903.png)' + id: totrans-66 prefs: [] type: TYPE_IMG + zh: '![](Images/gdl2_0903.png)' - en: Figure 9-3\. A multihead attention layer with four heads + id: totrans-67 prefs: - PREF_H6 type: TYPE_NORMAL + zh: 图9-3。一个具有四个头的多头注意力层 - en: Causal Masking + id: totrans-68 prefs: - PREF_H2 type: TYPE_NORMAL + zh: 因果掩码 - en: So far, we have assumed that the query input to our attention head is a single vector. However, for efficiency during training, we would ideally like the attention layer to be able to operate on every word in the input at once, predicting for each what the subsequent word will be. In other words, we want our GPT model to be able to handle a group of query vectors in parallel (i.e., a matrix). + id: totrans-69 prefs: [] type: TYPE_NORMAL + zh: 到目前为止,我们假设我们的注意力头的查询输入是一个单一的向量。然而,在训练期间为了效率,我们理想情况下希望注意力层能够一次操作输入中的每个单词,为每个单词预测接下来的单词。换句话说,我们希望我们的GPT模型能够并行处理一组查询向量(即一个矩阵)。 - en: You might think that we can just batch the vectors together into a matrix and let linear algebra handle the rest. This is true, but we need one extra step—we need to apply a mask to the query/key dot product, to avoid information from future words leaking through. This is known as *causal masking* and is shown in [Figure 9-4](#causal_mask). + id: totrans-70 prefs: [] type: TYPE_NORMAL + zh: 您可能会认为我们可以将向量批量处理成一个矩阵,让线性代数处理剩下的部分。这是正确的,但我们需要一个额外的步骤——我们需要对查询/键的点积应用一个掩码,以避免未来单词的信息泄漏。这被称为*因果掩码*,在[图9-4](#causal_mask)中显示。 - en: '![](Images/gdl2_0904.png)' + id: totrans-71 prefs: [] type: TYPE_IMG + zh: '![](Images/gdl2_0904.png)' - en: Figure 9-4\. Matrix calculation of the attention scores for a batch of input queries, using a causal attention mask to hide keys that are not available to the query (because they come later in the sentence) + id: totrans-72 prefs: - PREF_H6 type: TYPE_NORMAL + zh: 图9-4。对一批输入查询计算注意力分数的矩阵,使用因果注意力掩码隐藏对查询不可用的键(因为它们在句子中后面) - en: Without this mask, our GPT model would be able to perfectly guess the next word in the sentence, because it would be using the key from the word itself as a feature! The code for creating a causal mask is shown in [Example 9-3](#causal_mask_code), and the resulting `numpy` array (transposed to match the diagram) is shown in [Figure 9-5](#causal_mask_numpy). + id: totrans-73 prefs: [] type: TYPE_NORMAL + zh: 如果没有这个掩码,我们的GPT模型将能够完美地猜测句子中的下一个单词,因为它将使用单词本身的键作为特征!创建因果掩码的代码显示在[示例9-3](#causal_mask_code)中,结果的`numpy`数组(转置以匹配图表)显示在[图9-5](#causal_mask_numpy)中。 - en: Example 9-3\. The causal mask function + id: totrans-74 prefs: - PREF_H5 type: TYPE_NORMAL + zh: 示例9-3。因果掩码函数 - en: '[PRE3]' + id: totrans-75 prefs: [] type: TYPE_PRE + zh: '[PRE3]' - en: '![](Images/gdl2_0905.png)' + id: totrans-76 prefs: [] type: TYPE_IMG + zh: '![](Images/gdl2_0905.png)' - en: Figure 9-5\. The causal mask as a `numpy` array—1 means unmasked and 0 means masked + id: totrans-77 prefs: - PREF_H6 type: TYPE_NORMAL + zh: 图9-5。作为`numpy`数组的因果掩码——1表示未掩码,0表示掩码 - en: Tip + id: totrans-78 prefs: - PREF_H6 type: TYPE_NORMAL + zh: 提示 - en: Causal masking is only required in *decoder Transformers* such as GPT, where the task is to sequentially generate tokens given previous tokens. Masking out future tokens during training is therefore essential. + id: totrans-79 prefs: [] type: TYPE_NORMAL + zh: 因果掩码仅在*解码器Transformer*(如GPT)中需要,其中任务是根据先前的标记顺序生成标记。在训练期间屏蔽未来标记因此至关重要。 - en: Other flavors of Transformer (e.g., *encoder Transformers*) do not need causal masking, because they are not trained to predict the next token. For example Google’s BERT predicts masked words within a given sentence, so it can use context from both before and after the word in question.^([3](ch09.xhtml#idm45387006370384)) + id: totrans-80 prefs: [] type: TYPE_NORMAL - en: We will explore the different types of Transformers in more detail at the end of the chapter. + id: totrans-81 prefs: [] type: TYPE_NORMAL - en: This concludes our explanation of the multihead attention mechanism that is @@ -439,25 +586,31 @@ to reshape the output ( W O ). There are no convolutions or recurrent mechanisms at all in a multihead attention layer! + id: totrans-82 prefs: [] type: TYPE_NORMAL - en: Next, we shall take a step back and see how the multihead attention layer forms just one part of a larger component known as a *Transformer block*. + id: totrans-83 prefs: [] type: TYPE_NORMAL - en: The Transformer Block + id: totrans-84 prefs: - PREF_H2 type: TYPE_NORMAL - en: A *Transformer block* is a single component within a Transformer that applies some skip connections, feed-forward (dense) layers, and normalization around the multihead attention layer. A diagram of a Transformer block is shown in [Figure 9-6](#transformer_block). + id: totrans-85 prefs: [] type: TYPE_NORMAL - en: '![](Images/gdl2_0906.png)' + id: totrans-86 prefs: [] type: TYPE_IMG - en: Figure 9-6\. A Transformer block + id: totrans-87 prefs: - PREF_H6 type: TYPE_NORMAL @@ -467,6 +620,7 @@ not suffer as much from the vanishing gradient problem, because the skip connection provides a gradient-free *highway* that allows the network to transfer information forward uninterrupted. + id: totrans-88 prefs: [] type: TYPE_NORMAL - en: Secondly, *layer normalization* is used in the Transformer block to provide @@ -474,6 +628,7 @@ layer in action throughout this book, where the output from each channel is normalized to have a mean of 0 and standard deviation of 1\. The normalization statistics are calculated across the batch and spatial dimensions. + id: totrans-89 prefs: [] type: TYPE_NORMAL - en: In contrast, layer normalization in a Transformer block normalizes each position @@ -481,17 +636,21 @@ the channels. It is the complete opposite of batch normalization, in terms of how the normalization statistics are calculated. A diagram showing the difference between batch normalization and layer normalization is shown in [Figure 9-7](#layer_norm). + id: totrans-90 prefs: [] type: TYPE_NORMAL - en: '![](Images/gdl2_0907.png)' + id: totrans-91 prefs: [] type: TYPE_IMG - en: 'Figure 9-7\. Layer normalization versus batch normalization—the normalization statistics are calculated across the blue cells (source: [Sheng et al., 2020](https://arxiv.org/pdf/2003.07845.pdf))^([4](ch09.xhtml#idm45387006340992))' + id: totrans-92 prefs: - PREF_H6 type: TYPE_NORMAL - en: Layer Normalization Versus Batch Normalization + id: totrans-93 prefs: - PREF_H1 type: TYPE_NORMAL @@ -500,61 +659,80 @@ in the batch. However, recent work such as Shen et al.*s* challenges this assumption, showing that with some tweaks a form of batch normalization can still be used within Transformers, outperforming more traditional layer normalization. + id: totrans-94 prefs: [] type: TYPE_NORMAL - en: Lastly, a set of feed-forward (i.e., densely connected) layers is included in the Transformer block, to allow the component to extract higher-level features as we go deeper into the network. + id: totrans-95 prefs: [] type: TYPE_NORMAL - en: A Keras implementation of a Transformer block is shown in [Example 9-4](#transformer_block_code2). + id: totrans-96 prefs: [] type: TYPE_NORMAL - en: Example 9-4\. A `TransformerBlock` layer in Keras + id: totrans-97 prefs: - PREF_H5 type: TYPE_NORMAL - en: '[PRE4]' + id: totrans-98 prefs: [] type: TYPE_PRE + zh: '[PRE4]' - en: '[![1](Images/1.png)](#co_transformers_CO2-1)' + id: totrans-99 prefs: [] type: TYPE_NORMAL - en: The sublayers that make up the `TransformerBlock` layer are defined within the initialization function. + id: totrans-100 prefs: [] type: TYPE_NORMAL - en: '[![2](Images/2.png)](#co_transformers_CO2-2)' + id: totrans-101 prefs: [] type: TYPE_NORMAL - en: The causal mask is created to hide future keys from the query. + id: totrans-102 prefs: [] type: TYPE_NORMAL - en: '[![3](Images/3.png)](#co_transformers_CO2-3)' + id: totrans-103 prefs: [] type: TYPE_NORMAL - en: The multihead attention layer is created, with the attention masks specified. + id: totrans-104 prefs: [] type: TYPE_NORMAL - en: '[![4](Images/4.png)](#co_transformers_CO2-4)' + id: totrans-105 prefs: [] type: TYPE_NORMAL - en: The first *add and normalization* layer. + id: totrans-106 prefs: [] type: TYPE_NORMAL - en: '[![5](Images/5.png)](#co_transformers_CO2-5)' + id: totrans-107 prefs: [] type: TYPE_NORMAL - en: The feed-forward layers. + id: totrans-108 prefs: [] type: TYPE_NORMAL - en: '[![6](Images/6.png)](#co_transformers_CO2-6)' + id: totrans-109 prefs: [] type: TYPE_NORMAL - en: The second *add and normalization* layer. + id: totrans-110 prefs: [] type: TYPE_NORMAL - en: Positional Encoding + id: totrans-111 prefs: - PREF_H2 type: TYPE_NORMAL @@ -565,13 +743,16 @@ recurrent neural network. This is a strength (because of the parallelization efficiency gains) but also a problem, because we clearly need the attention layer to be able to predict different outputs for the following two sentences:' + id: totrans-112 prefs: [] type: TYPE_NORMAL - en: The dog looked at the boy and …​ (barked?) + id: totrans-113 prefs: - PREF_UL type: TYPE_NORMAL - en: The boy looked at the dog and …​ (smiled?) + id: totrans-114 prefs: - PREF_UL type: TYPE_NORMAL @@ -579,66 +760,84 @@ creating the inputs to the initial Transformer block. Instead of only encoding each token using a *token embedding*, we also encode the position of the token, using a *position embedding*. + id: totrans-115 prefs: [] type: TYPE_NORMAL - en: The *token embedding* is created using a standard `Embedding` layer to convert each token into a learned vector. We can create the *positional embedding* in the same way, using a standard `Embedding` layer to convert each integer position into a learned vector. + id: totrans-116 prefs: [] type: TYPE_NORMAL - en: Tip + id: totrans-117 prefs: - PREF_H6 type: TYPE_NORMAL - en: While GPT uses an `Embedding` layer to embed the position, the original Transformer paper used trigonometric functions—we’ll cover this alternative in [Chapter 11](ch11.xhtml#chapter_music), when we explore music generation. + id: totrans-118 prefs: [] type: TYPE_NORMAL - en: To construct the joint token–position encoding, the token embedding is added to the positional embedding, as shown in [Figure 9-8](#positional_enc). This way, the meaning and position of each word in the sequence are captured in a single vector. + id: totrans-119 prefs: [] type: TYPE_NORMAL - en: '![](Images/gdl2_0908.png)' + id: totrans-120 prefs: [] type: TYPE_IMG - en: Figure 9-8\. The token embeddings are added to the positional embeddings to give the token position encoding + id: totrans-121 prefs: - PREF_H6 type: TYPE_NORMAL - en: The code that defines our `TokenAndPositionEmbedding` layer is shown in [Example 9-5](#positional_embedding_code). + id: totrans-122 prefs: [] type: TYPE_NORMAL - en: Example 9-5\. The `TokenAndPositionEmbedding` layer + id: totrans-123 prefs: - PREF_H5 type: TYPE_NORMAL - en: '[PRE5]' + id: totrans-124 prefs: [] type: TYPE_PRE + zh: '[PRE5]' - en: '[![1](Images/1.png)](#co_transformers_CO3-1)' + id: totrans-125 prefs: [] type: TYPE_NORMAL - en: The tokens are embedded using an `Embedding` layer. + id: totrans-126 prefs: [] type: TYPE_NORMAL - en: '[![2](Images/2.png)](#co_transformers_CO3-2)' + id: totrans-127 prefs: [] type: TYPE_NORMAL - en: The positions of the tokens are also embedded using an `Embedding` layer. + id: totrans-128 prefs: [] type: TYPE_NORMAL - en: '[![3](Images/3.png)](#co_transformers_CO3-3)' + id: totrans-129 prefs: [] type: TYPE_NORMAL - en: The output from the layer is the sum of the token and position embeddings. + id: totrans-130 prefs: [] type: TYPE_NORMAL - en: Training GPT + id: totrans-131 prefs: - PREF_H2 type: TYPE_NORMAL @@ -646,117 +845,163 @@ we need to pass our input text through the token and position embedding layer, then through our Transformer block. The final output of the network is a simple `Dense` layer with softmax activation over the number of words in the vocabulary. + id: totrans-132 prefs: [] type: TYPE_NORMAL - en: Tip + id: totrans-133 prefs: - PREF_H6 type: TYPE_NORMAL - en: For simplicity, we will use just one Transformer block, rather than the 12 in the paper. + id: totrans-134 prefs: [] type: TYPE_NORMAL - en: The overall architecture is shown in [Figure 9-9](#transformer) and the equivalent code is provided in [Example 9-6](#transformer_code). + id: totrans-135 prefs: [] type: TYPE_NORMAL - en: '![](Images/gdl2_0909.png)' + id: totrans-136 prefs: [] type: TYPE_IMG - en: Figure 9-9\. The simplified GPT model architecture + id: totrans-137 prefs: - PREF_H6 type: TYPE_NORMAL - en: Example 9-6\. A GPT model in Keras + id: totrans-138 prefs: - PREF_H5 type: TYPE_NORMAL - en: '[PRE6]' + id: totrans-139 prefs: [] type: TYPE_PRE + zh: '[PRE6]' - en: '[![1](Images/1.png)](#co_transformers_CO4-1)' + id: totrans-140 prefs: [] type: TYPE_NORMAL - en: The input is padded (with zeros). + id: totrans-141 prefs: [] type: TYPE_NORMAL - en: '[![2](Images/2.png)](#co_transformers_CO4-2)' + id: totrans-142 prefs: [] type: TYPE_NORMAL - en: The text is encoded using a `TokenAndPositionEmbedding` layer. + id: totrans-143 prefs: [] type: TYPE_NORMAL - en: '[![3](Images/3.png)](#co_transformers_CO4-3)' + id: totrans-144 prefs: [] type: TYPE_NORMAL - en: The encoding is passed through a `TransformerBlock`. + id: totrans-145 prefs: [] type: TYPE_NORMAL - en: '[![4](Images/4.png)](#co_transformers_CO4-4)' + id: totrans-146 prefs: [] type: TYPE_NORMAL - en: The transformed output is passed through a `Dense` layer with softmax activation to predict a distribution over the subsequent word. + id: totrans-147 prefs: [] type: TYPE_NORMAL + zh: 转换后的输出通过具有softmax激活的`Dense`层传递,以预测后续单词的分布。 - en: '[![5](Images/5.png)](#co_transformers_CO4-5)' + id: totrans-148 prefs: [] type: TYPE_NORMAL + zh: '[![5](Images/5.png)](#co_transformers_CO4-5)' - en: The `Model` takes a sequence of word tokens as input and outputs the predicted subsequent word distribution. The output from the Transformer block is also returned so that we can inspect how the model is directing its attention. + id: totrans-149 prefs: [] type: TYPE_NORMAL + zh: '`Model`以单词标记序列作为输入,并输出预测的后续单词分布。还返回了Transformer块的输出,以便我们可以检查模型如何引导其注意力。' - en: '[![6](Images/6.png)](#co_transformers_CO4-6)' + id: totrans-150 prefs: [] type: TYPE_NORMAL + zh: '[![6](Images/6.png)](#co_transformers_CO4-6)' - en: The model is compiled with `SparseCategoricalCrossentropy` loss over the predicted word distribution. + id: totrans-151 prefs: [] type: TYPE_NORMAL + zh: 模型使用预测的单词分布上的`SparseCategoricalCrossentropy`损失进行编译。 - en: Analysis of GPT + id: totrans-152 prefs: - PREF_H2 type: TYPE_NORMAL + zh: GPT的分析 - en: Now that we have compiled and trained our GPT model, we can start to use it to generate long strings of text. We can also interrogate the attention weights that are output from the `TransformerBlock`, to understand where the Transformer is looking for information at different points in the generation process. + id: totrans-153 prefs: [] type: TYPE_NORMAL + zh: 现在我们已经编译并训练了我们的GPT模型,我们可以开始使用它生成长文本字符串。我们还可以询问从`TransformerBlock`输出的注意权重,以了解Transformer在生成过程中不同点处寻找信息的位置。 - en: Generating text + id: totrans-154 prefs: - PREF_H3 type: TYPE_NORMAL + zh: 生成文本 - en: 'We can generate new text by applying the following process:' + id: totrans-155 prefs: [] type: TYPE_NORMAL + zh: 我们可以通过以下过程生成新文本: - en: Feed the network with an existing sequence of words and ask it to predict the following word. + id: totrans-156 prefs: - PREF_OL type: TYPE_NORMAL + zh: 将现有单词序列馈送到网络中,并要求它预测接下来的单词。 - en: Append this word to the existing sequence and repeat. + id: totrans-157 prefs: - PREF_OL type: TYPE_NORMAL + zh: 将此单词附加到现有序列并重复。 - en: The network will output a set of probabilities for each word that we can sample from, so we can make the text generation stochastic, rather than deterministic. + id: totrans-158 prefs: [] type: TYPE_NORMAL + zh: 网络将为每个单词输出一组概率,我们可以从中进行抽样,因此我们可以使文本生成具有随机性,而不是确定性。 - en: We will use the same `TextGenerator` class introduced in [Chapter 5](ch05.xhtml#chapter_autoregressive) for LSTM text generation, including the `temperature` parameter that specifies how deterministic we would like the sampling process to be. Let’s take a look at this in action, at two different temperature values ([Figure 9-10](#transformer_examples)). + id: totrans-159 prefs: [] type: TYPE_NORMAL + zh: 我们将使用在[第5章](ch05.xhtml#chapter_autoregressive)中引入的相同`TextGenerator`类进行LSTM文本生成,包括指定采样过程的确定性程度的`temperature`参数。让我们看看这在两个不同的温度值([图9-10](#transformer_examples))下是如何运作的。 - en: '![](Images/gdl2_0910.png)' + id: totrans-160 prefs: [] type: TYPE_IMG + zh: '![](Images/gdl2_0910.png)' - en: Figure 9-10\. Generated outputs at `temperature = 1.0` and `temperature = 0.5`. + id: totrans-161 prefs: - PREF_H6 type: TYPE_NORMAL + zh: 图9-10。在`temperature = 1.0`和`temperature = 0.5`时生成的输出。 - en: There are a few things to note about these two passages. First, both are stylistically similar to a wine review from the original training set. They both open with the region and type of wine, and the wine type stays consistent throughout the passage @@ -765,40 +1010,54 @@ accurate than the example with temperature 0.5\. Generating multiple samples with temperature 1.0 will therefore lead to more variety as the model is sampling from a probability distribution with greater variance. + id: totrans-162 prefs: [] type: TYPE_NORMAL + zh: 关于这两段文字有几点需要注意。首先,两者在风格上与原始训练集中的葡萄酒评论相似。它们都以葡萄酒的产地和类型开头,而葡萄酒类型在整个段落中保持一致(例如,它不会在中途更换颜色)。正如我们在[第5章](ch05.xhtml#chapter_autoregressive)中看到的,使用温度为1.0生成的文本更加冒险,因此比温度为0.5的示例不够准确。因此,使用温度为1.0生成多个样本将导致更多的变化,因为模型正在从具有更大方差的概率分布中进行抽样。 - en: Viewing the attention scores + id: totrans-163 prefs: - PREF_H3 type: TYPE_NORMAL + zh: 查看注意力分数 - en: We can also ask the model to tell us how much attention is being placed on each word, when deciding on the next word in the sentence. The `TransformerBlock` outputs the attention weights for each head, which are a softmax distribution over the preceding words in the sentence. + id: totrans-164 prefs: [] type: TYPE_NORMAL + zh: 我们还可以要求模型告诉我们在决定句子中的下一个单词时,每个单词放置了多少注意力。`TransformerBlock`输出每个头的注意权重,这是对句子中前面单词的softmax分布。 - en: To demonstrate this, [Figure 9-11](#attention_probs) shows the top five tokens with the highest probabilities for three different input prompts, as well as the average attention across both heads, against each preceding word. The preceding words are colored according to their attention score, averaged across the two attention heads. Darker blue indicates more attention is being placed on the word. + id: totrans-165 prefs: [] type: TYPE_NORMAL + zh: 为了证明这一点,[图9-11](#attention_probs)显示了三个不同输入提示的前五个具有最高概率的标记,以及两个注意力头的平均注意力,针对每个前面的单词。根据其注意力分数对前面的单词进行着色,两个注意力头的平均值。深蓝色表示对该单词放置更多的注意力。 - en: '![](Images/gdl2_0911.png)' + id: totrans-166 prefs: [] type: TYPE_IMG + zh: '![](Images/gdl2_0911.png)' - en: Figure 9-11\. Distribution of word probabilities following various sequences + id: totrans-167 prefs: - PREF_H6 type: TYPE_NORMAL + zh: 图9-11。各种序列后单词概率分布 - en: In the first example, the model attends closely to the country (*germany*) in order to decide on the word that relates to the region. This makes sense! To pick a region, it needs to take lots of information from the words that relate to the country, to ensure they match. It doesn’t need to pay as much attention to the first two tokens (*wine review*) because they don’t hold any useful information regarding the region. + id: totrans-168 prefs: [] type: TYPE_NORMAL + zh: 在第一个示例中,模型密切关注国家(*德国*),以决定与地区相关的单词。这是有道理的!为了选择一个地区,它需要从与国家相关的单词中获取大量信息,以确保它们匹配。它不需要太关注前两个标记(*葡萄酒评论*),因为它们不包含有关地区的任何有用信息。 - en: In the second example, it needs to refer back to the grape (*riesling*), so it pays attention to the first time that it was mentioned. It can pull this information by directly attending to the word, no matter how far back it is in the sentence @@ -806,6 +1065,7 @@ a recurrent neural network, which relies on a hidden state to maintain all interesting information over the length of the sequence so that it can be drawn upon if required—a much less efficient approach. + id: totrans-169 prefs: [] type: TYPE_NORMAL - en: The final sequence shows an example of how our GPT model can choose an appropriate @@ -814,6 +1074,7 @@ As Riesling is typically a sweet wine, and sugar is already mentioned, it makes sense that it should be described as *slightly sweet* rather than *slightly earthy*, for example. + id: totrans-170 prefs: [] type: TYPE_NORMAL - en: It is incredibly informative to be able to interrogate the network in this way, @@ -822,6 +1083,7 @@ input prompts to see if you can get the model to attend to words really far back in the sentence, to convince yourself of the power of attention-based models over more traditional recurrent models!` `# Other Transformers + id: totrans-171 prefs: [] type: TYPE_NORMAL - en: Our GPT model is a *decoder Transformer*—it generates a text string one token @@ -832,39 +1094,49 @@ are also *encoder-decoder Transformers* that can translate from one text string to another; this type of model contains both encoder Transformer blocks and decoder Transformer blocks. + id: totrans-172 prefs: [] type: TYPE_NORMAL - en: '[Table 9-1](#transformer_types) summarizes the three types of Transformers, with the best examples of each architecture and typical use cases.' + id: totrans-173 prefs: [] type: TYPE_NORMAL - en: Table 9-1\. The three Transformer architectures + id: totrans-174 prefs: [] type: TYPE_NORMAL - en: '| Type | Examples | Use cases |' + id: totrans-175 prefs: [] type: TYPE_TB - en: '| --- | --- | --- |' + id: totrans-176 prefs: [] type: TYPE_TB - en: '| Encoder | BERT (Google) | Sentence classification, named entity recognition, extractive question answering |' + id: totrans-177 prefs: [] type: TYPE_TB - en: '| Encoder-decoder | T5 (Google) | Summarization, translation, question answering |' + id: totrans-178 prefs: [] type: TYPE_TB - en: '| Decoder | GPT-3 (OpenAI) | Text generation |' + id: totrans-179 prefs: [] type: TYPE_TB - en: A well-known example of an encoder Transformer is the *Bidirectional Encoder Representations from Transformers* (BERT) model, developed by Google (Devlin et al., 2018) that predicts missing words from a sentence, given context from both before and after the missing word in all layers. + id: totrans-180 prefs: [] type: TYPE_NORMAL - en: Encoder Transformers + id: totrans-181 prefs: - PREF_H1 type: TYPE_NORMAL @@ -874,14 +1146,17 @@ so we will not explore them in detail in this book—see Lewis Tunstall et al.’s [*Natural Language Processing with Transformers*](https://www.oreilly.com/library/view/natural-language-processing/9781098136789) (O’Reilly) for more information. + id: totrans-182 prefs: [] type: TYPE_NORMAL - en: In the following sections we will explore how encoder-decoder transformers work and discuss extensions of the original GPT model architecture released by OpenAI, including ChatGPT, which has been specifically designed for conversational applications. + id: totrans-183 prefs: [] type: TYPE_NORMAL - en: T5 + id: totrans-184 prefs: - PREF_H2 type: TYPE_NORMAL @@ -889,14 +1164,17 @@ the T5 model from Google.^([5](ch09.xhtml#idm45387005361120)) This model reframes a range of tasks into a text-to-text framework, including translation, linguistic acceptability, sentence similarity, and document summarization, as shown in [Figure 9-12](#t5). + id: totrans-185 prefs: [] type: TYPE_NORMAL - en: '![](Images/gdl2_0912.png)' + id: totrans-186 prefs: [] type: TYPE_IMG - en: 'Figure 9-12\. Examples of how T5 reframes a range of tasks into a text-to-text framework, including translation, linguistic acceptability, sentence similarity, and document summarization (source: [Raffel et al., 2019](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html))' + id: totrans-187 prefs: - PREF_H6 type: TYPE_NORMAL @@ -906,13 +1184,16 @@ Colossal Clean Crawled Corpus, or C4), whereas the original Transformer paper was focused only on language translation, so it was trained on 1.4 GB of English–German sentence pairs. + id: totrans-188 prefs: [] type: TYPE_NORMAL - en: '![](Images/gdl2_0913.png)' + id: totrans-189 prefs: [] type: TYPE_IMG - en: 'Figure 9-13\. An encoder-decoder Transformer model: each gray box is a Transformer block (source: [Vaswani et al., 2017](https://arxiv.org/abs/1706.03762))' + id: totrans-190 prefs: - PREF_H6 type: TYPE_NORMAL @@ -920,6 +1201,7 @@ being repeated and positional embedding being used to capture the ordering of the input sequences. The two key differences between this model and the GPT model that we built earlier in the chapter are as follows:' + id: totrans-191 prefs: [] type: TYPE_NORMAL - en: On the lefthand side, a set of *encoder* Transformer blocks encode the sequence @@ -929,6 +1211,7 @@ that can be fed to the decoder. Therefore, the attention layers in the encoder can be completely unmasked to capture all the cross-dependencies between words, no matter the order. + id: totrans-192 prefs: - PREF_UL type: TYPE_NORMAL @@ -941,6 +1224,7 @@ is called *cross-referential* attention and means that the decoder can attend to the encoder representation of the input sequence to be translated. This is how the decoder knows what meaning the translation needs to convey! + id: totrans-193 prefs: - PREF_UL type: TYPE_NORMAL @@ -951,18 +1235,22 @@ on the gender of the noun, but the Transformer knows to choose *die* because one attention head is able to attend to the word *street* (a feminine word in German), while another attends to the word to translate (*the*).' + id: totrans-194 prefs: [] type: TYPE_NORMAL - en: '![](Images/gdl2_0914.png)' + id: totrans-195 prefs: [] type: TYPE_IMG - en: Figure 9-14\. An example of how one attention head attends to the word “the” and another attends to the word “street” in order to correctly translate the word “the” to the German word “die” as the feminine definite article of “Straße” + id: totrans-196 prefs: - PREF_H6 type: TYPE_NORMAL - en: Tip + id: totrans-197 prefs: - PREF_H6 type: TYPE_NORMAL @@ -970,43 +1258,53 @@ which contains a Colab notebook that allows you to play around with a trained encoder-decoder Transformer model and see how the attention mechanisms of the encoder and decoder impact the translation of a given sentence into German. + id: totrans-198 prefs: [] type: TYPE_NORMAL - en: GPT-3 and GPT-4 + id: totrans-199 prefs: - PREF_H2 type: TYPE_NORMAL - en: Since the original 2018 publication of GPT, OpenAI has released multiple updated versions that improve upon the original model, as shown in [Table 9-2](#gpt_releases). + id: totrans-200 prefs: [] type: TYPE_NORMAL - en: Table 9-2\. The evolution of OpenAI’s GPT collection of models + id: totrans-201 prefs: [] type: TYPE_NORMAL - en: '| Model | Date | Layers | Attention heads | Word embedding size | Context window | # parameters | Training data |' + id: totrans-202 prefs: [] type: TYPE_TB - en: '| --- | --- | --- | --- | --- | --- | --- | --- |' + id: totrans-203 prefs: [] type: TYPE_TB - en: '| [GPT](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf) | Jun 2018 | 12 | 12 | 768 | 512 | 120,000,000 | BookCorpus: 4.5 GB of text from unpublished books |' + id: totrans-204 prefs: [] type: TYPE_TB - en: '| [GPT-2](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) | Feb 2019 | 48 | 48 | 1,600 | 1,024 | 1,500,000,000 | WebText: 40 GB of text from outbound Reddit links |' + id: totrans-205 prefs: [] type: TYPE_TB - en: '| [GPT-3](https://arxiv.org/abs/2005.14165) | May 2020 | 96 | 96 | 12,888 | 2,048 | 175,000,000,000 | CommonCrawl, WebText, English Wikipedia, book corpora and others: 570 GB |' + id: totrans-206 prefs: [] type: TYPE_TB - en: '| [GPT-4](https://arxiv.org/abs/2303.08774) | Mar 2023 | - | - | - | - | - | - |' + id: totrans-207 prefs: [] type: TYPE_TB - en: The model architecture of GPT-3 is fairly similar to the original GPT model, @@ -1016,6 +1314,7 @@ so crosses over into being a multimodal model for the first time. The model weights of GPT-3 and GPT-4 are not open source, though the models are available through a [commercial tool and API](https://platform.openai.com). + id: totrans-208 prefs: [] type: TYPE_NORMAL - en: GPT-3 can also be [fine-tuned to your own training data](https://oreil.ly/B-Koo)—this @@ -1025,16 +1324,20 @@ simply by providing a few examples in the prompt itself (this is known as *few-shot learning*). The benefit of fine-tuning is that you do not need to provide these examples as part of every single input prompt, saving costs in the long run. + id: totrans-209 prefs: [] type: TYPE_NORMAL - en: An example of the output from GPT-3, given a system prompt sentence, is shown in [Figure 9-15](#gpt3_story). + id: totrans-210 prefs: [] type: TYPE_NORMAL - en: '![](Images/gdl2_0915.png)' + id: totrans-211 prefs: [] type: TYPE_IMG - en: Figure 9-15\. An example of how GPT-3 can extend a given system prompt + id: totrans-212 prefs: - PREF_H6 type: TYPE_NORMAL @@ -1042,9 +1345,11 @@ of model weights and dataset size. The ceiling of large language model capability has yet to be reached, with researchers continuing to push the boundaries of what is possible with increasingly larger models and datasets. + id: totrans-213 prefs: [] type: TYPE_NORMAL - en: ChatGPT + id: totrans-214 prefs: - PREF_H2 type: TYPE_NORMAL @@ -1053,18 +1358,22 @@ a conversational interface. The original release in November 2022 was powered by *GPT-3.5*, a version of the model that was more powerful that GPT-3 and was fine-tuned to conversational responses. + id: totrans-215 prefs: [] type: TYPE_NORMAL - en: Example dialogue is shown in [Figure 9-16](#chatgpt_example). Notice how the agent is able to maintain state between inputs, understanding that the *attention* mentioned in the second question refers to attention in the context of Transformers, rather than a person’s ability to focus. + id: totrans-216 prefs: [] type: TYPE_NORMAL - en: '![](Images/gdl2_0916.png)' + id: totrans-217 prefs: [] type: TYPE_IMG - en: Figure 9-16\. An example of ChatGPT answering questions about Transformers + id: totrans-218 prefs: - PREF_H6 type: TYPE_NORMAL @@ -1075,14 +1384,17 @@ group’s earlier paper^([6](ch09.xhtml#idm45387005277024)) that introduced the *InstructGPT* model, a fine-tuned GPT-3 model that is specifically designed to more accurately follow written instructions. + id: totrans-219 prefs: [] type: TYPE_NORMAL - en: 'The training process for ChatGPT is as follows:' + id: totrans-220 prefs: [] type: TYPE_NORMAL - en: '*Supervised fine-tuning*: Collect a demonstration dataset of conversational inputs (prompts) and desired outputs that have been written by humans. This is used to fine-tune the underlying language model (GPT-3.5) using supervised learning.' + id: totrans-221 prefs: - PREF_OL type: TYPE_NORMAL @@ -1090,6 +1402,7 @@ sampled model outputs and ask them to rank the outputs from best to worst. Train a reward model that predicts the score given to each output, given the conversation history.' + id: totrans-222 prefs: - PREF_OL type: TYPE_NORMAL @@ -1100,26 +1413,32 @@ by the reward model trained in step 2\. A reinforcement learning algorithm—proximal policy optimization (PPO)—can then be trained to maximize the reward, by adjusting the weights of the language model.' + id: totrans-223 prefs: - PREF_OL type: TYPE_NORMAL - en: Reinforcement Learning + id: totrans-224 prefs: - PREF_H1 type: TYPE_NORMAL - en: For an introduction to reinforcement learning see [Chapter 12](ch12.xhtml#chapter_world_models), where we explore how generative models can be used in a reinforcement learning setting. + id: totrans-225 prefs: [] type: TYPE_NORMAL - en: The RLHF process is shown in [Figure 9-17](#rlhf). + id: totrans-226 prefs: [] type: TYPE_NORMAL - en: '![](Images/gdl2_0917.png)' + id: totrans-227 prefs: [] type: TYPE_IMG - en: 'Figure 9-17\. The reinforcement learning from human feedback fine-tuning process used in ChatGPT (source: [OpenAI](https://openai.com/blog/chatgpt))' + id: totrans-228 prefs: - PREF_H6 type: TYPE_NORMAL @@ -1129,6 +1448,7 @@ and novel output that is often indistinguishable from human-generated text. The progress made thus far by models like ChatGPT serves as a testament to the potential of AI and its transformative impact on the world. + id: totrans-229 prefs: [] type: TYPE_NORMAL - en: Moreover, it is evident that AI-driven communication and interaction will continue @@ -1138,32 +1458,38 @@ text, but also images. The fusion of linguistic and visual capabilities in projects like Visual ChatGPT and GPT-4 have the potential to herald a new era in human–computer interaction. + id: totrans-230 prefs: [] type: TYPE_NORMAL - en: Summary + id: totrans-231 prefs: - PREF_H1 type: TYPE_NORMAL - en: In this chapter, we explored the Transformer model architecture and built a version of GPT—a model for state-of-the-art text generation. + id: totrans-232 prefs: [] type: TYPE_NORMAL - en: GPT makes use of a mechanism known as attention, which removes the need for recurrent layers (e.g., LSTMs). It works like an information retrieval system, utilizing queries, keys, and values to decide how much information it wants to extract from each input token. + id: totrans-233 prefs: [] type: TYPE_NORMAL - en: Attention heads can be grouped together to form what is known as a multihead attention layer. These are then wrapped up inside a Transformer block, which includes layer normalization and skip connections around the attention layer. Transformer blocks can be stacked to create very deep neural networks. + id: totrans-234 prefs: [] type: TYPE_NORMAL - en: Causal masking is used to ensure that GPT cannot leak information from downstream tokens into the current prediction. Also, a technique known as positional encoding is used to ensure that the ordering of the input sequence is not lost, but instead is baked into the input alongside the traditional word embedding. + id: totrans-235 prefs: [] type: TYPE_NORMAL - en: When analyzing the output from GPT, we saw it was possible not only to generate @@ -1173,41 +1499,50 @@ because the attention scores are calculated in parallel and do not rely on a hidden state that is carried through the network sequentially, as is the case with recurrent neural networks. + id: totrans-236 prefs: [] type: TYPE_NORMAL - en: We saw how there are three families of Transformers (encoder, decoder, and encoder-decoder) and the different tasks that can be accomplished with each. Finally, we explored the structure and training process of other large language models such as Google’s T5 and OpenAI’s ChatGPT. + id: totrans-237 prefs: [] type: TYPE_NORMAL - en: ^([1](ch09.xhtml#idm45387006840576-marker)) Ashish Vaswani et al., “Attention Is All You Need,” June 12, 2017, [*https://arxiv.org/abs/1706.03762*](https://arxiv.org/abs/1706.03762). + id: totrans-238 prefs: [] type: TYPE_NORMAL - en: ^([2](ch09.xhtml#idm45387006828736-marker)) Alec Radford et al., “Improving Language Understanding by Generative Pre-Training,” June 11, 2018, [*https://openai.com/research/language-unsupervised*](https://openai.com/research/language-unsupervised). + id: totrans-239 prefs: [] type: TYPE_NORMAL - en: '^([3](ch09.xhtml#idm45387006370384-marker)) Jacob Devlin et al., “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding,” October 11, 2018, [*https://arxiv.org/abs/1810.04805*](https://arxiv.org/abs/1810.04805).' + id: totrans-240 prefs: [] type: TYPE_NORMAL - en: '^([4](ch09.xhtml#idm45387006340992-marker)) Sheng Shen et al., “PowerNorm: Rethinking Batch Normalization in Transformers,” June 28, 2020, [*https://arxiv.org/abs/2003.07845*](https://arxiv.org/abs/2003.07845).' + id: totrans-241 prefs: [] type: TYPE_NORMAL - en: ^([5](ch09.xhtml#idm45387005361120-marker)) Colin Raffel et al., “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer,” October 23, 2019, [*https://arxiv.org/abs/1910.10683*](https://arxiv.org/abs/1910.10683). + id: totrans-242 prefs: [] type: TYPE_NORMAL - en: ^([6](ch09.xhtml#idm45387005277024-marker)) Long Ouyang et al., “Training Language Models to Follow Instructions with Human Feedback,” March 4, 2022, [*https://arxiv.org/abs/2203.02155*](https://arxiv.org/abs/2203.02155). + id: totrans-243 prefs: [] type: TYPE_NORMAL - en: '^([7](ch09.xhtml#idm45387005252672-marker)) Chenfei Wu et al., “Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models,” March 8, 2023, [*https://arxiv.org/abs/2303.04671*](https://arxiv.org/abs/2303.04671).`' + id: totrans-244 prefs: [] type: TYPE_NORMAL