diff --git a/data/bd-llm-scr_1.yaml b/data/bd-llm-scr_1.yaml new file mode 100644 index 0000000..8f455f7 --- /dev/null +++ b/data/bd-llm-scr_1.yaml @@ -0,0 +1,1081 @@ +- en: 1 Understanding Large Language Models + id: totrans-0 + prefs: + - PREF_H1 + type: TYPE_NORMAL + zh: 1 理解大型语言模型 +- en: This chapter covers + id: totrans-1 + prefs: + - PREF_H3 + type: TYPE_NORMAL + zh: 本章包括 +- en: High-level explanations of the fundamental concepts behind large language models + (LLMs) + id: totrans-2 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: 大型语言模型(LLM)背后的基本概念的高层次解释 +- en: Insights into the transformer architecture from which ChatGPT-like LLMs are + derived + id: totrans-3 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: 探索ChatGPT类LLM源自的Transformer架构的深层次解释 +- en: A plan for building an LLM from scratch + id: totrans-4 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: 从零开始构建LLM的计划 +- en: Large language models (LLMs) like ChatGPT are deep neural network models developed + over the last few years. They ushered in a new era for Natural Language Processing + (NLP). Before the advent of large language models, traditional methods excelled + at categorization tasks such as email spam classification and straightforward + pattern recognition that could be captured with handcrafted rules or simpler models. + However, they typically underperformed in language tasks that demanded complex + understanding and generation abilities, such as parsing detailed instructions, + conducting contextual analysis, or creating coherent and contextually appropriate + original text. For example, previous generations of language models could not + write an email from a list of keywords—a task that is trivial for contemporary + LLMs. + id: totrans-5 + prefs: [] + type: TYPE_NORMAL + zh: 像ChatGPT这样的大型语言模型(LLM)是在过去几年中开发的深度神经网络模型。它们引领了自然语言处理(NLP)的新时代。在大型语言模型出现之前,传统方法擅长于分类任务,如电子邮件垃圾分类和可以通过手工制作的规则或简单模型捕获的简单模式识别。然而,在需要复杂理解和生成能力的语言任务方面,例如解析详细说明、进行上下文分析或创建连贯且上下文适当的原始文本时,它们通常表现不佳。例如,以前的语言模型无法根据关键字列表编写电子邮件-这对于当代LLM来说是微不足道的任务。 +- en: LLMs have remarkable capabilities to understand, generate, and interpret human + language. However, it's important to clarify that when we say language models + "understand," we mean that they can process and generate text in ways that appear + coherent and contextually relevant, not that they possess human-like consciousness + or comprehension. + id: totrans-6 + prefs: [] + type: TYPE_NORMAL + zh: LLM具有出色的理解、生成和解释人类语言的能力。然而,重要的是澄清,当我们说语言模型“理解”时,我们指的是它们可以以看起来连贯和上下文相关的方式处理和生成文本,而不是它们具有类似人类的意识或理解能力。 +- en: Enabled by advancements in deep learning, which is a subset of machine learning + and artificial intelligence (AI) focused on neural networks, LLMs are trained + on vast quantities of text data. This allows LLMs to capture deeper contextual + information and subtleties of human language compared to previous approaches. + As a result, LLMs have significantly improved performance in a wide range of NLP + tasks, including text translation, sentiment analysis, question answering, and + many more. + id: totrans-7 + prefs: [] + type: TYPE_NORMAL + zh: 在深度学习的推动下,LLM受益于大量文本数据的训练。这使得LLM能够捕获比以前更深层次的语境信息和人类语言的微妙之处。因此,LLM在各种NLP任务中的性能显着提高,包括文本翻译、情感分析、问答等等。 +- en: Another important distinction between contemporary LLMs and earlier NLP models + is that the latter were typically designed for specific tasks; whereas those earlier + NLP models excelled in their narrow applications, LLMs demonstrate a broader proficiency + across a wide range of NLP tasks. + id: totrans-8 + prefs: [] + type: TYPE_NORMAL + zh: 当代LLM与早期NLP模型之间的另一个重要区别是,后者通常是为特定任务而设计的;而早期的NLP模型在其狭窄应用中表现出色,LLM则在各种NLP任务中展示出更广泛的熟练程度。 +- en: The success behind LLMs can be attributed to the transformer architecture which + underpins many LLMs, and the vast amounts of data LLMs are trained on, allowing + them to capture a wide variety of linguistic nuances, contexts, and patterns that + would be challenging to manually encode. + id: totrans-9 + prefs: [] + type: TYPE_NORMAL + zh: LLM背后的成功归功于Transformer架构,该架构支撑了许多LLM,并且LLM训练的大量数据,使它们能够捕捉到各种语言细微差别、语境和模式,这些模式是难以手工编码的。 +- en: This shift towards implementing models based on the transformer architecture + and using large training datasets to train LLMs has fundamentally transformed + NLP, providing more capable tools for understanding and interacting with human + language. + id: totrans-10 + prefs: [] + type: TYPE_NORMAL + zh: 将模型基于Transformer架构实现,并使用大型训练数据集来训练LLM的这一转变,从根本上改变了NLP,为理解和与人类语言交互提供了更有能力的工具。 +- en: 'Beginning with this chapter, we set the foundation to accomplish the primary + objective of this book: understanding LLMs by implementing a ChatGPT-like LLM + based on the transformer architecture step by step in code.' + id: totrans-11 + prefs: [] + type: TYPE_NORMAL + zh: 从本章开始,我们为实现本书的主要目标奠定基础:通过逐步在代码中实现基于Transformer架构的ChatGPT样式LLM来理解LLM。 +- en: 1.1 What is an LLM? + id: totrans-12 + prefs: + - PREF_H2 + type: TYPE_NORMAL + zh: 1.1 什么是LLM? +- en: An LLM, a large language model, is a neural network designed to understand, + generate, and respond to human-like text. These models are deep neural networks + trained on massive amounts of text data, sometimes encompassing large portions + of the entire publicly available text on the internet. + id: totrans-13 + prefs: [] + type: TYPE_NORMAL + zh: LLM,即大型语言模型,是一种设计用于理解、生成和回应类似人类文本的神经网络。这些模型是在大量文本数据上训练的深度神经网络,有时包括互联网上整个可公开获取文本的大部分内容。 +- en: The "large" in large language model refers to both the model's size in terms + of parameters and the immense dataset on which it's trained. Models like this + often have tens or even hundreds of billions of parameters, which are the adjustable + weights in the network that are optimized during training to predict the next + word in a sequence. Next-word prediction is sensible because it harnesses the + inherent sequential nature of language to train models on understanding context, + structure, and relationships within text. Yet, it is a very simple task and so + it issurprising to many researchers that it can produce such capable models. We + will discuss and implement the next-word training procedure in later chapters + step by step. + id: totrans-14 + prefs: [] + type: TYPE_NORMAL + zh: '"大型"语言模型中的"大"既指模型在参数方面的规模,也指其所训练的庞大数据集。这样的模型通常具有数百亿甚至数百亿个参数,这些参数是网络中的可调权重,在训练过程中进行优化,以预测序列中的下一个词。下一个词的预测是合理的,因为它利用了语言固有的顺序性质来训练模型,使其理解文本中的上下文、结构和关系。然而,这是一个非常简单的任务,许多研究人员会感到惊讶的是,它能够产生如此有能力的模型。我们将在后续章节逐步讨论并实施下一个词的训练过程。' +- en: LLMs utilize an architecture called the *transformer* (covered in more detail + in section 1.4), which allows them to pay selective attention to different parts + of the input when making predictions, making them especially adept at handling + the nuances and complexities of human language. + id: totrans-15 + prefs: [] + type: TYPE_NORMAL + zh: LLMs利用一种称为*transformer*的架构(在第1.4节中更详细地介绍),使它们在进行预测时能够有选择地关注输入的不同部分,使其特别擅长处理人类语言的细微差别和复杂性。 +- en: Since LLMs are capable of *generating* text, LLMs are also often referred to + as a form of generative artificial intelligence (AI), often abbreviated as *generative + AI* or *GenAI*. As illustrated in figure 1.1, AI encompasses the broader field + of creating machines that can perform tasks requiring human-like intelligence, + including understanding language, recognizing patterns, and making decisions, + and includes subfields like machine learning and deep learning. + id: totrans-16 + prefs: [] + type: TYPE_NORMAL + zh: 由于LLMs能够*生成*文本,所以LLMs也经常被称为一种生成人工智能(AI)的形式,通常简称为*生成AI*或*GenAI*。如图1.1所示,AI涵盖了创建能够执行需要类似人类智能的任务的机器的更广泛领域,包括理解语言、识别模式和做决策,并包括诸如机器学习和深度学习之类的子领域。 +- en: Figure 1.1 As this hierarchical depiction of the relationship between the different + fields suggests, LLMs represent a specific application of deep learning techniques, + leveraging their ability to process and generate human-like text. Deep learning + is a specialized branch of machine learning that focuses on using multi-layer + neural networks. And machine learning and deep learning are fields aimed at implementing + algorithms that enable computers to learn from data and perform tasks that typically + require human intelligence. The field of artificial intelligence is nowadays dominated + by machine learning and deep learning but it also includes other approaches, for + example by using rule-based systems, genetic algorithms, expert systems, fuzzy + logic, or symbolic reasoning. + id: totrans-17 + prefs: + - PREF_H5 + type: TYPE_NORMAL + zh: 图1.1 正如这一层次化的关系图所示,LLMs代表了深度学习技术的一种特定应用,利用它们处理和生成类似人类文本的能力。深度学习是一种专门的机器学习分支,专注于使用多层神经网络。机器学习和深度学习是旨在实现使计算机能够从数据中学习并执行通常需要人类智能的任务的算法领域。人工智能领域如今被机器学习和深度学习主导,但也包括其他方法,例如使用基于规则的系统、遗传算法、专家系统、模糊逻辑或符号推理。 +- en: '![](images/ch-01__image002.png)' + id: totrans-18 + prefs: [] + type: TYPE_IMG + zh: '![图1.1](images/ch-01__image002.png)' +- en: The algorithms used to implement AI are the focus of the field of machine learning. + Specifically, machine learning involves the development of algorithms that can + learn from and make predictions or decisions based on data without being explicitly + programmed. To illustrate this, imagine a spam filter as a practical application + of machine learning. Instead of manually writing rules to identify spam emails, + a machine learning algorithm is fed examples of emails labeled as spam and legitimate + emails. By minimizing theerror in its predictions on a training dataset, the model + then learns to recognize patterns and characteristics indicative of spam, enabling + it to classify new emails as either spam or legitimate. + id: totrans-19 + prefs: [] + type: TYPE_NORMAL + zh: 用于实现人工智能的算法是机器学习领域的重点。具体而言,机器学习涉及开发可以从数据中学习并基于数据进行预测或决策而无需明确编程的算法。为了说明这一点,可以将垃圾邮件过滤器作为机器学习的实际应用。与手动编写规则来识别垃圾邮件不同,机器学习算法将被提供标记为垃圾邮件和合法邮件的示例。通过在训练数据集上最小化其预测错误,模型可以学习识别与垃圾邮件相关的模式和特征,从而能够将新邮件分类为垃圾邮件或合法邮件。 +- en: Deep learning is a subset of machine learning that focuses on utilizing neural + networks with three or more layers (also called deep neural networks) to model + complex patterns and abstractions in data. In contrast to deep learning, traditional + machine learning requires manual feature extraction. This means that human experts + need to identify and select the most relevant features for the model. + id: totrans-20 + prefs: [] + type: TYPE_NORMAL + zh: 深度学习是机器学习的一个子集,专注于利用有三层或更多层的神经网络(也称为深度神经网络)来对数据中的复杂模式和抽象进行建模。与深度学习相反,传统机器学习需要手动提取特征。这意味着人类专家需要识别和选择对模型最相关的特征。 +- en: Returning to the spam classification example, in traditional machine learning, + human experts might manually extract features from email text such as the frequency + of certain trigger words ("prize," "win," "free"), the number of exclamation marks, + use of all uppercase words, or the presence of suspicious links. This dataset, + created based on these expert-defined features, would then be used to train the + model. In contrast to traditional machine learning, deep learning does not require + manual feature extraction. This means that human experts do not need to identify + and select the most relevant features for a deep learning model + id: totrans-21 + prefs: [] + type: TYPE_NORMAL + zh: 回顾垃圾邮件分类的例子,在传统机器学习中,人类专家可能会从电子邮件文本中手动提取特征,例如特定触发词("prize","win","free")的频率,感叹号的数量,使用全大写单词或怀疑链接的存在。基于这些专家定义的特征创建的数据集将用于训练模型。与传统机器学习相比,深度学习不需要手动提取特征。这意味着人类专家不需要为深度学习模型识别和选择最相关的特征。 +- en: The upcoming sections will cover some of the problems LLMs can solve today, + the challenges that LLMs address, and the general LLM architecture, which we will + implement in this book. + id: totrans-22 + prefs: [] + type: TYPE_NORMAL + zh: 接下来的几节将涵盖LLM(大型语言模型)今天可以解决的一些问题,LLM解决的挑战,以及我们将在本书中实现的通用LLM架构。 +- en: 1.2 Applications of LLMs + id: totrans-23 + prefs: + - PREF_H2 + type: TYPE_NORMAL + zh: 1.2 LLM的应用 +- en: Owing to their advanced capabilities to parse and understand unstructured text + data, LLMs have a broad range of applications across various domains. Today, LLMs + are employed for machine translation, generation of novel texts (see figure 1.2), + sentiment analysis, text summarization, and many other tasks. LLMs have recently + been used for content creation, such as writing fiction, articles, and even computer + code. + id: totrans-24 + prefs: [] + type: TYPE_NORMAL + zh: 由于LLM具有解析和理解非结构化文本数据的高级能力,LLM在各个领域都有广泛的应用。目前,LLM被应用于机器翻译、生成新颖文本(参见图1.2)、情感分析、文本摘要和许多其他任务。LLM最近还用于内容创作,如写小说、文章甚至是计算机代码。 +- en: Figure 1.2 LLM interfaces enable natural language communication between users + and AI systems. This screenshot shows ChatGPT writing a poem that according to + a user's specifications. + id: totrans-25 + prefs: + - PREF_H5 + type: TYPE_NORMAL + zh: 图1.2 LLM接口实现了用户和人工智能系统之间的自然语言交流。该截图显示ChatGPT根据用户的规格要求写诗。 +- en: '![](images/ch-01__image004.png)' + id: totrans-26 + prefs: [] + type: TYPE_IMG + zh: '![](images/ch-01__image004.png)' +- en: LLMs can also power sophisticated chatbots and virtual assistants, such as OpenAI's + ChatGPT or Google's Bard, which can answer user queries and augment traditional + search engines such as Google Search or Microsoft Bing. + id: totrans-27 + prefs: [] + type: TYPE_NORMAL + zh: LLM还可以为复杂的聊天机器人和虚拟助手提供动力,例如OpenAI的ChatGPT或Google的Bard,它们可以回答用户提问并增强传统搜索引擎(如Google + Search或Microsoft Bing)。 +- en: Moreover, LLMs may be used for effective knowledge retrieval from vast volumes + of text in specialized areas such as medicine or law. This includes sifting through + documents, summarizing lengthy passages, and answering technical questions. + id: totrans-28 + prefs: [] + type: TYPE_NORMAL + zh: 此外,LLM可以用于有效地从专业领域的大量文本中检索知识,如医学或法律。这包括筛选文件、总结长篇文章和回答技术问题。 +- en: In short, LLMs are invaluable for automating almost any task that involves parsing + and generating text. Their applications are virtually endless, and as we continue + to innovate and explore new ways to use these models, it's clear that LLMs have + the potential to redefine our relationship with technology, making it more conversational, + intuitive, and accessible. + id: totrans-29 + prefs: [] + type: TYPE_NORMAL + zh: 简而言之,LLM对于自动化几乎任何涉及解析和生成文本的任务都是无价的。它们的应用几乎是无限的,随着我们不断创新和探索使用这些模型的新方法,很明显,LLM有潜力重新定义我们与技术的关系,使其更具对话性、直观和可访问性。 +- en: In this book, we will focus on understanding how LLMs work from the ground up, + coding an LLM that can generate texts. We will also learn about techniques that + allow LLMs to carry out queries, ranging from answering questions to summarizing + text, translating text into different languages, and more. In other words, in + this book, we will learn how complex LLM assistants such as ChatGPT work by building + one step by step. + id: totrans-30 + prefs: [] + type: TYPE_NORMAL + zh: 在本书中,我们将重点关注从零开始理解LLM(大型语言模型)的工作原理,编写一个能够生成文本的LLM。我们还将学习一些技术,使LLM能够进行各种查询,从回答问题到总结文本、将文本翻译成不同语言等等。换句话说,在本书中,我们将通过逐步构建一个LLM来了解复杂的LLM助手(如ChatGPT)是如何工作的。 +- en: 1.3 Stages of building and using LLMs + id: totrans-31 + prefs: + - PREF_H2 + type: TYPE_NORMAL + zh: 1.3 构建和使用LLM的阶段 +- en: Why should we build our own LLMs? Coding an LLM from the ground up is an excellent + exercise to understand its mechanics and limitations. Also, it equips us with + the required knowledge for pretaining or finetuning existing open-source LLM architectures + to our own domain-specific datasets or tasks. + id: totrans-32 + prefs: [] + type: TYPE_NORMAL + zh: 我们为什么要构建自己的LLM呢?从零开始编写一个LLM是一个很好的练习,可以理解其机制和局限性。此外,这使我们具备了必要的知识,可以对现有的开源LLM架构进行预训练或微调,以适应我们自己的领域特定数据集或任务。 +- en: Research has shown that when it comes to modeling performance, custom-built + LLMs—those tailored for specific tasks or domains—can outperform general-purpose + LLMs like ChatGPT, which are designed for a wide array of applications. Examples + of this include BloombergGPT, which is specialized for finance, and LLMs that + are tailored for medical question answering (please see the *Further Reading and + References* section at the end of this chapter for more details). + id: totrans-33 + prefs: [] + type: TYPE_NORMAL + zh: 研究表明,就建模性能而言,定制的LLM——针对特定任务或领域定制的LLM——可能会优于ChatGPT等通用LLM,后者设计用于广泛的应用。其中的例子包括专门用于金融领域的BloombergGPT,以及专门用于医学问题回答的LLM(请参阅本章末尾的*进一步阅读和参考*部分了解更多细节)。 +- en: The general process of creating an LLM, including pretraining and finetuning. + The term "pre" in "pretraining" refers to the initial phase where a model like + an LLM is trained on a large, diverse dataset to develop a broad understanding + of language. This pretrained model then serves as a foundational resource that + can be further refined through finetuning, a process where the model is specifically + trained on a narrower dataset that is more specific to particular tasks or domains. + This two-stage training approach consisting of pretraining and finetuning is depicted + in figure 1.3. + id: totrans-34 + prefs: [] + type: TYPE_NORMAL + zh: 创建LLM的一般过程,包括预训练和微调。在“预训练”中的“pre”一词指的是初始阶段,其中像LLM这样的模型在大型、多样的数据集上进行训练,以开发对语言的广泛理解。然后,这个预训练模型作为一个基础资源,可以通过微调进一步完善,微调是指模型在更具体于特定任务或领域的较窄数据集上进行专门训练的过程。图1.3展示了这个包括预训练和微调的两阶段训练方法。 +- en: Figure 1.3 Pretraining an LLM involves next-word prediction on large unlabeled + text corpora (raw text). A pretrained LLM can then be finetuned using a smaller + labeled dataset. + id: totrans-35 + prefs: + - PREF_H5 + type: TYPE_NORMAL + zh: 图1.3 对LLM进行预训练包括对大型未标记文本语料库(原始文本)进行下一个词预测。然后,可以使用较小的标记数据集对预训练的LLM进行微调。 +- en: '![](images/ch-01__image006.png)' + id: totrans-36 + prefs: [] + type: TYPE_IMG + zh: '![](images/ch-01__image006.png)' +- en: As illustrated in figure 1.3, the first step in creating an LLM is to train + it in on a large corpus of text data, sometimes referred to as *raw* text. Here, + "raw" refers to the fact that this data is just regular text without any labeling + information[[1]](#_ftn1). (Filtering may be applied, such as removing formatting + characters or documents in unknown languages.) + id: totrans-37 + prefs: [] + type: TYPE_NORMAL + zh: 如图1.3所示,创建LLM的第一步是对大量文本数据进行训练,有时被称为*原始*文本。这里,“原始”指的是这些数据只是普通文本,没有任何标记信息[[1]](#_ftn1)。(可能会应用过滤,如删除格式字符或未知语言的文档。) +- en: This first training stage of an LLM is also known as *pretraining*, creating + an initial pretrained LLM, often called a *bas*e or *foundation* *model*. A typical + example of such a model is the GPT-3 model (the precursor of ChatGPT). This model + is capable of text completion, that is, finishing a half-written sentence provided + by a user. It also has limited few-shot capabilities, which means it can learn + to perform new tasks based on only a few examples instead of needing extensive + training data. This is further illustrated in the next section*, Using transformers + for different tasks*. + id: totrans-38 + prefs: [] + type: TYPE_NORMAL + zh: LLM的第一个训练阶段也称为*预训练*,创建一个初始的预训练LLM,通常称为*基础*模型或*基础*模型。这种模型的典型例子是GPT-3模型(ChatGPT的前身)。该模型能够完成文本,即完成用户提供的半写句。它还具有有限的few-shot能力,这意味着它可以根据少量示例学习执行新任务,而不需要大量训练数据。这在接下来的部分*为不同任务使用变换器*中进一步阐述。 +- en: After obtaining a *pretrained* LLM from training on unlabeled texts, we can + further train the LLM on labeled data, also known as *finetuning*. + id: totrans-39 + prefs: [] + type: TYPE_NORMAL + zh: 从在未标记文本上训练的*预训练*LLM中获得之后,我们可以进一步在标记数据上训练LLM,也称为*微调*。 +- en: The two most popular categories of finetuning LLMs include *instruction-finetuning* + and finetuning for *classification* tasks. In instruction-finetuning, the labeled + dataset consists of instruction and answer pairs, such as a query to translate + a text accompanied by the correctly translated text. In classification finetuning, + the labeled dataset consists of texts and associated class labels, for example, + emails associated with *spam* and *non-spa*m labels. + id: totrans-40 + prefs: [] + type: TYPE_NORMAL + zh: 用于微调LLM的两个最流行的类别包括*指导微调*和用于*分类*任务的微调。在指导微调中,标记的数据集包括指导和答案对,例如需要翻译文本的查询和正确翻译文本。在分类微调中,标记的数据集包括文本和相关的类别标签,例如与*垃圾邮件*和*非垃圾邮件*标签相关联的电子邮件。 +- en: In this book, we will cover both code implementations for pretraining and finetuning + LLM, and we will delve deeper into the specifics of instruction-finetuning and + finetuning for classification later in this book after pretraining a base LLM. + id: totrans-41 + prefs: [] + type: TYPE_NORMAL + zh: 在本书中,我们将涵盖预训练和微调LLM的代码实现,并且我们会更深入地研究指导微调和分类微调的具体内容,这将在本书中在预训练基础LLM后进行。 +- en: 1.4 Using LLMs for different tasks + id: totrans-42 + prefs: + - PREF_H2 + type: TYPE_NORMAL + zh: 1.4 为不同任务使用LLM +- en: Most modern LLMs rely on the *transformer* architecture, which is a deep neural + network architecture introduced in the 2017 paper *Attention Is All You Need*. + To understand LLMs we briefly have to go over the original transformer, which + was originally developed for machine translation, translating English texts to + German and French. A simplified version of the transformer architecture is depicted + in figure 1.4. + id: totrans-43 + prefs: [] + type: TYPE_NORMAL + zh: 大多数现代LLM依赖*变换器*架构,这是一种深度神经网络架构,首次引入于2017年的论文*Attention Is All You Need*。要理解LLM,我们需要简要回顾原始变换器,它最初用于机器翻译,将英文文本翻译成德语和法语。图1.4描述了变换器架构的简化版本。 +- en: Figure 1.4 A simplified depiction of the original transformer architecture, + which is a deep learning model for language translation. The transformer consists + of two parts, an encoder that processes the input text and produces an embedding + representation (a numerical representation that captures many different factors + in different dimensions) of the text that the decoder can use to generate the + translated text one word at a time. Note that this figure shows the final stage + of the translation process where the decoder has to generate only the final word + ("Beispiel"), given the original input text ("This is an example") and a partially + translated sentence ("Das ist ein"), to complete the translation. The figure numbering + indicates the sequence in which the data is processed and provides guidance on + the optimal order to read the figure. + id: totrans-44 + prefs: + - PREF_H5 + type: TYPE_NORMAL + zh: 图1.4 原始变换器架构的简化描述,这是一个用于语言翻译的深度学习模型。变换器由两部分组成,一个处理输入文本并生成嵌入表示的编码器(捕捉许多不同因素在不同维度中的数字表示)和一个解码器,后者可以使用该表示来逐字生成翻译文本。请注意,该图显示了翻译过程的最终阶段,其中解码器只需生成最终单词("Beispiel"),给定原始输入文本("This + is an example")和部分翻译的句子("Das ist ein"),以完成翻译。图中的编号指示数据处理的顺序,并提供有关最佳阅读图的指导。 +- en: '![](images/ch-01__image008.png)' + id: totrans-45 + prefs: [] + type: TYPE_IMG + zh: '![](images/ch-01__image008.png)' +- en: The transformer architecture depicted in figure 1.4 consists of two submodules, + an encoder and a decoder. The encoder module processes the input text and encodes + it into a series of numerical representations or vectors that capture the contextual + information of the input. Then, the decoder module takes these encoded vectors + and generates the output text from them. In a translation task, for example, the + encoder would encode the text from the source language into vectors, and the decoder + would decode these vectors to generate text in the target language.. Both the + encoder and decoder consist of many layers connected by a so-called self-attention + mechanism. You may have many questions regarding how the inputs are preprocessed + and encoded. These will be addressed in a step-by-step implementation in the subsequent + chapters. + id: totrans-46 + prefs: [] + type: TYPE_NORMAL + zh: 图 1.4 中描绘的 transformer 架构由两个子模块组成,一个编码器和一个解码器。编码器模块处理输入文本并将其编码为一系列捕捉输入上下文信息的数值表示或向量。然后,解码器模块会从这些编码向量中生成输出文本。例如,在翻译任务中,编码器会将源语言的文本编码成向量,解码器则会将这些向量解码为目标语言的文本。编码器和解码器都由许多层连接的所谓自注意机制组成。关于如何预处理和编码输入,你可能有很多问题。这些将在随后的章节中逐步实现中得到解答。 +- en: A key component of transformers and LLMs is the self-attention mechanism (not + shown), which allows the model to weigh the importance of different words or tokens + in a sequence relative to each other. This mechanism enables the model to capture + long-range dependencies and contextual relationships within the input data, enhancing + its ability to generate coherent and contextually relevant output. However, due + to its complexity, we will defer the explanation to Chapter 3, where we will discuss + and implement it step by step. Moreover, we will also discuss and implement the + data preprocessing steps to create the model inputs in *Chapter 2, Working with + Text Data*. + id: totrans-47 + prefs: [] + type: TYPE_NORMAL + zh: transformer 和 LLMs 的关键组成部分是自注意机制(未显示),它允许模型权衡序列中不同单词或标记的重要性相对于彼此。这一机制使得模型能够捕捉长距离依赖和输入数据中的上下文关系,增强了生成连贯和有上下文相关性输出的能力。然而,由于其复杂性,我们将在第三章详细讨论并逐步实现这一解释。此外,我们还将在《第二章,处理文本数据》中讨论和实现数据预处理步骤来创建模型输入。 +- en: Later variants of the transformer architecture, such as the so-called BERT (short + for *bidirectional encoder representations from transformers*) and the various + GPT models (short for *generative pretrained transformers*), built on this concept + to adapt this architecture for different tasks. (References can be found in the + *Further Reading* section at the end of this chapter.) + id: totrans-48 + prefs: [] + type: TYPE_NORMAL + zh: 后来的变种 transformer 架构,如所谓的 BERT(*双向编码器表示来自 transformer*)和各种 GPT 模型(*生成式预训练 transformer*),建立在这一概念上,以适应不同任务的体系结构。 + (参考可以在本章结束处的*进一步阅读*部分找到。) +- en: BERT, which is built upon the original transformer's encoder submodule, differs + in its training approach from GPT. While GPT is designed for generative tasks, + BERT and its variants specialize in masked word prediction, where the model predicts + masked or hidden words in a given sentence as illustrated in figure 1.5\. This + unique training strategy equips BERT with strengths in text classification tasks, + including sentiment prediction and document categorization. As an application + of its capabilities, as of this writing, Twitter uses BERT to detect toxic content. + id: totrans-49 + prefs: [] + type: TYPE_NORMAL + zh: BERT,它是建立在原始 transformer 编码器子模块基础之上的,与 GPT 在训练方法上有所不同。虽然 GPT 设计用于生成任务,BERT 及其变体专门用于掩码词预测,模型会预测给定句子中的掩码或隐藏单词,如图 + 1.5 所示。这种独特的训练策略使得 BERT 在文本分类任务中具有优势,包括情绪预测和文档分类。截至目前,Twitter 使用 BERT 来检测有害内容,这是其能力的一个应用。 +- en: Figure 1.5 A visual representation of the transformer's encoder and decoder + submodules. On the left, the encoder segment exemplifies BERT-like LLMs, which + focus on masked word prediction and are primarily used for tasks like text classification. + On the right, the decoder segment showcases GPT-like LLMs, designed for generative + tasks and producing coherent text sequences. + id: totrans-50 + prefs: + - PREF_H5 + type: TYPE_NORMAL + zh: 图 1.5 transformer 的编码器和解码器子模块的可视化表示。在左侧,编码器部分举例说明了类似 BERT 的 LLMs,其专注于掩码词预测,主要用于文本分类等任务。在右侧,解码器部分展示了类似 + GPT 的 LLMs,设计用于生成任务并生成连贯的文本序列。 +- en: '![](images/ch-01__image010.png)' + id: totrans-51 + prefs: [] + type: TYPE_IMG + zh: '![](images/ch-01__image010.png)' +- en: GPT, on the other hand, focuses on the decoder portion of the original transformer + architecture and is designed for tasks that require generating texts. This includes + machine translation, text summarization, fiction writing, writing computer code, + and more. We will discuss the GPT architecture in more detail in the remaining + sections of this chapter and implement it from scratch in this book. + id: totrans-52 + prefs: [] + type: TYPE_NORMAL + zh: 另一方面,GPT侧重于原始变压器架构的解码器部分,旨在处理需要生成文本的任务。这包括机器翻译、文本摘要、虚构写作、编写计算机代码等。我们将在本章的其余部分更详细地讨论GPT架构,并在本书中从头开始实现它。 +- en: GPT models, primarily designed and trained to perform text completion tasks, + also show remarkable versatility in their capabilities. These models are adept + at executing both zero-shot and few-shot learning tasks. Zero-shot learning refers + to the ability to generalize to completely unseen tasks without any prior specific + examples. On the other hand, few-shot learning involves learning from a minimal + number of examples the user provides as input, as shown in figure 1.6. + id: totrans-53 + prefs: [] + type: TYPE_NORMAL + zh: GPT模型主要设计和训练用于执行文本完成任务,同时在其能力上显示出出色的多功能性。这些模型擅长执行零样本和少样本学习任务。零样本学习指的是在没有任何先前特定示例的情况下对完全不可见任务进行概括的能力。另一方面,少样本学习涉及从用户提供的最少数量示例中学习,如图1.6所示。 +- en: Figure 1.6 Next to text completion, GPT-like LLMs can solve various tasks based + on their inputs without needing retraining, finetuning, or task-specific model + architecture changes. Sometimes, it is helpful to provide examples of the target + within the input, which is known as a few-shot setting. However, GPT-like LLMs + are also capable of carrying out tasks without a specific example, which is called + zero-shot setting. + id: totrans-54 + prefs: + - PREF_H5 + type: TYPE_NORMAL + zh: 图1.6 除了文本完成之外,类似GPT的LLM可以根据其输入解决各种任务,无需重新训练、微调或特定于任务的模型架构更改。有时,在输入中提供目标示例是有帮助的,这被称为少样本设置。然而,类似GPT的LLM也能够在没有具体示例的情况下执行任务,这被称为零样本设置。 +- en: '![](images/ch-01__image012.png)' + id: totrans-55 + prefs: [] + type: TYPE_IMG + zh: '![](images/ch-01__image012.png)' +- en: Transformers versus LLMs + id: totrans-56 + prefs: + - PREF_H5 + type: TYPE_NORMAL + zh: 变压器与LLM +- en: Today's LLMs are based on the transformer architecture introduced in the previous + section. Hence, transformers and LLMs are terms that are often used synonymously + in the literature. However, note that not all transformers are LLMs since transformers + can also be used for computer vision. Also, not all LLMs are transformers, as + there are large language models based on recurrent and convolutional architectures. + The main motivation behind these alternative approaches is to improve the computational + efficiency of LLMs. However, whether these alternative LLM architectures can compete + with the capabilities of transformer-based LLMs and whether they are going to + be adopted in practice remains to be seen. (Interested readers can find literature + references describing these architectures in the *Further Reading* section at + the end of this chapter.) + id: totrans-57 + prefs: [] + type: TYPE_NORMAL + zh: 当前的LLM基于前面介绍的变压器架构。因此,在文献中常常将变压器和LLM等术语用作同义词。然而,需要注意的是,并非所有的变压器都是LLM,因为变压器也可以用于计算机视觉。另外,并非所有的LLM都是变压器,因为还有基于循环和卷积架构的大型语言模型。这些替代方法背后的主要动机是提高LLM的计算效率。然而,这些替代的LLM架构是否能与基于变压器的LLM的能力竞争,并且它们是否会被实际采用还有待观察。(感兴趣的读者可以在本章末尾的*进一步阅读*部分找到描述这些架构的文献引用。) +- en: 1.5 Utilizing large datasets + id: totrans-58 + prefs: + - PREF_H2 + type: TYPE_NORMAL + zh: 1.5 利用大型数据集 +- en: The large training datasets for popular GPT- and BERT-like models represent + diverse and comprehensive text corpora encompassing billions of words, which include + a vast array of topics and natural and computer languages. To provide a concrete + example, table 1.1 summarizes the dataset used for pretraining GPT-3, which served + as the base model for the first version of ChatGPT. + id: totrans-59 + prefs: [] + type: TYPE_NORMAL + zh: 流行的GPT和BERT等模型的大型训练数据集包含数十亿字的多样化和全面的文本语料库,涵盖了大量主题以及自然语言和计算机语言。为了提供一个具体的例子,表1.1总结了用于预训练GPT-3的数据集,这为ChatGPT的第一个版本提供了基础模型。 +- en: Table 1.1 The pretraining dataset of the popular GPT-3 LLM + id: totrans-60 + prefs: + - PREF_H5 + type: TYPE_NORMAL + zh: 表1.1 流行的GPT-3 LLM的预训练数据集 +- en: '| Dataset name | Dataset description | Number of tokens | Proportion in training + data |' + id: totrans-61 + prefs: [] + type: TYPE_TB + zh: '| 数据集名称 | 数据集描述 | 标记数量 | 在训练数据中的比例 |' +- en: '| CommonCrawl (filtered) | Web crawl data | 410 billion | 60% |' + id: totrans-62 + prefs: [] + type: TYPE_TB + zh: '| CommonCrawl(经过过滤) | 网络爬虫数据 | 4100亿 | 60% |' +- en: '| WebText2 | Web crawl data | 19 billion | 22% |' + id: totrans-63 + prefs: [] + type: TYPE_TB + zh: '| 网页文本2 | 网络爬虫数据 | 190亿 | 22% |' +- en: '| Books1 | Internet-based book corpus | 12 billion | 8% |' + id: totrans-64 + prefs: [] + type: TYPE_TB + zh: '| 图书1 | 基于互联网的图书语料库 | 120亿 | 8% |' +- en: '| Books2 | Internet-based book corpus | 55 billion | 8% |' + id: totrans-65 + prefs: [] + type: TYPE_TB + zh: '| 图书2 | 基于互联网的图书语料库 | 550亿 | 8% |' +- en: '| Wikipedia | High-quality text | 3 billion | 3% |' + id: totrans-66 + prefs: [] + type: TYPE_TB + zh: '| 维基百科 | 高质量文本 | 30亿 | 3% |' +- en: Table 1.1 reports the number of tokens, where a token is a unit of text that + a model reads, and the number of tokens in a dataset is roughly equivalent to + the number of words and punctuation characters in the text. We will cover tokenization, + the process of converting text into tokens, in more detail in the next chapter. + id: totrans-67 + prefs: [] + type: TYPE_NORMAL + zh: 表1.1报告了标记的数量,其中一个标记是模型读取的文本单位,数据集中的标记数量大致相当于文本中的单词和标点符号的数量。我们将在下一章更详细地介绍标记化的过程,即将文本转换为标记的过程。 +- en: The main takeaway is that the scale and diversity of this training dataset allows + these models to perform well on diverse tasks including language syntax, semantics, + and context, and even some requiring general knowledge. + id: totrans-68 + prefs: [] + type: TYPE_NORMAL + zh: 主要的要点是这个训练数据集的规模和多样性,使这些模型在包括语言句法、语义和内容的各种任务上表现良好,甚至包括一些需要一般知识的任务。 +- en: GPT-3 dataset details + id: totrans-69 + prefs: + - PREF_H5 + type: TYPE_NORMAL + zh: GPT-3数据集详细信息 +- en: Note that each subset in table 1.1 was sampled 300 billion tokens, which implies + that not all datasets were seen completely, and some were seen multiple times. + The proportion column, ignoring rounding, adds to 100%. For reference, the 410 + billion tokens in the CommonCrawl dataset require approximately 570 GB of storage. + Later models based on GPT-3, for example, Meta's LLaMA, also include research + papers from Arxiv (92 GB) and code-related Q&As from StackExchange (78 GB). + id: totrans-70 + prefs: [] + type: TYPE_NORMAL + zh: 请注意,表1.1中的每个子集都是抽样自3000亿个标记,这意味着并非所有数据集都完全被看到,有些甚至被多次看到。除四舍五入之外,比例列加起来为100%。作为参考,CommonCrawl数据集中的4100亿个标记大约需要570GB的存储空间。基于GPT-3的后续模型,如Meta的LLaMA,还包括来自Arxiv的研究论文(92GB)和来自StackExchange的与代码相关的问答(78GB)。 +- en: The Wikipedia corpus consists of English-language Wikipedia. While the authors + of the GPT-3 paper didn't further specify the details, Books1 is likely a sample + from Project Gutenberg ([https://www.gutenberg.org/](www.gutenberg.org.html)), + and Books2 is likely from Libgen ([https://en.wikipedia.org/wiki/Library_Genesis](wiki.html)). + CommonCrawl is a filtered subset of the CommonCrawl database ([https://commoncrawl.org/](commoncrawl.org.html)), + and WebText2 is the text of web pages from all outbound Reddit links from posts + with 3+ upvotes. + id: totrans-71 + prefs: [] + type: TYPE_NORMAL + zh: 维基百科语料库由英语维基百科组成。虽然GPT-3论文的作者没有进一步说明细节,但Books1很可能是从古登堡计划([https://www.gutenberg.org/](www.gutenberg.org.html))中抽样而来,而Books2很可能是来自Libgen([https://en.wikipedia.org/wiki/Library_Genesis](wiki.html))。CommonCrawl是CommonCrawl数据库的筛选子集([https://commoncrawl.org/](commoncrawl.org.html)),而WebText2是来自帖子中出现过3个以上赞的Reddit链接的网页文本。 +- en: The authors of the GPT-3 paper did not share the training dataset but a comparable + dataset that is publicly available is The Pile ([https://pile.eleuther.ai/](pile.eleuther.ai.html)). + However, the collection may contain copyrighted works, and the exact usage terms + may depend on the intended use case and country. For more information, see the + HackerNews discussion at [https://news.ycombinator.com/item?id=25607809](news.ycombinator.com.html). + id: totrans-72 + prefs: [] + type: TYPE_NORMAL + zh: GPT-3论文的作者没有公开训练数据集,但一个可比较的公开可用的数据集是The Pile([https://pile.eleuther.ai/](pile.eleuther.ai.html))。不过,这个收集可能包含有版权作品,而且确切的使用条款可能取决于使用案例和国家。有关更多信息,请参阅HackerNews上的讨论[https://news.ycombinator.com/item?id=25607809](news.ycombinator.com.html)。 +- en: The pretrained nature of these models makes them incredibly versatile for further + finetuning on downstream tasks, which is why they are also known as base or foundation + models. Pretraining LLMs requires access to significant resources and is very + expensive. For example, the GPT-3 pretraining cost is estimated to be $4.6 million + in terms of cloud computing credits[[2]](#_ftn2). + id: totrans-73 + prefs: [] + type: TYPE_NORMAL + zh: 这些模型的预训练性使它们在进一步微调下游任务时变得非常灵活,这也是它们被称为基础模型的原因。预训练LLM需要大量资源,并且代价非常高昂。例如,据估计GPT-3的预训练成本为460万美元的云计算费用[[2]](#_ftn2)。 +- en: The good news is that many pretrained LLMs, available as open-source models, + can be used as general purpose tools to write, extract, and edit texts that were + not part of the training data. Also, LLMs can be finetuned on specific tasks with + relatively smaller datasets, reducing the computational resources needed and improving + performance on the specific task. + id: totrans-74 + prefs: [] + type: TYPE_NORMAL + zh: 好消息是,许多预训练的LLM模型可以作为通用工具用于写作、提取和编辑不属于训练数据的文本,并且这些模型也可以在相对较小的数据集上进行微调,以降低所需的计算资源,并且改善在特定任务上的性能。 +- en: In this book, we will implement the code for pretraining and use it to pretrain + an LLM for educational purposes.. All computations will be executable on consumer + hardware. After implementing the pretraining code we will learn how to reuse openly + available model weights and load them into the architecture we will implement, + allowing us to skip the expensive pretraining stage when we finetune LLMs later + in this book. + id: totrans-75 + prefs: [] + type: TYPE_NORMAL + zh: 在本书中,我们将实现用于预训练的代码,并将其用于教育目的。所有计算都可以在消费者硬件上执行。在实现预训练代码之后,我们将学习如何重用公开可用的模型权重,并将它们加载到我们将要实现的架构中,从而使我们能够在本书后期微调LLM时跳过昂贵的预训练阶段。 +- en: 1.6 A closer look at the GPT architecture + id: totrans-76 + prefs: + - PREF_H2 + type: TYPE_NORMAL + zh: 1.6 更详细地了解GPT架构 +- en: 'Previously in this chapter, we mentioned the terms GPT-like models, GPT-3, + and ChatGPT. Let''s now take a closer look at the general GPT architecture. First, + GPT stands for ***G***enerative ***P***retrained ***T***ransformer and was originally + introduced in the following paper:' + id: totrans-77 + prefs: [] + type: TYPE_NORMAL + zh: 在本章中之前,我们提到了类似GPT模型、GPT-3和ChatGPT的术语。现在让我们更仔细地看一下通用的GPT架构。首先,GPT代表***G***enerative + ***P***retrained ***T***ransformer,最初是在以下论文中介绍的: +- en: '*Improving Language Understanding by Generative Pre-Training* (2018) by *Radford + et al.* from OpenAI, [http://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf](language-unsupervised.html)' + id: totrans-78 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: '*通过生成预训练提高语言理解* (2018) 由OpenAI的*Radford等人*提出,[http://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf](language-unsupervised.html)' +- en: GPT-3 is a scaled-up version of this model that has more parameters and was + trained on a larger dataset. And the original ChatGPT model was created by finetuning + GPT-3 on a large instruction dataset using a method from OpenAI'sInstructGPT paper, + which we will cover in more detail in *Chapter 8, Finetuning with Human Feedback + To Follow Instructions*. As we have seen earlier in figure 1.6, these models are + competent text completion models and can carry out other tasks such as spelling + correction, classification, or language translation. This is actually very remarkable + given that GPT models are pretrained on a relatively simple next-word prediction + task, as illustrated in figure 1.7. + id: totrans-79 + prefs: [] + type: TYPE_NORMAL + zh: GPT-3是该模型的一个规模扩大版本,具有更多的参数,并且是在一个更大的数据集上训练的。而最初的ChatGPT模型是通过对GPT-3在一个大型指导数据集上进行微调而创建的,使用了OpenAI的InstructGPT论文中的一种方法,我们将在*第8章,通过人类反馈微调以遵循指示*中详细介绍这种方法。正如我们在图1.6中早期看到的那样,这些模型是称职的文本补全模型,并且可以执行其他任务,比如拼写校正、分类或语言翻译。鉴于GPT模型是在一个相对简单的下一个单词预测任务上进行预训练的,正如图1.7所示,这实际上是非常了不起的。 +- en: Figure 1.7 In the next-word pretraining task for GPT models, the system learns + to predict the upcoming word in a sentence by looking at the words that have come + before it. This approach helps the model understand how words and phrases typically + fit together in language, forming a foundation that can be applied to various + other tasks. + id: totrans-80 + prefs: + - PREF_H5 + type: TYPE_NORMAL + zh: 图1.7 在GPT模型的下一个单词预训练任务中,系统通过查看之前出现过的单词来预测句子中即将出现的单词。这种方法有助于模型理解单词和短语在语言中通常是如何配合使用的,形成一个可以应用于各种其他任务的基础。 +- en: '![](images/ch-01__image014.png)' + id: totrans-81 + prefs: [] + type: TYPE_IMG + zh: '![](images/ch-01__image014.png)' +- en: 'The next-word prediction task is a form of self-supervised learning, which + is a form of self-labeling. This means that we don''t need to collect labels for + the training data explicitly but can leverage the structure of the data itself: + we can use the next word in a sentence or document as the label that the model + is supposed to predict. Since this next-word prediction task allows us to create + labels "on the fly," it is possible to leverage massive unlabeled text datasets + to train LLMs as previously discussed in section *1.5, Utilizing large datasets*.' + id: totrans-82 + prefs: [] + type: TYPE_NORMAL + zh: 下一个单词预测任务是一种自监督学习形式,是一种自我标记形式。这意味着我们不需要显式地为训练数据收集标签,而是可以利用数据本身的结构:我们可以使用句子或文档中的下一个单词作为模型要预测的标签。由于这个下一个单词预测任务允许我们"即兴"地创建标签,所以可以利用大规模的未标记文本数据集来训练LLM,如前面第1.5节中所讨论的。 +- en: Compared to the original transformer architecture we covered in section 1.4, + *Using LLMs for different tasks*, the general GPT architecture is relatively simple. + Essentially, it's just the decoder part without the encoder as illustrated in + figure 1.8\. Since decoder-style models like GPT generate text by predicting text + one word at a time, they are considered a type of autoregressive model. + id: totrans-83 + prefs: [] + type: TYPE_NORMAL + zh: 与我们在第1.4节中介绍的原始Transformer架构相比,*使用LLM执行不同任务*,通用GPT架构相对简单。从本质上讲,它只是解码器部分,没有编码器,如图1.8所示。由于像GPT这样的解码器样式模型通过逐字预测文本生成文本,因此它们被认为是一种自回归模型。 +- en: Architectures such as GPT-3 are also significantly larger than the original + transformer model. For instance, the original transformer repeated the encoder + and decoder blocks six times. GPT-3 has 96 transformer layers and 175 billion + parameters in total. + id: totrans-84 + prefs: [] + type: TYPE_NORMAL + zh: 诸如GPT-3之类的架构也比原始的transformer模型要大得多。例如,原始的transformer将编码器和解码器块重复六次。GPT-3共有96个transformer层和1750亿个参数。 +- en: Figure 1.8 The GPT architecture, employs only the decoder portion of the original + transformer. It is designed for unidirectional, left-to-right processing, making + it well-suited for text generation and next-word prediction tasks to generate + text in iterative fashion one word at a time. + id: totrans-85 + prefs: + - PREF_H5 + type: TYPE_NORMAL + zh: 图1.8 GPT架构仅使用原始transformer的解码器部分。它被设计为单向从左到右的处理,非常适合文本生成和下一个单词预测任务,以逐步生成一次一个单词的文本。 +- en: '![](images/ch-01__image016.png)' + id: totrans-86 + prefs: [] + type: TYPE_IMG + zh: '![](images/ch-01__image016.png)' +- en: GPT-3 was introduced in 2020, which is a long time ago by the standard of deep + learning and LLM development, more recent architectures like Meta's Llama models + are still based on the same underlying concepts, introducing only minor modifications. + Hence, understanding GPT remains as relevant as ever, and this book focuses on + implementing the prominent architecture behind GPT while providing pointers to + specific tweaks employed by alternative LLMs. + id: totrans-87 + prefs: [] + type: TYPE_NORMAL + zh: GPT-3是在2020年推出的,从深度学习和LLM发展的标准来看,这已经是很久之前了,而像Meta的Llama模型这样更近期的架构仍然基于相同的基本概念,只是进行了一些细微的修改。因此,理解GPT仍然像以往一样重要,而本书侧重于实现GPT背后突出的架构,并提供指向替代LLMs使用的具体调整。 +- en: Lastly, it's interesting to note that although the original transformer model + was explicitly designed for language translation, GPT models—despite their larger + yet simpler architecture aimed at next-word prediction—are also capable of performing + translation tasks. This capability was initially unexpected to researchers, as + it emerged from a model primarily trained on a next-word prediction task, which + is a task that did not specifically target translation. + id: totrans-88 + prefs: [] + type: TYPE_NORMAL + zh: 最后,有趣的是,尽管原始的transformer模型明确设计用于语言翻译,但GPT模型——尽管其更大但更简单的架构旨在进行下一个单词的预测——也能够执行翻译任务。这种能力最初对研究人员来说是意外的,因为它源自一个主要训练于下一个单词预测任务上的模型,而这是一个并不专门针对翻译的任务。 +- en: The ability to perform tasks that the model wasn't explicitly trained to perform + is called an "emerging property." This capability isn't explicitly taught during + training but emerges as a natural consequence of the model's exposure to vast + quantities of multilingual data in diverse contexts. The fact that GPT models + can "learn" the translation patterns between languages and perform translation + tasks even though they weren't specifically trained for it demonstrates the benefits + and capabilities of these large-scale, generative language models. We can perform + diverse tasks without using diverse models for each. + id: totrans-89 + prefs: [] + type: TYPE_NORMAL + zh: 模型能够执行其未明确接受训练的任务称为“新兴属性”。这种能力在训练期间并未得到明确教导,但是作为模型暴露于各种多语言环境下的大量数据的自然结果而出现。事实上,GPT模型可以“学习”语言之间的翻译模式,并执行翻译任务,即使它们并没有针对此进行专门训练,这显示了这些大规模生成语言模型的优势和能力。我们可以执行各种任务而无需为每个任务使用不同的模型。 +- en: 1.7 Building a large language model + id: totrans-90 + prefs: + - PREF_H2 + type: TYPE_NORMAL + zh: 1.7 构建大型语言模型 +- en: In this chapter, we laid the groundwork for understanding LLMs. In the remainder + of this book, we will be coding one from scratch. We will take the fundamental + idea behind GPT as a blueprint and tackle this in three stages, as outlined in + figure 1.9. + id: totrans-91 + prefs: [] + type: TYPE_NORMAL + zh: 在本章中,我们为理解LLMs奠定了基础。在本书的剩余部分中,我们将从头开始编写一个LLM。我们将以GPT背后的基本思想作为蓝本,并按照图1.9中的概述分三个阶段来解决这个问题。 +- en: Figure 1.9 The stages of building LLMs covered in this book include implementing + the LLM architecture and data preparation process, pretraining an LLM to create + a foundation model, and finetuning the foundation model to become a personal assistant + or text classifier. + id: totrans-92 + prefs: + - PREF_H5 + type: TYPE_NORMAL + zh: 图1.9 本书涵盖的构建LLMs的阶段包括实现LLM架构和数据准备过程,预训练LLM以创建基础模型,以及对基础模型进行微调以成为个人助理或文本分类器。 +- en: '![](images/ch-01__image018.png)' + id: totrans-93 + prefs: [] + type: TYPE_IMG + zh: '![](images/ch-01__image018.png)' +- en: First, we will learn about the fundamental data preprocessing steps and code + the attention mechanism that is at the heart of every LLM. + id: totrans-94 + prefs: [] + type: TYPE_NORMAL + zh: 首先,我们将学习基本的数据预处理步骤,并编写是每个LLM核心的注意力机制。 +- en: Next, in stage 2, we will learn how to code and pretrain a GPT-like LLM capable + of generating new texts. And we will also go over the fundamentals of evaluating + LLMs, which is essential for developing capable NLP systems. + id: totrans-95 + prefs: [] + type: TYPE_NORMAL + zh: 接下来,在第2阶段,我们将学习如何编码和预训练一个类似GPT的LLM,能够生成新的文本。并且我们还将深入研究评估LLMs的基础知识,这对于开发功能强大的自然语言处理系统至关重要。 +- en: Note that pretraining a large LLM from scratch is a significant endeavor, demanding + thousands to millions of dollars in computing costs for GPT-like models. Therefore, + the focus of stage 2 is on implementing training for educational purposes using + a small dataset. In addition, the book will also provide code examples for loading + openly available model weights. + id: totrans-96 + prefs: [] + type: TYPE_NORMAL + zh: 请注意,从零开始预训练大型LLM是一项重大工作,在GPT-like模型的计算成本中需要数千到数百万美元。因此,第2阶段的重点是利用小型数据集进行教育目的的训练实施。此外,本书还将提供加载公开可用模型权重的代码示例。 +- en: Finally, in stage 3, we will take a pretrained LLM and finetune it to follow + instructions such as answering queries or classifying texts -- the most common + tasks in many real-world applications and research. + id: totrans-97 + prefs: [] + type: TYPE_NORMAL + zh: 最后,在第3阶段,我们将获取一个预训练的LLM,并对其进行微调,以遵循诸如回答查询或分类文本等指令--这是许多现实应用和研究中最常见的任务。 +- en: I hope you are looking forward to embarking on this exciting journey! + id: totrans-98 + prefs: [] + type: TYPE_NORMAL + zh: 希望您期待着踏上这段令人兴奋的旅程! +- en: 1.8 Summary + id: totrans-99 + prefs: + - PREF_H2 + type: TYPE_NORMAL + zh: 1.8 总结 +- en: LLMs have transformed the field of natural language processing, which previouslyrelied + on explicit rule-based systems and simpler statistical methods. The advent of + LLMs introduced new deep learning-driven approaches that led to advancements in + understanding, generating, and translating human language. + id: totrans-100 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: LLMs已经改变了自然语言处理领域,之前依赖于显式基于规则的系统和更简单的统计方法。LLMs的出现引入了新的深度学习驱动方法,推动了对人类语言的理解、生成和翻译的进步。 +- en: Modern LLMs are trained in two main steps. + id: totrans-101 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: 现代LLMs的训练主要分为两个步骤。 +- en: First, they are pretrained on a large corpus of unlabeled text by using the + prediction of the next word in a sentence as a "label." + id: totrans-102 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: 首先,它们通过使用句子中下一个单词的预测作为"标签",在大型未标记文本语料库上进行预训练。 +- en: Then, they are finetuned on a smaller, labeled target dataset to follow instructions + or perform classification tasks. + id: totrans-103 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: 然后,它们在较小的、标记的目标数据集上进行微调,以遵循指令或执行分类任务。 +- en: LLMs are based on the transformer architecture. The key idea of the transformer + architecture is an attention mechanism that gives the LLM selective access to + the whole input sequence when generating the output one word at a time. + id: totrans-104 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: LLMs基于Transformer架构。Transformer架构的关键思想是一个注意力机制,在逐词生成输出时,给予LLM对整个输入序列的选择性访问。 +- en: The original transformer architecture consists of an encoder for parsing text + and a decoder for generating text. + id: totrans-105 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: 原始的Transformer架构包括一个用于解析文本的编码器和一个用于生成文本的解码器。 +- en: LLMs for generating text and following instructions, such as GPT-3 and ChatGPT, + only implement decoder modules, simplifying the architecture. + id: totrans-106 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: 用于生成文本和遵循指令的LLMs,如GPT-3和ChatGPT,仅实现解码器模块,简化了架构。 +- en: Large datasets consisting of billions of words are essential for pretraining + LLMs. in this book, we will implement and train LLMs on small datasets for educational + purposes but also see how we can load openly available model weights. + id: totrans-107 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: 由数十亿字组成的大型数据集对于LLMs的预训练至关重要。在本书中,我们将实现并训练LLMs以用于教育目的的小型数据集,还将了解如何加载公开可用的模型权重。 +- en: While the general pretraining task for GPT-like models is to predict the next + word in a sentence, these LLMs exhibit "emergent" properties such as capabilities + to classify, translate, or summarize texts. + id: totrans-108 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: 尽管GPT-like模型的一般预训练任务是预测句子中的下一个单词,但这些LLMs展现出"新兴"属性,如分类、翻译或总结文本的能力。 +- en: Once an LLM is pretrained, the resulting foundation model can be finetuned more + efficiently for various downstream tasks. + id: totrans-109 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: 一旦LLM被预训练,产生的基础模型可以更高效地针对各种下游任务进行微调。 +- en: LLMs finetuned on custom datasets can outperform general LLMs on specific tasks. + id: totrans-110 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: 使用定制数据集进行微调的LLMs可以在特定任务上胜过通用LLMs。 +- en: 1.9 References and further reading + id: totrans-111 + prefs: + - PREF_H2 + type: TYPE_NORMAL + zh: 1.9 参考和进一步阅读 +- en: 'Custom-built LLMs are able to outperform general-purpose LLMs as a team at + Bloomberg showed via a a version of GPT pretrained on finance data from scratch. + The custom LLM outperformed ChatGPT on financial tasks while maintaining good + performance on general LLM benchmarks:' + id: totrans-112 + prefs: [] + type: TYPE_NORMAL + zh: 由于一支彭博团队展示的LLMs在金融数据上从零开始预训练的GPT版本,定制LLMs能够在金融任务上胜过ChatGPT,同时在通用LLM基准测试中表现良好: +- en: '*BloombergGPT: A Large Language Model for Finance* (2023) by Wu *et al.*, [https://arxiv.org/abs/2303.17564](abs.html)' + id: totrans-113 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: '*BloombergGPT:一种用于金融的大型语言模型* (2023)由吴等人撰写,[https://arxiv.org/abs/2303.17564](abs.html)' +- en: 'Existing LLMs can be adapted and finetuned to outperform general LLMs as well, + which teams from Google Research and Google DeepMind showed in a medical context:' + id: totrans-114 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: 现有的LLM也可以被调整和微调,以表现出优于一般LLM的潜力,谷歌研究组和谷歌DeepMind团队在医疗领域展示了这一点: +- en: '*Towards Expert-Level Medical Question Answering with Large Language Models* + (2023) by Singhal *et al.*, [https://arxiv.org/abs/2305.09617](abs.html)' + id: totrans-115 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: '*通过大型语言模型实现专业水平医学问答* (2023)由辛哈尔等人撰写,[https://arxiv.org/abs/2305.09617](abs.html)' +- en: 'The paper that proposed the original transformer architecture:' + id: totrans-116 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: 提出原始变压器架构的论文: +- en: '*Attention Is All You Need* (2017) by Vaswani *et al.*, [https://arxiv.org/abs/1706.03762](abs.html)' + id: totrans-117 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: '*注意力机制就是一切* (2017)由瓦斯瓦尼等人撰写,[https://arxiv.org/abs/1706.03762](abs.html)' +- en: 'The original encoder-style transformer, called BERT:' + id: totrans-118 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: 原始的编码器式变压器,称为BERT: +- en: '*BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding* + (2018) by Devlin *et al.*, [https://arxiv.org/abs/1810.04805](abs.html).' + id: totrans-119 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: '*BERT:深度双向变压器进行语言理解的预训练* (2018)由德夫林等人撰写,[https://arxiv.org/abs/1810.04805](abs.html)。' +- en: 'The paper describing the decoder-style GPT-3 model, which inspired modern LLMs + and will be used as a template for implementing an LLM from scratch in this book:' + id: totrans-120 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: 描述解码器式GPT-3模型的论文,这激发了现代LLM的开发,并将被用作在本书中从头开始实现LLM的模板: +- en: '*Language Models are Few-Shot Learners* (2020) by Brown *et al.*, [https://arxiv.org/abs/2005.14165](abs.html).' + id: totrans-121 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: '*语言模型是少样本学习者* (2020)由布朗等人撰写,[https://arxiv.org/abs/2005.14165](abs.html)。' +- en: 'The original vision transformer for classifying images, which illustrates that + transformer architectures are not only restricted to text inputs:' + id: totrans-122 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: 用于分类图像的原始视觉变压器,说明变压器架构不仅限于文本输入: +- en: '*An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale* + (2020) by Dosovitskiy *et al.*, [https://arxiv.org/abs/2010.11929](abs.html)' + id: totrans-123 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: '*一幅图等于16x16个字:大规模图像识别的变压器* (2020)由多索维茨基等人撰写,[https://arxiv.org/abs/2010.11929](abs.html)' +- en: 'Two experimental (but less popular) LLM architectures that serve as examples + that not all LLMs need to be based on the transformer architecture:' + id: totrans-124 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: 两种实验性(但较不流行)的LLM架构作为示例,说明不是所有的LLM都必须基于变压器架构: +- en: '*RWKV: Reinventing RNNs for the Transformer Era* (2023) by Peng *et al.*, [https://arxiv.org/abs/2305.13048](abs.html)' + id: totrans-125 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: '*RWKV:为变压器时代重新设计RNN* (2023)由彭等人撰写,[https://arxiv.org/abs/2305.13048](abs.html)' +- en: '*Hyena Hierarchy: Towards Larger Convolutional Language Models (2023)* by Poli + *et al.,* [https://arxiv.org/abs/2302.10866](abs.html)' + id: totrans-126 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: '*鬣狗等级结构:向更大的卷积语言模型迈进(2023年)*由波利等人撰写,[https://arxiv.org/abs/2302.10866](abs.html)' +- en: 'Meta AI''s model is a popular implementation of a GPT-like model that is openly + available in contrast to GPT-3 and ChatGPT:' + id: totrans-127 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: Meta AI的模型是一个流行的GPT样式模型的实现,与GPT-3和ChatGPT相比是开放可用的: +- en: '*Llama 2: Open Foundation and Fine-Tuned Chat Models* (2023) by Touvron *et + al.*, [https://arxiv.org/abs/2307.09288](abs.html)[1](abs.html)' + id: totrans-128 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: '*Llama 2:开放基础和微调的聊天模型* (2023)由特文等人撰写,[https://arxiv.org/abs/2307.09288](abs.html)[1](abs.html)' +- en: 'For readers interested in additional details about the dataset references in + section 1.5, this paper describes the publicly available *The Pile* dataset curated + by Eleuther AI:' + id: totrans-129 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: 对于对第1.5节中提到的数据集引用感兴趣的读者,这篇论文描述了由Eleuther AI策划的公开可用的*The Pile*数据集: +- en: '*The Pile: An 800GB Dataset of Diverse Text for Language Modeling* (2020) by + Gao *et al.*, [https://arxiv.org/abs/2101.00027](abs.html).' + id: totrans-130 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: '*堆叠:一份包含多样文本的800GB数据集用于语言建模* (2020)由高等人撰写,[https://arxiv.org/abs/2101.00027](abs.html)。' +- en: '*Training Language Models to Follow Instructions with Human Feedback* (2022) + by *Ouyang et al.*, [https://arxiv.org/abs/2203.02155](abs.html)' + id: totrans-131 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: '*训练语言模型遵循人类反馈指令* (2022)由欧阳等人撰写,[https://arxiv.org/abs/2203.02155](abs.html)' +- en: '[[1]](#_ftnref1) Readers with a background in machine learning may note that + labeling information is typically required for traditional machine learning models + and deep neural networks trained via the conventional supervised learning paradigm. + However, this is not the case for the pretraining stage of LLMs. In this phase, + LLMs leverage self-supervised learning, where the model generates its own labels + from the input data. This concept is covered later in this chapter' + id: totrans-132 + prefs: [] + type: TYPE_NORMAL + zh: '[[1]](#_ftnref1) 有机器学习背景的读者可能会注意到,传统机器学习模型和通过传统监督学习范式训练的深度神经网络通常需要标签信息。但是,这并不适用于LLM的预训练阶段。在这个阶段,LLM利用自监督学习,模型从输入数据中生成自己的标签。这个概念稍后在本章中会有介绍' +- en: '[[2]](#_ftnref2) *GPT-3, The $4,600,000 Language Model*, [https://www.reddit.com/r/MachineLearning/comments/h0jwoz/d_gpt3_the_](h0jwoz.html)[4600000_language_model/](d_gpt3_the_4600000_language_model.html)' + id: totrans-133 + prefs: [] + type: TYPE_NORMAL + zh: '[[2]](#_ftnref2) *GPT-3,460 万美元的语言模型*,[https://www.reddit.com/r/MachineLearning/comments/h0jwoz/d_gpt3_the_](h0jwoz.html)[4600000_language_model/](d_gpt3_the_4600000_language_model.html)' diff --git a/data/bd-llm-scr_2.yaml b/data/bd-llm-scr_2.yaml new file mode 100644 index 0000000..f4f90ca --- /dev/null +++ b/data/bd-llm-scr_2.yaml @@ -0,0 +1,2181 @@ +- en: 2 Working with Text Data + id: totrans-0 + prefs: + - PREF_H1 + type: TYPE_NORMAL + zh: 2 使用文本数据 +- en: This chapter covers + id: totrans-1 + prefs: + - PREF_H3 + type: TYPE_NORMAL + zh: 本章涵盖内容 +- en: Preparing text for large language model training + id: totrans-2 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: 为大型语言模型训练准备文本 +- en: Splitting text into word and subword tokens + id: totrans-3 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: 将文本分割成单词和子单词标记 +- en: Byte pair encoding as a more advanced way of tokenizing text + id: totrans-4 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: 字节对编码作为一种更高级的文本标记化方式 +- en: Sampling training examples with a sliding window approach + id: totrans-5 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: 使用滑动窗口方法对训练样本进行抽样 +- en: Converting tokens into vectors that feed into a large language model + id: totrans-6 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: 将标记转换为输入大型语言模型的向量 +- en: In the previous chapter, we delved into the general structure of large language + models (LLMs) and learned that they are pretrained on vast amounts of text. Specifically, + our focus was on decoder-only LLMs based on the transformer architecture, which + underlies ChatGPT and other popular GPT-like LLMs. + id: totrans-7 + prefs: [] + type: TYPE_NORMAL + zh: 在上一章中,我们深入探讨了大型语言模型(LLMs)的一般结构,并了解到它们在大量文本上进行了预训练。具体来说,我们关注的是基于变压器架构的解码器专用LLMs,这是ChatGPT和其他流行的类GPT + LLMs的基础。 +- en: During the pretraining stage, LLMs process text one word at a time. Training + LLMs with millions to billions of parameters using a next-word prediction task + yields models with impressive capabilities. These models can then be further finetuned + to follow general instructions or perform specific target tasks. But before we + can implement and train LLMs in the upcoming chapters, we need to prepare the + training dataset, which is the focus of this chapter, as illustrated in figure + 2.1 + id: totrans-8 + prefs: [] + type: TYPE_NORMAL + zh: 在预训练阶段,LLM逐个单词处理文本。利用亿万到数十亿参数的LLM进行下一个词预测任务的训练,可以产生具有令人印象深刻能力的模型。然后可以进一步微调这些模型以遵循一般指示或执行特定目标任务。但是,在接下来的章节中实施和训练LLM之前,我们需要准备训练数据集,这是本章的重点,如图2.1所示 +- en: Figure 2.1 A mental model of the three main stages of coding an LLM, pretraining + the LLM on a general text dataset, and finetuning it on a labeled dataset. This + chapter will explain and code the data preparation and sampling pipeline that + provides the LLM with the text data for pretraining. + id: totrans-9 + prefs: + - PREF_H5 + type: TYPE_NORMAL + zh: 图2.1 LLM编码的三个主要阶段的心理模型,LLM在一般文本数据集上进行预训练,然后在有标签的数据集上进行微调。本章将解释并编写提供LLM预训练文本数据的数据准备和抽样管道。 +- en: '![](images/ch-02__image002.png)' + id: totrans-10 + prefs: [] + type: TYPE_IMG + zh: '![](images/ch-02__image002.png)' +- en: In this chapter, you'll learn how to prepare input text for training LLMs. This + involves splitting text into individual word and subword tokens, which can then + be encoded into vector representations for the LLM. You'll also learn about advanced + tokenization schemes like byte pair encoding, which is utilized in popular LLMs + like GPT. Lastly, we'll implement a sampling and data loading strategy to produce + the input-output pairs necessary for training LLMs in subsequent chapters. + id: totrans-11 + prefs: [] + type: TYPE_NORMAL + zh: 在本章中,您将学习如何准备输入文本以进行LLM训练。这涉及将文本拆分为单独的单词和子单词标记,然后将其编码为LLM的向量表示。您还将学习有关高级标记方案,如字节对编码,这在像GPT这样的流行LLM中被使用。最后,我们将实现一种抽样和数据加载策略,以生成后续章节中训练LLM所需的输入-输出对。 +- en: 2.1 Understanding word embeddings + id: totrans-12 + prefs: + - PREF_H2 + type: TYPE_NORMAL + zh: 2.1 理解词嵌入 +- en: Deep neural network models, including LLMs, cannot process raw text directly. + Since text is categorical, it isn't compatible with the mathematical operations + used to implement and train neural networks. Therefore, we need a way to represent + words as continuous-valued vectors. (Readers unfamiliar with vectors and tensors + in a computational context can learn more in Appendix A, section A2.2 Understanding + tensors.) + id: totrans-13 + prefs: [] + type: TYPE_NORMAL + zh: 深度神经网络模型,包括LLM,无法直接处理原始文本。由于文本是分类的,所以它与用于实现和训练神经网络的数学运算不兼容。因此,我们需要一种将单词表示为连续值向量的方式。(不熟悉计算上下文中向量和张量的读者可以在附录A,A2.2理解张量中了解更多。) +- en: The concept of converting data into a vector format is often referred to as + *embedding*. Using a specific neural network layer or another pretrained neural + network model, we can embed different data types, for example, video, audio, and + text, as illustrated in figure 2.2. + id: totrans-14 + prefs: [] + type: TYPE_NORMAL + zh: 将数据转换为向量格式的概念通常被称为*嵌入*。使用特定的神经网络层或其他预训练的神经网络模型,我们可以嵌入不同的数据类型,例如视频、音频和文本,如图2.2所示。 +- en: Figure 2.2 Deep learning models cannot process data formats like video, audio, + and text in their raw form. Thus, we use an embedding model to transform this + raw data into a dense vector representation that deep learning architectures can + easily understand and process. Specifically, this figure illustrates the process + of converting raw data into a three-dimensional numerical vector. It's important + to note that different data formats require distinct embedding models. For example, + an embedding model designed for text would not be suitable for embedding audio + or video data. + id: totrans-15 + prefs: + - PREF_H5 + type: TYPE_NORMAL + zh: 图2.2 深度学习模型无法直接处理视频、音频和文本等原始格式的数据。因此,我们使用嵌入模型将这些原始数据转换为深度学习架构可以轻松理解和处理的稠密向量表示。具体来说,这张图说明了将原始数据转换为三维数值向量的过程。需要注意的是,不同的数据格式需要不同的嵌入模型。例如,专为文本设计的嵌入模型不适用于嵌入音频或视频数据。 +- en: '![](images/ch-02__image004.png)' + id: totrans-16 + prefs: [] + type: TYPE_IMG + zh: '![](images/ch-02__image004.png)' +- en: At its core, an embedding is a mapping from discrete objects, such as words, + images, or even entire documents, to points in a continuous vector space -- the + primary purpose of embeddings is to convert non-numeric data into a format that + neural networks can process. + id: totrans-17 + prefs: [] + type: TYPE_NORMAL + zh: 在其核心,嵌入是从离散对象(如单词、图像,甚至整个文档)到连续向量空间中的点的映射——嵌入的主要目的是将非数值数据转换为神经网络可以处理的格式。 +- en: While word embeddings are the most common form of text embedding, there are + also embeddings for sentences, paragraphs, or whole documents. Sentence or paragraph + embeddings are popular choices for *retrieval-augmented generation.* Retrieval-augmented + generation combines generation (like producing text) with retrieval (like searching + an external knowledge base) to pull relevant information when generating text, + which is a technique that is beyond the scope of this book. Since our goal is + to train GPT-like LLMs, which learn to generate text one word at a time, this + chapter focuses on word embeddings. + id: totrans-18 + prefs: [] + type: TYPE_NORMAL + zh: 虽然单词嵌入是文本嵌入的最常见形式,但也有针对句子、段落或整个文档的嵌入。句子或段落嵌入是*检索增强生成*的流行选择。检索增强生成结合了生成(如生成文本)和检索(如搜索外部知识库)以在生成文本时提取相关信息的技术,这是本书讨论范围之外的技术。由于我们的目标是训练类似GPT的LLMs,这些模型学习逐词生成文本,因此本章重点介绍了单词嵌入。 +- en: There are several algorithms and frameworks that have been developed to generate + word embeddings. One of the earlier and most popular examples is the *Word2Vec* + approach. Word2Vec trained neural network architecture to generate word embeddings + by predicting the context of a word given the target word or vice versa. The main + idea behind Word2Vec is that words that appear in similar contexts tend to have + similar meanings. Consequently, when projected into 2-dimensional word embeddings + for visualization purposes, it can be seen that similar terms cluster together, + as shown in figure 2.3. + id: totrans-19 + prefs: [] + type: TYPE_NORMAL + zh: 有几种算法和框架已被开发用于生成单词嵌入。其中一个较早和最流行的示例是*Word2Vec*方法。Word2Vec训练神经网络架构以通过预测给定目标词或反之亦然的单词的上下文来生成单词嵌入。Word2Vec背后的主要思想是在相似上下文中出现的单词往往具有相似的含义。因此,当投影到二维单词嵌入进行可视化时,可以看到相似术语聚集在一起,如图2.3所示。 +- en: Figure 2.3 If word embeddings are two-dimensional, we can plot them in a two-dimensional + scatterplot for visualization purposes as shown here. When using word embedding + techniques, such as Word2Vec, words corresponding to similar concepts often appear + close to each other in the embedding space. For instance, different types of birds + appear closer to each other in the embedding space compared to countries and cities. + id: totrans-20 + prefs: + - PREF_H5 + type: TYPE_NORMAL + zh: 图2.3 如果单词嵌入是二维的,我们可以在二维散点图中绘制它们进行可视化,如此处所示。使用单词嵌入技术(例如Word2Vec),与相似概念对应的单词通常在嵌入空间中彼此靠近。例如,不同类型的鸟类在嵌入空间中彼此比国家和城市更接近。 +- en: '![](images/ch-02__image006.png)' + id: totrans-21 + prefs: [] + type: TYPE_IMG + zh: '![](images/ch-02__image006.png)' +- en: Word embeddings can have varying dimensions, from one to thousands. As shown + in figure 2.3, we can choose two-dimensional word embeddings for visualization + purposes. A higher dimensionality might capture more nuanced relationships but + at the cost of computational efficiency. + id: totrans-22 + prefs: [] + type: TYPE_NORMAL + zh: 单词嵌入的维度可以有不同的范围,从一维到数千维不等。如图2.3所示,我们可以选择二维单词嵌入进行可视化。更高的维度可能捕捉到更加微妙的关系,但会牺牲计算效率。 +- en: While we can use pretrained models such as Word2Vec to generate embeddings for + machine learning models, LLMs commonly produce their own embeddings that are part + of the input layer and are updated during training. The advantage of optimizing + the embeddings as part of the LLM training instead of using Word2Vec is that the + embeddings are optimized to the specific task and data at hand. We will implement + such embedding layers later in this chapter. Furthermore, LLMs can also create + contextualized output embeddings, as we discuss in chapter 3. + id: totrans-23 + prefs: [] + type: TYPE_NORMAL + zh: 虽然我们可以使用诸如Word2Vec之类的预训练模型为机器学习模型生成嵌入,但LLMs通常产生自己的嵌入,这些嵌入是输入层的一部分,并在训练过程中更新。优化嵌入作为LLM训练的一部分的优势,而不是使用Word2Vec的优势在于,嵌入被优化为特定的任务和手头的数据。我们将在本章后面实现这样的嵌入层。此外,LLMs还可以创建上下文化的输出嵌入,我们将在第3章中讨论。 +- en: Unfortunately, high-dimensional embeddings present a challenge for visualization + because our sensory perception and common graphical representations are inherently + limited to three dimensions or fewer, which is why figure 2.3 showed two-dimensional + embeddings in a two-dimensional scatterplot. However, when working with LLMs, + we typically use embeddings with a much higher dimensionality than shown in figure + 2.3\. For both GPT-2 and GPT-3, the embedding size (often referred to as the dimensionality + of the model's hidden states) varies based on the specific model variant and size. + It is a trade-off between performance and efficiency. The smallest GPT-2 (117M + parameters) and GPT-3 (125 M parameters) models use an embedding size of 768 dimensions + to provide concrete examples. The largest GPT-3 model (175B parameters) uses an + embedding size of 12,288 dimensions. + id: totrans-24 + prefs: [] + type: TYPE_NORMAL + zh: 不幸的是,高维度嵌入给可视化提出了挑战,因为我们的感知和常见的图形表示固有地受限于三个或更少维度,这就是为什么图2.3展示了在二维散点图中的二维嵌入。然而,当使用LLMs时,我们通常使用比图2.3中所示的更高维度的嵌入。对于GPT-2和GPT-3,嵌入大小(通常称为模型隐藏状态的维度)根据特定模型变体和大小而变化。这是性能和效率之间的权衡。最小的GPT-2(117M参数)和GPT-3(125M参数)模型使用768维度的嵌入大小来提供具体的例子。最大的GPT-3模型(175B参数)使用12288维的嵌入大小。 +- en: The upcoming sections in this chapter will walk through the required steps for + preparing the embeddings used by an LLM, which include splitting text into words, + converting words into tokens, and turning tokens into embedding vectors. + id: totrans-25 + prefs: [] + type: TYPE_NORMAL + zh: 本章的后续部分将介绍准备LLM使用的嵌入所需的步骤,包括将文本分割为单词,将单词转换为标记,并将标记转换为嵌入向量。 +- en: 2.2 Tokenizing text + id: totrans-26 + prefs: + - PREF_H2 + type: TYPE_NORMAL + zh: 2.2 文本分词 +- en: This section covers how we split input text into individual tokens, a required + preprocessing step for creating embeddings for an LLM. These tokens are either + individual words or special characters, including punctuation characters, as shown + in figure 2.4. + id: totrans-27 + prefs: [] + type: TYPE_NORMAL + zh: 本节介绍了如何将输入文本分割为单个标记,这是为了创建LLM嵌入所必需的预处理步骤。这些标记可以是单独的单词或特殊字符,包括标点符号字符,如图2.4所示。 +- en: Figure 2.4 A view of the text processing steps covered in this section in the + context of an LLM. Here, we split an input text into individual tokens, which + are either words or special characters, such as punctuation characters. In upcoming + sections, we will convert the text into token IDs and create token embeddings. + id: totrans-28 + prefs: + - PREF_H5 + type: TYPE_NORMAL + zh: 图2.4 在LLM上下文中查看本节涵盖的文本处理步骤。在这里,我们将输入文本分割为单个标记,这些标记可以是单词或特殊字符,如标点符号字符。在即将到来的部分中,我们将把文本转换为标记ID并创建标记嵌入。 +- en: '![](images/ch-02__image008.png)' + id: totrans-29 + prefs: [] + type: TYPE_IMG + zh: '![](images/ch-02__image008.png)' +- en: 'The text we will tokenize for LLM training is a short story by Edith Wharton + called *The Verdict*, which has been released into the public domain and is thus + permitted to be used for LLM training tasks. The text is available on Wikisource + at [https://en.wikisource.org/wiki/The_Verdict](wiki.html), and you can copy and + paste it into a text file, which I copied into a text file "`the-verdict.txt"` + to load using Python''s standard file reading utilities:' + id: totrans-30 + prefs: [] + type: TYPE_NORMAL + zh: 我们将用于LLM训练的文本是Edith Wharton的短篇小说《**The Verdict**》,该小说已进入公有领域,因此可以用于LLM训练任务。文本可在Wikisource上获得,网址为[https://en.wikisource.org/wiki/The_Verdict](wiki.html),您可以将其复制粘贴到文本文件中,我将其复制到一个名为"`the-verdict.txt`"的文本文件中,以便使用Python的标准文件读取实用程序加载: +- en: Listing 2.1 Reading in a short story as text sample into Python + id: totrans-31 + prefs: + - PREF_H5 + type: TYPE_NORMAL + zh: 列表2.1 将短篇小说作为文本示例读入Python +- en: '[PRE0]' + id: totrans-32 + prefs: [] + type: TYPE_PRE + zh: '[PRE0]' +- en: Alternatively, you can find this "`the-verdict.txt"` file in this book's GitHub + repository at [https://github.com/rasbt/LLMs-from-scratch/tree/main/ch02/01_main-chapter-code](ch02.html). + id: totrans-33 + prefs: [] + type: TYPE_NORMAL + zh: 或者,您可以在本书的GitHub存储库中找到此"`the-verdict.txt`"文件,网址为[https://github.com/rasbt/LLMs-from-scratch/tree/main/ch02/01_main-chapter-code](ch02.html)。 +- en: 'The print command prints the total number of characters followed by the first + 100 characters of this file for illustration purposes:' + id: totrans-34 + prefs: [] + type: TYPE_NORMAL + zh: 打印命令打印出字符的总数,然后是文件的前100个字符,用于说明目的: +- en: '[PRE1]' + id: totrans-35 + prefs: [] + type: TYPE_PRE + zh: '[PRE1]' +- en: Our goal is to tokenize this 20,479-character short story into individual words + and special characters that we can then turn into embeddings for LLM training + in the upcoming chapters. + id: totrans-36 + prefs: [] + type: TYPE_NORMAL + zh: 我们的目标是将这篇短篇小说的20,479个字符标记成单词和特殊字符,然后将其转换为LLM训练的嵌入。 +- en: Text sample sizes + id: totrans-37 + prefs: + - PREF_H5 + type: TYPE_NORMAL + zh: 文本样本大小 +- en: Note that it's common to process millions of articles and hundreds of thousands + of books -- many gigabytes of text -- when working with LLMs. However, for educational + purposes, it's sufficient to work with smaller text samples like a single book + to illustrate the main ideas behind the text processing steps and to make it possible + to run it in reasonable time on consumer hardware. + id: totrans-38 + prefs: [] + type: TYPE_NORMAL + zh: 请注意,在处理LLM时,处理数百万篇文章和数十万本书——许多吉字节的文本——是很常见的。但是,出于教育目的,使用小型文本样本,如一本书,就足以说明文本处理步骤背后的主要思想,并且可以在消费类硬件上合理的时间内运行。 +- en: How can we best split this text to obtain a list of tokens? For this, we go + on a small excursion and use Python's regular expression library `re` for illustration + purposes. (Note that you don't have to learn or memorize any regular expression + syntax since we will transition to a pre-built tokenizer later in this chapter.) + id: totrans-39 + prefs: [] + type: TYPE_NORMAL + zh: 我们如何最好地分割这段文本以获得标记列表? 为此,我们进行了小小的探索,并使用Python的正则表达式库`re`进行说明。 (请注意,您无需学习或记忆任何正则表达式语法,因为我们将在本章后面过渡到预构建的标记器。) +- en: 'Using some simple example text, we can use the `re.split` command with the + following syntax to split a text on whitespace characters:' + id: totrans-40 + prefs: [] + type: TYPE_NORMAL + zh: 使用一些简单的示例文本,我们可以使用`re.split`命令及以下语法来在空格字符上拆分文本: +- en: '[PRE2]' + id: totrans-41 + prefs: [] + type: TYPE_PRE + zh: '[PRE2]' +- en: 'The result is a list of individual words, whitespaces, and punctuation characters:' + id: totrans-42 + prefs: [] + type: TYPE_NORMAL + zh: '结果是一系列单词、空格和标点字符:' +- en: '[PRE3]' + id: totrans-43 + prefs: [] + type: TYPE_PRE + zh: '[PRE3]' +- en: Note that the simple tokenization scheme above mostly works for separating the + example text into individual words, however, some words are still connected to + punctuation characters that we want to have as separate list entries. + id: totrans-44 + prefs: [] + type: TYPE_NORMAL + zh: 请注意,上述简单分词方案通常可将示例文本分隔成单词,但是有些单词仍然与我们希望作为单独列表项的标点字符连接在一起。 +- en: 'Let''s modify the regular expression splits on whitespaces (`\s`) and commas, + and periods (`[,.]`):' + id: totrans-45 + prefs: [] + type: TYPE_NORMAL + zh: 让我们修改在空格(`\s`)和逗号、句号(`[,.]`)上的正则表达式分割: +- en: '[PRE4]' + id: totrans-46 + prefs: [] + type: TYPE_PRE + zh: '[PRE4]' +- en: 'We can see that the words and punctuation characters are now separate list + entries just as we wanted:' + id: totrans-47 + prefs: [] + type: TYPE_NORMAL + zh: '我们可以看到单词和标点字符现在是作为我们想要的分开的列表条目:' +- en: '[PRE5]' + id: totrans-48 + prefs: [] + type: TYPE_PRE + zh: '[PRE5]' +- en: 'A small remaining issue is that the list still includes whitespace characters. + Optionally, we can remove these redundant characters safely remove as follows:' + id: totrans-49 + prefs: [] + type: TYPE_NORMAL + zh: 一个小问题是列表仍然包括空白字符。可选地,我们可以安全地按如下方式删除这些多余的字符: +- en: '[PRE6]' + id: totrans-50 + prefs: [] + type: TYPE_PRE + zh: '[PRE6]' +- en: 'The resulting whitespace-free output looks like as follows:' + id: totrans-51 + prefs: [] + type: TYPE_NORMAL + zh: 去除空格字符后的输出如下: +- en: '[PRE7]' + id: totrans-52 + prefs: [] + type: TYPE_PRE + zh: '[PRE7]' +- en: Removing whitespaces or not + id: totrans-53 + prefs: + - PREF_H5 + type: TYPE_NORMAL + zh: 是否去除空白 +- en: When developing a simple tokenizer, whether we should encode whitespaces as + separate characters or just remove them depends on our application and its requirements. + Removing whitespaces reduces the memory and computing requirements. However, keeping + whitespaces can be useful if we train models that are sensitive to the exact structure + of the text (for example, Python code, which is sensitive to indentation and spacing). + Here, we remove whitespaces for simplicity and brevity of the tokenized outputs. + Later, we will switch to a tokenization scheme that includes whitespaces. + id: totrans-54 + prefs: [] + type: TYPE_NORMAL + zh: 在开发简单的标记器时,是否将空白字符编码为单独的字符或仅将其删除取决于我们的应用程序和其要求。去除空格减少了内存和计算需求。但是,如果我们训练的模型对文本的精确结构敏感(例如,对缩进和间距敏感的Python代码),保留空格可能会有用。在这里,为了简化标记化输出的简洁性,我们移除空白。稍后,我们将转换为包括空格的标记方案。 +- en: 'The tokenization scheme we devised above works well on the simple sample text. + Let''s modify it a bit further so that it can also handle other types of punctuation, + such as question marks, quotation marks, and the double-dashes we have seen earlier + in the first 100 characters of Edith Wharton''s short story, along with additional + special characters:' + id: totrans-55 + prefs: [] + type: TYPE_NORMAL + zh: 我们上面设计的标记方案在简单的示例文本上运行良好。让我们进一步修改它,使其还可以处理其他类型的标点符号,例如问号,引号以及我们在Edith Wharton的短篇小说的前100个字符中先前看到的双破折号,以及其他额外的特殊字符。 +- en: '[PRE8]' + id: totrans-56 + prefs: [] + type: TYPE_PRE + zh: '[PRE8]' +- en: 'The resulting output is as follows:' + id: totrans-57 + prefs: [] + type: TYPE_NORMAL + zh: 结果输出如下: +- en: '[PRE9]' + id: totrans-58 + prefs: [] + type: TYPE_PRE + zh: '[PRE9]' +- en: As we can see based on the results summarized in figure 2.5, our tokenization + scheme can now handle the various special characters in the text successfully. + id: totrans-59 + prefs: [] + type: TYPE_NORMAL + zh: 根据总结在图2.5中的结果,我们的标记方案现在可以成功处理文本中的各种特殊字符。 +- en: Figure 2.5 The tokenization scheme we implemented so far splits text into individual + words and punctuation characters. In the specific example shown in this figure, + the sample text gets split into 10 individual tokens. + id: totrans-60 + prefs: + - PREF_H5 + type: TYPE_NORMAL + zh: 图2.5 我们目前实施的标记化方案将文本分割为单个单词和标点字符。在本图所示的特定示例中,样本文本被分割成10个单独的标记。 +- en: '![](images/ch-02__image010.png)' + id: totrans-61 + prefs: [] + type: TYPE_IMG + zh: '![](images/ch-02__image010.png)' +- en: 'Now that we got a basic tokenizer working, let''s apply it to Edith Wharton''s + entire short story:' + id: totrans-62 + prefs: [] + type: TYPE_NORMAL + zh: 现在我们已经有了一个基本的标记器工作,让我们将其应用到爱迪丝·沃顿的整个短篇小说中: +- en: '[PRE10]' + id: totrans-63 + prefs: [] + type: TYPE_PRE + zh: '[PRE10]' +- en: The above print statement outputs `4649`, which is the number of tokens in this + text (without whitespaces). + id: totrans-64 + prefs: [] + type: TYPE_NORMAL + zh: 上面的打印语句输出了`4649`,这是这段文本(不包括空格)中的标记数。 +- en: 'Let''s print the first 30 tokens for a quick visual check:' + id: totrans-65 + prefs: [] + type: TYPE_NORMAL + zh: 让我们打印前30个标记进行快速的视觉检查: +- en: '[PRE11]' + id: totrans-66 + prefs: [] + type: TYPE_PRE + zh: '[PRE11]' +- en: 'The resulting output shows that our tokenizer appears to be handling the text + well since all words and special characters are neatly separated:' + id: totrans-67 + prefs: [] + type: TYPE_NORMAL + zh: 结果输出显示,我们的标记器似乎很好地处理了文本,因为所有单词和特殊字符都被很好地分开了: +- en: '[PRE12]' + id: totrans-68 + prefs: [] + type: TYPE_PRE + zh: '[PRE12]' +- en: 2.3 Converting tokens into token IDs + id: totrans-69 + prefs: + - PREF_H2 + type: TYPE_NORMAL + zh: 2.3 将标记转换为标记ID +- en: In the previous section, we tokenized a short story by Edith Wharton into individual + tokens. In this section, we will convert these tokens from a Python string to + an integer representation to produce the so-called token IDs. This conversion + is an intermediate step before converting the token IDs into embedding vectors. + id: totrans-70 + prefs: [] + type: TYPE_NORMAL + zh: 在上一节中,我们将爱迪丝·沃顿的短篇小说标记化为单个标记。在本节中,我们将这些标记从Python字符串转换为整数表示,以生成所谓的标记ID。这种转换是将标记ID转换为嵌入向量之前的中间步骤。 +- en: To map the previously generated tokens into token IDs, we have to build a so-called + vocabulary first. This vocabulary defines how we map each unique word and special + character to a unique integer, as shown in figure 2.6. + id: totrans-71 + prefs: [] + type: TYPE_NORMAL + zh: 要将之前生成的标记映射到标记ID中,我们必须首先构建一个所谓的词汇表。这个词汇表定义了我们如何将每个唯一的单词和特殊字符映射到一个唯一的整数,就像图2.6中所示的那样。 +- en: Figure 2.6 We build a vocabulary by tokenizing the entire text in a training + dataset into individual tokens. These individual tokens are then sorted alphabetically, + and unique tokens are removed. The unique tokens are then aggregated into a vocabulary + that defines a mapping from each unique token to a unique integer value. The depicted + vocabulary is purposefully small for illustration purposes and contains no punctuation + or special characters for simplicity. + id: totrans-72 + prefs: + - PREF_H5 + type: TYPE_NORMAL + zh: 图2.6 我们通过对训练数据集中的整个文本进行标记化来构建词汇表,将这些单独的标记按字母顺序排序,并移除唯一的标记。然后将这些唯一标记聚合成一个词汇表,从而定义了从每个唯一标记到唯一整数值的映射。为了说明的目的,所示的词汇表故意较小,并且不包含标点符号或特殊字符。 +- en: '![](images/ch-02__image012.png)' + id: totrans-73 + prefs: [] + type: TYPE_IMG + zh: '![](images/ch-02__image012.png)' +- en: 'In the previous section, we tokenized Edith Wharton''s short story and assigned + it to a Python variable called `preprocessed`. Let''s now create a list of all + unique tokens and sort them alphabetically to determine the vocabulary size:' + id: totrans-74 + prefs: [] + type: TYPE_NORMAL + zh: 在前一节中,我们标记化了爱迪丝·沃顿的短篇小说,并将其分配给了一个名为`preprocessed`的Python变量。现在让我们创建一个包含所有唯一标记并按字母顺序排列的列表,以确定词汇表的大小: +- en: '[PRE13]' + id: totrans-75 + prefs: [] + type: TYPE_PRE + zh: '[PRE13]' +- en: 'After determining that the vocabulary size is 1,159 via the above code, we + create the vocabulary and print its first 50 entries for illustration purposes:' + id: totrans-76 + prefs: [] + type: TYPE_NORMAL + zh: 通过上面的代码确定词汇表的大小为1,159后,我们创建词汇表,并打印其前50个条目以作说明: +- en: Listing 2.2 Creating a vocabulary + id: totrans-77 + prefs: + - PREF_H5 + type: TYPE_NORMAL + zh: 列表2.2 创建词汇表 +- en: '[PRE14]' + id: totrans-78 + prefs: [] + type: TYPE_PRE + zh: '[PRE14]' +- en: 'The output is as follows:' + id: totrans-79 + prefs: [] + type: TYPE_NORMAL + zh: 输出如下: +- en: '[PRE15]' + id: totrans-80 + prefs: [] + type: TYPE_PRE + zh: '[PRE15]' +- en: As we can see, based on the output above, the dictionary contains individual + tokens associated with unique integer labels. Our next goal is to apply this vocabulary + to convert new text into token IDs, as illustrated in figure 2.7. + id: totrans-81 + prefs: [] + type: TYPE_NORMAL + zh: 如上面的输出所示,字典包含与唯一整数标签相关联的单独标记。我们的下一个目标是将这个词汇表应用到新文本中,以将其转换为标记ID,就像图2.7中所示的那样。 +- en: Figure 2.7 Starting with a new text sample, we tokenize the text and use the + vocabulary to convert the text tokens into token IDs. The vocabulary is built + from the entire training set and can be applied to the training set itself and + any new text samples. The depicted vocabulary contains no punctuation or special + characters for simplicity. + id: totrans-82 + prefs: + - PREF_H5 + type: TYPE_NORMAL + zh: 图2.7 从新的文本样本开始,我们对文本进行标记化,并使用词汇表将文本标记转换为标记ID。词汇表是从整个训练集构建的,并且可以应用于训练集本身以及任何新的文本样本。为了简单起见,所示的词汇表不包含标点符号或特殊字符。 +- en: '![](images/ch-02__image014.png)' + id: totrans-83 + prefs: [] + type: TYPE_IMG + zh: '![](images/ch-02__image014.png)' +- en: Later in this book, when we want to convert the outputs of an LLM from numbers + back into text, we also need a way to turn token IDs into text. For this, we can + create an inverse version of the vocabulary that maps token IDs back to corresponding + text tokens. + id: totrans-84 + prefs: [] + type: TYPE_NORMAL + zh: 在本书的后面,当我们想要将LLM的输出从数字转换回文本时,我们还需要一种将标记ID转换成文本的方法。为此,我们可以创建词汇表的反向版本,将标记ID映射回相应的文本标记。 +- en: Let's implement a complete tokenizer class in Python with an `encode` method + that splits text into tokens and carries out the string-to-integer mapping to + produce token IDs via the vocabulary. In addition, we implement a `decode` method + that carries out the reverse integer-to-string mapping to convert the token IDs + back into text. + id: totrans-85 + prefs: [] + type: TYPE_NORMAL + zh: 让我们在Python中实现一个完整的标记器类,它具有一个`encode`方法,将文本分割成标记,并通过词汇表进行字符串到整数的映射,以产生标记ID。另外,我们实现了一个`decode`方法,进行反向整数到字符串的映射,将标记ID转回文本。 +- en: 'The code for this tokenizer implementation is as in listing 2.3:' + id: totrans-86 + prefs: [] + type: TYPE_NORMAL + zh: 这个标记器实现的代码如下所示,如列表2.3所示: +- en: Listing 2.3 Implementing a simple text tokenizer + id: totrans-87 + prefs: + - PREF_H5 + type: TYPE_NORMAL + zh: 列表2.3 实现一个简单的文本标记器 +- en: '[PRE16]' + id: totrans-88 + prefs: [] + type: TYPE_PRE + zh: '[PRE16]' +- en: Using the `SimpleTokenizerV1` Python class above, we can now instantiate new + tokenizer objects via an existing vocabulary, which we can then use to encode + and decode text, as illustrated in figure 2.8. + id: totrans-89 + prefs: [] + type: TYPE_NORMAL + zh: 使用上述的`SimpleTokenizerV1` Python类,我们现在可以通过现有词汇表实例化新的标记对象,然后可以用于编码和解码文本,如图2.8所示。 +- en: 'Figure 2.8 Tokenizer implementations share two common methods: an encode method + and a decode method. The encode method takes in the sample text, splits it into + individual tokens, and converts the tokens into token IDs via the vocabulary. + The decode method takes in token IDs, converts them back into text tokens, and + concatenates the text tokens into natural text.' + id: totrans-90 + prefs: + - PREF_H5 + type: TYPE_NORMAL + zh: 图2.8 标记器实现共享两个常见方法:一个是编码方法,一个是解码方法。编码方法接受示例文本,将其拆分为单独的标记,并通过词汇表将标记转换为标记ID。解码方法接受标记ID,将其转换回文本标记,并将文本标记连接成自然文本。 +- en: '![](images/ch-02__image016.png)' + id: totrans-91 + prefs: [] + type: TYPE_IMG + zh: '![](images/ch-02__image016.png)' +- en: 'Let''s instantiate a new tokenizer object from the `SimpleTokenizerV1` class + and tokenize a passage from Edith Wharton''s short story to try it out in practice:' + id: totrans-92 + prefs: [] + type: TYPE_NORMAL + zh: 让我们从`SimpleTokenizerV1`类中实例化一个新的标记对象,并对爱迪丝·沃顿的短篇小说中的段落进行分词,以尝试实践一下: +- en: '[PRE17]' + id: totrans-93 + prefs: [] + type: TYPE_PRE + zh: '[PRE17]' +- en: 'The code above prints the following token IDs:' + id: totrans-94 + prefs: [] + type: TYPE_NORMAL + zh: 上面的代码打印了以下标记ID: +- en: '[PRE18]' + id: totrans-95 + prefs: [] + type: TYPE_PRE + zh: '[PRE18]' +- en: 'Next, let''s see if we can turn these token IDs back into text using the decode + method:' + id: totrans-96 + prefs: [] + type: TYPE_NORMAL + zh: 接下来,让我们看看是否可以使用解码方法将这些标记ID还原为文本: +- en: '[PRE19]' + id: totrans-97 + prefs: [] + type: TYPE_PRE + zh: '[PRE19]' +- en: 'This outputs the following text:' + id: totrans-98 + prefs: [] + type: TYPE_NORMAL + zh: 这将输出以下文本: +- en: '[PRE20]' + id: totrans-99 + prefs: [] + type: TYPE_PRE + zh: '[PRE20]' +- en: Based on the output above, we can see that the decode method successfully converted + the token IDs back into the original text. + id: totrans-100 + prefs: [] + type: TYPE_NORMAL + zh: 根据上面的输出,我们可以看到解码方法成功地将标记ID转换回原始文本。 +- en: 'So far, so good. We implemented a tokenizer capable of tokenizing and de-tokenizing + text based on a snippet from the training set. Let''s now apply it to a new text + sample that is not contained in the training set:' + id: totrans-101 + prefs: [] + type: TYPE_NORMAL + zh: 目前为止,我们已经实现了一个能够根据训练集中的片段对文本进行标记化和解标记化的标记器。现在让我们将其应用于训练集中不包含的新文本样本: +- en: '[PRE21]' + id: totrans-102 + prefs: [] + type: TYPE_PRE + zh: '[PRE21]' +- en: 'Executing the code above will result in the following error:' + id: totrans-103 + prefs: [] + type: TYPE_NORMAL + zh: 执行上面的代码将导致以下错误: +- en: '[PRE22]' + id: totrans-104 + prefs: [] + type: TYPE_PRE + zh: '[PRE22]' +- en: The problem is that the word "Hello" was not used in the *The Verdict* short + story. Hence, it is not contained in the vocabulary. This highlights the need + to consider large and diverse training sets to extend the vocabulary when working + on LLMs. + id: totrans-105 + prefs: [] + type: TYPE_NORMAL + zh: 问题在于“Hello”这个词没有在*The Verdict*短篇小说中出现过。因此,它不包含在词汇表中。这突显了在处理LLMs时需要考虑大量和多样的训练集以扩展词汇表的需求。 +- en: In the next section, we will test the tokenizer further on text that contains + unknown words, and we will also discuss additional special tokens that can be + used to provide further context for an LLM during training. + id: totrans-106 + prefs: [] + type: TYPE_NORMAL + zh: 在下一节中,我们将进一步测试标记器对包含未知单词的文本的处理,我们还将讨论在训练期间可以使用的额外特殊标记,以提供LLM更多的上下文信息。 +- en: 2.4 Adding special context tokens + id: totrans-107 + prefs: + - PREF_H2 + type: TYPE_NORMAL + zh: 2.4 添加特殊上下文标记 +- en: In the previous section, we implemented a simple tokenizer and applied it to + a passage from the training set. In this section, we will modify this tokenizer + to handle unknown words. + id: totrans-108 + prefs: [] + type: TYPE_NORMAL + zh: 在上一节中,我们实现了一个简单的标记器,并将其应用于训练集中的一个段落。在本节中,我们将修改这个标记器来处理未知单词。 +- en: We will also discuss the usage and addition of special context tokens that can + enhance a model's understanding of context or other relevant information in the + text. These special tokens can include markers for unknown words and document + boundaries, for example. + id: totrans-109 + prefs: [] + type: TYPE_NORMAL + zh: 我们还将讨论使用和添加特殊上下文标记的用法,这些标记可以增强模型对文本中上下文或其他相关信息的理解。这些特殊标记可以包括未知单词和文档边界的标记,例如。 +- en: In particular, we will modify the vocabulary and tokenizer we implemented in + the previous section, SimpleTokenizerV2, to support two new tokens, `<|unk|>` + and `<|endoftext|>`, as illustrated in figure 2.8. + id: totrans-110 + prefs: [] + type: TYPE_NORMAL + zh: 具体来说,我们将修改上一节中实现的词汇表和标记器SimpleTokenizerV2,以支持两个新的标记`<|unk|>`和`<|endoftext|>`,如图2.8所示。 +- en: Figure 2.9 We add special tokens to a vocabulary to deal with certain contexts. + For instance, we add an <|unk|> token to represent new and unknown words that + were not part of the training data and thus not part of the existing vocabulary. + Furthermore, we add an <|endoftext|> token that we can use to separate two unrelated + text sources. + id: totrans-111 + prefs: + - PREF_H5 + type: TYPE_NORMAL + zh: 图2.9 我们向词汇表中添加特殊标记来处理特定上下文。 例如,我们添加一个<|unk|>标记来表示训练数据中没有出现过的新单词,因此不是现有词汇表的一部分。 + 此外,我们添加一个<|endoftext|>标记,用于分隔两个无关的文本源。 +- en: '![](images/ch-02__image018.png)' + id: totrans-112 + prefs: [] + type: TYPE_IMG + zh: '![](images/ch-02__image018.png)' +- en: As shown in figure 2.9, we can modify the tokenizer to use an `<|unk|>` token + if it encounters a word that is not part of the vocabulary. Furthermore, we add + a token between unrelated texts. For example, when training GPT-like LLMs on multiple + independent documents or books, it is common to insert a token before each document + or book that follows a previous text source, as illustrated in figure 2.10\. This + helps the LLM understand that, although these text sources are concatenated for + training, they are, in fact, unrelated. + id: totrans-113 + prefs: [] + type: TYPE_NORMAL + zh: 如图2.9所示,我们可以修改标记器,在遇到不在词汇表中的单词时使用`<|unk|>`标记。 此外,我们在无关的文本之间添加一个标记。 例如,在训练多个独立文档或书籍的GPT-like + LLM时,通常会在每个文档或书籍之前插入一个标记,用于指示这是前一个文本源的后续文档或书籍,如图2.10所示。 这有助于LLM理解,尽管这些文本源被连接起来进行训练,但实际上它们是无关的。 +- en: Figure 2.10 When working with multiple independent text source, we add <|endoftext|> + tokens between these texts. These <|endoftext|> tokens act as markers, signaling + the start or end of a particular segment, allowing for more effective processing + and understanding by the LLM. + id: totrans-114 + prefs: + - PREF_H5 + type: TYPE_NORMAL + zh: 图2.10 当处理多个独立的文本源时,我们在这些文本之间添加`<|endoftext|>`标记。 这些`<|endoftext|>`标记充当标记,标志着特定段落的开始或结束,让LLM更有效地处理和理解。 +- en: '![](images/ch-02__image020.png)' + id: totrans-115 + prefs: [] + type: TYPE_IMG + zh: '![](images/ch-02__image020.png)' +- en: 'Let''s now modify the vocabulary to include these two special tokens, `` + and `<|endoftext|>`, by adding these to the list of all unique words that we created + in the previous section:' + id: totrans-116 + prefs: [] + type: TYPE_NORMAL + zh: 现在让我们修改词汇表,以包括这两个特殊标记``和`<|endoftext|>`,通过将它们添加到我们在上一节中创建的所有唯一单词列表中: +- en: '[PRE23]' + id: totrans-117 + prefs: [] + type: TYPE_PRE + zh: '[PRE23]' +- en: Based on the output of the print statement above, the new vocabulary size is + 1161 (the vocabulary size in the previous section was 1159). + id: totrans-118 + prefs: [] + type: TYPE_NORMAL + zh: 根据上述打印语句的输出,新的词汇表大小为1161(上一节的词汇表大小为1159)。 +- en: 'As an additional quick check, let''s print the last 5 entries of the updated + vocabulary:' + id: totrans-119 + prefs: [] + type: TYPE_NORMAL + zh: 作为额外的快速检查,让我们打印更新后词汇表的最后5个条目: +- en: '[PRE24]' + id: totrans-120 + prefs: [] + type: TYPE_PRE + zh: '[PRE24]' +- en: 'The code above prints the following:' + id: totrans-121 + prefs: [] + type: TYPE_NORMAL + zh: 上面的代码打印如下所示: +- en: '[PRE25]' + id: totrans-122 + prefs: [] + type: TYPE_PRE + zh: '[PRE25]' +- en: 'Based on the code output above, we can confirm that the two new special tokens + were indeed successfully incorporated into the vocabulary. Next, we adjust the + tokenizer from code listing 2.3 accordingly, as shown in listing 2.4:' + id: totrans-123 + prefs: [] + type: TYPE_NORMAL + zh: 根据上面的代码输出,我们可以确认这两个新的特殊标记确实成功地融入到了词汇表中。 接下来,我们根据代码清单2.3调整标记器,如清单2.4所示: +- en: Listing 2.4 A simple text tokenizer that handles unknown words + id: totrans-124 + prefs: + - PREF_H5 + type: TYPE_NORMAL + zh: 清单2.4 处理未知词的简单文本标记器 +- en: '[PRE26]' + id: totrans-125 + prefs: [] + type: TYPE_PRE + zh: '[PRE26]' +- en: Compared to the `SimpleTokenizerV1` we implemented in code listing 2.3 in the + previous section, the new `SimpleTokenizerV2` replaces unknown words by `<|unk|>` + tokens. + id: totrans-126 + prefs: [] + type: TYPE_NORMAL + zh: 与我们在上一节代码清单2.3中实现的`SimpleTokenizerV1`相比,新的`SimpleTokenizerV2`将未知单词替换为`<|unk|>`标记。 +- en: 'Let''s now try this new tokenizer out in practice. For this, we will use a + simple text sample that we concatenate from two independent and unrelated sentences:' + id: totrans-127 + prefs: [] + type: TYPE_NORMAL + zh: 现在让我们尝试实践这种新的标记器。 为此,我们将使用一个简单的文本示例,该文本由两个独立且无关的句子串联而成: +- en: '[PRE27]' + id: totrans-128 + prefs: [] + type: TYPE_PRE + zh: '[PRE27]' +- en: 'The output is as follows:' + id: totrans-129 + prefs: [] + type: TYPE_NORMAL + zh: 输出如下所示: +- en: '[PRE28]' + id: totrans-130 + prefs: [] + type: TYPE_PRE + zh: '[PRE28]' +- en: 'Next, let''s tokenize the sample text using the `SimpleTokenizerV2`:' + id: totrans-131 + prefs: [] + type: TYPE_NORMAL + zh: 接下来,让我们使用`SimpleTokenizerV2`对样本文本进行标记: +- en: '[PRE29]' + id: totrans-132 + prefs: [] + type: TYPE_PRE + zh: '[PRE29]' +- en: 'This prints the following token IDs:' + id: totrans-133 + prefs: [] + type: TYPE_NORMAL + zh: 这打印了以下令牌ID: +- en: '[PRE30]' + id: totrans-134 + prefs: [] + type: TYPE_PRE + zh: '[PRE30]' +- en: Above, we can see that the list of token IDs contains 1159 for the <|endoftext|> + separator token as well as two 160 tokens, which are used for unknown words. + id: totrans-135 + prefs: [] + type: TYPE_NORMAL + zh: 从上面可以看到,令牌ID列表包含1159个<|endoftext|>分隔符令牌,以及两个用于未知单词的160个令牌。 +- en: 'Let''s de-tokenize the text for a quick sanity check:' + id: totrans-136 + prefs: [] + type: TYPE_NORMAL + zh: 让我们对文本进行反标记,做一个快速的检查: +- en: '[PRE31]' + id: totrans-137 + prefs: [] + type: TYPE_PRE + zh: '[PRE31]' +- en: 'The output is as follows:' + id: totrans-138 + prefs: [] + type: TYPE_NORMAL + zh: 输出如下所示: +- en: '[PRE32]' + id: totrans-139 + prefs: [] + type: TYPE_PRE + zh: '[PRE32]' +- en: Based on comparing the de-tokenized text above with the original input text, + we know that the training dataset, Edith Wharton's short story *The Verdict*, + did not contain the words "Hello" and "palace." + id: totrans-140 + prefs: [] + type: TYPE_NORMAL + zh: 根据上述去标记化文本与原始输入文本的比较,我们知道埃迪斯·沃顿(Edith Wharton)的短篇小说*The Verdict*训练数据集中不包含单词“Hello”和“palace”。 +- en: 'So far, we have discussed tokenization as an essential step in processing text + as input to LLMs. Depending on the LLM, some researchers also consider additional + special tokens such as the following:' + id: totrans-141 + prefs: [] + type: TYPE_NORMAL + zh: 到目前为止,我们已经讨论了分词作为将文本处理为LLMs输入的基本步骤。根据LLM,一些研究人员还考虑其他特殊标记,如下所示: +- en: '`[BOS]` (beginning of sequence): This token marks the start of a text. It signifies + to the LLM where a piece of content begins.' + id: totrans-142 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: '`[BOS]`(序列开始):该标记标志着文本的开始。它向LLM表示内容的开始位置。' +- en: '`[EOS]` (end of sequence): This token is positioned at the end of a text, and + is especially useful when concatenating multiple unrelated texts, similar to `<|endoftext|>`. + For instance, when combining two different Wikipedia articles or books, the `[EOS]` + token indicates where one article ends and the next one begins.' + id: totrans-143 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: '`[EOS]`(序列结束):该标记位于文本末尾,当连接多个不相关的文本时特别有用,类似于`<|endoftext|>`。例如,当合并两篇不同的维基百科文章或书籍时,`[EOS]`标记指示一篇文章的结束和下一篇文章的开始位置。' +- en: '`[PAD]` (padding): When training LLMs with batch sizes larger than one, the + batch might contain texts of varying lengths. To ensure all texts have the same + length, the shorter texts are extended or "padded" using the `[PAD]` token, up + to the length of the longest text in the batch.' + id: totrans-144 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: '`[PAD]`(填充):当使用大于一的批次大小训练LLMs时,批次可能包含不同长度的文本。为确保所有文本具有相同长度,较短的文本将使用`[PAD]`标记进行扩展或“填充”,直到批次中最长文本的长度。' +- en: Note that the tokenizer used for GPT models does not need any of these tokens + mentioned above but only uses an `<|endoftext|>` token for simplicity. The `<|endoftext|>` + is analogous to the `[EOS]` token mentioned above. Also, `<|endoftext|>` is used + for padding as well. However, as we'll explore in subsequent chapters when training + on batched inputs, we typically use a mask, meaning we don't attend to padded + tokens. Thus, the specific token chosen for padding becomes inconsequential. + id: totrans-145 + prefs: [] + type: TYPE_NORMAL + zh: 请注意,用于GPT模型的分词器不需要上述提到的任何这些标记,而仅使用`<|endoftext|>`标记简化。`<|endoftext|>`类似于上述的`[EOS]`标记。此外,`<|endoftext|>`也用于填充。然而,在后续章节中,当在批量输入上训练时,我们通常使用掩码,意味着我们不关注填充的标记。因此,所选择的特定填充标记变得不重要。 +- en: Moreover, the tokenizer used for GPT models also doesn't use an `<|unk|>` token + for out-of-vocabulary words. Instead, GPT models use a *byte pair encoding* tokenizer, + which breaks down words into subword units, which we will discuss in the next + section. + id: totrans-146 + prefs: [] + type: TYPE_NORMAL + zh: 此外,用于GPT模型的分词器也不使用`<|unk|>`标记来表示词汇表中没有的单词。相反,GPT模型使用字节对编码分词器,将单词拆分为子词单元,我们将在下一节中讨论。 +- en: 2.5 Byte pair encoding + id: totrans-147 + prefs: + - PREF_H2 + type: TYPE_NORMAL + zh: 2.5 字节对编码 +- en: We implemented a simple tokenization scheme in the previous sections for illustration + purposes. This section covers a more sophisticated tokenization scheme based on + a concept called byte pair encoding (BPE). The BPE tokenizer covered in this section + was used to train LLMs such as GPT-2, GPT-3, and ChatGPT. + id: totrans-148 + prefs: [] + type: TYPE_NORMAL + zh: 我们在前几节中实现了一个简单的分词方案,用于说明目的。本节介绍基于称为字节对编码(BPE)的概念的更复杂的分词方案。本节介绍的BPE分词器用于训练LLMs,如GPT-2、GPT-3和ChatGPT。 +- en: 'Since implementing BPE can be relatively complicated, we will use an existing + Python open-source library called *tiktoken* ([https://github.com/openai/tiktoken](openai.html)), + which implements the BPE algorithm very efficiently based on source code in Rust. + Similar to other Python libraries, we can install the tiktoken library via Python''s + `pip` installer from the terminal:' + id: totrans-149 + prefs: [] + type: TYPE_NORMAL + zh: 由于实现BPE可能相对复杂,我们将使用一个名为*tiktoken*([https://github.com/openai/tiktoken](openai.html))的现有Python开源库,该库基于Rust中的源代码非常有效地实现了BPE算法。与其他Python库类似,我们可以通过Python的终端上的`pip`安装程序安装tiktoken库: +- en: '[PRE33]' + id: totrans-150 + prefs: [] + type: TYPE_PRE + zh: '[PRE33]' +- en: 'The code in this chapter is based on tiktoken 0.5.1\. You can use the following + code to check the version you currently have installed:' + id: totrans-151 + prefs: [] + type: TYPE_NORMAL + zh: 本章中的代码基于tiktoken 0.5.1。您可以使用以下代码检查当前安装的版本: +- en: '[PRE34]' + id: totrans-152 + prefs: [] + type: TYPE_PRE + zh: '[PRE34]' +- en: 'Once installed, we can instantiate the BPE tokenizer from tiktoken as follows:' + id: totrans-153 + prefs: [] + type: TYPE_NORMAL + zh: 安装完成后,我们可以如下实例化tiktoken中的BPE分词器: +- en: '[PRE35]' + id: totrans-154 + prefs: [] + type: TYPE_PRE + zh: '[PRE35]' +- en: 'The usage of this tokenizer is similar to SimpleTokenizerV2 we implemented + previously via an `encode` method:' + id: totrans-155 + prefs: [] + type: TYPE_NORMAL + zh: 此分词器的使用方式类似于我们之前通过`encode`方法实现的SimpleTokenizerV2: +- en: '[PRE36]' + id: totrans-156 + prefs: [] + type: TYPE_PRE + zh: '[PRE36]' +- en: 'The code above prints the following token IDs:' + id: totrans-157 + prefs: [] + type: TYPE_NORMAL + zh: 上述代码打印以下标记ID: +- en: '[PRE37]' + id: totrans-158 + prefs: [] + type: TYPE_PRE + zh: '[PRE37]' +- en: 'We can then convert the token IDs back into text using the decode method, similar + to our `SimpleTokenizerV2` earlier:' + id: totrans-159 + prefs: [] + type: TYPE_NORMAL + zh: 然后,我们可以使用解码方法将标记ID转换回文本,类似于我们之前的`SimpleTokenizerV2`: +- en: '[PRE38]' + id: totrans-160 + prefs: [] + type: TYPE_PRE + zh: '[PRE38]' +- en: 'The above code prints the following:' + id: totrans-161 + prefs: [] + type: TYPE_NORMAL + zh: 上述代码打印如下: +- en: '[PRE39]' + id: totrans-162 + prefs: [] + type: TYPE_PRE + zh: '[PRE39]' +- en: We can make two noteworthy observations based on the token IDs and decoded text + above. First, the `<|endoftext|>` token is assigned a relatively large token ID, + namely, 50256\. In fact, the BPE tokenizer that was used to train models such + as GPT-2, GPT-3, and ChatGPT has a total vocabulary size of 50,257, with `<|endoftext|>` + being assigned the largest token ID. + id: totrans-163 + prefs: [] + type: TYPE_NORMAL + zh: 基于上述标记ID和解码文本,我们可以得出两个值得注意的观察结果。首先,`<|endoftext|>`标记被分配了一个相对较大的标记ID,即50256。事实上,用于训练诸如GPT-2、GPT-3和ChatGPT等模型的BPE分词器具有总共50257个词汇,其中`<|endoftext|>`被分配了最大的标记ID。 +- en: Second, the BPE tokenizer above encodes and decodes unknown words, such as "someunknownPlace" + correctly. The BPE tokenizer can handle any unknown word. How does it achieve + this without using `<|unk|>` tokens? + id: totrans-164 + prefs: [] + type: TYPE_NORMAL + zh: 第二,上述的BPE分词器可以正确地对未知单词进行编码和解码,例如"someunknownPlace"。BPE分词器可以处理任何未知单词。它是如何在不使用`<|unk|>`标记的情况下实现这一点的? +- en: The algorithm underlying BPE breaks down words that aren't in its predefined + vocabulary into smaller subword units or even individual characters, enabling + it to handle out-of-vocabulary words. So, thanks to the BPE algorithm, if the + tokenizer encounters an unfamiliar word during tokenization, it can represent + it as a sequence of subword tokens or characters, as illustrated in figure 2.11. + id: totrans-165 + prefs: [] + type: TYPE_NORMAL + zh: BPE算法的基础是将不在其预定义词汇表中的单词分解为更小的子词单元甚至是单个字符,使其能够处理词汇表之外的词汇。因此,多亏了BPE算法,如果分词器在分词过程中遇到陌生的单词,它可以将其表示为一系列子词标记或字符,如图2.11所示。 +- en: Figure 2.11 BPE tokenizers break down unknown words into subwords and individual + characters. This way, a BPE tokenizer can parse any word and doesn't need to replace + unknown words with special tokens, such as <|unk|>. + id: totrans-166 + prefs: + - PREF_H5 + type: TYPE_NORMAL + zh: 图2.11 BPE分词器将未知单词分解为子词和单个字符。这样,BPE分词器可以解析任何单词,无需用特殊标记(如`<|unk|>`)替换未知单词。 +- en: '![](images/ch-02__image022.png)' + id: totrans-167 + prefs: [] + type: TYPE_IMG + zh: '![](images/ch-02__image022.png)' +- en: As illustrated in figure 2.11, the ability to break down unknown words into + individual characters ensures that the tokenizer, and consequently the LLM that + is trained with it, can process any text, even if it contains words that were + not present in its training data. + id: totrans-168 + prefs: [] + type: TYPE_NORMAL + zh: 如图 2.11 所示,将未知单词分解为单个字符的能力确保了分词器以及随之训练的LLM可以处理任何文本,即使其中包含了其训练数据中未出现的单词。 +- en: Exercise 2.1 Byte pair encoding of unknown words + id: totrans-169 + prefs: + - PREF_H5 + type: TYPE_NORMAL + zh: 练习 2.1 未知单词的字节对编码 +- en: Try the BPE tokenizer from the tiktoken library on the unknown words "Akwirw + ier" and print the individual token IDs. Then, call the decode function on each + of the resulting integers in this list to reproduce the mapping shown in figure + 2.1\. Lastly, call the decode method on the token IDs to check whether it can + reconstruct the original input, "Akwirw ier". + id: totrans-170 + prefs: [] + type: TYPE_NORMAL + zh: 尝试从tiktoken库中使用BPE分词器对未知单词"Akwirw ier",并打印各个标记的ID。然后,在此列表中的每个生成的整数上调用解码函数,以重现图2.1中显示的映射。最后,在标记ID上调用解码方法以检查是否可以重建原始输入,即"Akwirw + ier"。 +- en: A detailed discussion and implementation of BPE is out of the scope of this + book, but in short, it builds its vocabulary by iteratively merging frequent characters + into subwords and frequent subwords into words. For example, BPE starts with adding + all individual single characters to its vocabulary ("a", "b", ...). In the next + stage, it merges character combinations that frequently occur together into subwords. + For example, "d" and "e" may be merged into the subword "de," which is common + in many English words like "define", "depend", "made", and "hidden". The merges + are determined by a frequency cutoff. + id: totrans-171 + prefs: [] + type: TYPE_NORMAL + zh: 本书不讨论BPE的详细讨论和实现,但简而言之,它通过迭代地将频繁出现的字符合并为子词和频繁出现的子词合并为单词来构建其词汇表。例如,BPE从将所有单个字符添加到其词汇表开始("a","b",...)。在下一阶段,它将经常一起出现的字符组合成子词。例如,"d"和"e"可能会合并成子词"de",在许多英文单词中很常见,如"define","depend","made"和"hidden"。合并是由频率截止确定的。 +- en: 2.6 Data sampling with a sliding window + id: totrans-172 + prefs: + - PREF_H2 + type: TYPE_NORMAL + zh: 2.6 滑动窗口数据采样 +- en: The previous section covered the tokenization steps and conversion from string + tokens into integer token IDs in great detail. The next step before we can finally + create the embeddings for the LLM is to generate the input-target pairs required + for training an LLM. + id: totrans-173 + prefs: [] + type: TYPE_NORMAL + zh: 前一节详细介绍了标记化步骤以及将字符串标记转换为整数标记ID之后,我们最终可以为LLM生成所需的输入-目标对,以用于训练LLM。 +- en: What do these input-target pairs look like? As we learned in chapter 1, LLMs + are pretrained by predicting the next word in a text, as depicted in figure 2.12. + id: totrans-174 + prefs: [] + type: TYPE_NORMAL + zh: 这些输入-目标对是什么样子?正如我们在第一章中学到的那样,LLMs 是通过预测文本中的下一个单词来进行预训练的,如图 2.12 所示。 +- en: Figure 2.12 Given a text sample, extract input blocks as subsamples that serve + as input to the LLM, and the LLM's prediction task during training is to predict + the next word that follows the input block. During training, we mask out all words + that are past the target. Note that the text shown in this figure would undergo + tokenization before the LLM can process it; however, this figure omits the tokenization + step for clarity. + id: totrans-175 + prefs: + - PREF_H5 + type: TYPE_NORMAL + zh: 图 2.12 给定一个文本样本,提取作为 LLM 输入的子样本的输入块,并且在训练期间,LLM 的预测任务是预测跟随输入块的下一个单词。在训练中,我们屏蔽所有超过目标的单词。请注意,在 + LLM 可处理文本之前,此图中显示的文本会进行 tokenization;但为了清晰起见,该图省略了 tokenization 步骤。 +- en: '![](images/ch-02__image024.png)' + id: totrans-176 + prefs: [] + type: TYPE_IMG + zh: '![](images/ch-02__image024.png)' +- en: In this section we implement a data loader that fetches the input-target pairs + depicted in figure 2.12 from the training dataset using a sliding window approach. + id: totrans-177 + prefs: [] + type: TYPE_NORMAL + zh: 在此部分中,我们实现了一个数据加载器,使用滑动窗口方法从训练数据集中提取图 2.12 中所示的输入-目标对。 +- en: 'To get started, we will first tokenize the whole The Verdict short story we + worked with earlier using the BPE tokenizer introduced in the previous section:' + id: totrans-178 + prefs: [] + type: TYPE_NORMAL + zh: 为了开始,我们将使用前面介绍的 BPE tokenizer 对我们之前使用的《裁决》短篇小说进行标记化处理: +- en: '[PRE40]' + id: totrans-179 + prefs: [] + type: TYPE_PRE + zh: '[PRE40]' +- en: Executing the code above will return 5145, the total number of tokens in the + training set, after applying the BPE tokenizer. + id: totrans-180 + prefs: [] + type: TYPE_NORMAL + zh: 执行上述代码将返回 5145,应用 BPE tokenizer 后训练集中的总标记数。 +- en: 'Next, we remove the first 50 tokens from the dataset for demonstration purposesas + it results in a slightly more interesting text passage in the next steps:' + id: totrans-181 + prefs: [] + type: TYPE_NORMAL + zh: 接下来,为了演示目的,让我们从数据集中删除前 50 个标记,因为这会使接下来的文本段落稍微有趣一些: +- en: '[PRE41]' + id: totrans-182 + prefs: [] + type: TYPE_PRE + zh: '[PRE41]' +- en: 'One of the easiest and most intuitive ways to create the input-target pairs + for the next-word prediction task is to create two variables, `x` and `y`, where + `x` contains the input tokens and `y` contains the targets, which are the inputs + shifted by 1:' + id: totrans-183 + prefs: [] + type: TYPE_NORMAL + zh: 创建下一个单词预测任务的输入-目标对最简单直观的方法之一是创建两个变量,`x` 和 `y`,其中 `x` 包含输入标记,`y` 包含目标,即将输入向后移动一个位置的输入: +- en: '[PRE42]' + id: totrans-184 + prefs: [] + type: TYPE_PRE + zh: '[PRE42]' +- en: 'Running the above code prints the following output:' + id: totrans-185 + prefs: [] + type: TYPE_NORMAL + zh: 运行上述代码会打印以下输出: +- en: '[PRE43]' + id: totrans-186 + prefs: [] + type: TYPE_PRE + zh: '[PRE43]' +- en: 'Processing the inputs along with the targets, which are the inputs shifted + by one position, we can then create the next-word prediction tasks depicted earlier + in figure 2.12, as follows:' + id: totrans-187 + prefs: [] + type: TYPE_NORMAL + zh: 处理输入以及目标(即向后移动了一个位置的输入),我们可以创建如图 2.12 中所示的下一个单词预测任务: +- en: '[PRE44]' + id: totrans-188 + prefs: [] + type: TYPE_PRE + zh: '[PRE44]' +- en: 'The code above prints the following:' + id: totrans-189 + prefs: [] + type: TYPE_NORMAL + zh: 上述代码会打印以下内容: +- en: '[PRE45]' + id: totrans-190 + prefs: [] + type: TYPE_PRE + zh: '[PRE45]' +- en: Everything left of the arrow (`---->`) refers to the input an LLM would receive, + and the token ID on the right side of the arrow represents the target token ID + that the LLM is supposed to predict. + id: totrans-191 + prefs: [] + type: TYPE_NORMAL + zh: 形如箭头 (`---->`) 左侧的所有内容指的是 LLM 收到的输入,箭头右侧的标记 ID 表示 LLM 应该预测的目标标记 ID。 +- en: 'For illustration purposes, let''s repeat the previous code but convert the + token IDs into text:' + id: totrans-192 + prefs: [] + type: TYPE_NORMAL + zh: 为了说明目的,让我们重复之前的代码但将标记 ID 转换为文本: +- en: '[PRE46]' + id: totrans-193 + prefs: [] + type: TYPE_PRE + zh: '[PRE46]' +- en: 'The following outputs show how the input and outputs look in text format:' + id: totrans-194 + prefs: [] + type: TYPE_NORMAL + zh: 以下输出显示输入和输出以文本格式的样式: +- en: '[PRE47]' + id: totrans-195 + prefs: [] + type: TYPE_PRE + zh: '[PRE47]' +- en: We've now created the input-target pairs that we can turn into use for the LLM + training in upcoming chapters. + id: totrans-196 + prefs: [] + type: TYPE_NORMAL + zh: 我们现在已经创建了输入-目标对,可以在接下来的章节中用于 LLM 训练。 +- en: 'There''s only one more task before we can turn the tokens into embeddings, + as we mentioned at the beginning of this chapter: implementing an efficient data + loader that iterates over the input dataset and returns the inputs and targets + as PyTorch tensors.' + id: totrans-197 + prefs: [] + type: TYPE_NORMAL + zh: 在我们可以将标记转换为嵌入之前,还有最后一个任务,正如我们在本章开头所提到的:实现一个高效的数据加载器,迭代输入数据集并返回 PyTorch 张量作为输入和目标。 +- en: 'In particular, we are interested in returning two tensors: an input tensor + containing the text that the LLM sees and a target tensor that includes the targets + for the LLM to predict, as depicted in figure 2.13.' + id: totrans-198 + prefs: [] + type: TYPE_NORMAL + zh: 特别是,我们有兴趣返回两个张量:一个包含 LLM 看到的文本的输入张量,以及一个包含 LLM 预测目标的目标张量,如图 2.13 所示。 +- en: Figure 2.13 To implement efficient data loaders, we collect the inputs in a + tensor, x, where each row represents one input context. A second tensor, y, contains + the corresponding prediction targets (next words), which are created by shifting + the input by one position. + id: totrans-199 + prefs: + - PREF_H5 + type: TYPE_NORMAL + zh: 图 2.13 为了实现高效的数据加载器,我们将输入都收集到一个张量 x 中,其中每一行代表一个输入上下文。第二个张量 y 包含对应的预测目标(下一个单词),它们是通过将输入向后移动一个位置来创建的。 +- en: '![](images/ch-02__image026.png)' + id: totrans-200 + prefs: [] + type: TYPE_IMG + zh: '![](images/ch-02__image026.png)' +- en: While figure 2.13 shows the tokens in string format for illustration purposes, + the code implementation will operate on token IDs directly since the encode method + of the BPE tokenizer performs both tokenization and conversion into token IDs + as a single step. + id: totrans-201 + prefs: [] + type: TYPE_NORMAL + zh: 虽然图2.13展示了字符串格式的token以进行说明,但代码实现将直接操作token ID,因为BPE标记器的encode方法执行了tokenization和转换为token + ID为单一步骤。 +- en: For the efficient data loader implementation, we will use PyTorch's built-in + Dataset and DataLoader classes. For additional information and guidance on installing + PyTorch, please see section A.1.3, Installing PyTorch, in Appendix A. + id: totrans-202 + prefs: [] + type: TYPE_NORMAL + zh: 对于高效的数据加载器实现,我们将使用PyTorch内置的Dataset和DataLoader类。有关安装PyTorch的更多信息和指导,请参阅附录A的*A.1.3,安装PyTorch*一节。 +- en: 'The code for the dataset class is shown in code listing 2.5:' + id: totrans-203 + prefs: [] + type: TYPE_NORMAL + zh: 数据集类的代码如图2.5所示: +- en: Listing 2.5 A dataset for batched inputs and targets + id: totrans-204 + prefs: + - PREF_H5 + type: TYPE_NORMAL + zh: 图2.5 一批输入和目标的数据集 +- en: '[PRE48]' + id: totrans-205 + prefs: [] + type: TYPE_PRE + zh: '[PRE48]' +- en: The `GPTDatasetV1` class in listing 2.5 is based on the PyTorch `Dataset` class + and defines how individual rows are fetched from the dataset, where each row consists + of a number of token IDs (based on a `max_length`) assigned to an `input_chunk` + tensor. The `target_chunk` tensor contains the corresponding targets. I recommend + reading on to see how the data returned from this dataset looks like when we combine + the dataset with a PyTorch `DataLoader` -- this will bring additional intuition + and clarity. + id: totrans-206 + prefs: [] + type: TYPE_NORMAL + zh: 图2.5中的`GPTDatasetV1`类基于PyTorch的`Dataset`类,定义了如何从数据集中获取单独的行,其中每一行都包含一系列基于`max_length`分配给`input_chunk`张量的token + ID。`target_chunk`张量包含相应的目标。我建议继续阅读,看看当我们将数据集与PyTorch的`DataLoader`结合使用时,这个数据集返回的数据是什么样的——这将带来额外的直觉和清晰度。 +- en: If you are new to the structure of PyTorch `Dataset` classes, such as shown + in listing 2.5, please read section *A.6, Setting up efficient data loaders*, + in Appendix A, which explains the general structure and usage of PyTorch `Dataset` + and `DataLoader` classes. + id: totrans-207 + prefs: [] + type: TYPE_NORMAL + zh: 如果您对PyTorch的`Dataset`类的结构(如图2.5所示)是新手,请阅读附录A的*A.6,设置高效的数据加载器*一节,其中解释了PyTorch的`Dataset`和`DataLoader`类的一般结构和用法。 +- en: 'The following code will use the `GPTDatasetV1` to load the inputs in batches + via a PyTorch `DataLoader`:' + id: totrans-208 + prefs: [] + type: TYPE_NORMAL + zh: 以下代码将使用`GPTDatasetV1`通过PyTorch的`DataLoader`来批量加载输入: +- en: Listing 2.6 A data loader to generate batches with input-with pairs + id: totrans-209 + prefs: + - PREF_H5 + type: TYPE_NORMAL + zh: 图2.6 用于生成带输入对的批次的数据加载器 +- en: '[PRE49]' + id: totrans-210 + prefs: [] + type: TYPE_PRE + zh: '[PRE49]' +- en: 'Let''s test the `dataloader` with a batch size of 1 for an LLM with a context + size of 4 to develop an intuition of how the `GPTDatasetV1` class from listing + 2.5 and the `create_dataloader` function from listing 2.6 work together:' + id: totrans-211 + prefs: [] + type: TYPE_NORMAL + zh: 让我们测试`dataloader`,将一个上下文大小为4的LLM的批量大小设为1,以便理解图2.5的`GPTDatasetV1`类和图2.6的`create_dataloader`函数如何协同工作。 +- en: '[PRE50]' + id: totrans-212 + prefs: [] + type: TYPE_PRE + zh: '[PRE50]' +- en: 'Executing the preceding code prints the following:' + id: totrans-213 + prefs: [] + type: TYPE_NORMAL + zh: 执行前面的代码将打印以下内容: +- en: '[PRE51]' + id: totrans-214 + prefs: [] + type: TYPE_PRE + zh: '[PRE51]' +- en: 'The `first_batch` variable contains two tensors: the first tensor stores the + input token IDs, and the second tensor stores the target token IDs. Since the + `max_length` is set to 4, each of the two tensors contains 4 token IDs. Note that + an input size of 4 is relatively small and only chosen for illustration purposes. + It is common to train LLMs with input sizes of at least 256.' + id: totrans-215 + prefs: [] + type: TYPE_NORMAL + zh: '`first_batch`变量包含两个张量:第一个张量存储输入 token ID,第二个张量存储目标 token ID。由于`max_length`设置为4,这两个张量每个都包含4个token + ID。值得注意的是,输入大小为4相对较小,仅用于说明目的。通常会用至少256的输入大小来训练LLMs。' +- en: 'To illustrate the meaning of `stride=1`, let''s fetch another batch from this + dataset:' + id: totrans-216 + prefs: [] + type: TYPE_NORMAL + zh: 为了说明`stride=1`的含义,让我们从这个数据集中获取另一个批次: +- en: '[PRE52]' + id: totrans-217 + prefs: [] + type: TYPE_PRE + zh: '[PRE52]' +- en: 'The second batch has the following contents:' + id: totrans-218 + prefs: [] + type: TYPE_NORMAL + zh: 第二批的内容如下: +- en: '[PRE53]' + id: totrans-219 + prefs: [] + type: TYPE_PRE + zh: '[PRE53]' +- en: If we compare the first with the second batch, we can see that the second batch's + token IDs are shifted by one position compared to the first batch (for example, + the second ID in the first batch's input is 367, which is the first ID of the + second batch's input). The `stride` setting dictates the number of positions the + inputs shift across batches, emulating a sliding window approach, as demonstrated + in Figure 2.14. + id: totrans-220 + prefs: [] + type: TYPE_NORMAL + zh: 如果我们比较第一批和第二批,我们会发现相对于第一批,第二批的token ID向后移动了一个位置(例如,第一批输入中的第二个ID是367,这是第二批输入中的第一个ID)。`stride`设置规定了输入在批次之间移动的位置数,模拟了一个滑动窗口的方法,如图2.14所示。 +- en: Figure 2.14 When creating multiple batches from the input dataset, we slide + an input window across the text. If the stride is set to 1, we shift the input + window by 1 position when creating the next batch. If we set the stride equal + to the input window size, we can prevent overlaps between the batches. + id: totrans-221 + prefs: + - PREF_H5 + type: TYPE_NORMAL + zh: 图2.14 在从输入数据集创建多个批次时,我们在文本上滑动一个输入窗口。如果将步幅设置为1,则在创建下一个批次时,将输入窗口向右移动1个位置。如果我们将步幅设置为等于输入窗口大小,我们可以防止批次之间的重叠。 +- en: '![](images/ch-02__image028.png)' + id: totrans-222 + prefs: [] + type: TYPE_IMG + zh: '![](images/ch-02__image028.png)' +- en: Exercise 2.2 Data loaders with different strides and context sizes + id: totrans-223 + prefs: + - PREF_H5 + type: TYPE_NORMAL + zh: 练习2.2 具有不同步幅和上下文大小的数据加载器 +- en: To develop more intuition for how the data loader works, try to run it with + different settings such as max_length=2 and stride=2 and max_length=8 and stride=2. + id: totrans-224 + prefs: [] + type: TYPE_NORMAL + zh: 要更好地理解数据加载器的工作原理,请尝试以不同设置运行,如max_length=2和stride=2以及max_length=8和stride=2。 +- en: Batch sizes of 1, such as we have sampled from the data loader so far, are useful + for illustration purposes. If you have previous experience with deep learning, + you may know that small batch sizes require less memory during training but lead + to more noisy model updates. Just like in regular deep learning, the batch size + is a trade-off and hyperparameter to experiment with when training LLMs. + id: totrans-225 + prefs: [] + type: TYPE_NORMAL + zh: 与我们到目前为止从数据加载器中抽样的批次大小为1一样,这对于说明目的非常有用。如果您有深度学习的经验,您可能知道,较小的批次大小在训练期间需要更少的内存,但会导致更多的噪声模型更新。就像在常规深度学习中一样,批次大小是一个需要在训练LLM时进行实验的权衡和超参数。 +- en: 'Before we move on to the two final sections of this chapter that are focused + on creating the embedding vectors from the token IDs, let''s have a brief look + at how we can use the data loader to sample with a batch size greater than 1:' + id: totrans-226 + prefs: [] + type: TYPE_NORMAL + zh: 在我们继续本章的最后两个重点部分,这些部分侧重于从标记ID创建嵌入向量之前,让我们简要了解如何使用数据加载器进行批量大小大于1的抽样: +- en: '[PRE54]' + id: totrans-227 + prefs: [] + type: TYPE_PRE + zh: '[PRE54]' +- en: 'This prints the following:' + id: totrans-228 + prefs: [] + type: TYPE_NORMAL + zh: 这将输出以下内容: +- en: '[PRE55]' + id: totrans-229 + prefs: [] + type: TYPE_PRE + zh: '[PRE55]' +- en: Note that we increase the stride to 5, which is the max length + 1\. This is + to utilize the data set fully (we don't skip a single word) but also avoid any + overlap between the batches, since more overlap could lead to increased overfitting. + For instance, if we set the stride equal to the max length, the target ID for + the last input token ID in each row would become the first input token ID in the + next row. + id: totrans-230 + prefs: [] + type: TYPE_NORMAL + zh: 请注意,我们将步幅增加到5,这是最大长度+1。这是为了充分利用数据集(我们不跳过任何单词),同时避免批次之间的任何重叠,因为更多的重叠可能导致过拟合增加。例如,如果我们将步幅设置为与最大长度相等,那么每行中最后一个输入标记ID的目标ID将成为下一行中第一个输入标记ID。 +- en: In the final two sections of this chapter, we will implement embedding layers + that convert the token IDs into continuous vector representations, which serve + as input data format for LLMs. + id: totrans-231 + prefs: [] + type: TYPE_NORMAL + zh: 在本章的最后两个部分中,我们将实现将标记ID转换为连续向量表示的嵌入层,这将作为LLM的输入数据格式。 +- en: 2.7 Creating token embeddings + id: totrans-232 + prefs: + - PREF_H2 + type: TYPE_NORMAL + zh: 2.7 创建标记嵌入 +- en: The last step for preparing the input text for LLM training is to convert the + token IDs into embedding vectors, as illustrated in figure 2.15, which will be + the focus of these two last remaining sections of this chapter. + id: totrans-233 + prefs: [] + type: TYPE_NORMAL + zh: 为准备LLM训练的输入文本的最后一步是将标记ID转换为嵌入向量,如图2.15所示,这将是本章最后两个剩余部分的重点。 +- en: Figure 2.15 Preparing the input text for an LLM involves tokenizing text, converting + text tokens to token IDs, and converting token IDs into vector embedding vectors. + In this section, we consider the token IDs created in previous sections to create + the token embedding vectors. + id: totrans-234 + prefs: + - PREF_H5 + type: TYPE_NORMAL + zh: 图2.15 准备LLM输入文本涉及对文本进行标记化、将文本标记转换为标记ID和将标记ID转换为向量嵌入向量。在本节中,我们考虑前几节中创建的标记ID以创建标记嵌入向量。 +- en: '![](images/ch-02__image030.png)' + id: totrans-235 + prefs: [] + type: TYPE_IMG + zh: '![](images/ch-02__image030.png)' +- en: A continuous vector representation, or embedding, is necessary since GPT-like + LLMs are deep neural networks trained with the backpropagation algorithm. If you + are unfamiliar with how neural networks are trained with backpropagation, please + read section A.4, *Automatic differentiation made easy*, in Appendix A. + id: totrans-236 + prefs: [] + type: TYPE_NORMAL + zh: 连续向量表示,或嵌入,是必要的,因为类似GPT的LLM是使用反向传播算法训练的深度神经网络。如果您不熟悉神经网络如何使用反向传播进行训练,请阅读附录A中的第A.4节,*简化的自动微分*。 +- en: 'Let''s illustrate how the token ID to embedding vector conversion works with + a hands-on example. Suppose we have the following three input tokens with IDs + 5, 1, 3, and 2:' + id: totrans-237 + prefs: [] + type: TYPE_NORMAL + zh: 让我们用一个实际例子说明标记ID到嵌入向量转换是如何工作的。假设我们有以下三个带有ID 5、1、3和2的输入标记: +- en: '[PRE56]' + id: totrans-238 + prefs: [] + type: TYPE_PRE + zh: '[PRE56]' +- en: 'For the sake of simplicity and illustration purposes, suppose we have a small + vocabulary of only 6 words (instead of the 50,257 words in the BPE tokenizer vocabulary), + and we want to create embeddings of size 3 (in GPT-3, the embedding size is 12,288 + dimensions):' + id: totrans-239 + prefs: [] + type: TYPE_NORMAL + zh: 为了简单起见和说明目的,假设我们只有一个小的词汇表,其中只有 6 个单词(而不是 BPE 标记器词汇表中的 50,257 个单词),我们想创建大小为 3 + 的嵌入(在 GPT-3 中,嵌入大小为 12,288 维): +- en: '[PRE57]' + id: totrans-240 + prefs: [] + type: TYPE_PRE + zh: '[PRE57]' +- en: 'Using the `vocab_size` and `output_dim`, we can instantiate an embedding layer + in PyTorch, setting the random seed to 123 for reproducibility purposes:' + id: totrans-241 + prefs: [] + type: TYPE_NORMAL + zh: 使用 `vocab_size` 和 `output_dim`,我们可以在 PyTorch 中实例化一个嵌入层,设置随机种子为 123 以便进行再现性: +- en: '[PRE58]' + id: totrans-242 + prefs: [] + type: TYPE_PRE + zh: '[PRE58]' +- en: 'The print statement in the preceding code example prints the embedding layer''s + underlying weight matrix:' + id: totrans-243 + prefs: [] + type: TYPE_NORMAL + zh: 在上述代码示例中的打印语句打印了嵌入层的底层权重矩阵: +- en: '[PRE59]' + id: totrans-244 + prefs: [] + type: TYPE_PRE + zh: '[PRE59]' +- en: We can see that the weight matrix of the embedding layer contains small, random + values. These values are optimized during LLM training as part of the LLM optimization + itself, as we will see in upcoming chapters. Moreover, we can see that the weight + matrix has six rows and three columns. There is one row for each of the six possible + tokens in the vocabulary. And there is one column for each of the three embedding + dimensions. + id: totrans-245 + prefs: [] + type: TYPE_NORMAL + zh: 我们可以看到嵌入层的权重矩阵包含了小型的随机值。这些值在 LLM 训练过程中作为 LLM 优化的一部分而被优化,我们将在后续章节中看到。此外,我们可以看到权重矩阵有六行和三列。词汇表中的每个可能的标记都有一行。这三个嵌入维度中的每个维度都有一列。 +- en: 'After we instantiated the embedding layer, let''s now apply it to a token ID + to obtain the embedding vector:' + id: totrans-246 + prefs: [] + type: TYPE_NORMAL + zh: 在我们实例化嵌入层之后,现在让我们将其应用到一个标记 ID 上以获取嵌入向量: +- en: '[PRE60]' + id: totrans-247 + prefs: [] + type: TYPE_PRE + zh: '[PRE60]' +- en: 'The returned embedding vector is as follows:' + id: totrans-248 + prefs: [] + type: TYPE_NORMAL + zh: 返回的嵌入向量如下: +- en: '[PRE61]' + id: totrans-249 + prefs: [] + type: TYPE_PRE + zh: '[PRE61]' +- en: If we compare the embedding vector for token ID 3 to the previous embedding + matrix, we see that it is identical to the 4th row (Python starts with a zero + index, so it's the row corresponding to index 3). In other words, the embedding + layer is essentially a look-up operation that retrieves rows from the embedding + layer's weight matrix via a token ID. + id: totrans-250 + prefs: [] + type: TYPE_NORMAL + zh: 如果我们将标记 ID 3 的嵌入向量与先前的嵌入矩阵进行比较,我们会看到它与第四行完全相同(Python 从零索引开始,所以它是与索引 3 对应的行)。换句话说,嵌入层本质上是一个查找操作,它通过标记 + ID 从嵌入层的权重矩阵中检索行。 +- en: Embedding layers versus matrix multiplication + id: totrans-251 + prefs: + - PREF_H5 + type: TYPE_NORMAL + zh: 嵌入层与矩阵乘法 +- en: For those who are familiar with one-hot encoding, the embedding layer approach + above is essentially just a more efficient way of implementing one-hot encoding + followed by matrix multiplication in a fully connected layer, which is illustrated + in the supplementary code on GitHub at [https://github.com/rasbt/LLMs-from-scratch/tree/main/ch02/03_bonus_embedding-vs-matmul](ch02.html). + Because the embedding layer is just a more efficient implementation equivalent + to the one-hot encoding and matrix-multiplication approach, it can be seen as + a neural network layer that can be optimized via backpropagation. + id: totrans-252 + prefs: [] + type: TYPE_NORMAL + zh: 对于那些熟悉独热编码的人来说,上面的嵌入层方法实质上只是实施独热编码加上全连接层中的矩阵乘法更高效的一种方式,这在 GitHub 上的补充代码中进行了说明 + [https://github.com/rasbt/LLMs-from-scratch/tree/main/ch02/03_bonus_embedding-vs-matmul](ch02.html)。因为嵌入层只是一个更高效的等效实现,等同于独热编码和矩阵乘法方法,它可以看作是一个可以通过反向传播进行优化的神经网络层。 +- en: 'Previously, we have seen how to convert a single token ID into a three-dimensional + embedding vector. Let''s now apply that to all four input IDs we defined earlier + (`torch.tensor([5, 1, 3, 2])`):' + id: totrans-253 + prefs: [] + type: TYPE_NORMAL + zh: 在之前,我们已经看到如何将单个标记 ID 转换为三维嵌入向量。现在让我们将其应用到我们之前定义的四个输入 ID 上 (`torch.tensor([5, + 1, 3, 2])`): +- en: '[PRE62]' + id: totrans-254 + prefs: [] + type: TYPE_PRE + zh: '[PRE62]' +- en: 'The print output reveals that this results in a 4x3 matrix:' + id: totrans-255 + prefs: [] + type: TYPE_NORMAL + zh: 打印输出显示,结果是一个 4x3 的矩阵: +- en: '[PRE63]' + id: totrans-256 + prefs: [] + type: TYPE_PRE + zh: '[PRE63]' +- en: Each row in this output matrix is obtained via a lookup operation from the embedding + weight matrix, as illustrated in figure 2.16. + id: totrans-257 + prefs: [] + type: TYPE_NORMAL + zh: 此输出矩阵中的每一行都是通过从嵌入权重矩阵中进行查找操作得到的,正如图 2.16 所示。 +- en: Figure 2.16 Embedding layers perform a look-up operation, retrieving the embedding + vector corresponding to the token ID from the embedding layer's weight matrix. + For instance, the embedding vector of the token ID 5 is the sixth row of the embedding + layer weight matrix (it is the sixth instead of the fifth row because Python starts + counting at 0). + id: totrans-258 + prefs: + - PREF_H5 + type: TYPE_NORMAL + zh: 图 2.16 嵌入层执行查找操作,从嵌入层的权重矩阵中检索与标记 ID 对应的嵌入向量。例如,标记 ID 5 的嵌入向量是嵌入层权重矩阵的第六行(它是第六行而不是第五行,因为 + Python 从 0 开始计数)。 +- en: '![](images/ch-02__image032.png)' + id: totrans-259 + prefs: [] + type: TYPE_IMG + zh: '![](images/ch-02__image032.png)' +- en: This section covered how we create embedding vectors from token IDs. The next + and final section of this chapter will add a small modification to these embedding + vectors to encode positional information about a token within a text. + id: totrans-260 + prefs: [] + type: TYPE_NORMAL + zh: 本节介绍了如何从标记ID创建嵌入向量。本章的下一节也是最后一节,将对这些嵌入向量进行一些小的修改,以编码文本中标记的位置信息。 +- en: 2.8 Encoding word positions + id: totrans-261 + prefs: + - PREF_H2 + type: TYPE_NORMAL + zh: 2.8 编码词的位置 +- en: In the previous section, we converted the token IDs into a continuous vector + representation, the so-called token embeddings. In principle, this is a suitable + input for an LLM. However, a minor shortcoming of LLMs is that their self-attention + mechanism, which will be covered in detail in chapter 3, doesn't have a notion + of position or order for the tokens within a sequence. + id: totrans-262 + prefs: [] + type: TYPE_NORMAL + zh: 在前一节中,我们将标记ID转换为连续的向量表示,即所谓的标记嵌入。从原则上讲,这对于LLM来说是一个合适的输入。然而,LLM的一个小缺陷是,它们的自我注意机制(将详细介绍于第3章中)对于序列中的标记没有位置或顺序的概念。 +- en: The way the previously introduced embedding layer works is that the same token + ID always gets mapped to the same vector representation, regardless of where the + token ID is positioned in the input sequence, as illustrated in figure 2.17. + id: totrans-263 + prefs: [] + type: TYPE_NORMAL + zh: 先前介绍的嵌入层的工作方式是,相同的标记ID始终被映射到相同的向量表示,无论标记ID在输入序列中的位置如何,如图2.17所示。 +- en: Figure 2.17 The embedding layer converts a token ID into the same vector representation + regardless of where it is located in the input sequence. For example, the token + ID 5, whether it's in the first or third position in the token ID input vector, + will result in the same embedding vector. + id: totrans-264 + prefs: + - PREF_H5 + type: TYPE_NORMAL + zh: 图2.17 嵌入层将标记ID转换为相同的向量表示,无论其在输入序列中的位置如何。例如,标记ID 5,无论是在标记ID输入向量的第一个位置还是第三个位置,都会导致相同的嵌入向量。 +- en: '![](images/ch-02__image034.png)' + id: totrans-265 + prefs: [] + type: TYPE_IMG + zh: '![](images/ch-02__image034.png)' +- en: In principle, the deterministic, position-independent embedding of the token + ID is good for reproducibility purposes. However, since the self-attention mechanism + of LLMs itself is also position-agnostic, it is helpful to inject additional position + information into the LLM. + id: totrans-266 + prefs: [] + type: TYPE_NORMAL + zh: 从原则上讲,标记ID的确定性、位置无关的嵌入对于可重现性目的很好。然而,由于LLM的自我注意机制本身也是位置不可知的,向LLM注入额外的位置信息是有帮助的。 +- en: 'To achieve this, there are two broad categories of position-aware embeddings: + relative *positional embeddings* and absolute positional embeddings.' + id: totrans-267 + prefs: [] + type: TYPE_NORMAL + zh: 为了实现这一点,位置感知嵌入有两个广泛的类别:相对*位置嵌入*和绝对位置嵌入。 +- en: Absolute positional embeddings are directly associated with specific positions + in a sequence. For each position in the input sequence, a unique embedding is + added to the token's embedding to convey its exact location. For instance, the + first token will have a specific positional embedding, the second token another + distinct embedding, and so on, as illustrated in figure 2.18. + id: totrans-268 + prefs: [] + type: TYPE_NORMAL + zh: 绝对位置嵌入与序列中的特定位置直接相关联。对于输入序列中的每个位置,都会添加一个唯一的嵌入,以传达其确切位置。例如,第一个标记将具有特定的位置嵌入,第二个标记是另一个不同的嵌入,依此类推,如图2.18所示。 +- en: Figure 2.18 Positional embeddings are added to the token embedding vector to + create the input embeddings for an LLM. The positional vectors have the same dimension + as the original token embeddings. The token embeddings are shown with value 1 + for simplicity. + id: totrans-269 + prefs: + - PREF_H5 + type: TYPE_NORMAL + zh: 图2.18 位置嵌入被添加到标记嵌入向量中,用于创建LLM的输入嵌入。位置向量的维度与原始标记嵌入相同。为简单起见,标记嵌入显示为值1。 +- en: '![](images/ch-02__image036.png)' + id: totrans-270 + prefs: [] + type: TYPE_IMG + zh: '![](images/ch-02__image036.png)' +- en: Instead of focusing on the absolute position of a token, the emphasis of relative + positional embeddings is on the relative position or distance between tokens. + This means the model learns the relationships in terms of "how far apart" rather + than "at which exact position." The advantage here is that the model can generalize + better to sequences of varying lengths, even if it hasn't seen such lengths during + training. + id: totrans-271 + prefs: [] + type: TYPE_NORMAL + zh: 相对位置嵌入不是关注一个标记的绝对位置,而是关注标记之间的相对位置或距离。这意味着模型学习的是关于“有多远”而不是“在哪个确切位置”。这里的优势在于,即使模型在训练期间没有看到这样的长度,它也能更好地概括不同长度的序列。 +- en: Both types of positional embeddings aim to augment the capacity of LLMs to understand + the order and relationships between tokens, ensuring more accurate and context-aware + predictions. The choice between them often depends on the specific application + and the nature of the data being processed. + id: totrans-272 + prefs: [] + type: TYPE_NORMAL + zh: 这两种位置嵌入的目标都是增强LLM理解标记之间的顺序和关系的能力,确保更准确和能够理解上下文的预测。它们之间的选择通常取决于特定的应用和正在处理的数据的性质。 +- en: OpenAI's GPT models use absolute positional embeddings that are optimized during + the training process rather than being fixed or predefined like the positional + encodings in the original Transformer model. This optimization process is part + of the model training itself, which we will implement later in this book. For + now, let's create the initial positional embeddings to create the LLM inputs for + the upcoming chapters. + id: totrans-273 + prefs: [] + type: TYPE_NORMAL + zh: OpenAI的GPT模型使用的是在训练过程中进行优化的绝对位置嵌入,而不是像原始Transformer模型中的位置编码一样是固定或预定义的。这个优化过程是模型训练本身的一部分,我们稍后会在本书中实现。现在,让我们创建初始位置嵌入以创建即将到来的章节的LLM输入。 +- en: 'Previously, we focused on very small embedding sizes in this chapter for illustration + purposes. We now consider more realistic and useful embedding sizes and encode + the input tokens into a 256-dimensional vector representation. This is smaller + than what the original GPT-3 model used (in GPT-3, the embedding size is 12,288 + dimensions) but still reasonable for experimentation. Furthermore, we assume that + the token IDs were created by the BPE tokenizer that we implemented earlier, which + has a vocabulary size of 50,257:' + id: totrans-274 + prefs: [] + type: TYPE_NORMAL + zh: 在本章中,我们之前专注于非常小的嵌入尺寸以进行举例说明。现在我们考虑更现实和有用的嵌入尺寸,并将输入令牌编码为256维向量表示。这比原始的GPT-3模型使用的要小(在GPT-3中,嵌入尺寸是12,288维),但对于实验仍然是合理的。此外,我们假设令牌ID是由我们先前实现的BPE标记器创建的,其词汇量为50,257: +- en: '[PRE64]' + id: totrans-275 + prefs: [] + type: TYPE_PRE + zh: '[PRE64]' +- en: Using the `token_embedding_layer` above, if we sample data from the data loader, + we embed each token in each batch into a 256-dimensional vector. If we have a + batch size of 8 with four tokens each, the result will be an 8 x 4 x 256 tensor. + id: totrans-276 + prefs: [] + type: TYPE_NORMAL + zh: 使用上面的`token_embedding_layer`,如果我们从数据加载器中取样数据,我们将每个批次中的每个令牌嵌入为一个256维的向量。如果我们的批次大小为8,每个有四个令牌,结果将是一个8x4x256的张量。 +- en: 'Let''s instantiate the data loader from section 2.6, *Data sampling with a + sliding window*, first:' + id: totrans-277 + prefs: [] + type: TYPE_NORMAL + zh: 让我们先从第2.6节“使用滑动窗口进行数据抽样”中实例化数据加载器: +- en: '[PRE65]' + id: totrans-278 + prefs: [] + type: TYPE_PRE + zh: '[PRE65]' +- en: 'The preceding code prints the following output:' + id: totrans-279 + prefs: [] + type: TYPE_NORMAL + zh: 前面的代码打印如下输出: +- en: '[PRE66]' + id: totrans-280 + prefs: [] + type: TYPE_PRE + zh: '[PRE66]' +- en: As we can see, the token ID tensor is 8x4-dimensional, meaning that the data + batch consists of 8 text samples with 4 tokens each. + id: totrans-281 + prefs: [] + type: TYPE_NORMAL + zh: 如我们所见,令牌ID张量是8x4维的,这意味着数据批次由8个文本样本组成,每个样本有4个令牌。 +- en: 'Let''s now use the embedding layer to embed these token IDs into 256-dimensional + vectors:' + id: totrans-282 + prefs: [] + type: TYPE_NORMAL + zh: 现在让我们使用嵌入层将这些令牌ID嵌入到256维的向量中: +- en: '[PRE67]' + id: totrans-283 + prefs: [] + type: TYPE_PRE + zh: '[PRE67]' +- en: 'The preceding print function call returns the following:' + id: totrans-284 + prefs: [] + type: TYPE_NORMAL + zh: 前面的打印函数调用返回以下内容: +- en: '[PRE68]' + id: totrans-285 + prefs: [] + type: TYPE_PRE + zh: '[PRE68]' +- en: As we can tell based on the 8x4x256-dimensional tensor output, each token ID + is now embedded as a 256-dimensional vector. + id: totrans-286 + prefs: [] + type: TYPE_NORMAL + zh: 根据8x4x256维张量的输出,我们可以看出,现在每个令牌ID都嵌入为一个256维的向量。 +- en: 'For a GPT model''s absolute embedding approach, we just need to create another + embedding layer that has the same dimension as the `token_embedding_layer`:' + id: totrans-287 + prefs: [] + type: TYPE_NORMAL + zh: 对于GPT模型的绝对嵌入方法,我们只需要创建另一个具有与`token_embedding_layer`相同维度的嵌入层: +- en: '[PRE69]' + id: totrans-288 + prefs: [] + type: TYPE_PRE + zh: '[PRE69]' +- en: As shown in the preceding code example, the input to the pos_embeddings is usually + a placeholder vector `torch.arange(block_size)`, which contains a sequence of + numbers 1, 2, ..., up to the maximum input length. The `block_size` is a variable + that represents the supported input size of the LLM. Here, we choose it similar + to the maximum length of the input text. In practice, input text can be longer + than the supported block size, in which case we have to truncate the text. The + text can also be shorter than the block size, in which case we fill in the remaining + input with placeholder tokens to match the block size, as we will see in chapter + 3. + id: totrans-289 + prefs: [] + type: TYPE_NORMAL + zh: 如前面的代码示例所示,pos_embeddings的输入通常是一个占位符向量`torch.arange(block_size)`,其中包含一个数字序列1、2、…、直到最大输入长度。`block_size`是代表LLM的支持输入尺寸的变量。在这里,我们选择它类似于输入文本的最大长度。在实践中,输入文本可能比支持的块大小更长,在这种情况下,我们必须截断文本。文本还可以比块大小短,在这种情况下,我们填充剩余的输入以匹配块大小的占位符令牌,正如我们将在第3章中看到的。 +- en: 'The output of the print statement is as follows:' + id: totrans-290 + prefs: [] + type: TYPE_NORMAL + zh: 打印语句的输出如下所示: +- en: '[PRE70]' + id: totrans-291 + prefs: [] + type: TYPE_PRE + zh: '[PRE70]' +- en: 'As we can see, the positional embedding tensor consists of four 256-dimensional + vectors. We can now add these directly to the token embeddings, where PyTorch + will add the 4x256-dimensional `pos_embeddings` tensor to each 4x256-dimensional + token embedding tensor in each of the 8 batches:' + id: totrans-292 + prefs: [] + type: TYPE_NORMAL + zh: 如我们所见,位置嵌入张量由四个256维向量组成。我们现在可以直接将它们添加到令牌嵌入中,PyTorch将会将4x256维的`pos_embeddings`张量添加到8个批次中每个4x256维的令牌嵌入张量中: +- en: '[PRE71]' + id: totrans-293 + prefs: [] + type: TYPE_PRE + zh: '[PRE71]' +- en: 'The print output is as follows:' + id: totrans-294 + prefs: [] + type: TYPE_NORMAL + zh: 打印输出如下: +- en: '[PRE72]' + id: totrans-295 + prefs: [] + type: TYPE_PRE + zh: '[PRE72]' +- en: The `input_embeddings` we created, as summarized in figure 2.19, are the embedded + input examples that can now be processed by the main LLM modules, which we will + begin implementing in chapter 3 + id: totrans-296 + prefs: [] + type: TYPE_NORMAL + zh: 我们创建的`input_embeddings`,如图2.19所总结的,是嵌入的输入示例,现在可以被主LLM模块处理,我们将在第3章中开始实施它 +- en: Figure 2.19 As part of the input processing pipeline, input text is first broken + up into individual tokens. These tokens are then converted into token IDs using + a vocabulary. The token IDs are converted into embedding vectors to which positional + embeddings of a similar size are added, resulting in input embeddings that are + used as input for the main LLM layers. + id: totrans-297 + prefs: + - PREF_H5 + type: TYPE_NORMAL + zh: 图2.19 作为输入处理流程的一部分,输入文本首先被分解为单独的标记。然后这些标记使用词汇表转换为标记ID。标记ID转换为嵌入向量,与类似大小的位置嵌入相加,产生用作主LLM层输入的输入嵌入。 +- en: '![](images/ch-02__image038.png)' + id: totrans-298 + prefs: [] + type: TYPE_IMG + zh: '![](images/ch-02__image038.png)' +- en: 2.9 Summary + id: totrans-299 + prefs: + - PREF_H2 + type: TYPE_NORMAL + zh: 2.9 总结 +- en: LLMs require textual data to be converted into numerical vectors, known as embeddings + since they can't process raw text. Embeddings transform discrete data (like words + or images) into continuous vector spaces, making them compatible with neural network + operations. + id: totrans-300 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: 由于LLM不能处理原始文本,所以需要将文本数据转换为数字向量,这些向量被称为嵌入。嵌入将离散数据(如文字或图像)转换为连续的向量空间,使其与神经网络操作兼容。 +- en: As the first step, raw text is broken into tokens, which can be words or characters. + Then, the tokens are converted into integer representations, termed token IDs. + id: totrans-301 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: 作为第一步,原始文本被分解为标记,这些标记可以是单词或字符。然后,这些标记被转换为整数表示,称为标记ID。 +- en: Special tokens, such as `<|unk|>` and `<|endoftext|>`, can be added to enhance + the model's understanding and handle various contexts, such as unknown words or + marking the boundary between unrelated texts. + id: totrans-302 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: 特殊标记,比如`<|unk|>`和`<|endoftext|>`,可以增强模型的理解并处理各种上下文,比如未知单词或标记无关文本的边界。 +- en: The byte pair encoding (BPE) tokenizer used for LLMs like GPT-2 and GPT-3 can + efficiently handle unknown words by breaking them down into subword units or individual + characters. + id: totrans-303 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: 用于像GPT-2和GPT-3这样的LLM的字节对编码(BPE)分词器可以通过将未知单词分解为子词单元或单个字符来高效地处理未知单词。 +- en: We use a sliding window approach on tokenized data to generate input-target + pairs for LLM training. + id: totrans-304 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: 我们在标记化数据上使用滑动窗口方法生成用于LLM训练的输入-目标对。 +- en: Embedding layers in PyTorch function as a lookup operation, retrieving vectors + corresponding to token IDs. The resulting embedding vectors provide continuous + representations of tokens, which is crucial for training deep learning models + like LLMs. + id: totrans-305 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: PyTorch中的嵌入层作为查找操作,检索与标记ID相对应的向量。结果嵌入向量提供了标记的连续表示,这对于训练像LLM这样的深度学习模型至关重要。 +- en: 'While token embeddings provide consistent vector representations for each token, + they lack a sense of the token''s position in a sequence. To rectify this, two + main types of positional embeddings exist: absolute and relative. OpenAI''s GPT + models utilize absolute positional embeddings that are added to the token embedding + vectors and are optimized during the model training.' + id: totrans-306 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: 虽然标记嵌入为每个标记提供了一致的向量表示,但它缺乏对标记在序列中位置的感知。为了纠正这一点,存在两种主要类型的位置嵌入:绝对和相对。OpenAI的GPT模型利用绝对位置嵌入,这些嵌入被加到标记嵌入向量中,并在模型训练过程中进行优化。 +- en: 2.10 References and further reading + id: totrans-307 + prefs: + - PREF_H2 + type: TYPE_NORMAL + zh: 2.10 参考资料和进一步阅读 +- en: 'Readers who are interested in discussion and comparison of embedding spaces + with latent spaces and the general notion of vector representations can find more + information in the first chapter of my book Machine Learning Q and AI:' + id: totrans-308 + prefs: [] + type: TYPE_NORMAL + zh: 对嵌入空间和潜空间以及向量表达的一般概念感兴趣的读者,可以在我写的书《机器学习 Q 和 AI》的第一章中找到更多信息: +- en: '*Machine Learning Q and AI* (2023) by Sebastian Raschka, [https://leanpub.com/machine-learning-q-and-ai](machine-learning-q-and-ai.html)' + id: totrans-309 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: '*机器学习 Q 和 AI* (2023) 由Sebastian Raschka著作,[https://leanpub.com/machine-learning-q-and-ai](machine-learning-q-and-ai.html)' +- en: 'The following paper provides more in-depth discussions of how how byte pair + encoding is used as a tokenization method:' + id: totrans-310 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: 以下论文更深入地讨论了字节对编码作为分词方法的使用: +- en: Neural Machine Translation of Rare Words with Subword Units (2015) by Sennrich + at al., [https://arxiv.org/abs/1508.07909](abs.html) + id: totrans-311 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: 《稀有词的子词单元神经机器翻译》(2015) 由Sennrich等人编写,[https://arxiv.org/abs/1508.07909](abs.html) +- en: 'The code for the byte pair encoding tokenizer used to train GPT-2 was open-sourced + by OpenAI:' + id: totrans-312 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: 用于训练GPT-2的字节对编码分词器的代码已被OpenAI开源: +- en: '[https://github.com/openai/gpt-2/blob/master/src/encoder.py](src.html)' + id: totrans-313 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: '[https://github.com/openai/gpt-2/blob/master/src/encoder.py](src.html)' +- en: 'OpenAI provides an interactive web UI to illustrate how the byte pair tokenizer + in GPT models works:' + id: totrans-314 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: OpenAI 提供了一个交互式 Web UI,以说明 GPT 模型中的字节对分词器的工作原理: +- en: '[https://platform.openai.com/tokenizer](platform.openai.com.html)' + id: totrans-315 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: '[https://platform.openai.com/tokenizer](platform.openai.com.html)' +- en: 'Readers who are interested in studying alternative tokenization schemes that + are used by some other popular LLMs can find more information in the SentencePiece + and WordPiece papers:' + id: totrans-316 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: 对于对研究其他流行 LLMs 使用的替代分词方案感兴趣的读者,可以在 SentencePiece 和 WordPiece 论文中找到更多信息: +- en: 'SentencePiece: A Simple and Language Independent Subword Tokenizer and Detokenizer + for Neural Text Processing (2018) by Kudo and Richardson, [https://aclanthology.org/D18-2012/](D18-2012.html)' + id: totrans-317 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: SentencePiece:一种简单且语言无关的子词分词器和去分词器,用于神经文本处理(2018),作者 Kudo 和 Richardson,[https://aclanthology.org/D18-2012/](D18-2012.html) +- en: Fast WordPiece Tokenization (2020) by Song et al., [https://arxiv.org/abs/2012.15524](abs.html) + id: totrans-318 + prefs: + - PREF_UL + type: TYPE_NORMAL + zh: 快速 WordPiece 分词(2020),作者 Song 等人,[https://arxiv.org/abs/2012.15524](abs.html) +- en: 2.11 Exercise answers + id: totrans-319 + prefs: + - PREF_H2 + type: TYPE_NORMAL + zh: 2.11 练习答案 +- en: The complete code examples for the exercises answers can be found in the supplementary + GitHub repository at [https://github.com/rasbt/LLMs-from-scratch](rasbt.html) + id: totrans-320 + prefs: [] + type: TYPE_NORMAL + zh: 练习答案的完整代码示例可以在补充的 GitHub 仓库中找到:[https://github.com/rasbt/LLMs-from-scratch](rasbt.html) +- en: Exercise 2.1 + id: totrans-321 + prefs: + - PREF_H4 + type: TYPE_NORMAL + zh: 练习 2.1 +- en: 'You can obtain the individual token IDs by prompting the encoder with one string + at a time:' + id: totrans-322 + prefs: [] + type: TYPE_NORMAL + zh: 您可以通过一个字符串逐个提示编码器来获得单个标记 ID: +- en: '[PRE73]' + id: totrans-323 + prefs: [] + type: TYPE_PRE + zh: '[PRE73]' +- en: 'This prints:' + id: totrans-324 + prefs: [] + type: TYPE_NORMAL + zh: 这将打印: +- en: '[PRE74]' + id: totrans-325 + prefs: [] + type: TYPE_PRE + zh: '[PRE74]' +- en: 'You can then use the following code to assemble the original string:' + id: totrans-326 + prefs: [] + type: TYPE_NORMAL + zh: 然后,您可以使用以下代码来组装原始字符串: +- en: '[PRE75]' + id: totrans-327 + prefs: [] + type: TYPE_PRE + zh: '[PRE75]' +- en: 'This returns:' + id: totrans-328 + prefs: [] + type: TYPE_NORMAL + zh: 这将返回: +- en: '[PRE76]' + id: totrans-329 + prefs: [] + type: TYPE_PRE + zh: '[PRE76]'