From 6f7fc9bfea1863e5a6e62b98304e51c98945f7f2 Mon Sep 17 00:00:00 2001 From: wizardforcel <562826179@qq.com> Date: Thu, 8 Feb 2024 18:09:18 +0800 Subject: [PATCH] 2024-02-08 18:09:16 --- totrans/fund-dl_01.yaml | 9 + totrans/fund-dl_02.yaml | 621 +++++++++++++++++++++++++++++++++++++++- 2 files changed, 626 insertions(+), 4 deletions(-) diff --git a/totrans/fund-dl_01.yaml b/totrans/fund-dl_01.yaml index 4397233..295e259 100644 --- a/totrans/fund-dl_01.yaml +++ b/totrans/fund-dl_01.yaml @@ -610,6 +610,7 @@ id: totrans-67 prefs: [] type: TYPE_NORMAL + zh: 通常,存在矩阵*A*,其中某些列是其他列的线性组合。例如,想象一下,如果在我们的蜥蜴数据集中有一个额外的特征,表示每只蜥蜴的体重,但单位是磅。这是数据中的一个明显冗余,因为这个特征完全由千克体重的特征决定。换句话说,新特征是数据中其他特征的线性组合——只需取千克体重的列,乘以2.2,然后将其与所有其他列乘以零相加,即可得到磅体重的列。从逻辑上讲,如果我们从*A*中去除这些冗余,那么*C(A)*不应该改变。一种方法是首先创建一个包含所有原始列向量的列表*A*,其中顺序是任意指定的。在遍历列表时,检查当前向量是否是所有在它之前的向量的线性组合。如果是,从列表中移除这个向量并继续。很明显,移除的向量没有提供除我们已经看到的信息之外的额外信息。 - en: The resulting list is called the *basis* of *C(A),* and the length of the basis is the *dimension* of *C(A).* We say that the basis of any vector space *spans* the space, which means that all of the elements in the vector space can be formulated @@ -627,6 +628,8 @@ id: totrans-68 prefs: [] type: TYPE_NORMAL + zh: 结果列表称为*C(A)*的*基础*,基础的长度是*C(A)*的*维度*。我们说任何向量空间的基础*跨越*该空间,这意味着向量空间中的所有元素都可以被基础向量的线性组合表示。此外,基础向量是*线性独立*的,这意味着没有一个向量可以被其他向量的线性组合表示,即没有冗余。回到我们定义向量空间的例子,(0,0,1),(0,1,0),(1,0,0)将是空间 3的基础,因为列表中的任何向量都不是其他向量的线性组合,而且这个列表跨越整个空间。相反,列表(0,0,1),(0,1,0),(1,0,0),(2,5,1)跨越整个空间,但不是线性独立的,因为(2,5,1)可以被前三个向量的线性组合表示(我们称这样的向量列表为*跨度列表*,当然,向量空间的基础集合是相同空间的跨度列表集合的子集)。 - en: As we alluded to in the discussion of our lizard dataset, the basis of the column space, given each lizard feature is a column, is a concise representation of the information represented in the feature matrix. In the real world, where we often @@ -639,11 +642,13 @@ id: totrans-69 prefs: [] type: TYPE_NORMAL + zh: 正如我们在讨论蜥蜴数据集时所提到的,给定每个蜥蜴特征作为一列的列空间的基础是特征矩阵中所代表信息的简洁表示。在现实世界中,我们通常有成千上万个特征(例如图像中的每个像素),实现对数据的简洁表示是非常可取的。尽管这是一个良好的开始,但通常仅仅识别数据中的明显冗余是不够的,因为存在于现实世界中的随机性和复杂性往往会掩盖这些冗余。量化特征之间的关系可以为简洁的数据表示提供信息,正如我们在本章末尾和第9章中讨论的那样。 - en: The Null Space id: totrans-70 prefs: - PREF_H2 type: TYPE_NORMAL + zh: 零空间 - en: Another key vector space is the *null space *of a matrix *A,* or *N(A)**. *This space consists of the vectors *v *such that *Av = 0.* We know that *v = 0*, the trivial solution, will always satisfy this property. If only the trivial solution @@ -654,16 +659,20 @@ id: totrans-71 prefs: [] type: TYPE_NORMAL + zh: 另一个关键的向量空间是矩阵*A*的*零空间*,或*N(A)*。这个空间包括向量*v*,使得*Av = 0*。我们知道*v = 0*,平凡解,总是满足这个性质。如果矩阵的零空间中只有平凡解,我们称之为空间平凡。然而,根据*A*的性质,或者非平凡的零空间,可能存在其他解决方案。为了使向量*v*满足*Av + = 0*,*v*必须与*A*的每一行正交,如[图1-11](#the_implication_that_the_dot)所示。 - en: '![](Images/fdl2_0111.png)' id: totrans-72 prefs: [] type: TYPE_IMG + zh: '![](Images/fdl2_0111.png)' - en: Figure 1-11\. The implication that the dot product between each row and the vector v must be equal to 0 id: totrans-73 prefs: - PREF_H6 type: TYPE_NORMAL + zh: 图1-11. 每行与向量v之间的点积必须等于0的含义 - en: Let’s assume *A* is of dimension 2 by 3, for example. In our case, *A*’s rows cannot span  3  due to *A *having only two rows (remember from our recent discussion that all diff --git a/totrans/fund-dl_02.yaml b/totrans/fund-dl_02.yaml index a47cde5..ebd9316 100644 --- a/totrans/fund-dl_02.yaml +++ b/totrans/fund-dl_02.yaml @@ -1,7 +1,9 @@ - en: Chapter 2\. Fundamentals of Probability + id: totrans-0 prefs: - PREF_H1 type: TYPE_NORMAL + zh: 第2章。概率基础 - en: Probability is a field of mathematics that quantifies our uncertainty regarding events. For example, when rolling dice or flipping a coin, barring any irregularities in the dice or coin themselves, we are uncertain about the result to come. However, @@ -13,12 +15,18 @@ are the sorts of probabilities we talk about with ease in our daily lives, but how can we define and utilize them effectively? In this chapter we’ll discuss the fundamentals of probability and how they connect to key concepts in deep learning. + id: totrans-1 prefs: [] type: TYPE_NORMAL + zh: 概率是一门量化我们对事件的不确定性的数学领域。例如,当掷骰子或抛硬币时,除非骰子或硬币本身存在任何不规则性,否则我们对即将发生的结果感到不确定。然而,我们可以通过概率来量化我们对每种可能结果的信念。例如,我们说每次抛硬币时硬币出现正面的概率是 + 1 2 。每次掷骰子时,我们说骰子朝上的概率是 + 1 6 。这些是我们在日常生活中轻松谈论的概率,但我们如何定义和有效利用它们呢?在本章中,我们将讨论概率的基础知识以及它们与深度学习中的关键概念的联系。 - en: Events and Probability + id: totrans-2 prefs: - PREF_H1 type: TYPE_NORMAL + zh: 事件和概率 - en: When running a trial such as rolling a dice or tossing a coin, we intuitively assign some belief to the trial’s possible outcomes. In this section, we aim to formalize some of these concepts. In particular, we will begin by working in this @@ -33,8 +41,10 @@ previously. A set of probabilities that sum to one over all outcomes in the sample space is termed a *probability distribution* over that sample space, and these distributions will be the main focus of our discussion. + id: totrans-3 prefs: [] type: TYPE_NORMAL + zh: 当进行像掷骰子或抛硬币这样的试验时,我们直观地对试验的可能结果赋予一些信念。在本节中,我们旨在形式化其中一些概念。特别是,我们将从在这个*离散*空间中工作开始,其中离散表示有限或可数无限的可能性。掷骰子和抛硬币都在离散空间中——掷一个公平的骰子有六种可能结果,抛一个公平的硬币有两种可能。我们将实验的整个可能性集合称为*样本空间*。例如,从一到六的数字将构成掷一个公平骰子的样本空间。我们可以将*事件*定义为样本空间的子集。至少掷出三的事件对应于之前定义的样本空间中三、四、五和六中的任何数字朝上的骰子。一组在样本空间中所有结果上总和为一的概率被称为该样本空间上的*概率分布*,这些分布将是我们讨论的主要焦点。 - en: In general, we won’t worry too much about where exactly these probabilities come from, as that requires a much more rigorous and thorough examination beyond the scope of this text. However, we will give some intuition about the different @@ -47,8 +57,12 @@ fraction. As the number of rolls in the experiment grows, we see that this estimate gets closer and closer to the limit 1 6 , the outcome’s probability. + id: totrans-4 prefs: [] type: TYPE_NORMAL + zh: 一般来说,我们不会过多担心这些概率的确切来源,因为这需要进行更严格和彻底的检查,超出了本文的范围。然而,我们将对不同的解释提供一些直觉。在高层次上,*频率主义*观点认为结果的概率来自于长期实验中的频率。在公平骰子的情况下,这种观点声称我们可以说在给定的投掷中骰子的任何一面出现的概率是 + 1 6 ,因为进行大量投掷并计算每一面出现的次数将给我们一个大致为这个分数的估计。随着实验中投掷次数的增加,我们看到这个估计越来越接近极限 + 1 6 ,结果的概率。 - en: On the other hand, the *Bayesian* view of probability is based more on quantifying our prior belief in hypotheses and how we update our beliefs in light of new data. For a fair dice, the Bayesian view would claim there is no prior information, @@ -65,12 +79,14 @@ the prior associated with each weight accordingly to better fit the data we see. At the end of the training, we are left with a posterior distribution associated with each weight. + id: totrans-5 prefs: [] type: TYPE_NORMAL - en: 'We will assume throughout this chapter that the probabilities associated with any outcome have been determined via reasonable methods, and focus on how we can manipulate these probabilities for use in our analyses. We start with the four tenets of probability, specifically in the discrete space:' + id: totrans-6 prefs: [] type: TYPE_NORMAL - en: The sum of probabilities for all possible outcomes in a sample space must be @@ -84,6 +100,7 @@ o P ( o ) = 1 , where o represents an outcome. + id: totrans-7 prefs: - PREF_OL type: TYPE_NORMAL @@ -102,15 +119,18 @@ tenet. In [Figure 2-1](#we_see_here_how_the_event_a), we see an example of this, where *S* represents the entire space of outcomes, and the event and its complement together form the entirety of *S*. + id: totrans-8 prefs: - PREF_OL type: TYPE_NORMAL - en: '![](Images/fdl2_0201.png)' + id: totrans-9 prefs: - PREF_IND type: TYPE_IMG - en: Figure 2-1\. Event A and its complement interact to form the entire set of possibilities, S. The complement simply defines all the possibilities not originally in A. + id: totrans-10 prefs: - PREF_IND - PREF_H6 @@ -128,9 +148,18 @@ the first event has since the second is a superset of the first. If this tenet were not true, that would imply the existence of outcomes with negative probability, which is impossible from our definitions. + id: totrans-11 prefs: - PREF_OL type: TYPE_NORMAL + zh: 设 E 1E 2 是两个事件,其中 E 1E + 2 的子集(不一定是严格的)。第三个原则是 P ( E 1 + ) P ( E 2 + ) 。再次强调,这并不会太令人惊讶——第二个事件至少有第一个事件的那么多结果,而且第二个事件是第一个事件的超集,包含了第一个事件的所有结果。如果这个原则不成立,那就意味着存在具有负概率的结果,这在我们的定义中是不可能的。 - en: The fourth and last tenet of probability is the principle of inclusion and exclusion, which states that P ( A + B ) = P ( A ) + + P ( B ) - P ( + A B ) 。对于不熟悉这个术语的人来说, 表示两个事件的*并集*,这是一个集合操作,将两个事件返回一个包含来自两个原始集合的所有元素的事件。而 + ,或*交集*,是一个集合操作,返回一个包含属于两个原始集合的所有元素的事件。所述等式背后的思想是,通过简单地对*A*和*B*的概率求和,我们会重复计算属于两个集合的元素。因此,为了准确地获得并集的概率,我们必须减去交集的概率。在[图2-2](#the_middle_sliver_labeled)中,我们展示了两个事件及其交集在物理上的样子,而并集则是两个事件的组合区域中的所有结果。 - en: '![](Images/fdl2_0202.png)' + id: totrans-13 prefs: - PREF_IND type: TYPE_IMG + zh: '![](Images/fdl2_0202.png)' - en: Figure 2-2\. The middle sliver is the overlap between the two sets, containing all the outcomes that are in both sets. The union is all the events in the combined area of the two circles; if we were to add their probabilities naively, we would double-count all the outcomes in the middle sliver. + id: totrans-14 prefs: - PREF_IND - PREF_H6 type: TYPE_NORMAL + zh: 图2-2。中间的薄片是两个集合之间的重叠部分,包含了同时在两个集合中的所有结果。并集是两个圆圈组合区域中的所有事件;如果我们简单地将它们的概率相加,我们将重复计算中间薄片中的所有结果。 - en: 'These tenets of probability find their way into everything that has to do with the field. For example, in deep learning, most of our problems fall into one of two categories: *regression* and *classification*. In the latter, we train a neural @@ -181,12 +224,16 @@ back-of-the-envelope check to ensure the model isn’t buggy. In the next section, we cover probabilities where we are initially given relevant information that affects our beliefs and how to use that information.' + id: totrans-15 prefs: [] type: TYPE_NORMAL + zh: 这些概率原则渗透到与该领域有关的一切事物中。例如,在深度学习中,我们的大多数问题可以归为两类:*回归*和*分类*。在分类问题中,我们训练一个神经模型,可以预测输入属于一组离散类别中的哪一个的可能性。例如,著名的MNIST数字数据集为我们提供了0到9范围内的数字图片和相关的数字标签。我们的目标是构建一个*分类器*,可以接收这张图片并返回最有可能的标签作为猜测。这自然地被制定为一个概率问题——分类器产生一个关于样本空间0到9的概率分布,对于任何给定的输入,它的最佳猜测是被分配了最高概率的数字。这与我们的原则有什么关系?由于分类器产生一个概率分布,它必须遵循这些原则。例如,与每个数字相关的概率必须相加为一——这是一个快速的粗略检查,以确保模型没有错误。在下一节中,我们将涵盖最初给定相关信息影响我们信念的概率以及如何使用该信息。 - en: Conditional Probability + id: totrans-16 prefs: - PREF_H1 type: TYPE_NORMAL + zh: 条件概率 - en: Knowing information often changes our beliefs, and by consequence, our probabilities. Going back to our classic dice example, we may roll the dice thinking that it’s fair, while in reality there’s a hidden weight at the dice’s core, making it more @@ -204,8 +251,10 @@ i o n ) instead. This quantity, which we term a *conditional probability,* is spoken as “the probability the dice is biased *given* the information we’ve seen.” + id: totrans-17 prefs: [] type: TYPE_NORMAL + zh: 了解信息通常会改变我们的信念,从而改变我们的概率。回到我们经典的骰子示例,我们可能认为掷骰子是公平的,而实际上骰子的核心有一个隐藏的重量,使得它更有可能掷出大于三的数字。当我们掷骰子时,当然会开始注意到这种模式,我们对骰子公平性的信念开始转变。这正是条件概率的核心。我们不再简单地考虑*P(偏向)*或*P(公平)*,而是要考虑像*P(偏向|信息)*这样的概率。这个量,我们称之为*条件概率*,可以理解为“在我们看到的信息的情况下,骰子偏向的概率”。 - en: How do we think about such probabilities intuitively? For starters, we must imagine that we are now in a different universe than the one we started in. The new universe is one that incorporates the information we’ve seen since the start @@ -220,8 +269,10 @@ think about in terms of prior belief. Without any knowledge of the input pixel configuration, we’d have no reason to believe that the possibility of returning a zero is any more or less likely than that of any other digit. + id: totrans-18 prefs: [] type: TYPE_NORMAL + zh: 我们如何直观地思考这些概率?首先,我们必须想象我们现在处于一个不同的宇宙中,而不是我们开始时的那个宇宙。新的宇宙是一个包含了我们自实验开始以来看到的信息的宇宙,例如我们过去的骰子点数。回到我们的MNIST示例,训练好的神经网络产生的概率分布实际上是一个条件概率分布。例如,输入图像为零的概率可以看作是*P(0|input)*。简单来说,我们想找到的是在我们馈送给神经网络的特定输入图像中组成的所有像素的情况下零的概率。我们的新宇宙是输入像素已经具有这种特定值配置的宇宙。这与简单地看*P(0)*,即返回零的概率是不同的,我们可以从先验信念的角度来思考。如果没有任何关于输入像素配置的知识,我们没有理由相信返回零的可能性比其他数字更有可能或更不可能。 - en: 'Sometimes, seeing certain information does not change our probabilities—we call this property *independence.* For example, Tom Brady may have thrown a touchdown pass after the third roll of our experiment, but incorporating that information @@ -245,8 +296,18 @@ E 2 | E 1 ) = P ( E 2 ) .' + id: totrans-19 prefs: [] type: TYPE_NORMAL + zh: 有时,看到某些信息并不会改变我们的概率——我们称之为*独立性*。例如,汤姆·布雷迪可能在我们的实验第三次掷骰子后投出了一个触摸得分,但将这些信息纳入我们的新宇宙中应该(希望如此!)不会对骰子有偏倚的可能性产生影响。我们将这种独立性属性表述为*P(有偏|汤姆·布雷迪投出触摸得分) + = P(有偏)*。请注意,任何满足这一属性的两个事件E 1E 2都是独立的。也许稍微有些违反直觉的是,如果到目前为止我们所有的掷骰子结果在数值上并没有改变我们对骰子公平性的先验信念(也许到目前为止的掷骰子结果在一到六之间均匀出现,而我们最初的先验信念是骰子是公平的),我们仍然会说这些事件是独立的。最后,请注意独立性是对称的:如果P ( E + 1 | E 2 ) + = P ( E 1 ),那么也有P ( E + 2 | E 1 ) + = P ( E 2 )。 - en: 'In the previous section, we introduced intersection and union notation. It turns out that we can break down the intersection operation into a product of probabilities. We have the following equality: P ( E + 1 E 2 ) + = P ( E 1 | + E 2 ) * P ( + E 2 )。让我们解释一下这里的直觉。在左边,我们有两个事件E 1E 2同时发生的概率。在右边,我们有相同的想法,但表达略有不同。在这两个事件都发生的宇宙中,到达这个宇宙的一种方式是首先发生E 2,然后是E 1。将这种直觉转化为数学术语,我们必须首先找到E 2发生的概率,然后是在E 2已经发生的宇宙中E + 1发生的概率。我们如何结合这两个概率?直觉上,将它们相乘是有意义的——我们必须让两个事件都发生,第一个是无条件的,第二个是在第一个已经发生的宇宙中。请注意,这些事件的顺序并不重要,因为这两条路径都将我们带到同一个宇宙。因此,更完整地说,P ( E + 1 E 2 ) + = P ( E 1 | + E 2 ) * P ( + E 2 ) = P ( + E 2 | E 1 + ) * P ( E 1 + )。 - en: However, some of these paths make much more physical sense than others. For example, if we think of *E 1* as the event where someone contracts a disease, and E 2 as the event where the patient shows symptoms of the disease, the path in which the patient contracts the disease and then shows symptoms makes much more physical sense than the reverse. + id: totrans-21 prefs: [] type: TYPE_NORMAL + zh: 然而,其中一些路径比其他路径更有物理意义。例如,如果我们将*E 1*看作某人感染疾病的事件,将E 2看作患者出现疾病症状的事件,那么患者先感染疾病然后出现症状的路径比反过来的路径更有物理意义。 - en: In the case where the two events are independent, we have that P ( E + 1 E 2 ) + = P ( E 1 | + E 2 ) * P ( + E 2 ) = P ( + E 1 ) * P ( + E 2 ) . + 希望这些能有一些直观的理解。在独立的情况下,事件E 2 + 的发生不会影响事件E 1 发生的概率;即,将这些信息纳入新的宇宙中不会影响下一个事件的概率。在接下来的部分中,我们将讨论随机变量,它们是事件的相关总结,也有自己的概率分布。 - en: Random Variables + id: totrans-23 prefs: - PREF_H1 type: TYPE_NORMAL + zh: 随机变量 - en: Once again, let’s consider the coin flipping experiment. If we flip a coin some finite number of times, natural questions start to arise. How many heads did we encounter during our experiment? How many tails? How many tails until the first head? Every outcome in such an experiment has an answer to each of the listed questions. If we flip a coin say, five times, and we receive the sequence TTHHT, we have seen two heads, three tails, and two tails until the first head. + id: totrans-24 prefs: [] type: TYPE_NORMAL + zh: 再次,让我们考虑抛硬币的实验。如果我们抛硬币有限次数,自然会产生一些问题。在我们的实验中遇到了多少次正面?多少次反面?第一个正面前有多少次反面?在这样一个实验中,每个结果都有对应的答案。比如,如果我们抛硬币5次,得到了序列TTHHT,我们看到了两次正面,三次反面,以及第一个正面前有两次反面。 - en: We can think of a *random variable* as a map, or a function, from the sample space to another space, such as the integers in [Figure 2-3](#the_random_variables_x_y_andz). Such a function would take as input the sequence TTHHT and output one of the three @@ -336,32 +446,45 @@ with their output space. This is due to the inherent randomness in the experiment—depending on the probability of the input outcome, its corresponding output may be more or less likely than other outputs. + id: totrans-25 prefs: [] type: TYPE_NORMAL + zh: 我们可以将*随机变量*看作是一个从样本空间到另一个空间的映射或函数,比如[图2-3](#the_random_variables_x_y_andz)中的整数。这样一个函数将以TTHHT作为输入,并根据我们提出的问题输出三个答案中的一个。随机变量取得的值将是与实验结果相关联的输出。虽然随机变量是确定性的,因为它们将给定的输入映射到单个输出,但它们在输出空间中也有与之相关的分布。这是由于实验中固有的随机性——根据输入结果的概率,其相应的输出可能比其他输出更有可能。 - en: '![](Images/fdl2_0203.png)' + id: totrans-26 prefs: [] type: TYPE_IMG + zh: '![](Images/fdl2_0203.png)' - en: Figure 2-3\. Random variables X, Y, and Z all act on the same sample space, but have varying outputs. It’s important to keep in mind what you’re measuring! + id: totrans-27 prefs: - PREF_H6 type: TYPE_NORMAL + zh: 图2-3. 随机变量X、Y和Z都作用于相同的样本空间,但具有不同的输出。记住你正在测量什么是很重要的! - en: Note + id: totrans-28 prefs: - PREF_H6 type: TYPE_NORMAL + zh: 注意 - en: Note that multiple inputs could map to the same output. For example, *X(HHH)* = 3 in addition to *X(HHTH)* in [Figure 2-3](#the_random_variables_x_y_andz). + id: totrans-29 prefs: [] type: TYPE_NORMAL + zh: 请注意,多个输入可能映射到相同的输出。例如,*X(HHH)* = 3,除了*X(HHTH)*在[图2-3](#the_random_variables_x_y_andz)中也是如此。 - en: One easy way to begin is to just think of this map as an identity function—whatever we flip or roll, its map in the output space is exactly the same as the input. Encoding a heads as a one and a tails as a zero, we can define a random variable representing the coin flip as whether the coin came up heads, i.e., *C(1) = 1,* where *C* is our random variable. In the dice scenario, the mapped output is the same as whatever we rolled, i.e., *D**(5) = 5,* where *D*is our random variable. + id: totrans-30 prefs: [] type: TYPE_NORMAL + zh: 一个简单的开始方法是将这个映射看作是一个恒等函数——无论我们抛硬币还是掷骰子,它在输出空间中的映射与输入完全相同。将正面编码为1,反面编码为0,我们可以定义一个代表硬币翻转的随机变量,即硬币是正面的情况,即*C(1) + = 1*,其中*C*是我们的随机变量。在掷骰子的情况下,映射的输出与我们掷出的数字相同,即*D(5) = 5*,其中*D*是我们的随机变量。 - en: Why should we care about random variables and their distributions? It turns out they play a vital role in deep learning and machine learning as a whole. For example, in [Chapter 4](ch04.xhtml#training_feed_forward), we will cover the concept @@ -376,8 +499,12 @@ variable *X* associated with it, with input one if the dropout layer decides to mask it and zero otherwise. *X* is an identity function from the input space to the output space, i.e., *X(1) = 1* and *X(0) = 0\.* + id: totrans-31 prefs: [] type: TYPE_NORMAL + zh: 为什么我们要关心随机变量及其分布?事实证明,它们在深度学习和机器学习中起着至关重要的作用。例如,在[第4章](ch04.xhtml#training_feed_forward)中,我们将介绍*辍学*的概念,这是一种在神经网络中减少过拟合的技术。辍学层的想法是,在训练期间,它独立且随机地以一定概率屏蔽前一层中的每个神经元。这可以防止网络过度依赖特定连接或子网络。我们可以将前一层中的每个神经元看作代表硬币翻转类型实验。唯一的区别是,我们设置了这个实验的概率,而不是一个公平硬币具有默认概率*mfrac + 1 2*的概率。每个神经元都有与之关联的随机变量*X*,如果辍学层决定屏蔽它,则输入为1,否则为0。*X*是一个从输入空间到输出空间的恒等函数,即*X(1) + = 1*和*X(0) = 0*。 - en: Random variables, in general, need not be the identity map. Most functions you can think of are valid methods of mapping the input space to an output space where the random variable is defined. For example, if the input space were every possible @@ -397,8 +524,12 @@ coin flip. Back to the dropout example, we can think of the random variable representing the total number of masked-out neurons as the sum of binary random variables representing each neuron. + id: totrans-32 prefs: [] type: TYPE_NORMAL + zh: 随机变量,一般来说,不必是恒等映射。您可以想到的大多数函数都是将输入空间映射到定义了随机变量的输出空间的有效方法。例如,如果输入空间是每个可能长度为*n*的硬币翻转序列,函数可以是计算序列中头的数量并对其进行平方。一些随机变量甚至可以表示为其他随机变量的函数,或者是函数的函数,我们稍后会讨论。如果我们再次考虑每个可能长度为*n*的硬币翻转序列的输入空间,那么计算输入序列中头的数量的随机变量与计算每个单独硬币翻转是否为头并将所有这些值求和的随机变量是相同的。在数学术语中,我们说*X等于sigma-summation + Underscript i equals 1 Overscript n Endscripts upper C Subscript i*,其中*X*是表示头的总数的随机变量,*C + i*是与第*i*次硬币翻转相关的二进制随机变量。回到辍学的例子,我们可以将代表被屏蔽神经元总数的随机变量看作是代表每个神经元的二进制随机变量之和。 - en: In the future, when we want to refer to the event where the random variable takes on a specific value *c* (the domain being the output space we’ve been referring to, e.g., the number of heads in a sequence of coin flips), we will write this @@ -411,12 +542,17 @@ inputs. Note that *P(X)* itself is also a probability distribution that follows all the basic tenets of probability described in the first section. In the next section, we consider statistics regarding random variables. + id: totrans-33 prefs: [] type: TYPE_NORMAL + zh: 将来,当我们想要提及随机变量取特定值*c*的事件时(域是我们一直在提到的输出空间,例如硬币翻转序列中的头数),我们将简洁地写为*X = c*。我们将随机变量取特定值的概率表示为*P(X + = c)*,例如。随机变量在输出空间中取任何给定值的概率只是映射到它的输入的概率之和。这应该有一些直观的意义,因为这基本上是概率的第四原则,其中任何两个事件之间的交集是空集,因为我们从的所有事件都是独立的、不同的输入。请注意,*P(X)*本身也是一个遵循第一节描述的概率的所有基本原则的概率分布。在下一节中,我们将考虑关于随机变量的统计数据。 - en: Expectation + id: totrans-34 prefs: - PREF_H1 type: TYPE_NORMAL + zh: 期望值 - en: As we discussed, a random variable is a map from input space to output space, where inputs are generated according to some probability distribution. The random variable can be thought of as a relevant summary of the input, and can take on @@ -426,8 +562,10 @@ see the average number of heads all the time—how much does the number of heads we see tend to vary? The first quantity is what we call the random variable’s *expectation*, and the second is the random variable’s *variance*. + id: totrans-35 prefs: [] type: TYPE_NORMAL + zh: 正如我们讨论的,随机变量是从输入空间到输出空间的映射,其中输入根据某种概率分布生成。随机变量可以被视为输入的相关摘要,并且根据我们提出的问题可以采用多种形式。有时,了解关于随机变量的统计数据是有用的。例如,如果我们抛硬币八次,我们平均期望看到多少次正面?当然,我们并不总是看到平均头数——我们看到的头数会有多大变化?第一个数量是我们称之为随机变量的*期望*,第二个是随机变量的*方差*。 - en: For a random variable *X*, we denote its expectation as 𝔼 [ X ] . We can think of this as the average value that *X* @@ -451,8 +589,22 @@ Of course, this makes no physical sense in that we could never possibly flip half of a head, but this gives you an idea of the proportions we’d expect to see over a long run experiment. - prefs: [] - type: TYPE_NORMAL + id: totrans-36 + prefs: [] + type: TYPE_NORMAL + zh: 对于随机变量*X*,我们将其期望表示为𝔼 [ X ]。我们可以将其视为*X*取值的平均值,按照每个结果的概率加权。数学上,这被写为𝔼 [ X ] + = o o * P + ( X = o )。请注意,如果所有结果*o*是等可能的,我们得到所有结果的简单平均值。使用结果的概率作为加权是有意义的,因为某些结果比其他结果更有可能发生,我们观察到的平均值将偏向这些结果。对于一次公平抛硬币,预期的正面数量将是 o{0,1} + o * P ( o ) + = 0 * 0 . 5 + 1 + * 0 . 5 = 0 . 5。换句话说,我们预计在任何给定的公平抛硬币中看到半个正面。当然,这在物理上没有意义,因为我们永远不可能抛出半个正面,但这给了你一个关于我们在长期实验中预期看到的比例的想法。 - en: Returning to our example of length *n* sequences of coin flips, let’s try to find the expected number of heads in such a sequence. We have *n* + 1 possible number of heads, and according to our formula, we’d need to find the probability @@ -464,8 +616,15 @@ ) , where *X* is the random variable representing the total number of heads. However, as *n* gets larger and larger, performing this calculation starts to become more and more complicated. + id: totrans-37 prefs: [] type: TYPE_NORMAL + zh: 回到我们的例子,长度为*n*的硬币序列,让我们尝试找到这样一个序列中预期的正面数量。我们有*n* + 1种可能的正面数量,根据我们的公式,我们需要找到获得每个可能数量的概率作为我们的权重。数学上,我们需要计算 + x{0,...,n} + x * P ( X = x + ),其中*X*是代表正面总数的随机变量。然而,随着*n*变得越来越大,执行这个计算开始变得越来越复杂。 - en: Instead, let’s denote X i as the binary random variable for the *i*th coin flip and use the observation we made in the last section of being able to break up the total number of heads @@ -492,10 +651,12 @@ in a sequence of *n* flips is just 0.5**n*. This is much simpler than going down the previous route, as this approach’s difficulty does not scale with the number of flips. + id: totrans-38 prefs: [] type: TYPE_NORMAL - en: 'Let’s go over the simplification we made in a bit more detail. Mathematically, if we have any two independent random variables *A* and *B*:' + id: totrans-39 prefs: [] type: TYPE_NORMAL - en: 𝔼 [ A + + B ] = a,b + ( a + b ) * + P ( A = a , B + = b ) - en: = @@ -515,8 +685,16 @@ a + b ) * P ( A = a ) * P ( B = b ) + id: totrans-41 prefs: [] type: TYPE_NORMAL + zh: = + a,b ( + a + b ) * P ( + A = a ) * P ( + B = b ) - en: = a,b + a * P ( A = a + ) * P ( B = + b ) + b * P ( + A = a ) * P ( + B = b ) - en: = + a,b a + * P ( A = a ) + * P ( B = b ) + + a,b + b * P ( A = a + ) * P ( B = + b ) - en: = + a a * P ( A + = a ) b P + ( B = b ) + + b b * P ( + B = b ) a + P ( A = a ) - en: = @@ -564,15 +779,29 @@ A = a ) + b b * P ( B = b ) + id: totrans-45 prefs: [] type: TYPE_NORMAL + zh: = + a a * P ( + A = a ) + + b b * P ( B + = b ) - en: = 𝔼 [ A ] + 𝔼 [ B ] + id: totrans-46 prefs: [] type: TYPE_NORMAL + zh: = + 𝔼 [ A ] + 𝔼 [ B + ] - en: Note + id: totrans-47 prefs: - PREF_H6 type: TYPE_NORMAL @@ -582,6 +811,7 @@ doesn’t require additional assumptions, so we recommend working through the algebra on your own. Although we won’t show this for the dependent case, linearity of expectation also holds for dependent random variables. + id: totrans-48 prefs: [] type: TYPE_NORMAL - en: Going back to the dropout example, the expectation of the total number of masked @@ -590,6 +820,7 @@ of coin flips, is *p*n,* where *p* is the probability of being masked (and the expectation of each individual binary random variable representing a neuron) and *n* is the number of neurons. + id: totrans-49 prefs: [] type: TYPE_NORMAL - en: As mentioned, we don’t always see the expected number of occurrences of an event @@ -597,9 +828,11 @@ of heads in a single, fair coin flip from earlier, we never see it! Next, we will quantify the average deviation, or variance, from the expected value we see in repetitions of an experiment. + id: totrans-50 prefs: [] type: TYPE_NORMAL - en: Variance + id: totrans-51 prefs: - PREF_H1 type: TYPE_NORMAL @@ -619,8 +852,19 @@ left-bracket upper X minus mu right-bracket">𝔼 [ X - μ ] instead. To obtain a slightly simpler form for the variance, we can perform the following simplification:' - prefs: [] - type: TYPE_NORMAL + id: totrans-52 + prefs: [] + type: TYPE_NORMAL + zh: 我们定义方差,或Var(*X*),为𝔼 [ + (X-μ) 2 + ],其中我们让μ = 𝔼 [ X + ]。简单来说,这个度量表示值*X*取值与其期望之间的平均平方差。请注意,(X-μ) + 2本身也是一个随机变量,因为它是一个函数的函数(*X*),而函数仍然是一个函数。虽然我们不会详细讨论为什么我们特别使用这个公式,但我们鼓励您思考为什么我们不使用𝔼 + [ X - μ ]这样的公式。为了获得方差的稍微简化形式,我们可以进行以下简化: - en: 𝔼 @@ -628,23 +872,44 @@ 2 ] = 𝔼 [ X 2 - 2 μ X + μ 2 ] + id: totrans-53 prefs: [] type: TYPE_NORMAL + zh: 𝔼 + [ (X-μ) + 2 ] = 𝔼 [ X + 2 - 2 μ X + μ + 2 ] - en: = 𝔼 [ X 2 ] - 𝔼 [ 2 μ X ] + 𝔼 [ μ 2 ] + id: totrans-54 prefs: [] type: TYPE_NORMAL + zh: = 𝔼 [ + X 2 ] - 𝔼 [ + 2 μ X ] + 𝔼 [ + μ 2 ] - en: = 𝔼 [ X 2 ] - 2 μ 𝔼 [ X ] + μ 2 + id: totrans-55 prefs: [] type: TYPE_NORMAL + zh: = + 𝔼 [ X 2 ] + - 2 μ 𝔼 [ X ] + + μ 2 - en: = @@ -652,14 +917,27 @@ - 2 𝔼 [X] 2 + 𝔼 [X] 2 + id: totrans-56 prefs: [] type: TYPE_NORMAL + zh: = + 𝔼 [ X 2 ] + - 2 𝔼 [X] + 2 + 𝔼 [X] + 2 - en: = 𝔼 [ X 2 ] - 𝔼 [X] 2 + id: totrans-57 prefs: [] type: TYPE_NORMAL + zh: = + 𝔼 [ X 2 ] + - 𝔼 [X] 2 - en: 'Let’s take a moment to go through each of these steps. In the first step, we fully express the random variable as all of its component terms via classic binomial expansion. In the second step, we perform linearity of expectation to break out @@ -676,8 +954,12 @@ us to the simplified result. Let’s use this formula to find the variance of the binary random variable representing a single neuron under dropout, and *p* is the probability of the neuron being masked out:' + id: totrans-58 prefs: [] type: TYPE_NORMAL + zh: 让我们花点时间逐步进行这些步骤。在第一步中,我们通过经典的二项式展开完全表达随机变量作为其所有组成项。在第二步中,我们执行期望的线性性,将组成项分解为它们自己的单独期望。在第三步中,我们注意到μ,或者𝔼 [ X ],及其平方都是常数,因此可以从周围的期望中提取出来。它们是常数,因为它们不是值*X*的函数,而是使用整个域(*X*可以取的值集合)进行评估。常数可以看作是只能取一个值的随机变量,即常数本身。因此,它们的期望值,或者随机变量取值的平均值,就是常数本身,因为我们总是看到这个常数。最后的步骤是代数操作,将我们带到简化的结果。让我们使用这个公式来找到表示单个神经元在辍学下的二进制随机变量的方差,*p*是神经元被屏蔽的概率: - en: 𝔼 + [ X 2 ] - + 𝔼 [X] 2 + = x0,1 + x 2 * P ( X + = x ) - ( + x0,1 + x*P(X=x)) + 2 - en: = x0,1 x 2 * P ( X = x ) - p 2 + id: totrans-60 prefs: [] type: TYPE_NORMAL + zh: = x0,1 + x 2 * P ( X + = x ) - p 2 - en: = p - p 2 + id: totrans-61 prefs: [] type: TYPE_NORMAL + zh: = p - + p 2 - en: = p ( 1 - p ) + id: totrans-62 prefs: [] type: TYPE_NORMAL + zh: = + p ( 1 - p ) - en: 'These simplifications should make sense. We know from [“Expectation”](#expectation_sect1) that the expectation of the binary random variable representing a neuron is just *p,* and the rest is algebraic simplifications. We highly encourage you to work @@ -716,8 +1025,10 @@ representing the number of masked neurons in the entire layer, we naturally ask the question of whether there exists a similar linearity property for variance as there does for expectation. Unfortunately, the property does not hold in general:' + id: totrans-63 prefs: [] type: TYPE_NORMAL + zh: 这些简化应该是合理的。我们从[“期望”](#expectation_sect1)中知道,表示神经元的二进制随机变量的期望值只是*p*,其余是代数简化。我们强烈鼓励您自己进行这些推导。当我们开始思考代表整个层中被屏蔽神经元数量的随机变量时,我们自然会问是否存在与期望相似的方差线性性质。不幸的是,该性质通常不成立: - en: V a r + ( A + B ) = + 𝔼 [ (A+B) + 2 ] - 𝔼 [A+B] + 2 - en: = + 𝔼 [ A 2 + 2 + * A * B + B 2 + ] - (𝔼[A]+𝔼[B]) + 2 - en: = 𝔼 [ A 2 + ] + 2 𝔼 [ A + * B ] + 𝔼 [ + B 2 ] - 𝔼 [A] + 2 - 2 𝔼 [ A + ] 𝔼 [ B ] - + 𝔼 [B] 2 - en: = + 𝔼 [ A 2 ] + - 𝔼 [A] 2 + + 𝔼 [ B 2 ] + - 𝔼 [B] 2 + + 2 𝔼 [ A * B + ] - 2 𝔼 [ A + ] 𝔼 [ B ] - en: = V + a r ( A ) + V a + r ( B ) + 2 ( 𝔼 + [ A * B ] - 𝔼 [ + A ] 𝔼 [ B ] ) - en: = V a r ( A ) + V a r ( B ) + 2 C o v ( A , B ) + id: totrans-69 prefs: [] type: TYPE_NORMAL + zh: = V + a r ( A ) + V a + r ( B ) + 2 C o + v ( A , B ) - en: As we can see from the last line, the final term in the expression, which we call the *covariance* between the two random variables, ruins our hope for linearity. However, covariance is another key concept in probability—the intuition for covariance @@ -798,20 +1172,26 @@ variables, the covariance between them should be zero, and linearity should hold in this special case. We highly encourage you to work through the math and show this on your own. + id: totrans-70 prefs: [] type: TYPE_NORMAL + zh: 正如我们从最后一行可以看到的那样,表达式中的最后一项,我们称之为两个随机变量之间的*协方差*,破坏了我们对线性的希望。然而,协方差是概率中的另一个关键概念——协方差的直觉是它衡量了两个随机变量之间的依赖关系。当一个随机变量更完全地确定另一个随机变量的值(想象*A*是一系列抛硬币的正面数量,*B*是同一系列抛硬币的反面数量),协方差的大小会增加。因此,可以推断,如果*A*和*B*是独立的随机变量,它们之间的协方差应该为零,在这种特殊情况下线性应该成立。我们强烈鼓励您通过数学来证明这一点。 - en: Back to the dropout example, the variance of the total number of masked neurons can be broken up into a sum of variances over each neuron, since each neuron is masked independently. The variance of the number of masked neurons is *p(1 – p)*n,* where *p(1 – p)* is the variance for any given neuron and *n* is the number of neurons. Expectation and variance in dropout allow us to understand more deeply what we expect to see when applying such a layer in a deep neural network. + id: totrans-71 prefs: [] type: TYPE_NORMAL + zh: 回到辍学的例子,被屏蔽神经元的总数的方差可以分解为每个神经元的方差之和,因为每个神经元都是独立屏蔽的。被屏蔽神经元的数量的方差是*p(1-p)*n*,其中*p(1-p)*是任何给定神经元的方差,*n*是神经元的数量。辍学中的期望和方差使我们能够更深入地理解在深度神经网络中应用这样一个层时我们期望看到什么。 - en: Bayes’ Theorem + id: totrans-72 prefs: - PREF_H1 type: TYPE_NORMAL + zh: 贝叶斯定理 - en: Returning to our discussion on conditional probability, we noted that the probability of intersection between two events could be written as a product of a conditional distribution and a distribution over a single event. Let’s translate this into @@ -828,8 +1208,10 @@ . Note that we generally write the joint probability distribution as *P(A,B),* since this encompasses all possible joint settings of the random variables *A* and *B.* + id: totrans-73 prefs: [] type: TYPE_NORMAL + zh: 回到我们关于条件概率的讨论,我们注意到两个事件之间的交集的概率可以写成条件分布和单个事件的分布的乘积。现在让我们将这个翻译成随机变量的语言,现在我们已经介绍了这个新术语。我们将*A*表示为一个随机变量,*B*表示第二个随机变量。让*a*是*A*可以取的值,*b*是*B*可以取的值。对于随机变量的交集操作的类比是*联合概率分布P(A=a,B=b)*,表示*A=a*和*B=b*的事件。我们可以将*A=a*和*B=b*看作是单独的事件,当我们写*P(A=a,B=b)*时,我们考虑的是两个事件都发生的概率,即它们的交集。请注意,我们通常将联合概率分布写为*P(A,B)*,因为这包含了随机变量*A*和*B*的所有可能的联合设置。 - en: 'We mentioned earlier that intersection operations could be written as the product of a conditional distribution and a distribution over a single event. Rewriting this in the format for random variables, we have *P(A = a,B = b) = P(A = a|B = @@ -842,8 +1224,12 @@ by *B,* the path in which *B* takes on a value *b*, and then *A* takes on a value *a* in that universe makes much more sense than the reverse since, biologically, people contract a disease first and only then show symptoms for that disease.' + id: totrans-74 prefs: [] type: TYPE_NORMAL + zh: 我们之前提到,交集操作可以写成条件分布和单个事件的分布的乘积。将这个重写为随机变量的格式,我们有*P(A=a,B=b) = P(A=a|B=b)P(B=b)*。更一般地,考虑两个随机变量的所有可能的联合设置,我们有*P(A,B) + = P(A|B)P(B)*。我们还讨论了总是存在第二种写这个联合分布的方法:*P(A=a,B=b) = P(B=b|A=a)P(A=a)*,更一般地,*P(A,B) + = P(B|A)P(A)*。我们注意到有时其中一种路径比另一种更有意义。例如,在症状由*A*表示,疾病由*B*表示的情况下,*B*取一个值*b*,然后*A*在那个宇宙中取一个值*a*的路径比反向更有意义,因为从生物学上讲,人们先感染疾病,然后才表现出该疾病的症状。 - en: 'However, this doesn’t mean that the reverse isn’t useful. It is almost universally the case that people show up at a hospital with mild symptoms, and medical professionals must try to infer the most likely disease from these symptoms to effectively treat @@ -851,6 +1237,7 @@ of a disease given the observed symptoms. Since the same joint probability distribution can be written in the two ways mentioned in the previous paragraph, we have the following equality:' + id: totrans-75 prefs: [] type: TYPE_NORMAL - en: P ( B | A ) = P(A|B)P(B) P(A) + id: totrans-76 prefs: [] type: TYPE_NORMAL + zh: P ( B + | A ) = P(A|B)P(B) + P(A) - en: 'If *B* represents disease, while *A* represents symptoms, this gives us a method for computing the likelihood of any disease given the observed symptoms. Let’s analyze the right side to see if the equality also makes intuitive sense. The @@ -870,6 +1264,7 @@ of the numerator over all possible diseases. This is an instance of a more general process called *marginalization,* or removing a subset of random variables from a joint distribution by summing over all possible configurations of the subset:' + id: totrans-77 prefs: [] type: TYPE_NORMAL - en: P ( A ) = b P ( A , B = b ) + id: totrans-78 prefs: [] type: TYPE_NORMAL + zh: P ( A ) + = b P ( A + , B = b ) - en: 'In more concise terms, we have:' + id: totrans-79 prefs: [] type: TYPE_NORMAL - en: P ( + B = b query + | A ) = P(B=b + query ,A) + b P(B=b,A) - en: Bayes’ Theorem is a very valuable application of probability in the real world, especially in the case of disease prediction. Additionally, if we replace the random variable for symptoms with a random variable representing the result of @@ -901,9 +1313,11 @@ of actually having a specific disease given a positive test for it using Bayes’ Theorem. This is a common problem in most hospitals, and is especially relevant to epidemiology given the outbreak of COVID-19. + id: totrans-81 prefs: [] type: TYPE_NORMAL - en: Entropy, Cross Entropy, and KL Divergence + id: totrans-82 prefs: - PREF_H1 type: TYPE_NORMAL @@ -913,6 +1327,7 @@ sorts of events. In this section, we first consider the problem of defining a single metric that encapsulates all of the uncertainty within a probability distribution, which we will define as the distribution’s *entropy*. + id: totrans-83 prefs: [] type: TYPE_NORMAL - en: Let’s set up the following scenario. I am a researcher who is running an experiment. @@ -924,6 +1339,7 @@ you write down. As a scribe, you are necessary in this situation—I may run hundreds of trials and my memory is limited, so I cannot remember the results of all of my trials. + id: totrans-84 prefs: [] type: TYPE_NORMAL - en: For example, if I roll a dice and neither of us knows anything about the fairness @@ -933,6 +1349,7 @@ to the end of the string consisting of all results so far. If I were to roll a one, followed by two twos, and finally a one, using the encoding scheme defined so far you would have written down “0110.” + id: totrans-85 prefs: [] type: TYPE_NORMAL - en: After all runs of the experiment have ended, I have a meeting with you and try @@ -942,8 +1359,10 @@ by a two, followed by a three? Or even a one, followed by a four, followed by a one? It seems that there are at least a few possible translations of this string into outcomes using the encoding scheme. + id: totrans-86 prefs: [] type: TYPE_NORMAL + zh: 在所有实验运行结束后,我与您开会,尝试将这个字符串“0110”解密为一系列结果,以供我的研究使用。然而,作为研究人员,我对这个字符串感到困惑——它代表一个一,接着两个二,最后一个一吗?还是代表一个一,接着一个二,再接着一个三?甚至是一个一,接着一个四,再接着一个一?看起来至少有几种可能的翻译方式可以使用编码方案将这个字符串转换为结果。 - en: To prevent this situation from ever occurring again, we decide to enforce some limitations on the binary strings you can use to represent outcomes. We use what is called a *prefix code*, which disallows binary string representations of different @@ -957,8 +1376,10 @@ prefix of the binary string that has been successfully translated to a series of outcomes. We then recursively use this logic until we have reached the end of the string. + id: totrans-87 prefs: [] type: TYPE_NORMAL + zh: 为了防止这种情况再次发生,我们决定对您用于表示结果的二进制字符串施加一些限制。我们使用所谓的*前缀编码*,它不允许不同结果的二进制字符串表示成为彼此的前缀。不难理解为什么这会导致字符串到结果的唯一翻译。假设我们有一个二进制字符串,其中的某个前缀我们已成功解码为一系列结果。要解码剩余的字符串,或者后缀,我们必须首先找到系列中的下一个结果。当我们找到这个后缀的前缀被翻译为一个结果时,我们已经知道,根据定义,没有更小的前缀可以翻译为有效的结果。现在我们有一个更大的前缀二进制字符串已成功翻译为一系列结果。然后我们递归使用这种逻辑,直到达到字符串的末尾。 - en: Now that we have some guidelines on string representations for outcomes, we redo the original experiment with one as “0,” two as “10,” three as “110,” four as “1110,” five as “11110,” and six as “111110.” However, as noted earlier, I @@ -969,8 +1390,10 @@ number of letters you’d need to write down per trial is 3.5\. We could get down to 3 if we set one as “000,” two as “001,” three as “010,” four as “011,” five as “100,” and six as “101,” for example. + id: totrans-88 prefs: [] type: TYPE_NORMAL + zh: 现在我们有了一些关于结果的字符串表示的指导方针,我们使用“0”表示一,使用“10”表示二,使用“110”表示三,使用“1110”表示四,使用“11110”表示五,使用“111110”表示六重新进行原始实验。然而,正如前面提到的,我可能进行数百次试验,作为抄写员,您可能希望限制您需要写入的数量。在没有关于骰子的信息的情况下,我们无法做得比这更好。假设每个结果出现的概率为1/6,您每次试验需要写下的预期字母数量为3.5。例如,如果我们将一设置为“000”,将二设置为“001”,将三设置为“010”,将四设置为“011”,将五设置为“100”,将六设置为“101”,我们可以降至3。 - en: But what if we knew information about the dice? For example, what if it were a weighted dice that showed up six almost all of the time? In that case, you probably want to assign a shorter binary string to six, for example “0” (instead of assigning @@ -978,8 +1401,10 @@ makes intuitive sense that, as the result of any single trial becomes more and more certain, the expected number of characters you’d need to write becomes lower by assigning the shortest binary strings to the most likely outcomes. + id: totrans-89 prefs: [] type: TYPE_NORMAL + zh: 但如果我们知道有关骰子的信息呢?例如,如果它是一个加权骰子,几乎总是出现六点?在这种情况下,您可能希望为六分配一个更短的二进制字符串,例如“0”(而不是将“0”分配给一),这样您就可以限制您需要写入的预期数量。直观地讲,随着任何单次试验的结果变得越来越确定,通过将最可能的结果分配给最短的二进制字符串,您需要写入的预期字符数量就会降低。 - en: 'This raises the question: given a probability distribution over outcomes, what is the optimal encoding scheme, where optimal is defined as the fewest expected number of characters you’d need to write per trial? Although this whole situation @@ -990,8 +1415,10 @@ expectation per trial. For example, in the extreme case where we already knew beforehand that a six would always show up, the scribe wouldn’t need to write anything down.' + id: totrans-90 prefs: [] type: TYPE_NORMAL + zh: 这引发了一个问题:在结果上给定一个概率分布,什么是最佳的编码方案,其中最佳被定义为每次试验需要写入的最少预期字符数量?尽管整个情况可能有点刻意,但它为我们提供了一个稍微不同的视角,通过它我们可以理解概率分布中的不确定性。正如我们所指出的,随着实验结果变得越来越确定,最佳的编码方案将允许抄写员每次试验的预期字符数量变得越来越少。例如,在极端情况下,如果我们事先知道六点总是会出现,抄写员就不需要写任何东西。 - en: 'It turns out that, although we won’t show it here, the best you can do is assign a binary string of length p ( x i ) is its probability. The expected string length of any given trial would then be:' + id: totrans-91 prefs: [] type: TYPE_NORMAL + zh: 事实证明,尽管我们在这里不会展示,但你可以做的最好的事情是为每个可能结果*x_i*分配一个长度为*log_2(1/p(x_i))*的二进制字符串,其中*p(x_i)*是其概率。然后,任何给定试验的预期字符串长度将是: - en: log 2 1 p(x i ) + id: totrans-92 prefs: [] type: TYPE_NORMAL + zh: 𝔼 p(x) + [ log 2 1 + p(x) ] + = x i p + ( x i ) log 2 1 p(x + i ) - en: = @@ -1024,14 +1466,24 @@ ( x i ) log 2 p ( x i ) + id: totrans-93 prefs: [] type: TYPE_NORMAL + zh: = + - x i p + ( x i ) log 2 p ( x + i ) - en: This expression is defined as the *entropy* of a probability distribution. In the case where we are completely certain of the final outcome (e.g., the dice always lands up six), we can evaluate the expression for entropy and see that we get a result of 0. + id: totrans-94 prefs: [] type: TYPE_NORMAL + zh: 这个表达式被定义为概率分布的*熵*。在我们完全确定最终结果的情况下(例如,骰子总是掷出六点),我们可以评估熵的表达式,看到我们得到的结果是0。 - en: In the case where we are completely certain of the final outcome (e.g., the dice always lands up six), we can evaluate the expression for entropy and see that we get a result of 0\. Additionally, the probability distribution that has @@ -1039,16 +1491,22 @@ outcomes. This is because, for any given trial, we are no more certain that a particular outcome will appear as opposed to any other outcome. As a result, we cannot use the strategy of assigning a shorter string to any single outcome. + id: totrans-95 prefs: [] type: TYPE_NORMAL + zh: 在我们完全确定最终结果的情况下(例如,骰子总是掷出六点),我们可以评估熵的表达式,看到我们得到的结果是0。此外,具有最高熵的概率分布是将等概率分布在所有可能结果上的分布。这是因为对于任何给定的试验,我们对某个特定结果出现与其他结果出现一样的确定性。因此,我们不能使用将较短的字符串分配给任何单个结果的策略。 - en: Now that we have defined entropy, we can discuss cross entropy, which provides us a way of measuring the distinctness of two distributions. + id: totrans-96 prefs: [] type: TYPE_NORMAL + zh: 现在我们已经定义了熵,我们可以讨论交叉熵,它为我们提供了一种衡量两个分布之间差异的方法。 - en: Equation 2-1\. Cross entropy + id: totrans-97 prefs: - PREF_H5 type: TYPE_NORMAL + zh: 方程2-1. 交叉熵 - en: log 2 q ( x ) + id: totrans-98 prefs: [] type: TYPE_NORMAL + zh: C + E ( p | | q ) + = 𝔼 p(x) + [ log 2 1 + q(x) ] + = x p ( x + ) log 2 1 + q(x) = - + x p ( x ) + log 2 q ( + x ) - en: Note that cross entropy has a log 1 q(x) term, @@ -1083,8 +1560,10 @@ so we assume some distribution *q(x)* to optimize our encoding scheme, but as we carry out trials, we learn more information that gets us closer to the true distribution *p(x)*. + id: totrans-99 prefs: [] type: TYPE_NORMAL + zh: 注意交叉熵中有一个项,可以解释为对每个结果分配的最佳二进制字符串长度,假设结果按概率分布*q(x)*出现。然而,请注意这是相对于*p(x)*的期望,那么我们如何解释整个表达式呢?嗯,我们可以理解交叉熵是指在为分布*q(x)*优化编码方案的情况下,对于任何试验的预期字符串长度,而实际上,所有结果都是根据分布*p(x)*出现的。这在实验中肯定会发生,因为我们对实验的先验信息有限,所以我们假设某个分布*q(x)*来优化我们的编码方案,但随着我们进行试验,我们学到了更多信息,使我们更接近真实分布*p(x)*。 - en: 'The KL divergence takes this logic a bit further. If we take the cross entropy, which tells us the expected number of bits per trial given we have optimized our encoding for the incorrect distribution *q(x),* and subtract from that the entropy, @@ -1092,8 +1571,10 @@ the correct distribution *p(x*), we get the expected number of extra bits required to represent a trial when using *q(x)* compared to *p(x)*. Here is the expression for the KL divergence:' + id: totrans-100 prefs: [] type: TYPE_NORMAL + zh: KL散度将这种逻辑推得更远。如果我们取交叉熵,告诉我们在为不正确的分布*q(x)*优化我们的编码时每次试验预期的比特数,然后从中减去熵,告诉我们在为正确的分布*p(x)*优化时每次试验预期的比特数,我们得到了使用*q(x)*比*p(x)*时表示试验所需的额外比特数的预期值。以下是KL散度的表达式: - en: log 2 p(x) q(x) ] + id: totrans-101 prefs: [] type: TYPE_NORMAL + zh: K + L ( p | | q ) + = 𝔼 p(x) + [ log 2 1 + q(x) - log 2 1 p(x) + ] = 𝔼 p(x) + [ log 2 p(x) + q(x) ] - en: At the unique global minimum *q(x)* = *p(x)*, the KL divergence is exactly zero. Why this is the unique minimum is a bit beyond the scope of this text, so we leave that as an exercise for you. + id: totrans-102 prefs: [] type: TYPE_NORMAL + zh: 在唯一的全局最小值*q(x)* = *p(x)*处,KL散度恰好为零。为什么这是唯一的最小值有点超出了本文的范围,所以我们把它留给你作为一个练习。 - en: In practice, when trying to match the true distribution *p(x)* with a learned distribution *q(x)*, KL divergence is often minimized as an objective function. Most models will actually minimize the cross entropy in place of the KL divergence, @@ -1125,6 +1625,7 @@ is a constant and has no dependence on the weights that parameterize *q(x)*. Thus, the gradient with respect to the weights that parameterize *q(x)* when using either objective is the same. + id: totrans-103 prefs: [] type: TYPE_NORMAL - en: One common example where cross-entropy/KL divergence is optimized is in the @@ -1141,9 +1642,11 @@ of the data. Both are valid interpretations of how neural networks are trained, and lead to the same objective function. We encourage you to try writing out both expressions independently to see this. + id: totrans-104 prefs: [] type: TYPE_NORMAL - en: Continuous Probability Distributions + id: totrans-105 prefs: - PREF_H1 type: TYPE_NORMAL @@ -1153,6 +1656,7 @@ digits. We can define probability distributions over sample spaces of infinite size, such as all the real numbers. In this section, we will extend principles covered in the previous sections to the continuous realm. + id: totrans-106 prefs: [] type: TYPE_NORMAL - en: In the continuous realm, probability distributions are often referred to as @@ -1167,6 +1671,7 @@ 2 right-parenthesis">P ( X 2 ) , all we’d need to do is integrate the PDF of *X* from negative infinity to 2. + id: totrans-107 prefs: [] type: TYPE_NORMAL - en: But how about the probability of any individual outcome, say *P*(*X* = 2)? Since @@ -1178,6 +1683,7 @@ outcomes we are most likely to see when performing an experiment over a continuous space. Going forward, when considering continuous probability distributions, we will only refer to events as having probability, rather than individual outcomes. + id: totrans-108 prefs: [] type: TYPE_NORMAL - en: One famous example of a continuous probability distribution is the *uniform @@ -1188,13 +1694,16 @@ height, or the likelihood for each outcome, is the value that makes the area of the rectangle equal to one. [Figure 2-4](#c299) shows the uniform distribution over the interval [0,0.5]. + id: totrans-109 prefs: [] type: TYPE_NORMAL - en: '![](Images/fdl2_0204.png)' + id: totrans-110 prefs: [] type: TYPE_IMG - en: Figure 2-4\. The uniform distribution has uniform height over its entire area, which shows that each value in the domain of the distribution has equal likelihood. + id: totrans-111 prefs: - PREF_H6 type: TYPE_NORMAL @@ -1202,6 +1711,7 @@ and probabilities in the continuous realm. The height of the rectangle being 2 was no error—there is no constraint on the magnitude of the likelihood in continuous distributions, unlike probabilities, which must be less than or equal to 1. + id: totrans-112 prefs: [] type: TYPE_NORMAL - en: 'Another famous example of a continuous probability distribution is the *Gaussian @@ -1209,6 +1719,7 @@ in the real world. The Gaussian distribution is defined by two parameters: its mean μ and its standard deviation σ . The PDF of a Gaussian distribution is:' + id: totrans-113 prefs: [] type: TYPE_NORMAL - en: f ( x ; + μ , σ ) = 1 + σ2π + e -1 2(x-μ + σ) 2 - en: Why this function integrates to 1 over the real domain is beyond the scope of this chapter, but one important characteristic of a Gaussian distribution is that its mean is also its unique mode. In other words, the outcome with the highest @@ -1228,14 +1748,17 @@ For example, [Figure 2-4](#c299) does not have this property. The graph of a standard Gaussian, which has mean zero and unit variance, is shown in [Figure 2-5](#c302) (the PDF asymptotically reaches zero in the limit in both directions). + id: totrans-115 prefs: [] type: TYPE_NORMAL - en: '![](Images/fdl2_0205.png)' + id: totrans-116 prefs: [] type: TYPE_IMG - en: Figure 2-5\. The Gaussian distribution has a bell shape, with highest likelihood in the center and dropping exponentially as the value in question gets farther and farther from the center. + id: totrans-117 prefs: - PREF_H6 type: TYPE_NORMAL @@ -1250,6 +1773,7 @@ when standardized correctly, is approximately distributed as a standard Gaussian distribution. We won’t cover CLT in much depth here, but it has more recently been extended to weakly dependent variables under certain special conditions. + id: totrans-118 prefs: [] type: TYPE_NORMAL - en: Many real-world datasets can be seen as approximately sums of many random variables. @@ -1258,12 +1782,14 @@ many Bernoulli random variables (where each person is a Bernoulli random variable that has a value of 1 if they have the disease and a value of 0 if they do not)—although likely dependent. + id: totrans-119 prefs: [] type: TYPE_NORMAL - en: 'Continuous random variables are still functions, just as we defined discrete random variables. The only difference is that the range of this function is a continuous space. To compute the expectation and variance of a continuous random variable, all we need to do is replace our summations with integrations, as follows:' + id: totrans-120 prefs: [] type: TYPE_NORMAL - en: 𝔼 [ X ] + = x x * f + ( X = x ) d + x - en: V a r + ( X ) = x + (x-𝔼[X]) + 2 * f ( X = + x ) d x - en: 'As an example, let’s evaluate the expectation for our uniform random variable defined earlier. But first, confirm that it makes intuitive sense that the expectation should be 0.25, since the endpoints of the interval are 0 and 0.5 and all values in between are of equal likelihood. Now, let’s evaluate the integral and see if the computation matches our intuition:' + id: totrans-123 prefs: [] type: TYPE_NORMAL - en: 0 0.5 + x * f ( x ) + d x = 0 0.5 + 2 x d x - en: = x 2 | 0 0.5 + id: totrans-125 prefs: [] type: TYPE_NORMAL + zh: = x 2 | + 0 0.5 - en: = 0 . 25 + id: totrans-126 prefs: [] type: TYPE_NORMAL + zh: = 0 . 25 - en: Where the superscript and the subscript of the | symbol represent the values at which we will evaluate the preceding function, which we will then difference to get the value of the integral. We see that the expectation comes out to the same value as our intuition, which is a great sanity check. + id: totrans-127 prefs: [] type: TYPE_NORMAL - en: 'Bayes’ Theorem also holds for continuous variables. The only major difference @@ -1320,6 +1877,7 @@ example of extending the tenets of probability to the continuous space by replacing summations with integrations. Here is Bayes’ Theorem for continuous probability distributions, following the notation from [“Bayes’ Theorem”](#bayes-theorem-sect):' + id: totrans-128 prefs: [] type: TYPE_NORMAL - en: P + ( B = b query + | A ) = P(A|B=b + query )P(B=b + query ) + P(A) = P(A|B=b + query )P(B=b + query ) + b P(A,B=b)db - en: 'And finally, we have our discussion on entropy, cross entropy, and KL divergence. All three of these extend nicely to the continuous space as well. We replace our summations with integrations and note that the properties introduced in the previous @@ -1348,6 +1924,7 @@ highest entropy is the uniform distribution, and the KL divergence between two distributions is zero if and only if the two distributions are the exact same. Here are the definitions in their continuous form, following [Equation 2-1](#cross-entropy-formula):' + id: totrans-130 prefs: [] type: TYPE_NORMAL - en: log 2 f ( x ) d x + id: totrans-131 prefs: [] type: TYPE_NORMAL + zh: H ( f ( x + ) ) = - x + f ( x ) log + 2 f ( x ) d + x - en: log 2 f(x) g(x) d x + id: totrans-132 prefs: [] type: TYPE_NORMAL + zh: K + L ( f ( x ) + | | g ( x ) + ) = x f ( + x ) log 2 + f(x) g(x) + d x - en: log 2 g ( x ) d x + id: totrans-133 prefs: [] type: TYPE_NORMAL + zh: C + E ( f ( x ) + | | g ( x ) + ) = - x f + ( x ) log + 2 g ( x ) d + x - en: Our extension of these concepts to the continuous space will come in handy in [Chapter 10](ch10.xhtml#ch10), where we model many distributions as Gaussians. Additionally, we use the KL divergence/cross-entropy terms as a regularization @@ -1392,9 +2001,11 @@ is only zero when the query distribution matches the target distribution, setting the target distribution to a Gaussian forces the learned distribution to approximate a Gaussian. + id: totrans-134 prefs: [] type: TYPE_NORMAL - en: Summary + id: totrans-135 prefs: - PREF_H1 type: TYPE_NORMAL @@ -1407,6 +2018,7 @@ technique in neural nets. Finally, we discussed measurements of uncertainty in probability distributions such as entropy, and generalized these concepts to the continuous realm. + id: totrans-136 prefs: [] type: TYPE_NORMAL - en: Probability is a field that affects the choices in our everyday lives, and it’s @@ -1414,5 +2026,6 @@ introduction puts the rest of the book in perspective and allows you to more rigorously understand future concepts. In the next chapter, we will discuss the structure of neural networks, and the motivations behind their design. + id: totrans-137 prefs: [] type: TYPE_NORMAL