diff --git a/totrans/dl-scr_2.yaml b/totrans/dl-scr_2.yaml index ebc4c18..f69bbf6 100644 --- a/totrans/dl-scr_2.yaml +++ b/totrans/dl-scr_2.yaml @@ -1410,6 +1410,9 @@ id: totrans-160 prefs: [] type: TYPE_NORMAL + zh: 做“一堆线性回归”是什么意思?做一个线性回归涉及使用一组参数进行矩阵乘法:如果我们的数据*X*的维度是`[batch_size, num_features]`,那么我们将它乘以一个维度为`[num_features, + 1]`的权重矩阵*W*,得到一个维度为`[batch_size, 1]`的输出;对于批次中的每个观察值,这个输出只是原始特征的一个*加权和*。要做多个线性回归,我们只需将我们的输入乘以一个维度为`[num_features, + num_outputs]`的权重矩阵,得到一个维度为`[batch_size, num_outputs]`的输出;现在,*对于每个观察值*,我们有`num_outputs`个不同的原始特征的加权和。 - en: What are these weighted sums? We should think of each of them as a “learned feature”—a combination of the original features that, once the network is trained, will represent its attempt to learn combinations of features that help it accurately @@ -1418,26 +1421,31 @@ id: totrans-161 prefs: [] type: TYPE_NORMAL + zh: 这些加权和是什么?我们应该将它们中的每一个看作是一个“学习到的特征”——原始特征的组合,一旦网络训练完成,将代表其尝试学习的特征组合,以帮助准确预测房价。我们应该创建多少个学习到的特征?让我们创建13个,因为我们创建了13个原始特征。 - en: 'Step 2: A Nonlinear Function' id: totrans-162 prefs: - PREF_H2 type: TYPE_NORMAL + zh: 步骤2:一个非线性函数 - en: Next, we’ll feed each of these weighted sums through a *non*linear function; the first function we’ll try is the `sigmoid` function that was mentioned in [Chapter 1](ch01.html#foundations). As a refresher, [Figure 2-9](#fig_02-09) plots the `sigmoid` function. id: totrans-163 prefs: [] type: TYPE_NORMAL + zh: 接下来,我们将通过一个非线性函数来处理这些加权和;我们将尝试的第一个函数是在第1章中提到的`sigmoid`函数。作为提醒,[图2-9](#fig_02-09)展示了`sigmoid`函数。 - en: '![Sigmoid](assets/dlfs_0209.png)' id: totrans-164 prefs: [] type: TYPE_IMG + zh: '![Sigmoid](assets/dlfs_0209.png)' - en: Figure 2-9\. Sigmoid function plotted from x = –5 to x = 5 id: totrans-165 prefs: - PREF_H6 type: TYPE_NORMAL + zh: 图2-9。从x = -5到x = 5绘制的Sigmoid函数 - en: Why is using this nonlinear function a good idea? Why not the `square` function *f*(*x*) = *x*², for example? There are a couple of reasons. First, we want the function we use here to be *monotonic* so that it “preserves” information about @@ -1450,17 +1458,20 @@ id: totrans-166 prefs: [] type: TYPE_NORMAL + zh: 为什么使用这个非线性函数是个好主意?为什么不使用`square`函数*f*(*x*) = *x*²,例如?有几个原因。首先,我们希望在这里使用的函数是*单调*的,以便“保留”输入的数字的信息。假设,给定输入的日期,我们的两个线性回归分别产生值-3和3。然后通过`square`函数传递这些值将为每个产生一个值9,因此任何接收这些数字作为输入的函数在它们通过`square`函数传递后将“丢失”一个原始为-3,另一个为3的信息。 - en: The second reason, of course, is that the function is nonlinear; this nonlinearity will enable our neural network to model the inherently nonlinear relationship between the features and the target. id: totrans-167 prefs: [] type: TYPE_NORMAL + zh: 当然,第二个原因是这个函数是非线性的;这种非线性将使我们的神经网络能够建模特征和目标之间固有的非线性关系。 - en: 'Finally, the `sigmoid` function has the nice property that its derivative can be expressed in terms of the function itself:' id: totrans-168 prefs: [] type: TYPE_NORMAL + zh: 最后,`sigmoid`函数有一个很好的性质,即它的导数可以用函数本身来表示: - en: σ u ( x ) = σ ( x ) × ( 1 - @@ -1477,11 +1488,13 @@ id: totrans-170 prefs: [] type: TYPE_NORMAL + zh: 我们将很快在神经网络的反向传播中使用`sigmoid`函数时使用它。 - en: 'Step 3: Another Linear Regression' id: totrans-171 prefs: - PREF_H2 type: TYPE_NORMAL + zh: 步骤3:另一个线性回归 - en: Finally, we’ll take the resulting 13 elements—each of which is a combination of the original features, fed through the `sigmoid` function so that they all have values between 0 and 1—and feed them into a regular linear regression, using @@ -1489,6 +1502,7 @@ id: totrans-172 prefs: [] type: TYPE_NORMAL + zh: 最后,我们将得到的13个元素——每个元素都是原始特征的组合,通过`sigmoid`函数传递,使它们的值都在0到1之间——并将它们输入到一个常规线性回归中,使用它们的方式与我们之前使用原始特征的方式相同。 - en: 'Then, we’ll try training the *entire* resulting function in the same way we trained the standard linear regression earlier in this chapter: we’ll feed data through the model, use the chain rule to figure out how much increasing the weights @@ -1499,31 +1513,37 @@ id: totrans-173 prefs: [] type: TYPE_NORMAL + zh: 然后,我们将尝试训练*整个*得到的函数,方式与本章前面训练标准线性回归的方式相同:我们将数据通过模型,使用链式法则来计算增加权重会增加(或减少)损失多少,然后在每次迭代中更新权重,以减少损失。随着时间的推移(我们希望),我们将得到比以前更准确的模型,一个已经“学会”了特征和目标之间固有非线性关系的模型。 - en: It might be tough to wrap your mind around what’s going on based on this description, so let’s look at an illustration. id: totrans-174 prefs: [] type: TYPE_NORMAL + zh: 根据这个描述,可能很难理解正在发生的事情,所以让我们看一个插图。 - en: Diagrams id: totrans-175 prefs: - PREF_H2 type: TYPE_NORMAL + zh: 图表 - en: '[Figure 2-10](#fig_02-10) is a diagram of what our more complicated model now looks like.' id: totrans-176 prefs: [] type: TYPE_NORMAL + zh: '[图2-10](#fig_02-10)是我们更复杂模型的图表。' - en: '![Neural network forward pass](assets/dlfs_0210.png)' id: totrans-177 prefs: [] type: TYPE_IMG + zh: '![神经网络前向传播](assets/dlfs_0210.png)' - en: Figure 2-10\. Steps 1–3 translated into a computational graph of the kind we saw in [Chapter 1](ch01.html#foundations) id: totrans-178 prefs: - PREF_H6 type: TYPE_NORMAL + zh: 图2-10。将步骤1-3翻译成我们在第1章中看到的计算图的一种类型 - en: 'You’ll see that we start with matrix multiplication and matrix addition, as before. Now let’s formalize some terminology that was mentioned previously: when we apply these operations in the course of a nested function, we’ll call the first diff --git a/totrans/dl-scr_3.yaml b/totrans/dl-scr_3.yaml index b4cd162..f12057f 100644 --- a/totrans/dl-scr_3.yaml +++ b/totrans/dl-scr_3.yaml @@ -1,4 +1,5 @@ - en: Chapter 3\. Deep Learning from Scratch + id: totrans-0 prefs: - PREF_H1 type: TYPE_NORMAL @@ -13,6 +14,7 @@ learn to represent these building blocks themselves as abstract Python classes and then use these classes to build deep learning models; by the end of this chapter, you will indeed have done “deep learning from scratch”!' + id: totrans-1 prefs: [] type: TYPE_NORMAL - en: 'We’ll also map the descriptions of neural networks in terms of these building @@ -25,9 +27,11 @@ that happen at a low level. In the first part of this chapter, we’ll map this description of models to common higher-level concepts such as “layers” that will ultimately allow us to more easily describe more complex models.' + id: totrans-2 prefs: [] type: TYPE_NORMAL - en: 'Deep Learning Definition: A First Pass' + id: totrans-3 prefs: - PREF_H1 type: TYPE_NORMAL @@ -39,26 +43,31 @@ We found that if we defined the model as a function that included *parameters* as inputs to some of its operations, we could “fit” it to optimally describe the data using the following procedure:' + id: totrans-4 prefs: [] type: TYPE_NORMAL - en: Repeatedly feed observations through the model, keeping track of the quantities computed along the way during this “forward pass.” + id: totrans-5 prefs: - PREF_OL type: TYPE_NORMAL - en: Calculate a *loss* representing how far off our model’s predictions were from the desired outputs or *target*. + id: totrans-6 prefs: - PREF_OL type: TYPE_NORMAL - en: Using the quantities computed on the forward pass and the chain rule math worked out in [Chapter 1](ch01.html#foundations), compute how much each of the input *parameters* ultimately affects this loss. + id: totrans-7 prefs: - PREF_OL type: TYPE_NORMAL - en: Update the values of the parameters so that the loss will hopefully be reduced when the next set of observations is passed through the model. + id: totrans-8 prefs: - PREF_OL type: TYPE_NORMAL @@ -67,6 +76,7 @@ linear regression model). This had the expected limitation that, even when fit “optimally,” the model could nevertheless represent only linear relationships between our features and our target. + id: totrans-9 prefs: [] type: TYPE_NORMAL - en: We then defined a function structure that applied these linear operations first, @@ -75,12 +85,14 @@ something closer to the true, nonlinear relationship between input and output, while having the additional benefit that it could learn relationships between *combinations* of our input features and the target. + id: totrans-10 prefs: [] type: TYPE_NORMAL - en: 'What is the connection between models like these and deep learning models? We’ll start with a somewhat clumsy attempt at a definition: deep learning models are represented by series of operations that have *at least two, nonconsecutive* nonlinear functions involved.' + id: totrans-11 prefs: [] type: TYPE_NORMAL - en: I’ll show where this definition comes from shortly, but first note that since @@ -92,6 +104,7 @@ is differentiable, so as long as the individual operations making up the function are differentiable, the whole function will be differentiable, and we’ll be able to train it using the same four-step training procedure just described. + id: totrans-12 prefs: [] type: TYPE_NORMAL - en: However, so far our approach to actually training these models has been to compute @@ -108,14 +121,17 @@ To guide us in the right direction as far as which abstractions to create, we’ll try to map the operations we’ve been using to traditional descriptions of neural networks as being made up of “layers,” “neurons,” and so on. + id: totrans-13 prefs: [] type: TYPE_NORMAL - en: As our first step, we’ll have to create an abstraction to represent the individual operations we’ve been working with so far, instead of continuing to code the same matrix multiplication and bias addition over and over again. + id: totrans-14 prefs: [] type: TYPE_NORMAL - en: 'The Building Blocks of Neural Networks: Operations' + id: totrans-15 prefs: - PREF_H1 type: TYPE_NORMAL @@ -126,6 +142,7 @@ such as matrix multiplication, seem to have *another* special kind of input, also an `ndarray`: the parameters. In our `Operation` class—or perhaps in another class that inherits from it—we should allow for `params` as another instance variable.' + id: totrans-16 prefs: [] type: TYPE_NORMAL - en: 'Another insight is that there seem to be two types of `Operation`s: some, such @@ -141,66 +158,84 @@ network). Also on the backward pass, each `Operation` will send an “input gradient” backward, representing the partial derivative of the loss with respect to each element of the input.' + id: totrans-17 prefs: [] type: TYPE_NORMAL - en: 'These facts place a few important restrictions on the workings of our `Operation`s that will help us ensure we’re computing the gradients correctly:' + id: totrans-18 prefs: [] type: TYPE_NORMAL - en: The shape of the *output gradient* `ndarray` must match the shape of the *output*. + id: totrans-19 prefs: - PREF_UL type: TYPE_NORMAL - en: The shape of the *input gradient* that the `Operation` sends backward during the backward pass must match the shape of the `Operation`’s *input*. + id: totrans-20 prefs: - PREF_UL type: TYPE_NORMAL - en: This will all be clearer once you see it in a diagram; let’s look at that next. + id: totrans-21 prefs: [] type: TYPE_NORMAL - en: Diagram + id: totrans-22 prefs: - PREF_H2 type: TYPE_NORMAL - en: This is all summarized in [Figure 3-1](#fig_03-01), for an operation `O` that is receiving inputs from an operation `N` and passing outputs on to another operation `P`. + id: totrans-23 prefs: [] type: TYPE_NORMAL - en: '![Neural net diagram](assets/dlfs_0301.png)' + id: totrans-24 prefs: [] type: TYPE_IMG - en: Figure 3-1\. An Operation, with input and output + id: totrans-25 prefs: - PREF_H6 type: TYPE_NORMAL - en: '[Figure 3-2](#fig_03-02) covers the case of an `Operation` with parameters.' + id: totrans-26 prefs: [] type: TYPE_NORMAL - en: '![Neural net diagram](assets/dlfs_0302.png)' + id: totrans-27 prefs: [] type: TYPE_IMG - en: Figure 3-2\. A ParamOperation, with input and output and parameters + id: totrans-28 prefs: - PREF_H6 type: TYPE_NORMAL - en: Code + id: totrans-29 prefs: - PREF_H2 type: TYPE_NORMAL - en: 'With all this, we can write the fundamental building block for our neural network, an `Operation`, as:' + id: totrans-30 prefs: [] type: TYPE_NORMAL - en: '[PRE0]' + id: totrans-31 prefs: [] type: TYPE_PRE + zh: '[PRE0]' - en: For any individual `Operation` that we define, we’ll have to implement the `_output` and `_input_grad` functions, so named because of the quantities they compute. + id: totrans-32 prefs: [] type: TYPE_NORMAL - en: Note + id: totrans-33 prefs: - PREF_H6 type: TYPE_NORMAL @@ -209,28 +244,35 @@ throughout deep learning fit this blueprint of sending inputs forward and gradients backward, with the shapes of what they receive on the forward pass matching the shapes of what they send backward on the backward pass, and vice versa.' + id: totrans-34 prefs: [] type: TYPE_NORMAL - en: 'We’ll define the specific `Operation`s we’ve used thus far—matrix multiplication and so on—later in this chapter. First we’ll define another class that inherits from `Operation` that we’ll use specifically for `Operation`s that involve parameters:' + id: totrans-35 prefs: [] type: TYPE_NORMAL - en: '[PRE1]' + id: totrans-36 prefs: [] type: TYPE_PRE + zh: '[PRE1]' - en: Similar to the base `Operation`, an individual `ParamOperation` would have to define the `_param_grad` function in addition to the `_output` and `_input_grad` functions. + id: totrans-37 prefs: [] type: TYPE_NORMAL - en: 'We have now formalized the neural network building blocks we’ve been using in our models so far. We could skip ahead and define neural networks directly in terms of these `Operation`s, but there is an intermediate class we’ve been dancing around for a chapter and a half that we’ll define first: the `Layer`.' + id: totrans-38 prefs: [] type: TYPE_NORMAL - en: 'The Building Blocks of Neural Networks: Layers' + id: totrans-39 prefs: - PREF_H1 type: TYPE_NORMAL @@ -248,6 +290,7 @@ numbering—also has an important name: it is called a *hidden* layer, since it is the only layer whose values we don’t typically see explicitly during the course of training.' + id: totrans-40 prefs: [] type: TYPE_NORMAL - en: The output layer is an important exception to this definition of layers, in @@ -257,28 +300,34 @@ functions typically “squash down” their input to some subset of that range relevant to the particular problem we’re trying to solve (for example, the `sigmoid` function squashes down its input to between 0 and 1). + id: totrans-41 prefs: [] type: TYPE_NORMAL - en: Diagrams + id: totrans-42 prefs: - PREF_H2 type: TYPE_NORMAL - en: To make the connection explicit, [Figure 3-3](#fig_03-03) shows the diagram of the neural network from the prior chapter with the individual operations grouped into layers. + id: totrans-43 prefs: [] type: TYPE_NORMAL - en: '![Neural net diagram](assets/dlfs_0303.png)' + id: totrans-44 prefs: [] type: TYPE_IMG - en: Figure 3-3\. The neural network from the prior chapter with the operations grouped into layers + id: totrans-45 prefs: - PREF_H6 type: TYPE_NORMAL - en: You can see that the input represents an “input” layer, the next three operations (ending with the `sigmoid` function) represent the next layer, and the last two operations represent the last layer. + id: totrans-46 prefs: [] type: TYPE_NORMAL - en: 'This is, of course, rather cumbersome. And that’s the point: representing neural @@ -286,16 +335,20 @@ networks work and how to train them, is too “low level” for anything more complicated than a two-layer neural network. That’s why the more common way to represent neural networks is in terms of layers, as shown in [Figure 3-4](#fig_03-04).' + id: totrans-47 prefs: [] type: TYPE_NORMAL - en: '![Neural net diagram](assets/dlfs_0304.png)' + id: totrans-48 prefs: [] type: TYPE_IMG - en: Figure 3-4\. The neural network from the prior chapter in terms of layers + id: totrans-49 prefs: - PREF_H6 type: TYPE_NORMAL - en: Connection to the brain + id: totrans-50 prefs: - PREF_H3 type: TYPE_NORMAL @@ -305,6 +358,7 @@ each observation in the layer’s output*. The neural network from the prior example can thus be thought of as having 13 neurons in the input layer, then 13 neurons (again) in the hidden layer, and one neuron in the output layer.' + id: totrans-51 prefs: [] type: TYPE_NORMAL - en: 'Neurons in the brain have the property that they can receive inputs from many @@ -315,127 +369,162 @@ via a nonlinear function. Thus, this nonlinear function is called the *activation function*, and the values that come out of it are called the *activations* for that layer.^([1](ch03.html#idm45732624417528))' + id: totrans-52 prefs: [] type: TYPE_NORMAL - en: 'Now that we’ve defined layers, we can state the more conventional definition of deep learning: *deep learning models are neural networks with more than one hidden layer.*' + id: totrans-53 prefs: [] type: TYPE_NORMAL - en: We can see that this is equivalent to the earlier definition that was purely in terms of `Operation`s, since a layer is just a series of `Operation`s with a nonlinear operation at the end. + id: totrans-54 prefs: [] type: TYPE_NORMAL - en: Now that we’ve defined a base class for our `Operation`s, let’s show how it can serve as the fundamental building block of the models we saw in the prior chapter. + id: totrans-55 prefs: [] type: TYPE_NORMAL - en: Building Blocks on Building Blocks + id: totrans-56 prefs: - PREF_H1 type: TYPE_NORMAL - en: 'What specific `Operation`s do we need to implement for the models in the prior chapter to work? Based on our experience of implementing that neural network step by step, we know there are three kinds:' + id: totrans-57 prefs: [] type: TYPE_NORMAL - en: The matrix multiplication of the input with the matrix of parameters + id: totrans-58 prefs: - PREF_UL type: TYPE_NORMAL - en: The addition of a bias term + id: totrans-59 prefs: - PREF_UL type: TYPE_NORMAL - en: The `sigmoid` activation function + id: totrans-60 prefs: - PREF_UL type: TYPE_NORMAL - en: 'Let’s start with the `WeightMultiply` `Operation`:' + id: totrans-61 prefs: [] type: TYPE_NORMAL - en: '[PRE2]' + id: totrans-62 prefs: [] type: TYPE_PRE + zh: '[PRE2]' - en: Here we simply code up the matrix multiplication on the forward pass, as well as the rules for “sending gradients backward” to both the inputs and the parameters on the backward pass (using the rules for doing so that we reasoned through at the end of [Chapter 1](ch01.html#foundations)). As you’ll see shortly, we can now use this as a *building block* that we can simply plug into our `Layer`s. + id: totrans-63 prefs: [] type: TYPE_NORMAL - en: 'Next up is the addition operation, which we’ll call `BiasAdd`:' + id: totrans-64 prefs: [] type: TYPE_NORMAL - en: '[PRE3]' + id: totrans-65 prefs: [] type: TYPE_PRE + zh: '[PRE3]' - en: 'Finally, let’s do `sigmoid`:' + id: totrans-66 prefs: [] type: TYPE_NORMAL - en: '[PRE4]' + id: totrans-67 prefs: [] type: TYPE_PRE + zh: '[PRE4]' - en: This simply implements the math described in the previous chapter. + id: totrans-68 prefs: [] type: TYPE_NORMAL - en: Note + id: totrans-69 prefs: - PREF_H6 type: TYPE_NORMAL - en: 'For both `sigmoid` and the `ParamOperation`, the step during the backward pass where we compute:' + id: totrans-70 prefs: [] type: TYPE_NORMAL - en: '[PRE5]' + id: totrans-71 prefs: [] type: TYPE_PRE + zh: '[PRE5]' - en: 'is the step where we are applying the chain rule, and the corresponding rule for `WeightMultiply`:' + id: totrans-72 prefs: [] type: TYPE_NORMAL - en: '[PRE6]' + id: totrans-73 prefs: [] type: TYPE_PRE + zh: '[PRE6]' - en: is, as I argued in [Chapter 1](ch01.html#foundations), the analogue of the chain rule when the function in question is a matrix multiplication. + id: totrans-74 prefs: [] type: TYPE_NORMAL - en: Now that we’ve defined these `Operation`s precisely, we can use *them* as building blocks to define a `Layer`. + id: totrans-75 prefs: [] type: TYPE_NORMAL - en: The Layer Blueprint + id: totrans-76 prefs: - PREF_H2 type: TYPE_NORMAL - en: 'Because of the way we’ve written the `Operation`s, writing the `Layer` class is easy:' + id: totrans-77 prefs: [] type: TYPE_NORMAL - en: 'The `forward` and `backward` methods simply involve sending the input successively forward through a series of `Operation`s—exactly as we’ve been doing in the diagrams all along! This is the most important fact about the working of `Layer`s; the rest of the code is a wrapper around this and mostly involves bookkeeping:' + id: totrans-78 prefs: - PREF_UL type: TYPE_NORMAL - en: Defining the correct series of `Operation`s in the `_setup_layer` function and initializing and storing the parameters in these `Operation`s (which will also take place in the `_setup_layer` function) + id: totrans-79 prefs: - PREF_IND - PREF_UL type: TYPE_NORMAL - en: Storing the correct values in `self.input_` and `self.output` on the `forward` method + id: totrans-80 prefs: - PREF_IND - PREF_UL type: TYPE_NORMAL - en: Performing the correct assertion checking in the `backward` method + id: totrans-81 prefs: - PREF_IND - PREF_UL @@ -443,27 +532,34 @@ - en: Finally, the `_params` and `_param_grads` functions simply extract the parameters and their gradients (with respect to the loss) from the `ParamOperation`s within the layer. + id: totrans-82 prefs: - PREF_UL type: TYPE_NORMAL - en: 'Here’s what all that looks like:' + id: totrans-83 prefs: [] type: TYPE_NORMAL - en: '[PRE7]' + id: totrans-84 prefs: [] type: TYPE_PRE + zh: '[PRE7]' - en: Just as we moved from an abstract definition of an `Operation` to the implementation of specific `Operation`s needed for the neural network from [Chapter 2](ch02.html#fundamentals), let’s now implement the `Layer` from that network as well. + id: totrans-85 prefs: [] type: TYPE_NORMAL - en: The Dense Layer + id: totrans-86 prefs: - PREF_H2 type: TYPE_NORMAL - en: We called the `Operation`s we’ve been dealing with `WeightMultiply`, `BiasAdd`, and so on. What should we call the layer we’ve been using so far? A `LinearNonLinear` layer? + id: totrans-87 prefs: [] type: TYPE_NORMAL - en: 'A defining characteristic of this layer is that *each output neuron is a function @@ -476,20 +572,25 @@ Thus these layers are often called *fully connected* layers; recently, in the popular `Keras` library, they are also often called `Dense` layers, a more concise term that gets across the same idea.' + id: totrans-88 prefs: [] type: TYPE_NORMAL - en: Now that we know what to call it and why, let’s define the `Dense` layer in terms of the operations we’ve already defined—as you’ll see, because of how we defined our `Layer` base class, all we need to do is to put the `Operation`s defined in the previous section in as a list in the `_setup_layer` function. + id: totrans-89 prefs: [] type: TYPE_NORMAL - en: '[PRE8]' + id: totrans-90 prefs: [] type: TYPE_PRE + zh: '[PRE8]' - en: Note that we’ll make the default activation a `Linear` activation, which really means we apply no activation, and simply apply the identity function to the output of the layer. + id: totrans-91 prefs: [] type: TYPE_NORMAL - en: What building blocks should we now add on top of `Operation` and `Layer`? To @@ -497,9 +598,11 @@ just as `Layer`s wrapped around `Operation`s. It isn’t obvious what other classes will be needed, so we’ll just dive in and build `NeuralNetwork` and figure out the other classes we’ll need as we go. + id: totrans-92 prefs: [] type: TYPE_NORMAL - en: The NeuralNetwork Class, and Maybe Others + id: totrans-93 prefs: - PREF_H1 type: TYPE_NORMAL @@ -508,155 +611,216 @@ of data representing “observations” (`X`) and “correct answers” (`y`) and learn the relationship between `X` and `y`, which means learning a function that can transform `X` into predictions `p` that are very close to `y`.' + id: totrans-94 prefs: [] type: TYPE_NORMAL - en: 'How exactly will this learning take place, given the `Layer` and `Operation` classes just defined? Recalling how the model from the last chapter worked, we’ll implement the following:' + id: totrans-95 prefs: [] type: TYPE_NORMAL - en: The neural network should take `X` and pass it successively forward through each `Layer` (which is really a convenient wrapper around feeding it through many `Operation`s), at which point the result will represent the `prediction`. + id: totrans-96 prefs: - PREF_OL type: TYPE_NORMAL + zh: 神经网络应该接受`X`并将其逐步通过每个`Layer`(实际上是一个方便的包装器,用于通过许多`Operation`进行馈送),此时结果将代表`prediction`。 - en: Next, `prediction` should be compared with the value `y` to calculate the loss and generate the “loss gradient,” which is the partial derivative of the loss with respect to each element in the last layer in the network (namely, the one that generated the `prediction`). + id: totrans-97 prefs: - PREF_OL type: TYPE_NORMAL + zh: 接下来,应该将`prediction`与值`y`进行比较,计算损失并生成“损失梯度”,这是与网络中最后一个层(即生成`prediction`的层)中的每个元素相关的损失的偏导数。 - en: Finally, we’ll send this loss gradient successively backward through each layer, along the way computing the “parameter gradients”—the partial derivative of the loss with respect to each of the parameters—and storing them in the corresponding `Operation`s. + id: totrans-98 prefs: - PREF_OL type: TYPE_NORMAL + zh: 最后,我们将通过每个层将这个损失梯度逐步向后发送,同时计算“参数梯度”——损失对每个参数的偏导数,并将它们存储在相应的`Operation`中。 - en: Diagram + id: totrans-99 prefs: - PREF_H2 type: TYPE_NORMAL + zh: 图 - en: '[Figure 3-5](#backpropagation_now_in_terms) captures this description of a neural network in terms of `Layer`s.' + id: totrans-100 prefs: [] type: TYPE_NORMAL + zh: '[图3-5](#backpropagation_now_in_terms)以`Layer`的术语捕捉了神经网络的描述。' - en: '![Neural net diagram](assets/dlfs_0305.png)' + id: totrans-101 prefs: [] type: TYPE_IMG + zh: '![神经网络图](assets/dlfs_0305.png)' - en: Figure 3-5\. Backpropagation, now in terms of Layers instead of Operations + id: totrans-102 prefs: - PREF_H6 type: TYPE_NORMAL + zh: 图3-5。反向传播,现在以Layer而不是Operation的术语 - en: Code + id: totrans-103 prefs: - PREF_H2 type: TYPE_NORMAL + zh: 代码 - en: 'How should we implement this? First, we’ll want our neural network to ultimately deal with `Layer`s the same way our `Layer`s dealt with `Operation`s. For example, we want the `forward` method to receive `X` as input and simply do something like:' + id: totrans-104 prefs: [] type: TYPE_NORMAL + zh: 我们应该如何实现这一点?首先,我们希望我们的神经网络最终处理`Layer`的方式与我们的`Layer`处理`Operation`的方式相同。例如,我们希望`forward`方法接收`X`作为输入,然后简单地执行类似以下的操作: - en: '[PRE9]' + id: totrans-105 prefs: [] type: TYPE_PRE + zh: '[PRE9]' - en: 'Similarly, we’ll want our `backward` method to take in an argument—let’s initially call it `grad`—and do something like:' + id: totrans-106 prefs: [] type: TYPE_NORMAL + zh: 同样,我们希望我们的`backward`方法接收一个参数——我们最初称之为`grad`——然后执行类似以下的操作: - en: '[PRE10]' + id: totrans-107 prefs: [] type: TYPE_PRE + zh: '[PRE10]' - en: 'Where will `grad` come from? It has to come from the *loss*, a special function that takes in the `prediction` along with `y` and:' + id: totrans-108 prefs: [] type: TYPE_NORMAL + zh: '`grad`将从哪里来?它必须来自*损失*,一个特殊的函数,它接收`prediction`以及`y`,然后:' - en: Computes a single number representing the “penalty” for the network making that `prediction`. + id: totrans-109 prefs: - PREF_UL type: TYPE_NORMAL + zh: 计算代表网络进行该`prediction`的“惩罚”的单个数字。 - en: Sends backward a gradient for every element of the `prediction` with respect to the loss. This gradient is what the last `Layer` in the network will receive as the input to its `backward` function. + id: totrans-110 prefs: - PREF_UL type: TYPE_NORMAL + zh: 针对每个`prediction`中的元素,发送一个梯度与损失相关的反向梯度。这个梯度是网络中最后一个`Layer`将作为其`backward`函数输入接收的内容。 - en: In the example from the prior chapter, the loss function was the squared difference between the `prediction` and the target, and the gradient of the `prediction` with respect to the loss was computed accordingly. + id: totrans-111 prefs: [] type: TYPE_NORMAL + zh: 在前一章的示例中,损失函数是`prediction`和目标之间的平方差,相应地计算了`prediction`相对于损失的梯度。 - en: How should we implement this? It seems like this concept is important enough to deserve its own class. Furthermore, this class can be implemented similarly to the `Layer` class, except the `forward` method will produce an actual number (a `float`) as the loss, instead of an `ndarray` to be sent forward to the next `Layer`. Let’s formalize this. + id: totrans-112 prefs: [] type: TYPE_NORMAL + zh: 我们应该如何实现这一点?这个概念似乎很重要,值得拥有自己的类。此外,这个类可以类似于`Layer`类实现,只是`forward`方法将产生一个实际数字(一个`float`)作为损失,而不是一个`ndarray`被发送到下一个`Layer`。让我们正式化这一点。 - en: Loss Class + id: totrans-113 prefs: - PREF_H2 type: TYPE_NORMAL + zh: 损失类 - en: 'The `Loss` base class will be similar to `Layer`—the `forward` and `backward` methods will check that the shapes of the appropriate `ndarray`s are identical and define two methods, `_output` and `_input_grad`, that any subclass of `Loss` will have to define:' + id: totrans-114 prefs: [] type: TYPE_NORMAL + zh: '`Loss`基类将类似于`Layer`——`forward`和`backward`方法将检查适当的`ndarray`的形状是否相同,并定义两个方法,`_output`和`_input_grad`,任何`Loss`子类都必须定义:' - en: '[PRE11]' + id: totrans-115 prefs: [] type: TYPE_PRE + zh: '[PRE11]' - en: 'As in the `Operation` class, we check that the gradient that the loss sends backward is the same shape as the `prediction` received as input from the last layer of the network:' + id: totrans-116 prefs: [] type: TYPE_NORMAL + zh: 与`Operation`类一样,我们检查损失向后发送的梯度与从网络的最后一层接收的`prediction`的形状是否相同: - en: '[PRE12]' + id: totrans-117 prefs: [] type: TYPE_PRE + zh: '[PRE12]' - en: Here, we simply code the forward and backward rules of the mean squared error loss formula. + id: totrans-118 prefs: [] type: TYPE_NORMAL + zh: 在这里,我们简单地编写均方误差损失公式的前向和反向规则。 - en: This is the last key building block we need to build deep learning from scratch. Let’s review how these pieces fit together and then proceed with building a model! + id: totrans-119 prefs: [] type: TYPE_NORMAL + zh: 这是我们需要从头开始构建深度学习的最后一个关键构建块。让我们回顾一下这些部分如何组合在一起,然后继续构建模型! - en: Deep Learning from Scratch + id: totrans-120 prefs: - PREF_H1 type: TYPE_NORMAL + zh: 从零开始的深度学习 - en: 'We ultimately want to build a `NeuralNetwork` class, using [Figure 3-5](#backpropagation_now_in_terms) as a guide, that we can use to define and train deep learning models. Before we dive in and start coding, let’s describe precisely what such a class would be and how it would interact with the `Operation`, `Layer`, and `Loss` classes we just defined:' + id: totrans-121 prefs: [] type: TYPE_NORMAL + zh: 我们最终希望构建一个`NeuralNetwork`类,使用[图3-5](#backpropagation_now_in_terms)作为指南,我们可以用来定义和训练深度学习模型。在我们深入编码之前,让我们准确描述一下这样一个类会是什么样的,以及它将如何与我们刚刚定义的`Operation`、`Layer`和`Loss`类进行交互: - en: A `NeuralNetwork` will have a list of `Layer`s as an attribute. The `Layer`s would be as defined previously, with `forward` and `backward` methods. These methods take in `ndarray` objects and return `ndarray` objects. + id: totrans-122 prefs: - PREF_OL type: TYPE_NORMAL + zh: '`NeuralNetwork`将具有`Layer`列表作为属性。`Layer`将如先前定义的那样,具有`forward`和`backward`方法。这些方法接受`ndarray`对象并返回`ndarray`对象。' - en: Each `Layer` will have a list of `Operation`s saved in the `operations` attribute of the layer during the `_setup_layer` function. + id: totrans-123 prefs: - PREF_OL type: TYPE_NORMAL + zh: 每个`Layer`在`_setup_layer`函数期间的`operations`属性中保存了一个`Operation`列表。 - en: These `Operation`s, just like the `Layer` itself, have `forward` and `backward` methods that take in `ndarray` objects as arguments and return `ndarray` objects as outputs. + id: totrans-124 prefs: - PREF_OL type: TYPE_NORMAL + zh: 这些`Operation`,就像`Layer`本身一样,有`forward`和`backward`方法,接受`ndarray`对象作为参数并返回`ndarray`对象作为输出。 - en: In each operation, the shape of the `output_grad` received in the `backward` method must be the same as the shape of the `output` attribute of the `Layer`. The same is true for the shapes of the `input_grad` passed backward during the `backward` method and the `input_` attribute. + id: totrans-125 prefs: - PREF_OL type: TYPE_NORMAL @@ -665,6 +829,7 @@ shapes apply to `Layer`s and their `forward` and `backward` methods as well—they take in `ndarray` objects and output `ndarray` objects, and the shapes of the `input` and `output` attributes and their corresponding gradients must match. + id: totrans-126 prefs: - PREF_OL type: TYPE_NORMAL @@ -672,87 +837,109 @@ the last operation from the `NeuralNetwork` and the target, check that their shapes are the same, and calculate both a loss value (a number) and an `ndarray` `loss_grad` that will be fed into the output layer, starting backpropagation. + id: totrans-127 prefs: - PREF_OL type: TYPE_NORMAL - en: Implementing Batch Training + id: totrans-128 prefs: - PREF_H2 type: TYPE_NORMAL - en: 'We’ve covered several times the high-level steps for training a model one batch at a time. They are important and worth repeating:' + id: totrans-129 prefs: [] type: TYPE_NORMAL - en: Feed input through the model function (the “forward pass”) to get a prediction. + id: totrans-130 prefs: - PREF_OL type: TYPE_NORMAL - en: Calculate the number representing the loss. + id: totrans-131 prefs: - PREF_OL type: TYPE_NORMAL - en: Calculate the gradient of the loss with respect to the parameters, using the chain rule and the quantities computed during the forward pass. + id: totrans-132 prefs: - PREF_OL type: TYPE_NORMAL - en: Update the parameters using these gradients. + id: totrans-133 prefs: - PREF_OL type: TYPE_NORMAL - en: We would then feed a new batch of data through and repeat these steps. + id: totrans-134 prefs: [] type: TYPE_NORMAL - en: 'Translating these steps into the `NeuralNetwork` framework just described is straightforward:' + id: totrans-135 prefs: [] type: TYPE_NORMAL - en: Receive `X` and `y` as inputs, both `ndarray`s. + id: totrans-136 prefs: - PREF_OL type: TYPE_NORMAL - en: Feed `X` successively forward through each `Layer`. + id: totrans-137 prefs: - PREF_OL type: TYPE_NORMAL - en: Use the `Loss` to produce loss value and the loss gradient to be sent backward. + id: totrans-138 prefs: - PREF_OL type: TYPE_NORMAL - en: Use the loss gradient as input to the `backward` method for the network, which will calculate the `param_grads` for each layer in the network. + id: totrans-139 prefs: - PREF_OL type: TYPE_NORMAL - en: Call the `update_params` function on each layer, which will use the overall learning rate for the `NeuralNetwork` as well as the newly calculated `param_grads`. + id: totrans-140 prefs: - PREF_OL type: TYPE_NORMAL - en: We finally have our full definition of a neural network that can accommodate batch training. Now let’s code it up. + id: totrans-141 prefs: [] type: TYPE_NORMAL - en: 'NeuralNetwork: Code' + id: totrans-142 prefs: - PREF_H2 type: TYPE_NORMAL - en: 'Coding all of this up is pretty straightforward:' + id: totrans-143 prefs: [] type: TYPE_NORMAL - en: '[PRE13]' + id: totrans-144 prefs: [] type: TYPE_PRE + zh: '[PRE13]' - en: With this `NeuralNetwork` class, we can implement the models from the prior chapter in a more modular, flexible way and define other models to represent complex nonlinear relationships between input and output. For example, here’s how to easily instantiate the two models we covered in the last chapter—the linear regression and the neural network:^([3](ch03.html#idm45732622822120)) + id: totrans-145 prefs: [] type: TYPE_NORMAL - en: '[PRE14]' + id: totrans-146 prefs: [] type: TYPE_PRE + zh: '[PRE14]' - en: We’re basically done; now we just feed data repeatedly through the network in order for it to learn. To make this process cleaner and easier to extend to the more complicated deep learning scenarios we’ll see in the following chapter, however, @@ -760,9 +947,11 @@ as an additional class that carries out the “learning,” or the actual updating of the `NeuralNetwork` parameters given the gradients computed on the backward pass. Let’s quickly define these two classes. + id: totrans-147 prefs: [] type: TYPE_NORMAL - en: Trainer and Optimizer + id: totrans-148 prefs: - PREF_H1 type: TYPE_NORMAL @@ -770,13 +959,17 @@ to train the network in [Chapter 2](ch02.html#fundamentals). There, we used the following code to implement the four steps described earlier for training the model:' + id: totrans-149 prefs: [] type: TYPE_NORMAL - en: '[PRE15]' + id: totrans-150 prefs: [] type: TYPE_PRE + zh: '[PRE15]' - en: This code was within a `for` loop that repeatedly fed data through the function defining and updated our network. + id: totrans-151 prefs: [] type: TYPE_NORMAL - en: 'With the classes we have now, we’ll ultimately do this inside a `fit` function @@ -785,22 +978,28 @@ Notebook](https://oreil.ly/2MV0aZI) on the book’s GitHub page.) The main difference is that inside this new function, the first two lines from the preceding code block will be replaced with this line:' + id: totrans-152 prefs: [] type: TYPE_NORMAL - en: '[PRE16]' + id: totrans-153 prefs: [] type: TYPE_PRE + zh: '[PRE16]' - en: Updating the parameters, which happens in the following two lines, will take place in a separate `Optimizer` class. And finally, the `for` loop that previously wrapped around all of this will take place in the `Trainer` class that wraps around the `NeuralNetwork` and the `Optimizer`. + id: totrans-154 prefs: [] type: TYPE_NORMAL - en: Next, let’s discuss why we need an `Optimizer` class and what it should look like. + id: totrans-155 prefs: [] type: TYPE_NORMAL - en: Optimizer + id: totrans-156 prefs: - PREF_H2 type: TYPE_NORMAL @@ -811,9 +1010,11 @@ gradient updates from the specific batch that was fed in at that iteration. Creating a separate `Optimizer` class will give us the flexibility to swap in one update rule for another, something that we’ll explore in more detail in the next chapter. + id: totrans-157 prefs: [] type: TYPE_NORMAL - en: Description and code + id: totrans-158 prefs: - PREF_H3 type: TYPE_NORMAL @@ -821,31 +1022,41 @@ `step` function is called, will update the parameters of the network based on their current values, their gradients, and any other information stored in the `Optimizer`:' + id: totrans-159 prefs: [] type: TYPE_NORMAL - en: '[PRE17]' + id: totrans-160 prefs: [] type: TYPE_PRE + zh: '[PRE17]' - en: 'And here’s how this looks with the straightforward update rule we’ve seen so far, known as *stochastic gradient descent*:' + id: totrans-161 prefs: [] type: TYPE_NORMAL - en: '[PRE18]' + id: totrans-162 prefs: [] type: TYPE_PRE + zh: '[PRE18]' - en: Note + id: totrans-163 prefs: - PREF_H6 type: TYPE_NORMAL - en: Note that while our `NeuralNetwork` class does not have an `_update_params` method, we do rely on the `params()` and `param_grads()` methods to extract the correct `ndarray`s for optimization. + id: totrans-164 prefs: [] type: TYPE_NORMAL - en: That’s the basic `Optimizer` class; let’s cover the `Trainer` class next. + id: totrans-165 prefs: [] type: TYPE_NORMAL - en: Trainer + id: totrans-166 prefs: - PREF_H2 type: TYPE_NORMAL @@ -855,30 +1066,38 @@ we didn’t pass in a `NeuralNetwork` when initializing our `Optimizer`; instead, we’ll assign the `NeuralNetwork` to be an attribute of the `Optimizer` when we initialize the `Trainer` class shortly, with this line:' + id: totrans-167 prefs: [] type: TYPE_NORMAL - en: '[PRE19]' + id: totrans-168 prefs: [] type: TYPE_PRE + zh: '[PRE19]' - en: 'In the following subsection, I show a simplified but working version of the `Trainer` class that for now contains just the `fit` method. This method trains our model for a number of *epochs* and prints out the loss value after each set number of epochs. In each epoch, we:' + id: totrans-169 prefs: [] type: TYPE_NORMAL - en: Shuffle the data at the beginning of the epoch + id: totrans-170 prefs: - PREF_OL type: TYPE_NORMAL - en: Feed the data through the network in batches, updating the parameters after each batch has been fed through + id: totrans-171 prefs: - PREF_OL type: TYPE_NORMAL - en: The epoch ends when we have fed the entire training set through the `Trainer`. + id: totrans-172 prefs: [] type: TYPE_NORMAL - en: Trainer code + id: totrans-173 prefs: - PREF_H3 type: TYPE_NORMAL @@ -889,32 +1108,41 @@ epoch. We also include a `restart` argument in the `train` function: if `True` (default), it will reinitialize the model’s parameters to random values upon calling the `train` function:' + id: totrans-174 prefs: [] type: TYPE_NORMAL - en: '[PRE20]' + id: totrans-175 prefs: [] type: TYPE_PRE + zh: '[PRE20]' - en: 'In the full version of this function in the book’s [GitHub repository](https://oreil.ly/2MV0aZI), we also implement *early stopping*, which does the following:' + id: totrans-176 prefs: [] type: TYPE_NORMAL - en: It saves the loss value every `eval_every` epochs. + id: totrans-177 prefs: - PREF_OL type: TYPE_NORMAL - en: It checks whether the validation loss is lower than the last time it was calculated. + id: totrans-178 prefs: - PREF_OL type: TYPE_NORMAL - en: If the validation loss is *not* lower, it uses the model from `eval_every` epochs ago. + id: totrans-179 prefs: - PREF_OL type: TYPE_NORMAL - en: Finally, we have everything we need to train these models! + id: totrans-180 prefs: [] type: TYPE_NORMAL - en: Putting Everything Together + id: totrans-181 prefs: - PREF_H1 type: TYPE_NORMAL @@ -922,78 +1150,110 @@ classes and the two models defined before—`linear_regression` and `neural_network`. We’ll set the learning rate to `0.01` and the maximum number of epochs to `50` and evaluate our models every `10` epochs:' + id: totrans-182 prefs: [] type: TYPE_NORMAL - en: '[PRE21]' + id: totrans-183 prefs: [] type: TYPE_PRE + zh: '[PRE21]' - en: '[PRE22]' + id: totrans-184 prefs: [] type: TYPE_PRE + zh: '[PRE22]' - en: 'Using the same model-scoring functions from [Chapter 2](ch02.html#fundamentals), and wrapping them inside an `eval_regression_model` function, gives us these results:' + id: totrans-185 prefs: [] type: TYPE_NORMAL - en: '[PRE23]' + id: totrans-186 prefs: [] type: TYPE_PRE + zh: '[PRE23]' - en: '[PRE24]' + id: totrans-187 prefs: [] type: TYPE_PRE + zh: '[PRE24]' - en: These are similar to the results of the linear regression we ran in the last chapter, confirming that our framework is working. + id: totrans-188 prefs: [] type: TYPE_NORMAL - en: 'Running the same code with the `neural_network` model with a hidden layer with 13 neurons, we get the following:' + id: totrans-189 prefs: [] type: TYPE_NORMAL - en: '[PRE25]' + id: totrans-190 prefs: [] type: TYPE_PRE + zh: '[PRE25]' - en: '[PRE26]' + id: totrans-191 prefs: [] type: TYPE_PRE + zh: '[PRE26]' - en: '[PRE27]' + id: totrans-192 prefs: [] type: TYPE_PRE + zh: '[PRE27]' - en: Again, these results are similar to what we saw in the prior chapter, and they’re significantly better than our straightforward linear regression. + id: totrans-193 prefs: [] type: TYPE_NORMAL - en: Our First Deep Learning Model (from Scratch) + id: totrans-194 prefs: - PREF_H2 type: TYPE_NORMAL - en: 'Now that all of that setup is out of the way, defining our first deep learning model is trivial:' + id: totrans-195 prefs: [] type: TYPE_NORMAL - en: '[PRE28]' + id: totrans-196 prefs: [] type: TYPE_PRE + zh: '[PRE28]' - en: We won’t even try to be clever with this (yet). We’ll just add a hidden layer with the same dimensionality as the first layer, so that our network now has two hidden layers, each with 13 neurons. + id: totrans-197 prefs: [] type: TYPE_NORMAL - en: 'Training this using the same learning rate and evaluation schedule as the prior models yields the following result:' + id: totrans-198 prefs: [] type: TYPE_NORMAL - en: '[PRE29]' + id: totrans-199 prefs: [] type: TYPE_PRE + zh: '[PRE29]' - en: '[PRE30]' + id: totrans-200 prefs: [] type: TYPE_PRE + zh: '[PRE30]' - en: '[PRE31]' + id: totrans-201 prefs: [] type: TYPE_PRE + zh: '[PRE31]' - en: We finally worked up to doing deep learning from scratch—and indeed, on this real-world problem, without the use of any tricks (just a bit of learning rate tuning), our deep learning model does perform slightly better than a neural network with just one hidden layer. + id: totrans-202 prefs: [] type: TYPE_NORMAL - en: More importantly, we did so by building a framework that is easily extensible. @@ -1004,9 +1264,11 @@ activation functions into our existing layers and see if it decreases our error metrics; I encourage you to clone the book’s [GitHub repo](https://oreil.ly/deep-learning-github) and try this! + id: totrans-203 prefs: [] type: TYPE_NORMAL - en: Conclusion and Next Steps + id: totrans-204 prefs: - PREF_H1 type: TYPE_NORMAL @@ -1018,27 +1280,32 @@ into the `Optimizer` and `Trainer` classes. Finally, we’ll see Dropout, a new kind of `Operation` that has proven essential for increasing the training stability of deep learning models. Onward! + id: totrans-205 prefs: [] type: TYPE_NORMAL - en: ^([1](ch03.html#idm45732624417528-marker)) Among all activation functions, the `sigmoid` function, which maps inputs to between 0 and 1, most closely mimics the actual activation of neurons in the brain, but in general activation functions can be any monotonic, nonlinear function. + id: totrans-206 prefs: [] type: TYPE_NORMAL - en: '^([2](ch03.html#idm45732623512888-marker)) As we’ll see in [Chapter 5](ch05.html#convolution), this is not true of all layers: in *convolutional* layers, for example, each output feature is a combination of *only a small subset* of the input features.' + id: totrans-207 prefs: [] type: TYPE_NORMAL - en: ^([3](ch03.html#idm45732622822120-marker)) The learning rate of 0.01 isn’t special; we simply found it to be optimal in the course of experimenting while writing the prior chapter. + id: totrans-208 prefs: [] type: TYPE_NORMAL - en: ^([4](ch03.html#idm45732621371848-marker)) Even on this simple problem, changing the hyperparameters slightly can cause the deep learning model to fail to beat the two-layer neural network. Clone the [GitHub repo](https://oreil.ly/deep-learning-github) and try it yourself! + id: totrans-209 prefs: [] type: TYPE_NORMAL