diff --git a/totrans/prac-dl-cld_18.yaml b/totrans/prac-dl-cld_18.yaml index 7585728..8e27b44 100644 --- a/totrans/prac-dl-cld_18.yaml +++ b/totrans/prac-dl-cld_18.yaml @@ -1,11 +1,15 @@ - en: 'Chapter 17\. Building an Autonomous Car in Under an Hour: Reinforcement Learning with AWS DeepRacer' + id: totrans-0 prefs: - PREF_H1 type: TYPE_NORMAL + zh: 第17章。在不到一个小时内构建自动驾驶汽车:使用AWS DeepRacer进行强化学习 - en: 'Contributed by guest author: Sunil Mallya' + id: totrans-1 prefs: [] type: TYPE_NORMAL + zh: 由客座作者Sunil Mallya撰写 - en: If you follow technology news, you probably have seen a resurgence in debates about when computers are going to take over the world. Although that’s a fun thought exercise, what’s triggered the resurgence in these debates? A large part of the @@ -15,15 +19,20 @@ at Defence of the Ancients (Dota) 2 in 2017\. The most astonishing thing about these successes is that the “bots” learned the games by playing against one another and reinforcing the strategies that they found to bring them success. + id: totrans-2 prefs: [] type: TYPE_NORMAL + zh: 如果你关注科技新闻,你可能已经看到了关于计算机何时将接管世界的辩论再次兴起。尽管这是一个有趣的思考练习,但是是什么引发了这些辩论的再次兴起呢?这种辩论再次兴起的很大一部分原因归功于计算机在决策任务中击败人类的消息——在国际象棋中获胜,在视频游戏中取得高分,如《Atari》(2013),在复杂的围棋比赛中击败人类(2016),最后,在2017年击败人类团队在《Defense + of the Ancients》(Dota)2中。这些成功最令人惊讶的事情是,“机器人”通过相互对抗并强化他们发现的成功策略来学习这些游戏。 - en: If we think more broadly on this concept, it’s no different than how humans teach their pets. To train a dog, every good behavior is reinforced by rewarding the dog with a treat and lots of hugs, and every undesired behavior is discouraged by asserting “bad doggie.” This concept of reinforcing good behaviors and discouraging bad ones essentially forms the crux of *reinforcement learning*. + id: totrans-3 prefs: [] type: TYPE_NORMAL + zh: 如果我们更广泛地思考这个概念,这与人类教导他们的宠物没有什么不同。为了训练一只狗,每一种好行为都会通过奖励狗狗一块零食和许多拥抱来加强,而每一种不良行为都会通过断言“坏狗狗”来加以阻止。强化好行为和阻止不良行为的概念基本上构成了*强化学习*的核心。 - en: 'Computer games, or games in general, require a sequence of decisions to be made, so traditional supervised methods aren’t well suited, because they often focus on making a single decision (e.g., is this an image of a cat or a dog?). @@ -37,12 +46,16 @@ we focus on learning this paradigm of machine learning and applying it to a real-world problem: building a one-eighteenth-scale, self-driving autonomous car in less than an hour.' + id: totrans-4 prefs: [] type: TYPE_NORMAL + zh: 计算机游戏,或者说一般的游戏,需要做出一系列决策,因此传统的监督方法并不适用,因为它们通常专注于做出单一决策(例如,这是一张猫还是狗的图片?)。强化学习社区内部的一个笑话是我们整天都在玩视频游戏(剧透:这是真的!)。目前,强化学习正在被应用于各行各业,以优化股票交易,管理大型建筑和数据中心的供暖和制冷,进行实时广告竞价,优化视频流质量,甚至优化实验室中的化学反应。鉴于这些生产系统的例子,我们强烈建议在顺序决策制定和优化问题中使用强化学习。在本章中,我们专注于学习这种机器学习范式,并将其应用于一个真实世界问题:在不到一个小时内构建一个1/18比例的自动驾驶汽车。 - en: A Brief Introduction to Reinforcement Learning + id: totrans-5 prefs: - PREF_H1 type: TYPE_NORMAL + zh: 强化学习简介 - en: Similar to deep learning, reinforcement learning has seen a renaissance in the past few years ever since previously held human video game records were broken. Reinforcement learning theory had its heyday in the 1990s, but didn’t break in @@ -55,8 +68,10 @@ learning and deep reinforcement learning interchangeably, but in almost all cases if not stated otherwise, when we refer to reinforcement learning, we are talking about deep reinforcement learning. + id: totrans-6 prefs: [] type: TYPE_NORMAL + zh: 与深度学习类似,强化学习在过去几年中经历了复兴,自从以前保持的人类视频游戏记录被打破以来。强化学习理论在上世纪90年代达到鼎盛时期,但由于计算要求和训练这些系统的困难,它没有进入大规模生产系统。传统上,强化学习被认为是计算密集型的;相比之下,神经网络是数据密集型的。但是深度神经网络的进步也使强化学习受益。神经网络现在被用来表示强化学习模型,从而诞生了深度强化学习。在本章中,我们将强化学习和深度强化学习这两个术语互换使用,但在几乎所有情况下,如果没有另有说明,当我们提到强化学习时,我们指的是深度强化学习。 - en: 'Despite recent advancements, the landscape for reinforcement learning has not been developer friendly. Interfaces for training deep learning models have progressively become simpler, but this hasn’t quite caught up in the reinforcement learning @@ -72,19 +87,28 @@ learning accessible to developers. It is powered by Amazon SageMaker reinforcement learning, which is a general-purpose reinforcement learning platform. And, let’s get real: who doesn’t like self-driving cars?' + id: totrans-7 prefs: [] type: TYPE_NORMAL + zh: 尽管最近有所进展,但强化学习的领域并不友好。用于训练深度学习模型的界面逐渐变得简单,但在强化学习社区中还没有跟上。强化学习的另一个具有挑战性的方面是显著的计算要求和模型收敛所需的时间(学习完成)——要创建一个收敛的模型实际上需要几天,甚至几周的时间。现在,假设我们有耐心、神经网络知识和金钱带宽,关于强化学习的教育资源是少之又少的。大多数资源都针对高级数据科学家,有时对开发人员来说难以触及。我们之前提到的那个1/18比例的自动驾驶汽车?那就是AWS的DeepRacer。AWS + DeepRacer背后最大的动机之一是让强化学习对开发人员更加可访问。它由亚马逊SageMaker强化学习提供支持,这是一个通用的强化学习平台。而且,让我们真实一点:谁不喜欢自动驾驶汽车呢? - en: '![The AWS DeepRacer one-eighteenth-scale autonomous car](../images/00305.jpeg)' + id: totrans-8 prefs: [] type: TYPE_IMG + zh: '![AWS DeepRacer的1/18比例自主汽车](../images/00305.jpeg) ' - en: Figure 17-1\. The AWS DeepRacer one-eighteenth-scale autonomous car + id: totrans-9 prefs: - PREF_H6 type: TYPE_NORMAL + zh: 图17-1。AWS DeepRacer的1/18比例自主汽车 - en: Why Learn Reinforcement Learning with an Autonomous Car? + id: totrans-10 prefs: - PREF_H1 type: TYPE_NORMAL + zh: 为什么要通过自动驾驶汽车学习强化学习? - en: In recent years, self-driving technology has seen significant investments and success. DIY self-driving and radio-controlled (RC) car-racing communities have since become popular. This unparalleled enthusiasm in developers building scaled @@ -93,8 +117,10 @@ Although there existed other algorithms to build self-driving cars, like traditional computer vision or supervised learning (behavioral cloning), we believe reinforcement learning has an edge over these. + id: totrans-11 prefs: [] type: TYPE_NORMAL + zh: 近年来,自动驾驶技术得到了重大投资和成功。DIY自动驾驶和无线电控制(RC)汽车竞速社区因此变得流行。开发人员在真实硬件上构建按比例缩小的自主汽车,并在真实场景中进行测试的热情空前,这促使我们使用“车辆”(字面上)来教育开发人员学习强化学习。尽管存在其他算法来构建自动驾驶汽车,如传统计算机视觉或监督学习(行为克隆),但我们认为强化学习比这些算法更具优势。 - en: '[Table 17-1](part0020.html#landscape_of_autonomous_self-driving_tec) summarizes some of the popular self-driving kits available for developers, and the technologies that enable them. One of the key benefits of reinforcement learning is that models @@ -106,65 +132,102 @@ first six months following the launch of DeepRacer in November 2018, close to 9,000 developers trained their models in the simulator and successfully tested them on a real-world track.' + id: totrans-12 prefs: [] type: TYPE_NORMAL + zh: '[表17-1](part0020.html#landscape_of_autonomous_self-driving_tec)总结了一些供开发人员使用的热门自动驾驶套件,以及支持它们的技术。强化学习的一个关键优势是模型可以在模拟器中进行专门训练。但强化学习系统也带来了一系列挑战,其中最大的挑战之一是从模拟到真实(sim2real)问题。在将完全在模拟中训练的模型部署到真实环境中时,总是存在挑战。DeepRacer通过一些简单而有效的解决方案来解决这个问题,我们稍后在本章中讨论。在2018年11月推出DeepRacer后的前六个月内,近9000名开发人员在模拟器中训练了他们的模型,并成功在真实赛道上进行了测试。' - en: Table 17-1\. Landscape of autonomous self-driving technology + id: totrans-13 prefs: [] type: TYPE_NORMAL + zh: 表17-1。自主自动驾驶技术的概况 - en: '| | **Hardware** | **Assembly** | **Technology** | **Cost** | |' + id: totrans-14 prefs: [] type: TYPE_TB + zh: '| | **硬件** | **组装** | **技术** | **成本** | |' - en: '| --- | --- | --- | --- | --- | --- |' + id: totrans-15 prefs: [] type: TYPE_TB + zh: '| --- | --- | --- | --- | --- | --- |' - en: '| AWS DeepRacer | Intel Atom with 100 GFLOPS GPU | Preassembled | Reinforcement learning | $399 | ![](../images/00263.jpeg) |' + id: totrans-16 prefs: [] type: TYPE_TB + zh: '| AWS DeepRacer | Intel Atom with 100 GFLOPS GPU | 预装 | 强化学习 | $399 | ![](../images/00263.jpeg) + |' - en: '| OpenMV | OpenMV H7 | DIY (two hours) | Traditional computer vision | $90 | ![](../images/00203.jpeg) |' + id: totrans-17 prefs: [] type: TYPE_TB + zh: '| OpenMV | OpenMV H7 | DIY (两小时) | 传统计算机视觉 | $90 | ![](../images/00203.jpeg) + |' - en: '| Duckietown | Raspberry Pi | Preassembled | Reinforcement learning, behavioral cloning | $279–$350 | ![](../images/00079.jpeg) |' + id: totrans-18 prefs: [] type: TYPE_TB + zh: '| Duckietown | Raspberry Pi | 预装 | 强化学习,行为克隆 | $279–$350 | ![](../images/00079.jpeg) + |' - en: '| DonkeyCar | Raspberry Pi | DIY (two to three hours) | Behavioral closing | $250 | ![](../images/00266.jpeg) |' + id: totrans-19 prefs: [] type: TYPE_TB + zh: '| DonkeyCar | Raspberry Pi | DIY (两到三小时) | 行为克隆 | $250 | ![](../images/00266.jpeg) + |' - en: '| NVIDIA JetRacer | Jetson Nano | DIY (three to five hours) | Supervised learning | ~$400 | ![](../images/00225.jpeg) |' + id: totrans-20 prefs: [] type: TYPE_TB + zh: '| NVIDIA JetRacer | Jetson Nano | DIY (三到五小时) | 监督学习 | 约$400 | ![](../images/00225.jpeg) + |' - en: Practical Deep Reinforcement Learning with DeepRacer + id: totrans-21 prefs: - PREF_H1 type: TYPE_NORMAL + zh: 使用DeepRacer进行实用的深度强化学习 - en: 'Now for the most exciting part of this chapter: building our first reinforcement learning-based autonomous racing model. Before we embark on the journey, let’s build a quick cheat sheet of terms that will help you to become familiar with important reinforcement learning terminology:' + id: totrans-22 prefs: [] type: TYPE_NORMAL + zh: 现在是本章最令人兴奋的部分:构建我们的第一个基于强化学习的自主赛车模型。在我们踏上这个旅程之前,让我们建立一个快速的术语备忘单,帮助您熟悉重要的强化学习术语: - en: Goal + id: totrans-23 prefs: [] type: TYPE_NORMAL + zh: 目标 - en: Finishing a lap around the track without going off track. + id: totrans-24 prefs: [] type: TYPE_NORMAL + zh: 完成绕过赛道一圈而不偏离赛道。 - en: Input + id: totrans-25 prefs: [] type: TYPE_NORMAL + zh: 输入 - en: In a human driven car, the human visualizes the environment and uses driving knowledge to make decisions and navigate the road. DeepRacer is also a vision-driven system and so we use a single camera image as our input into the system. Specifically, we use a grayscale 120x160 image as the input. + id: totrans-26 prefs: [] type: TYPE_NORMAL + zh: 在人类驾驶的汽车中,人类通过可视化环境并利用驾驶知识做出决策并驾驶车辆。DeepRacer也是一个视觉驱动系统,因此我们将单个摄像头图像作为系统的输入。具体来说,我们使用灰度120x160图像作为输入。 - en: Output (Actions) + id: totrans-27 prefs: [] type: TYPE_NORMAL + zh: 输出(动作) - en: 'In the real world, we drive the car by using the throttle (gas), brake, and steering wheel. DeepRacer, which is built on top of a RC car, has two control signals: the throttle and the steering, both of which are controlled by traditional @@ -176,93 +239,141 @@ and steering. Our reinforcement learning model, after it’s trained, will make decisions on which action to take such that it can navigate the track successfully. We’ll have the flexibility to define these actions as we create the model.' + id: totrans-28 prefs: [] type: TYPE_NORMAL + zh: 在现实世界中,我们通过油门(油门)、刹车和方向盘来驾驶汽车。建立在遥控车上的DeepRacer有两个控制信号:油门和方向盘,两者都由传统的脉冲宽度调制(PWM)信号控制。将驾驶映射到PWM信号可能不直观,因此我们将汽车可以采取的驾驶动作离散化。还记得我们在电脑上玩的那些老式赛车游戏吗?其中最简单的使用箭头键——左、右和上——来驾驶汽车。同样,我们可以定义汽车可以采取的一组固定动作,但对油门和方向盘有更精细的控制。我们的强化学习模型在训练后将决定采取哪种动作,以便成功地驾驶赛道。在创建模型时,我们将有灵活性定义这些动作。 - en: Note + id: totrans-29 prefs: - PREF_H6 type: TYPE_NORMAL + zh: 注意 - en: A servo in an RC hobby car is generally controlled by a PWM signal, which is a series of pulses of varying width. The position in which the servo needs to end up is achieved by sending a particular width of the pulse signal. The parameters for the pulses are min pulse width, max pulse width, and repetition rate. + id: totrans-30 prefs: [] type: TYPE_NORMAL + zh: 遥控爱好车中的舵机通常由PWM信号控制,这是一系列脉冲信号,脉冲宽度不同。舵机需要到达的位置是通过发送特定宽度的脉冲信号来实现的。脉冲的参数是最小脉冲宽度、最大脉冲宽度和重复率。 - en: Agent + id: totrans-31 prefs: [] type: TYPE_NORMAL + zh: 代理 - en: The system that learns and makes decisions. In our case, it’s the car that’s learning to navigate the environment (the track). + id: totrans-32 prefs: [] type: TYPE_NORMAL + zh: 学习并做出决策的系统。在我们的情况下,是汽车学习如何驾驶环境(赛道)。 - en: Environment + id: totrans-33 prefs: [] type: TYPE_NORMAL + zh: 环境 - en: Where the agent learns by interacting with actions. In DeepRacer, the environment contains a track that defines where the agent can go and be in. The agent explores the environment to collect data to train the underlying deep reinforcement learning neural network. + id: totrans-34 prefs: [] type: TYPE_NORMAL + zh: 代理通过与动作的交互学习。在DeepRacer中,环境包含一个定义代理可以前往和停留的赛道。代理探索环境以收集数据,以训练基础的深度强化学习神经网络。 - en: State (s) + id: totrans-35 prefs: [] type: TYPE_NORMAL + zh: 状态(s) - en: The representation of where the agent is in an environment. It’s a point-in-time snapshot of the agent. For DeepRacer, we use an image as the state. + id: totrans-36 prefs: [] type: TYPE_NORMAL + zh: 代理在环境中的位置表示。这是代理的一个瞬时快照。对于DeepRacer,我们使用图像作为状态。 - en: Actions (a) + id: totrans-37 prefs: [] type: TYPE_NORMAL + zh: 动作(a) - en: Set of decisions that the agent can make. + id: totrans-38 prefs: [] type: TYPE_NORMAL + zh: 代理可以做出的决策集。 - en: Step + id: totrans-39 prefs: [] type: TYPE_NORMAL + zh: 步骤 - en: Discrete transition from one state to the next. + id: totrans-40 prefs: [] type: TYPE_NORMAL + zh: 从一个状态离散过渡到下一个状态。 - en: Episode + id: totrans-41 prefs: [] type: TYPE_NORMAL + zh: 情节 - en: This refers to an attempt by the car to achieve its goal; that is, complete a lap on the track. Thus, an episode is a sequence of steps, or experience. Different episodes can have different lengths. + id: totrans-42 prefs: [] type: TYPE_NORMAL + zh: 这指的是汽车为实现目标而尝试的努力;即,在赛道上完成一圈。因此,一个情节是一系列步骤或经验。不同的情节可能有不同的长度。 - en: Reward (r) + id: totrans-43 prefs: [] type: TYPE_NORMAL + zh: 奖励(r) - en: The value for the action that the agent took given an input state. + id: totrans-44 prefs: [] type: TYPE_NORMAL + zh: 给定输入状态,代理采取的动作的值。 - en: Policy (π) + id: totrans-45 prefs: [] type: TYPE_NORMAL + zh: 策略(π) - en: Decision-making strategy or function; a mapping from state to actions. + id: totrans-46 prefs: [] type: TYPE_NORMAL + zh: 决策策略或函数;从状态到动作的映射。 - en: Value function (V) + id: totrans-47 prefs: [] type: TYPE_NORMAL + zh: 价值函数(V) - en: The mapping of state to values, in which value represents the expected reward for an action given the state. + id: totrans-48 prefs: [] type: TYPE_NORMAL + zh: 状态到值的映射,其中值代表给定状态下对动作的预期奖励。 - en: Replay or experience buffer + id: totrans-49 prefs: [] type: TYPE_NORMAL + zh: 重播或经验缓冲区 - en: Temporary storage buffer that stores the experience, which is a tuple of (s,a,r,s`'`), where “s” stands for an observation (or state) captured by the camera, “a” for an action taken by the vehicle, “r” for the expected reward incurred by the said action, and “s`'`” for the new observation (or new state) after the action is taken. + id: totrans-50 prefs: [] type: TYPE_NORMAL + zh: 临时存储缓冲区,存储经验,这是一个元组(s,a,r,s`'`),其中“s”代表摄像头捕获的观察(或状态),“a”代表车辆采取的动作,“r”代表该动作产生的预期奖励,“s`'`”代表采取动作后的新观察(或新状态)。 - en: Reward function + id: totrans-51 prefs: [] type: TYPE_NORMAL + zh: 奖励函数 - en: Any reinforcement learning system needs a guide, something that tells the model as it learns what’s a good or a bad action given the situation. The reward function acts as this guide, which evaluates the actions taken by the car and gives it @@ -272,18 +383,26 @@ be bad (reward = 0). The reinforcement learning system eventually collects this guidance based on the reward function and trains the model. This is the most critical piece in training the car and the part that we’ll focus on. + id: totrans-52 prefs: [] type: TYPE_NORMAL + zh: 任何强化学习系统都需要一个指导,告诉模型在学习过程中在特定情况下什么是好的或坏的动作。奖励函数充当这个指导,评估汽车采取的动作并给予奖励(标量值),指示该动作在该情况下的可取性。例如,在左转时,采取“左转”动作将被认为是最佳的(例如,奖励=1;在0-1范围内),但采取“右转”动作将是不好的(奖励=0)。强化学习系统最终根据奖励函数收集这些指导并训练模型。这是训练汽车的最关键部分,也是我们将重点关注的部分。 - en: 'Finally, when we put together the system, the schematic flow looks as follows:' + id: totrans-53 prefs: [] type: TYPE_NORMAL + zh: 最后,当我们组装系统时,原理图流程如下: - en: '[PRE0]' + id: totrans-54 prefs: [] type: TYPE_PRE + zh: '[PRE0]' - en: In AWS DeepRacer, the reward function is an important part of the model building process. We must provide it when training our AWS DeepRacer model. + id: totrans-55 prefs: [] type: TYPE_NORMAL + zh: 在AWS DeepRacer中,奖励函数是模型构建过程中的重要部分。在训练AWS DeepRacer模型时,我们必须提供它。 - en: In an episode, the agent interacts with the track to learn the optimal set of actions it needs to take to maximize the expected cumulative reward. But a single episode doesn’t produce enough data for us to train the agent. So, we end up collecting @@ -295,83 +414,120 @@ given an image input. The evaluation of the model can be done in either the simulated environment with a virtual agent or a real-world environment with a physical AWS DeepRacer car. + id: totrans-56 prefs: [] type: TYPE_NORMAL + zh: 在一个episode中,代理与赛道互动,学习最优动作集,以最大化预期累积奖励。但是,单个episode并不产生足够的数据来训练代理。因此,我们最终会收集许多episode的数据。定期,在每个第n个episode结束时,我们启动一个训练过程,生成一个强化学习模型的迭代。我们运行许多迭代来生成我们能够的最佳模型。这个过程在下一节中详细解释。训练结束后,代理通过在模型上运行推理来执行自主驾驶,以根据图像输入采取最佳动作。模型的评估可以在模拟环境中使用虚拟代理或在物理AWS + DeepRacer汽车的真实环境中进行。 - en: It’s finally time to create our first model. Because the input into the car is fixed, which is a single image from the camera, we need to focus only on the output (actions) and the reward function. We can follow the steps below to begin training the model. + id: totrans-57 prefs: [] type: TYPE_NORMAL + zh: 终于是时候创建我们的第一个模型了。因为汽车的输入是固定的,即来自摄像头的单个图像,所以我们只需要关注输出(动作)和奖励函数。我们可以按照以下步骤开始训练模型。 - en: Building Our First Reinforcement Learning + id: totrans-58 prefs: - PREF_H2 type: TYPE_NORMAL + zh: 构建我们的第一个强化学习 - en: To do this exercise, you will need an AWS account. Log into the AWS Console using your account credentials, as shown in [Figure 17-2](part0020.html#the_aws_login_console). + id: totrans-59 prefs: [] type: TYPE_NORMAL + zh: 要进行这个练习,您需要一个AWS账户。使用您的账户凭据登录AWS控制台,如[图17-2](part0020.html#the_aws_login_console)所示。 - en: '![The AWS login console](../images/00171.jpeg)' + id: totrans-60 prefs: [] type: TYPE_IMG + zh: '![AWS登录控制台](../images/00171.jpeg)' - en: Figure 17-2\. The AWS login console + id: totrans-61 prefs: - PREF_H6 type: TYPE_NORMAL + zh: 图17-2。AWS登录控制台 - en: 'First, let’s make sure we are in the North Virginia region, given that the service is available only in that region, and then navigate to the DeepRacer console page: [*https://console.aws.amazon.com/deepracer/home?region=us-east-1#getStarted*](https://console.aws.amazon.com/deepracer/home?region=us-east-1#getStarted).' + id: totrans-62 prefs: [] type: TYPE_NORMAL + zh: 首先,让我们确保我们在北弗吉尼亚地区,因为该服务仅在该地区提供,并转到DeepRacer控制台页面:[*https://console.aws.amazon.com/deepracer/home?region=us-east-1#getStarted*](https://console.aws.amazon.com/deepracer/home?region=us-east-1#getStarted)。 - en: After you select “Reinforcement learning,” the model page opens. This page shows a list of all the models that have been created and the status of each model. To create a model, start the process here. + id: totrans-63 prefs: [] type: TYPE_NORMAL + zh: 在选择“强化学习”后,模型页面会打开。该页面显示了所有已创建模型的列表以及每个模型的状态。要创建模型,请从这里开始该过程。 - en: '![Workflow for training the AWS DeepRacer model](../images/00189.jpeg)' + id: totrans-64 prefs: [] type: TYPE_IMG + zh: '![训练AWS DeepRacer模型的工作流程](../images/00189.jpeg)' - en: Figure 17-3\. Workflow for training the AWS DeepRacer model + id: totrans-65 prefs: - PREF_H6 type: TYPE_NORMAL + zh: 图17-3。训练AWS DeepRacer模型的工作流程 - en: 'Step 1: Create Model' + id: totrans-66 prefs: - PREF_H2 type: TYPE_NORMAL + zh: 步骤1:创建模型 - en: We are going to create a model that can be used by the AWS DeepRacer car to autonomously drive (take actions) around a race track. We need to select the specific race track, provide the actions that our model can choose from, provide a reward function that will be used to incentivize our desired driving behavior, and configure the hyperparameters used during training. + id: totrans-67 prefs: [] type: TYPE_NORMAL + zh: 我们将创建一个模型,供AWS DeepRacer汽车在赛道上自主驾驶(采取动作)。我们需要选择特定的赛道,提供我们的模型可以选择的动作,提供一个奖励函数,用于激励我们期望的驾驶行为,并配置训练期间使用的超参数。 - en: '![Creating a model on the AWS DeepRacer console](../images/00124.jpeg)' + id: totrans-68 prefs: [] type: TYPE_IMG + zh: 在AWS DeepRacer控制台上创建模型 - en: Figure 17-4\. Creating a model on the AWS DeepRacer console + id: totrans-69 prefs: - PREF_H6 type: TYPE_NORMAL + zh: 图17-4。在AWS DeepRacer控制台上创建模型 - en: 'Step 2: Configure Training' + id: totrans-70 prefs: - PREF_H2 type: TYPE_NORMAL + zh: 步骤2:配置训练 - en: In this step, we select our training environment, configure action space, write our reward function, and adjust other training related settings before kicking off our training job. + id: totrans-71 prefs: [] type: TYPE_NORMAL + zh: 在这一步中,我们选择我们的训练环境,配置动作空间,编写奖励函数,并在启动训练作业之前调整其他与训练相关的设置。 - en: Configure the simulation environment + id: totrans-72 prefs: - PREF_H3 type: TYPE_NORMAL + zh: 配置模拟环境 - en: Training our reinforcement learning model takes place on a simulated race track, and we can choose the track to train our model. We’ll use AWS RoboMaker, a cloud service that makes building robotic applications easy, to spin up the simulation environment. + id: totrans-73 prefs: [] type: TYPE_NORMAL + zh: 我们的强化学习模型训练发生在模拟赛道上,我们可以选择赛道来训练我们的模型。我们将使用AWS RoboMaker,这是一个使构建机器人应用程序变得简单的云服务,来启动模拟环境。 - en: When training a model, we pick the track most similar to the final track we intend to race on. As of July 2019, AWS DeepRacer provides seven tracks that we can train on. While configuring such a complementary environment isn’t required @@ -382,8 +538,11 @@ that’s not part of the training data, in reinforcement learning, the agent is unlikely to learn anything out of scope from the training environment. For our first exercise, select the re:Invent 2018 track, as shown in [Figure 17-5](part0020.html#track_selection_on_the_aws_deepracer_con). + id: totrans-74 prefs: [] type: TYPE_NORMAL + zh: 在训练模型时,我们选择与我们打算在赛道上比赛的最终赛道最相似的赛道。截至2019年7月,AWS DeepRacer提供了七个可以进行训练的赛道。虽然配置这样一个辅助环境并不是必需的,也不能保证一个好的模型,但它将最大化我们的模型在赛道上表现最好的可能性。此外,如果我们在一条直线赛道上训练,那么我们的模型很可能不会学会如何转弯。就像在监督学习的情况下,模型不太可能学会不属于训练数据的内容一样,在强化学习中,代理人不太可能从训练环境中学到超出范围的内容。对于我们的第一个练习,选择re:Invent + 2018赛道,如[图17-5](part0020.html#track_selection_on_the_aws_deepracer_con)所示。 - en: To train a reinforcement learning model, we must choose a learning algorithm. Currently, the AWS DeepRacer console supports only the proximal policy optimization (PPO) algorithm. The team eventually will support more algorithms, but PPO was @@ -400,19 +559,28 @@ is look at the logs and see whether the car is going past the finish line; in other words, whether progress is 100%. Alternatively, we could visually observe the car’s behavior and confirm that it goes past the finish line. + id: totrans-75 prefs: [] type: TYPE_NORMAL + zh: 要训练一个强化学习模型,我们必须选择一个学习算法。目前,AWS DeepRacer控制台仅支持近端策略优化(PPO)算法。团队最终将支持更多的算法,但选择PPO是为了更快的训练时间和更优越的收敛性能。训练一个强化学习模型是一个迭代的过程。首先,定义一个奖励函数来覆盖代理在环境中的所有重要行为是一个挑战。其次,通常需要调整超参数以确保令人满意的训练性能。这两者都需要实验。一个谨慎的方法是从一个简单的奖励函数开始,这将是本章的方法,然后逐步增强它。AWS + DeepRacer通过允许我们克隆一个训练好的模型来促进这个迭代过程,在这个模型中,我们可以增强奖励函数以处理之前被忽略的变量,或者我们可以系统地调整超参数直到结果收敛。检测这种收敛的最简单方法是查看日志,看看汽车是否超过了终点线;换句话说,进展是否达到了100%。或者,我们可以直观地观察汽车的行为,并确认它是否超过了终点线。 - en: '![Track selection on the AWS DeepRacer console](../images/00085.jpeg)' + id: totrans-76 prefs: [] type: TYPE_IMG + zh: '![在AWS DeepRacer控制台上选择赛道](../images/00085.jpeg)' - en: Figure 17-5\. Track selection on the AWS DeepRacer console + id: totrans-77 prefs: - PREF_H6 type: TYPE_NORMAL + zh: 图17-5。在AWS DeepRacer控制台上选择赛道 - en: Configure the action space + id: totrans-78 prefs: - PREF_H3 type: TYPE_NORMAL + zh: 配置动作空间 - en: Next, we configure the action space that our model will select both during and after training. An action (output) is a combination of speed and steering angle. Currently in AWS DeepRacer, we use a discrete action space (a fixed set of actions) @@ -421,29 +589,44 @@ the physical car, which we dive into later in the [“Racing the AWS DeepRacer Car”](part0020.html#J2BTE-13fa565533764549a6f0ab7f11eed62b) To build this discrete action space, we specify the maximum speed, the speed levels, the maximum steering angle, and the steering levels, as depicted in [Figure 17-6](part0020.html#defining_the_action_space_on_the_aws_dee). + id: totrans-79 prefs: [] type: TYPE_NORMAL + zh: 接下来,我们配置我们的模型在训练期间和训练后选择的动作空间。一个动作(输出)是速度和转向角的组合。目前在AWS DeepRacer中,我们使用离散动作空间(固定的动作集)而不是连续动作空间(以*x*速度转动*x*度,其中*x*和*y*取实值)。这是因为更容易映射到物理汽车上的值,我们稍后会深入探讨这一点在[“驾驶AWS + DeepRacer汽车”](part0020.html#J2BTE-13fa565533764549a6f0ab7f11eed62b)。为了构建这个离散动作空间,我们指定了最大速度、速度级别、最大转向角和转向级别,如[图17-6](part0020.html#defining_the_action_space_on_the_aws_dee)所示。 - en: '![Defining the action space on the AWS DeepRacer console](../images/00041.jpeg)' + id: totrans-80 prefs: [] type: TYPE_IMG + zh: '![在AWS DeepRacer控制台上定义动作空间](../images/00041.jpeg)' - en: Figure 17-6\. Defining the action space on the AWS DeepRacer console + id: totrans-81 prefs: - PREF_H6 type: TYPE_NORMAL + zh: 图17-6。在AWS DeepRacer控制台上定义动作空间 - en: 'Following are the configuration parameters for the action space:' + id: totrans-82 prefs: [] type: TYPE_NORMAL + zh: 以下是动作空间的配置参数: - en: Maximum steering angle + id: totrans-83 prefs: [] type: TYPE_NORMAL + zh: 最大转向角度 - en: This is the maximum angle in degrees that the front wheels of the car can turn to the left and to the right. There is a limit as to how far the wheels can turn, and so the maximum turning angle is 30 degrees. + id: totrans-84 prefs: [] type: TYPE_NORMAL + zh: 这是汽车前轮可以向左和向右转动的最大角度。轮子可以转动的角度是有限的,因此最大转向角度为30度。 - en: Steering angle granularity + id: totrans-85 prefs: [] type: TYPE_NORMAL + zh: 转向角度粒度 - en: 'Refers to the number of steering intervals between the maximum steering angle on either side. Thus, if our maximum steering angle is 30 degrees, +30 degrees is to the left and –30 degrees is to the right. With a steering granularity of @@ -451,70 +634,96 @@ from left to right, will be in the action space: 30 degrees, 15 degrees, 0 degrees, –15 degrees, and –30 degrees. Steering angles are always symmetrical around 0 degrees.' + id: totrans-86 prefs: [] type: TYPE_NORMAL + zh: 指的是最大转向角两侧的转向间隔数。因此,如果我们的最大转向角为30度,+30度是向左,-30度是向右。具有5个转向粒度时,从左到右的转向角如[图17-6](part0020.html#defining_the_action_space_on_the_aws_dee)所示,将在行动空间中:30度,15度,0度,-15度和-30度。转向角始终围绕0度对称。 - en: Maximum speed + id: totrans-87 prefs: [] type: TYPE_NORMAL + zh: 最大速度 - en: Refers to the maximum speed the car will drive in the simulator as measured in meters per second (m/s). + id: totrans-88 prefs: [] type: TYPE_NORMAL + zh: 指的是模拟器中车辆将以米/秒(m/s)为单位测量的最大速度驾驶的速度。 - en: Speed levels + id: totrans-89 prefs: [] type: TYPE_NORMAL + zh: 速度级别 - en: Refers to the number of speed levels from the maximum speed (including) to zero (excluding). So, if our maximum speed is 3 m/s and our speed granularity is 3, our action space will contain speed settings of 1 m/s, 2 m/s, and 3 m/s. Simply put, 3 m/s divided by 3 = 1 m/s, so go from 0 m/s to 3 m/s in increments of 1 m/s. 0 m/s is not included in the action space. + id: totrans-90 prefs: [] type: TYPE_NORMAL + zh: 指的是从最大速度(包括)到零(不包括)的速度级别数。因此,如果我们的最大速度是3m/s,速度粒度为3,那么我们的行动空间将包含1m/s、2m/s和3m/s的速度设置。简单来说,3m/s除以3等于1m/s,所以从0m/s到3m/s以1m/s的增量进行。0m/s不包括在行动空间中。 - en: Based on the previous example the final action space will include 15 discrete actions (three speeds x five steering angles), which should be listed in the AWS DeepRacer service. Feel free to tinker with other options, just remember that larger action spaces might take a bit longer to train. + id: totrans-91 prefs: [] type: TYPE_NORMAL + zh: 根据前面的例子,最终的行动空间将包括15个离散行动(三种速度 x 五种转向角),这些应该在AWS DeepRacer服务中列出。随意尝试其他选项,只需记住较大的行动空间可能需要更长时间进行训练。 - en: Tip + id: totrans-92 prefs: - PREF_H6 type: TYPE_NORMAL + zh: 提示 - en: 'Based on our experience, here are some tips on how to configure the action space:' + id: totrans-93 prefs: [] type: TYPE_NORMAL + zh: 根据我们的经验,以下是一些建议如何配置行动空间: - en: Our experiments have shown that models with a faster maximum speed take longer to converge than those with a slower maximum speed. In some cases (reward function and track dependent), it can take longer than 12 hours for a 5 m/s model to converge. + id: totrans-94 prefs: - PREF_UL type: TYPE_NORMAL + zh: 我们的实验表明,具有更快最大速度的模型收敛所需的时间比具有较慢最大速度的模型更长。在某些情况下(奖励函数和赛道相关),5m/s模型收敛可能需要超过12小时。 - en: Our model will not perform an action that is not in the action space. Similarly, if the model is trained on a track that never required the use of this action—for example, turning won’t be incentivized on a straight track—the model won’t know how to use this action, because it won’t be incentivized to turn. As you begin thinking about building a robust model, make sure that you keep the action space and training track in mind. + id: totrans-95 prefs: - PREF_UL type: TYPE_NORMAL + zh: 我们的模型不会执行不在行动空间中的行动。同样,如果模型在从未需要使用此行动的赛道上进行训练,例如,在直道上不会激励转弯,那么模型将不知道如何使用此行动,因为它不会被激励转弯。在开始考虑构建强大模型时,请确保记住行动空间和训练赛道。 - en: Specifying a fast speed or a wide steering angle is great, but we still need to think about our reward function and whether it makes sense to drive full speed into a turn, or exhibit zigzag behavior on a straight section of the track. + id: totrans-96 prefs: - PREF_UL type: TYPE_NORMAL + zh: 指定快速速度或大转向角是很好的,但我们仍然需要考虑我们的奖励函数,以及是否有意义全速驶入转弯,或在赛道的直线段上展示之字形行为。 - en: We also need to keep physics in mind. If we try to train a model at faster than 5 m/s, we might see our car spin out on corners, which will probably increase the time to convergence of our model. + id: totrans-97 prefs: - PREF_UL type: TYPE_NORMAL + zh: 我们还需要牢记物理学。如果我们尝试以超过5m/s的速度训练模型,我们可能会看到我们的车在拐弯时打滑,这可能会增加模型收敛的时间。 - en: Configure reward function + id: totrans-98 prefs: - PREF_H3 type: TYPE_NORMAL + zh: 配置奖励函数 - en: As we explained earlier, the reward function evaluates the quality of an action’s outcome given the situation, and rewards the action accordingly. In practice the reward is calculated during training after each action is taken, and forms a key @@ -525,29 +734,42 @@ relation to the racetrack, such as (x, y) coordinates; and the racetrack, such as waypoints (milestone markers on the track). We can use these measurements to build our reward function logic in Python 3 syntax. + id: totrans-99 prefs: [] type: TYPE_NORMAL + zh: 正如我们之前解释的那样,奖励函数评估了在给定情况下行动结果的质量,并相应地奖励该行动。在实践中,奖励是在每次行动后进行训练时计算的,并且构成了用于训练模型的经验的关键部分。然后我们将元组(状态,行动,下一个状态,奖励)存储在内存缓冲区中。我们可以使用模拟器提供的多个变量来构建奖励函数逻辑。这些变量代表了车辆的测量,如转向角和速度;车辆与赛道的关系,如(x,y)坐标;以及赛道,如路标(赛道上的里程碑标记)。我们可以使用这些测量值来在Python + 3语法中构建我们的奖励函数逻辑。 - en: All of the parameters are available as a dictionary to the reward function. Their keys, data types, and descriptions are documented in [Figure 17-7](part0020.html#reward_function_parameters_left_parenthe), and some of the more nuanced ones are further illustrated in [Figure 17-8](part0020.html#visual_explanation_of_some_of_the_reward). + id: totrans-100 prefs: [] type: TYPE_NORMAL + zh: 所有参数都作为字典提供给奖励函数。它们的键,数据类型和描述在[图17-7](part0020.html#reward_function_parameters_left_parenthe)中有文档记录,一些更微妙的参数在[图17-8](part0020.html#visual_explanation_of_some_of_the_reward)中有进一步说明。 - en: '![Reward function parameters (a more in-depth review of these parameters is available in the documentation)](../images/00002.jpeg)' + id: totrans-101 prefs: [] type: TYPE_IMG + zh: 奖励函数参数(这些参数的更深入审查可在文档中找到) - en: Figure 17-7\. Reward function parameters (a more in-depth review of these parameters is available in the documentation) + id: totrans-102 prefs: - PREF_H6 type: TYPE_NORMAL + zh: 图17-7. 奖励函数参数(这些参数的更深入审查可在文档中找到) - en: '![Visual explanation of some of the reward function parameters](../images/00289.jpeg)' + id: totrans-103 prefs: [] type: TYPE_IMG + zh: 图解释了一些奖励函数参数 - en: Figure 17-8\. Visual explanation of some of the reward function parameters + id: totrans-104 prefs: - PREF_H6 type: TYPE_NORMAL + zh: 图17-8. 奖励函数参数的可视化解释 - en: 'To build our first model, let’s pick an example reward function and train our model. Let’s use the default template, shown in [Figure 17-9](part0020.html#an_example_reward_function), in which the car tries to follow the center dashed lines. The intuition behind @@ -563,35 +785,51 @@ Remember that time you did your driving test? The examiner probably did the same and cut off points as you got closer to the curb or the lane makers. This could come in handy, especially when there are sharp corners.' + id: totrans-105 prefs: [] type: TYPE_NORMAL + zh: 为了构建我们的第一个模型,让我们选择一个示例奖励函数并训练我们的模型。让我们使用默认模板,在其中汽车试图跟随中心虚线,如[图17-9](part0020.html#an_example_reward_function)所示。这个奖励函数背后的直觉是沿着赛道采取最安全的导航路径,因为保持在中心位置可以使汽车远离赛道外。奖励函数的作用是:在赛道周围创建三个层次,使用三个标记,然后为在第二层次驾驶的汽车提供更多奖励,而不是在中心或最后一层次驾驶。还要注意奖励的大小差异。我们为保持在狭窄的中心层次提供1的奖励,为保持在第二(偏离中心)层次提供0.5的奖励,为保持在最后一层次提供0.1的奖励。如果我们减少中心层次的奖励,或增加第二层次的奖励,实质上我们在激励汽车使用更大的赛道表面。记得你考驾照的时候吗?考官可能也是这样做的,当你靠近路缘或车道标志时扣分。这可能会很有用,特别是在有急转弯的情况下。 - en: '![An example reward function](../images/00249.jpeg)' + id: totrans-106 prefs: [] type: TYPE_IMG + zh: '![一个示例奖励函数](../images/00249.jpeg)' - en: Figure 17-9\. An example reward function + id: totrans-107 prefs: - PREF_H6 type: TYPE_NORMAL + zh: 图17-9. 一个示例奖励函数 - en: 'Here’s the code that sets this up:' + id: totrans-108 prefs: [] type: TYPE_NORMAL + zh: 以下是设置这一切的代码: - en: '[PRE1]' + id: totrans-109 prefs: [] type: TYPE_PRE + zh: '[PRE1]' - en: Because this is the first training run, let’s focus on understanding the process of creating and evaluating a basic model, and then focus on optimizing it. In this case, we skip the algorithm settings and hyperparameter sections and use the defaults. + id: totrans-110 prefs: [] type: TYPE_NORMAL + zh: 因为这是第一次训练运行,让我们专注于理解创建和评估基本模型的过程,然后专注于优化它。在这种情况下,我们跳过算法设置和超参数部分,使用默认设置。 - en: Note + id: totrans-111 prefs: - PREF_H6 type: TYPE_NORMAL + zh: 注意 - en: '*FAQ: Should the rewards be in a certain range, and also can I give negative rewards?*' + id: totrans-112 prefs: [] type: TYPE_NORMAL + zh: '*常见问题:奖励应该在某个范围内吗,我可以给负奖励吗?*' - en: There are no real constraints on what we can reward and not reward, but as a good practice it’s easier to understand rewards when they are in a 0–1 or 0–100 scale. What’s more important is that our reward scale gives relative rewards for @@ -599,56 +837,77 @@ right action with high reward, a left action with close to 0 reward, perhaps a straight action with an in-between reward or higher than the left action because it might not be a completely bad action to take. + id: totrans-113 prefs: [] type: TYPE_NORMAL + zh: 没有真正的约束来决定我们可以奖励和不奖励什么,但作为一个良好的实践,更容易理解奖励的方式是将它们放在0-1或0-100的范围内。更重要的是,我们的奖励尺度应该适当地为行为提供相对奖励。例如,在右转时,我们应该用高奖励奖励正确的行为,用接近0的奖励奖励左转行为,也许用介于两者之间的奖励或高于左转行为的奖励奖励直行行为,因为这可能不是完全错误的行为。 - en: Configure stop conditions + id: totrans-114 prefs: - PREF_H3 type: TYPE_NORMAL + zh: 配置停止条件 - en: This is the last section before we begin training. Here we specify the maximum time our model will train for. This is provided as a convenient mechanism for us to terminate our training given that we are billed for the amount of training time. + id: totrans-115 prefs: [] type: TYPE_NORMAL + zh: 这是我们开始训练之前的最后一节。在这里,我们指定模型将训练的最长时间。这是一个方便的机制,让我们可以终止训练,因为我们将根据训练时间计费。 - en: Specify 60 minutes and then select “Start training.” If there is an error, we will be taken to the error location. After we start training, it can take up to six minutes to spin up the services (such as Amazon SageMaker, AWS Robomaker, AWS Lambda, AWS Step Function) needed to start training. Remember, we can always stop training early if we determine that the model has converged (as explained in the next section) by clicking the “stop” button. + id: totrans-116 prefs: [] type: TYPE_NORMAL + zh: 指定60分钟,然后选择“开始训练”。如果出现错误,我们将被带到错误位置。开始训练后,可能需要长达六分钟的时间来启动所需的服务(如Amazon SageMaker、AWS + Robomaker、AWS Lambda、AWS Step Function)来开始训练。请记住,如果我们确定模型已经收敛(如下一节所述),我们随时可以通过点击“停止”按钮提前停止训练。 - en: 'Step 3: Model Training' + id: totrans-117 prefs: - PREF_H2 type: TYPE_NORMAL + zh: 第三步:模型训练 - en: After our model has begun training, we can select it from the listed models on the DeepRacer console. We can then see quantitatively how the training is progressing by looking at the total reward over time graph and also qualitatively from a first-person view from the car in the simulator (see [Figure 17-10](part0020.html#training_graph_and_simulation_video_stre)). + id: totrans-118 prefs: [] type: TYPE_NORMAL + zh: 当我们的模型开始训练后,我们可以从DeepRacer控制台上列出的模型中选择它。然后,我们可以通过查看随时间变化的总奖励图以及从模拟器中汽车的第一人称视角来定量地了解训练的进展情况(参见[图17-10](part0020.html#training_graph_and_simulation_video_stre))。 - en: At first, our car will not be able to drive on a straight road, but as it learns better driving behavior, we should see its performance improving and the reward graph increasing. Furthermore, when our car drives off of the track it will be reset on the track. We might observe that the reward graph is spiky. + id: totrans-119 prefs: [] type: TYPE_NORMAL + zh: 起初,我们的车无法在直路上行驶,但随着它学习到更好的驾驶行为,我们应该看到它的表现提高,奖励图增加。此外,当我们的车驶离赛道时,它将被重置在赛道上。我们可能会观察到奖励图呈波动状态。 - en: Note + id: totrans-120 prefs: - PREF_H6 type: TYPE_NORMAL + zh: 注意 - en: '*FAQ: Why is the reward graph spiky?*' + id: totrans-121 prefs: [] type: TYPE_NORMAL + zh: '*常见问题:为什么奖励图呈波动状态?*' - en: The agent starts with high exploration, and gradually begins to exploit the trained model. Because the agent always takes random actions for a fraction of its decisions, there might be occasions for which it totally makes the wrong decision and ends up going off track. This is typically high at the beginning of training, but eventually the spikiness should reduce as the model begins to learn. + id: totrans-122 prefs: [] type: TYPE_NORMAL + zh: 代理从高探索开始,并逐渐开始利用训练模型。因为代理始终对其决策的一部分采取随机动作,所以可能会有时候完全做出错误决定并最终偏离轨道。这通常在训练开始时很高,但随着模型开始学习,这种尖锐性应该会减少。 - en: Logs are always a good source of more granular information regarding our model’s training. Later in the chapter, we examine how we can use the logs programmatically to gain a deeper understanding of our model training. In the meantime, we can @@ -657,20 +916,29 @@ graph and select the three dots that appear below the refresh button. Then, select “View logs.” Because the model training will take an hour, this might be a good time to skip to the next section and learn a bit more about reinforcement learning. + id: totrans-123 prefs: [] type: TYPE_NORMAL + zh: 日志始终是关于我们模型训练更细粒度信息的良好来源。在本章后面,我们将探讨如何以编程方式使用日志来更深入地了解我们模型的训练。与此同时,我们可以查看Amazon + SageMaker和AWS RoboMaker的日志文件。日志输出到Amazon CloudWatch。要查看日志,请将鼠标悬停在奖励图上,并选择刷新按钮下方出现的三个点。然后,选择“查看日志”。因为模型训练需要一个小时,这可能是一个好时机跳到下一节,了解更多关于强化学习的知识。 - en: '![Training graph and simulation video stream on the AWS DeepRacer console](../images/00209.jpeg)' + id: totrans-124 prefs: [] type: TYPE_IMG + zh: '![AWS DeepRacer控制台上的训练图和模拟视频流](../images/00209.jpeg)' - en: Figure 17-10\. Training graph and simulation video stream on the AWS DeepRacer console + id: totrans-125 prefs: - PREF_H6 type: TYPE_NORMAL + zh: 图17-10。AWS DeepRacer控制台上的训练图和模拟视频流 - en: 'Step 4: Evaluating the Performance of the Model' + id: totrans-126 prefs: - PREF_H2 type: TYPE_NORMAL + zh: 第4步:评估模型的性能 - en: In reinforcement learning, the best way to gauge the ability of the model is to run it such that it only exploits—that is, it doesn’t take random actions. In our case, first test it on a track similar to the one it’s trained on to see @@ -683,26 +951,39 @@ page as that shown in [Figure 17-11](part0020.html#model_evaluation_page_on_the_aws_deeprac), which summarizes the results of our model’s attempts to go around the track and complete the lap. + id: totrans-127 prefs: [] type: TYPE_NORMAL + zh: 在强化学习中,评估模型能力的最佳方法是运行它,使其仅利用——也就是说,它不会采取随机动作。在我们的情况下,首先在类似于训练的赛道上测试它,看看它是否能复制训练行为。接下来,尝试在不同的赛道上测试泛化能力。在我们的模型训练完成后,我们可以开始模型评估。从我们观察到训练的模型详细信息页面中,选择“开始评估”。现在我们可以选择要评估我们模型性能的赛道以及圈数。选择“re:Invent + 2018”赛道和5圈,然后选择开始。完成后,我们应该看到与[图17-11](part0020.html#model_evaluation_page_on_the_aws_deeprac)中显示的类似的页面,总结了我们模型尝试绕过赛道并完成圈数的结果。 - en: '![Model evaluation page on the AWS DeepRacer console](../images/00176.jpeg)' + id: totrans-128 prefs: [] type: TYPE_IMG + zh: '![AWS DeepRacer控制台上的模型评估页面](../images/00176.jpeg)' - en: Figure 17-11\. Model evaluation page on the AWS DeepRacer console + id: totrans-129 prefs: - PREF_H6 type: TYPE_NORMAL + zh: 图17-11。AWS DeepRacer控制台上的模型评估页面 - en: Great job! We have successfully built our first reinforcement learning–enabled autonomous car. + id: totrans-130 prefs: [] type: TYPE_NORMAL + zh: 干得好!我们已经成功构建了我们的第一个强化学习启用的自动驾驶汽车。 - en: Note + id: totrans-131 prefs: - PREF_H6 type: TYPE_NORMAL + zh: 注意 - en: '*FAQ: When I run evaluation, I see only 100% completion rates sometimes?*' + id: totrans-132 prefs: [] type: TYPE_NORMAL + zh: '*常见问题:当我运行评估时,有时只看到100%的完成率?*' - en: As the model runs inference and navigates around the track in the simulator, due to the fidelity of the simulator, the car can end up in slightly different positions for the same action—in other words, a 15 degree left turn action might @@ -710,36 +991,50 @@ in the simulator, but these small deviations can add up over time. A well-trained model is able to recover from close-to-off-track positions, but an undertrained model is likely to not recover from close calls. + id: totrans-133 prefs: [] type: TYPE_NORMAL + zh: 当模型在模拟器中运行推理并在赛道周围导航时,由于模拟器的保真度,汽车可能会以稍微不同的位置结束相同的动作——换句话说,15度左转动作可能只会导致14.9度。实际上,我们在模拟器中观察到的只是非常小的偏差,但这些小偏差会随着时间的推移而累积。训练良好的模型能够从接近越野位置恢复,但训练不足的模型可能无法从接近事故的位置恢复。 - en: Now that our model has been successfully evaluated, we move on to improving it and learn to achieve better lap times. But before that, it’s necessary for you to understand more about the theory behind reinforcement learning and dig deeper into the learning process. + id: totrans-134 prefs: [] type: TYPE_NORMAL + zh: 现在我们的模型已经成功评估,我们继续改进它,并学习如何实现更好的圈速。但在此之前,您需要更多地了解强化学习背后的理论,并深入了解学习过程。 - en: Note + id: totrans-135 prefs: - PREF_H6 type: TYPE_NORMAL + zh: 注意 - en: We can get started with the AWS DeepRacer console service to create our first model, train it (for up to six hours), evaluate it, and submit it to the AWS DeepRacer League for free. + id: totrans-136 prefs: [] type: TYPE_NORMAL + zh: 我们可以开始使用AWS DeepRacer控制台服务创建我们的第一个模型,对其进行训练(最多六个小时),评估它,并免费提交给AWS DeepRacer联赛。 - en: Reinforcement Learning in Action + id: totrans-137 prefs: - PREF_H1 type: TYPE_NORMAL + zh: 行动中的强化学习 - en: Let’s take a closer look at reinforcement learning in action. Here, we discuss some theory, the inner workings, and lots of practical insights related to our autonomous car project. + id: totrans-138 prefs: [] type: TYPE_NORMAL + zh: 让我们更仔细地看看强化学习的实际应用。在这里,我们讨论一些理论、内部工作原理以及与我们自动驾驶汽车项目相关的许多实用见解。 - en: How Does a Reinforcement Learning System Learn? + id: totrans-139 prefs: - PREF_H2 type: TYPE_NORMAL + zh: 强化学习系统是如何学习的? - en: First, we must understand *explore* versus *exploit*. Similar to how a child learns, a reinforcement learning system learns from *exploring* and finding out what’s good and bad. In the case of the child, the parent guides or informs the @@ -750,6 +1045,7 @@ in a child’s life, it’s more attentive to the suggestions from its parents, thus creating opportunities to learn. And later in life, as adults who seldom listen to their parents, exploit the concepts they have learned to make decisions. + id: totrans-140 prefs: [] type: TYPE_NORMAL - en: As shown in [Figure 17-12](part0020.html#reinforcement_learning_theory_basics_in), @@ -757,12 +1053,15 @@ the state (S[t]), and on that basis the car selects an action (A[t]). As a result of the action the agent took, it gets a reward R[t+1] and moves to state S[t+1,] and this process continues throughout the episode. + id: totrans-141 prefs: [] type: TYPE_NORMAL - en: '![Reinforcement learning theory basics in a nutshell](../images/00131.jpeg)' + id: totrans-142 prefs: [] type: TYPE_IMG - en: Figure 17-12\. Reinforcement learning theory basics in a nutshell + id: totrans-143 prefs: - PREF_H6 type: TYPE_NORMAL @@ -777,6 +1076,7 @@ here being that the car can learn from what decisions were good and other decisions that were bad. The key point is that we *start with a high degree of exploration, and slowly increase exploitation.*' + id: totrans-144 prefs: [] type: TYPE_NORMAL - en: In the DeepRacer simulator, we sample the input image state at 15 frames per @@ -792,12 +1092,15 @@ or any similar strategy that’s usually tuned based on the degree of learning we hypothesize. Typically, an exponential decay is used in practical reinforcement learning training. + id: totrans-145 prefs: [] type: TYPE_NORMAL - en: '![The DeepRacer training flow](../images/00089.jpeg)' + id: totrans-146 prefs: [] type: TYPE_IMG - en: Figure 17-13\. The DeepRacer training flow + id: totrans-147 prefs: - PREF_H6 type: TYPE_NORMAL @@ -809,6 +1112,7 @@ from one state to another, and at each step a (state, action, new state, reward) tuple is recorded. The collection of steps from the reset point until the terminal state is called an *episode*. + id: totrans-148 prefs: [] type: TYPE_NORMAL - en: 'To illustrate an episode, let’s look at the example of a miniature race track @@ -817,12 +1121,15 @@ and quickest path from start to finish. This episode consists of four steps: at step 1 and 2 the car follows the center line, and then it turns left 45 degrees at step 3, and continues in that direction only to finally crash at step 4.' + id: totrans-149 prefs: [] type: TYPE_NORMAL - en: '![Illustration of an agent exploring during an episode](../images/00046.jpeg)' + id: totrans-150 prefs: [] type: TYPE_IMG - en: Figure 17-14\. Illustration of an agent exploring during an episode + id: totrans-151 prefs: - PREF_H6 type: TYPE_NORMAL @@ -835,6 +1142,7 @@ Of course, the supercritical caveat being that the reward function is well defined and directing the agent toward the goal. If our reward function is bad, our model will not learn the correct behavior. + id: totrans-152 prefs: [] type: TYPE_NORMAL - en: Let’s take a slight diversion to understand a bad reward function. As we were @@ -845,6 +1153,7 @@ the car to go forward. Luckily for developers now, the DeepRacer team made it simpler by allowing the car to move only in the forward direction, so we don’t even need to think of such behavior. + id: totrans-153 prefs: [] type: TYPE_NORMAL - en: Now back to reinforcement learning. *How do we know if the model is getting @@ -863,6 +1172,7 @@ is the optimal action for that state. Over time, the model will explore less and exploit more, and this percentage allocation is changed either linearly or exponentially to allow for the best learning. + id: totrans-154 prefs: [] type: TYPE_NORMAL - en: The key here is that if the model is taking optimal actions, the system learns @@ -881,37 +1191,51 @@ path and converge to an optimal policy to reach the goal with total cumulative rewards of 18, as illustrated in [Figure 17-15](part0020.html#illustration_of_different_paths_to_the_g) (right). + id: totrans-155 prefs: [] type: TYPE_NORMAL + zh: 关键在于,如果模型正在采取最佳行动,系统会学会继续选择这些行动;如果没有选择最佳行动,系统将继续尝试学习给定状态的最佳行动。从实际角度来看,可能存在许多到达目标的路径,但对于[图17-13](part0020.html#the_deepracer_training_flow)中的小型赛道示例来说,最快的路径是沿着赛道中间直线前进,因为实际上任何转弯都会减慢汽车速度。在训练过程中,在模型开始学习的早期阶段,我们观察到它可能到达终点线(目标),但可能没有选择最佳路径,如[图17-15](part0020.html#illustration_of_different_paths_to_the_g)(左)所示,汽车来回穿梭,因此需要更长时间到达终点线,累积奖励仅为9。但因为系统仍然继续探索一部分时间,代理给自己机会找到更好的路径。随着经验的积累,代理学会找到最佳路径并收敛到一个最佳策略,以达到总累积奖励18,如[图17-15](part0020.html#illustration_of_different_paths_to_the_g)(右)所示。 - en: '![Illustration of different paths to the goal](../images/00009.jpeg)' + id: totrans-156 prefs: [] type: TYPE_IMG + zh: '![达到目标的不同路径示例](../images/00009.jpeg)' - en: Figure 17-15\. Illustration of different paths to the goal + id: totrans-157 prefs: - PREF_H6 type: TYPE_NORMAL + zh: 图17-15. 达到目标的不同路径示例 - en: Quantitatively speaking, for each episode, you should see a trend of increasing rewards. If the model is making those optimal decisions, the car must be on track and navigating the course, resulting in accumulating rewards. However, you might see even after high reward episodes the graph dipping, which can happen in early episodes because the car might still have a high degree of exploration, as mentioned earlier. + id: totrans-158 prefs: [] type: TYPE_NORMAL + zh: 从数量上来说,对于每一集,你应该看到奖励逐渐增加的趋势。如果模型做出了最佳决策,汽车必须在赛道上并且在课程中导航,从而累积奖励。然而,你可能会看到即使在高奖励的集数之后,图表也会下降,这可能是因为汽车仍然具有较高程度的探索,正如前面提到的那样。 - en: Reinforcement Learning Theory + id: totrans-159 prefs: - PREF_H2 type: TYPE_NORMAL + zh: 强化学习理论 - en: Now that we understand how a reinforcement learning system learns and works, especially in the context of AWS DeepRacer, let’s look at some formal definitions and general reinforcement learning theory. This background will be handy when we solve other problems using reinforcement learning. + id: totrans-160 prefs: [] type: TYPE_NORMAL + zh: 现在我们了解了强化学习系统是如何学习和工作的,特别是在AWS DeepRacer的背景下,让我们来看一些正式的定义和一般的强化学习理论。当我们使用强化学习解决其他问题时,这些背景知识将会很有用。 - en: The Markov decision process + id: totrans-161 prefs: - PREF_H3 type: TYPE_NORMAL + zh: 马尔可夫决策过程 - en: Markov decision process (MDP) is a discrete stochastic state-transition process framework that is used for modeling decision making in a control process. The markov property defines that each state is solely dependent on the previous state. @@ -920,12 +1244,16 @@ results in reinforcement learning rely on the problem being formulated as an MDP, hence it’s important to understand how to model a problem as an MDP to be solved using reinforcement learning. + id: totrans-162 prefs: [] type: TYPE_NORMAL + zh: 马尔可夫决策过程(MDP)是一个用于建模控制过程中决策制定的离散随机状态转移过程框架。马尔可夫性质定义了每个状态仅仅依赖于前一个状态。这是一个方便的性质,因为这意味着要进行状态转移,所有必要的信息必须在当前状态中可用。强化学习中的理论结果依赖于问题被制定为一个MDP,因此重要的是要理解如何将问题建模为一个MDP,以便使用强化学习来解决。 - en: Model free versus model based + id: totrans-163 prefs: - PREF_H3 type: TYPE_NORMAL + zh: 无模型与基于模型 - en: Model, in this context, refers to the learned representation of the environment. This is useful because we can potentially learn the dynamics in the environment and train our agent using the model rather than having to use the real environment @@ -934,12 +1262,16 @@ jointly learn both perception and dynamics as part of the agent navigation rather than one after the other in a sequence. In this chapter, we focus only on model-free reinforcement learning. + id: totrans-164 prefs: [] type: TYPE_NORMAL + zh: 在这个背景下,模型指的是对环境的学习表示。这是有用的,因为我们可以潜在地学习环境中的动态并使用模型训练我们的代理,而不必每次都使用真实环境。然而,在现实中,学习环境并不容易,通常更容易在模拟中拥有真实世界的表示,然后联合学习感知和动态作为代理导航的一部分,而不是按顺序一个接一个地学习。在本章中,我们只关注无模型的强化学习。 - en: Value based + id: totrans-165 prefs: - PREF_H3 type: TYPE_NORMAL + zh: 基于价值 - en: For every action taken by the agent, there’s a corresponding reward assigned by the reward function. For any given state-action pair, it’s helpful to know its value (reward). If such a function were to exist, we could compute the maximum @@ -954,12 +1286,16 @@ by parameterizing the value function and using a neural network to approximate the value for every action given a state observation. An example of a value-based algorithm is Deep Q-Learning. + id: totrans-166 prefs: [] type: TYPE_NORMAL + zh: 对于代理采取的每个动作,奖励函数都会分配相应的奖励。对于任何给定的状态-动作对,了解其价值(奖励)是有帮助的。如果这样的函数存在,我们可以计算在任何状态下可以实现的最大奖励,并简单地选择相应的动作来导航环境。例如,在一个3x3的井字棋游戏中,游戏情况的数量是有限的,因此我们可以建立一个查找表,以便在给定情况下给出最佳移动。但是在国际象棋游戏中,考虑到棋盘的大小和游戏的复杂性,这样的查找表将是计算昂贵的,并且存储空间将会很大。因此,在复杂的环境中,很难列出状态-动作值对或定义一个可以将状态-动作对映射到值的函数。因此,我们尝试使用神经网络通过对价值函数进行参数化,并使用神经网络来近似给定状态观察下每个动作的价值。一个基于值的算法的例子是深度Q学习。 - en: Policy based + id: totrans-167 prefs: - PREF_H3 type: TYPE_NORMAL + zh: 基于策略 - en: A policy is a set of rules that the agent learns to navigate the environment. Simply put the policy function tells the agent which action is the best action to take from its current state. Policy-based reinforcement learning algorithms @@ -967,18 +1303,24 @@ map values to states. In reinforcement learning, we parameterize the policy. In other words, we allow a neural network to learn what the optimal policy function is. + id: totrans-168 prefs: [] type: TYPE_NORMAL + zh: 策略是代理学习如何在环境中导航的一组规则。简单来说,策略函数告诉代理从当前状态中采取哪个动作是最佳动作。基于策略的强化学习算法,如REINFORCE和策略梯度,找到最佳策略,无需将值映射到状态。在强化学习中,我们对策略进行参数化。换句话说,我们允许神经网络学习什么是最佳策略函数。 - en: Policy based or value based—why not both? + id: totrans-169 prefs: - PREF_H3 type: TYPE_NORMAL + zh: 基于策略还是基于值——为什么不两者兼而有之? - en: There’s always been a debate about whether to use policy-based or value-based reinforcement learning. Newer architectures try to learn both the value function and policy function together rather than keeping one fixed. This approach in reinforcement learning is called *actor critic*. + id: totrans-170 prefs: [] type: TYPE_NORMAL + zh: 一直存在关于使用基于策略还是基于值的强化学习的争论。新的架构尝试同时学习价值函数和策略函数,而不是保持其中一个固定。这种在强化学习中的方法被称为*演员评论家*。 - en: You can associate the actor with the policy and the critic with the value function. The actor is responsible for taking actions, and the critic is responsible for estimating the “goodness,” or the value of those actions. The actor maps states @@ -990,8 +1332,10 @@ reward states, and the critic also becomes better at estimating the values of those actions. The signal for both the actor and critic to learn comes purely from the reward function. + id: totrans-171 prefs: [] type: TYPE_NORMAL + zh: 您可以将演员与策略关联起来,将评论家与价值函数关联起来。演员负责采取行动,评论家负责估计这些行动的“好坏”或价值。演员将状态映射到动作,评论家将状态-动作对映射到值。在演员评论家范式中,这两个网络(演员和评论家)使用梯度上升分别进行训练,以更新我们深度神经网络的权重。(请记住,我们的目标是最大化累积奖励;因此,我们需要找到全局最大值。)随着情节的推移,演员变得更擅长采取导致更高奖励状态的行动,评论家也变得更擅长估计这些行动的价值。演员和评论家学习的信号纯粹来自奖励函数。 - en: 'The value for a given state-action pair is called *Q value*, denoted by Q(s,a). We can decompose the Q value into two parts: the estimated value and a quantitative measure of the factor by which the action is better than others. This measure @@ -999,20 +1343,26 @@ the actual reward for a given state-action pair and the expected reward at that state. The higher the difference, the farther we are from selecting an optimal action.' + id: totrans-172 prefs: [] type: TYPE_NORMAL + zh: 给定状态-动作对的价值称为*Q值*,表示为Q(s,a)。我们可以将Q值分解为两部分:估计值和衡量动作优于其他动作的因素的量化度量。这个度量被称为*优势*函数。我们可以将优势视为给定状态-动作对的实际奖励与该状态的预期奖励之间的差异。差异越大,我们离选择最佳动作就越远。 - en: Given that estimating the value of the state could end up being a difficult problem, we can focus on learning the advantage function. This allows us to evaluate the action not only based on how good it is, but also based on how much better it could be. This allows us to converge to an optimal policy much easier than other simpler policy gradient-based methods because, generally, policy networks have high variance. + id: totrans-173 prefs: [] type: TYPE_NORMAL + zh: 考虑到估计状态的价值可能会成为一个困难的问题,我们可以专注于学习优势函数。这使我们能够评估动作不仅基于其有多好,还基于它可能有多好。这使我们比其他简单的基于策略梯度的方法更容易收敛到最佳策略,因为通常策略网络具有很高的方差。 - en: Delayed rewards and discount factor (γ) + id: totrans-174 prefs: - PREF_H3 type: TYPE_NORMAL + zh: 延迟奖励和折扣因子(γ) - en: Rewards are given for every state transition based on the action taken. But the impact of these rewards might be nonlinear. For certain problems the immediate rewards might be more important, and in some cases the future rewards more important. @@ -1022,23 +1372,33 @@ to zero indicates rewards in the immediate future are more important. With a high value that’s closer to 1, the agent will focus on taking actions that maximize future rewards. + id: totrans-175 prefs: [] type: TYPE_NORMAL + zh: 根据采取的动作,每个状态转换都会获得奖励。但这些奖励的影响可能是非线性的。对于某些问题,即时奖励可能更重要,在某些情况下,未来奖励可能更重要。例如,如果我们要构建一个股票交易算法,未来奖励可能具有更高的不确定性;因此,我们需要适当地进行折现。折现因子(γ)是介于[0,1]之间的乘法因子,接近零表示即时未来奖励更重要。对于接近1的高值,代理将专注于采取最大化未来奖励的行动。 - en: Reinforcement Learning Algorithm in AWS DeepRacer + id: totrans-176 prefs: - PREF_H2 type: TYPE_NORMAL + zh: AWS DeepRacer中的强化学习算法 - en: 'First, let’s look at one of the simplest examples of a policy optimization reinforcement learning algorithm: vanilla policy gradient.' + id: totrans-177 prefs: [] type: TYPE_NORMAL + zh: 首先,让我们看一个最简单的策略优化强化学习算法的例子:香草策略梯度。 - en: '![Training process for the vanilla policy gradient algorithm](../images/00296.jpeg)' + id: totrans-178 prefs: [] type: TYPE_IMG + zh: '![香草策略梯度算法的训练过程](../images/00296.jpeg)' - en: Figure 17-16\. Training process for the vanilla policy gradient algorithm + id: totrans-179 prefs: - PREF_H6 type: TYPE_NORMAL + zh: 图17-16. 香草策略梯度算法的训练过程 - en: 'We can think of a deep reinforcement learning model as consisting of two parts: the input embedder and policy network. The input embedder will extract features from the image input and pass it to the policy network, which makes decisions; @@ -1058,25 +1418,35 @@ the cumulative reward; so, instead of minimization, we look to maximize. So, we use gradient ascent to move the weights in the direction of the steepest reward signal.' + id: totrans-180 prefs: [] type: TYPE_NORMAL + zh: 我们可以将深度强化学习模型看作由两部分组成:输入嵌入器和策略网络。输入嵌入器将从图像输入中提取特征并将其传递给策略网络,策略网络做出决策;例如,预测对于给定输入状态哪个动作是最佳的。鉴于我们的输入是图像,我们使用卷积层(CNNs)来提取特征。因为策略是我们想要学习的内容,我们对策略函数进行参数化,最简单的方法是使用全连接层进行学习。输入CNN层接收图像,然后策略网络使用图像特征作为输入并输出一个动作。因此,将状态映射到动作。随着模型的训练,我们变得更擅长映射输入空间和提取相关特征,同时优化策略以获得每个状态的最佳动作。我们的目标是收集最大的累积奖励。为了实现这一目标,我们更新模型权重以最大化累积未来奖励,通过这样做,我们给导致更高累积未来奖励的动作赋予更高的概率。在以前训练神经网络时,我们使用随机梯度下降或其变体;在训练强化学习系统时,我们寻求最大化累积奖励;因此,我们不是最小化,而是最大化。因此,我们使用梯度上升来将权重移动到最陡峭奖励信号的方向。 - en: DeepRacer uses an advanced variation of policy optimization, called Proximal Policy Optimization (PPO), summarized in [Figure 17-17](part0020.html#training_using_the_ppo_algorithm). + id: totrans-181 prefs: [] type: TYPE_NORMAL + zh: DeepRacer使用一种高级的策略优化变体,称为Proximal Policy Optimization(PPO),在[图17-17](part0020.html#training_using_the_ppo_algorithm)中进行了总结。 - en: '![Training using the PPO algorithm](../images/00043.jpeg)' + id: totrans-182 prefs: [] type: TYPE_IMG + zh: '![使用PPO算法进行训练](../images/00043.jpeg)' - en: Figure 17-17\. Training using the PPO algorithm + id: totrans-183 prefs: - PREF_H6 type: TYPE_NORMAL + zh: 图17-17. 使用PPO算法进行训练 - en: On the left side of [Figure 17-17](part0020.html#training_using_the_ppo_algorithm), we have our simulator using the latest policy (model) to get new experience (s,a,r,s'). The experience is fed into an experience replay buffer that feeds our PPO algorithm after we have a set number of episodes. + id: totrans-184 prefs: [] type: TYPE_NORMAL + zh: 在[图17-17](part0020.html#training_using_the_ppo_algorithm)的左侧,我们的模拟器使用最新的策略(模型)获取新的经验(s,a,r,s')。经验被馈送到经验重放缓冲区中,在我们完成一定数量的周期后,将经验馈送给我们的PPO算法。 - en: On the right side of [Figure 17-17](part0020.html#training_using_the_ppo_algorithm), we update our model using PPO. Although PPO is a policy optimization method, it uses the advantage actor-critic method, which we described earlier. We compute @@ -1090,40 +1460,59 @@ of new and old policy at [0.8, 1.2]. The critic tells the actor how good the action taken was, and how the actor should adjust its network. After the policy is updated the new model is sent to the simulator to get more experience. + id: totrans-185 prefs: [] type: TYPE_NORMAL + zh: 在[图17-17](part0020.html#training_using_the_ppo_algorithm)的右侧,我们使用PPO更新我们的模型。尽管PPO是一种策略优化方法,但它使用了我们之前描述的优势演员-评论家方法。我们计算PPO梯度并将策略移动到我们获得最高奖励的方向。盲目地朝这个方向迈大步可能会导致训练中的变化过大;如果我们迈小步,训练可能会持续很长时间。PPO通过限制每个训练步骤中策略可以更新的程度来改善策略(演员)的稳定性。这是通过使用剪切的替代目标函数来实现的,它防止策略更新过多,从而解决了策略优化方法中常见的大方差问题。通常情况下,对于PPO,我们保持新旧策略的比率在[0.8, + 1.2]。评论家告诉演员采取的行动有多好,以及演员应该如何调整其网络。在策略更新后,新模型被发送到模拟器以获取更多经验。 - en: Deep Reinforcement Learning Summary with DeepRacer as an Example + id: totrans-186 prefs: - PREF_H2 type: TYPE_NORMAL + zh: 以DeepRacer为例的深度强化学习总结 - en: 'To solve any problem with reinforcement learning, we need to work through the following steps:' + id: totrans-187 prefs: [] type: TYPE_NORMAL + zh: 要使用强化学习解决任何问题,我们需要按照以下步骤进行: - en: Define the goal. + id: totrans-188 prefs: - PREF_OL type: TYPE_NORMAL + zh: 定义目标。 - en: Select the input state. + id: totrans-189 prefs: - PREF_OL type: TYPE_NORMAL + zh: 选择输入状态。 - en: Define the action space. + id: totrans-190 prefs: - PREF_OL type: TYPE_NORMAL + zh: 定义动作空间。 - en: Construct the reward function. + id: totrans-191 prefs: - PREF_OL type: TYPE_NORMAL + zh: 构建奖励函数。 - en: Define the DNN architecture. + id: totrans-192 prefs: - PREF_OL type: TYPE_NORMAL + zh: 定义DNN架构。 - en: Pick the reinforcement learning optimization algorithm (DQN, PPO, etc.). + id: totrans-193 prefs: - PREF_OL type: TYPE_NORMAL + zh: 选择强化学习优化算法(DQN、PPO等)。 - en: The fundamental manner in which a reinforcement learning model trains doesn’t change if we are building a self-driving car or building a robotic arm to grasp objects. This is a huge benefit of the paradigm because it allows us to focus @@ -1141,76 +1530,106 @@ of the internal combustion engine doesn’t influence too much the way we drive. Similarly, as long as we understand the knobs that each algorithm exposes, we can train a reinforcement learning model. + id: totrans-194 prefs: [] type: TYPE_NORMAL + zh: 训练强化学习模型的基本方式在构建自动驾驶汽车或构建机器人手臂抓取物体时并没有改变。这是该范式的一个巨大优势,因为它允许我们专注于更高层次的抽象。要使用强化学习解决问题,首要任务是将问题定义为MDP,然后定义输入状态和代理在给定环境中可以采取的一组动作,以及奖励函数。实际上,奖励函数可能是最难定义的部分之一,通常也是最重要的,因为这会影响我们的代理学习的策略。在定义了与环境相关的因素之后,我们可以专注于深度神经网络架构应该如何将输入映射到动作,然后选择强化学习算法(基于价值、基于策略、演员-评论家)进行学习。选择算法后,我们可以专注于控制算法行为的高级旋钮。当我们驾驶汽车时,我们倾向于关注控制,对内燃机的理解并不会太大程度上影响我们的驾驶方式。同样,只要我们了解每个算法暴露的旋钮,我们就可以训练强化学习模型。 - en: 'It’s time to bring this home. Let’s formulate the DeepRacer racing problem:' + id: totrans-195 prefs: [] type: TYPE_NORMAL + zh: 现在是时候结束了。让我们制定DeepRacer赛车问题: - en: 'Goal: To finish a lap by going around the track in the least amount of time' + id: totrans-196 prefs: - PREF_OL type: TYPE_NORMAL + zh: 目标:在最短时间内绕过赛道完成一圈 - en: 'Input: Grayscale 120x160 image' + id: totrans-197 prefs: - PREF_OL type: TYPE_NORMAL + zh: 输入:灰度 120x160 图像 - en: 'Actions: Discrete actions with combined speed and steering angle values' + id: totrans-198 prefs: - PREF_OL type: TYPE_NORMAL + zh: 动作:具有组合速度和转向角值的离散动作 - en: 'Rewards: Reward the car for being on the track, incentivize going faster, and prevent from doing a lot of corrections or zigzag behavior' + id: totrans-199 prefs: - PREF_OL type: TYPE_NORMAL + zh: 奖励:奖励汽车在赛道上行驶,鼓励更快行驶,并防止进行大量校正或曲线行为 - en: 'DNN architecture: Three-layer CNN + fully connected layer (Input → CNN → CNN → CNN → FC → Output)' + id: totrans-200 prefs: - PREF_OL type: TYPE_NORMAL + zh: DNN 架构:三层CNN + 全连接层(输入 → CNN → CNN → CNN → FC → 输出) - en: 'Optimization algorithm: PPO' + id: totrans-201 prefs: - PREF_OL type: TYPE_NORMAL + zh: 优化算法:PPO - en: 'Step 5: Improving Reinforcement Learning Models' + id: totrans-202 prefs: - PREF_H2 type: TYPE_NORMAL + zh: 步骤5:改进强化学习模型 - en: We can now look to improve our models and also get insights into our model training. First, we focus on training improvements in the console. We have at our disposal the ability to change the reinforcement learning algorithm settings and neural network hyperparameters. + id: totrans-203 prefs: [] type: TYPE_NORMAL + zh: 我们现在可以着手改进我们的模型,并了解我们模型训练的见解。首先,我们专注于在控制台中进行训练改进。我们可以改变强化学习算法设置和神经网络超参数。 - en: Algorithm settings + id: totrans-204 prefs: - PREF_H3 type: TYPE_NORMAL + zh: 算法设置 - en: This section specifies the hyperparameters that will be used by the reinforcement learning algorithm during training. Hyperparameters are used to improve training performance. + id: totrans-205 prefs: [] type: TYPE_NORMAL + zh: 这一部分指定了在训练过程中强化学习算法将使用的超参数。超参数用于提高训练性能。 - en: Hyperparameters for the neural network + id: totrans-206 prefs: - PREF_H3 type: TYPE_NORMAL + zh: 神经网络的超参数 - en: '[Table 17-2](part0020.html#description_and_guidance_for_tuneable_hy) presents the hyperparameters that are available to tune the neural network. Even though the default values have experimentally proven to be good, practically speaking the developer should focus on the batch size, number of epochs, and the learning rate as they were found to be the most influential in producing high-quality models; that is, getting the best out of our reward function.' + id: totrans-207 prefs: [] type: TYPE_NORMAL - en: Table 17-2\. Description and guidance for tuneable hyperparameters for the deep neural network + id: totrans-208 prefs: [] type: TYPE_NORMAL - en: '| **Parameter** | **Description** | **Tips** |' + id: totrans-209 prefs: [] type: TYPE_TB - en: '| --- | --- | --- |' + id: totrans-210 prefs: [] type: TYPE_TB - en: '| Batch size | The number recent of vehicle experiences sampled at random from @@ -1221,6 +1640,7 @@ training. | Use a larger batch size to promote more stable and smooth updates to the neural network weights, but be aware of the possibility that the training may be slower. |' + id: totrans-211 prefs: [] type: TYPE_TB - en: '| Number of epochs | An epoch represents one pass through all batches, where @@ -1229,6 +1649,7 @@ all batches one at a time, but repeat this process 10 times. | Use a larger number of epochs to promote more stable updates, but expect slower training. When the batch size is small, we can use a smaller number of epochs. |' + id: totrans-212 prefs: [] type: TYPE_TB - en: '| Learning rate | The learning rate controls how big the updates to the neural @@ -1237,6 +1658,7 @@ | A larger learning rate will lead to faster training, but it may struggle to converge. Smaller learning rates lead to stable convergence, but can take a long time to train. |' + id: totrans-213 prefs: [] type: TYPE_TB - en: '| Exploration | This refers to the method used to determine the trade-off between @@ -1244,11 +1666,13 @@ when we should stop exploring (randomly choosing actions) and when should we exploit the experience we have built up. | Since we will be using a discrete action space, we should always select “CategoricalParameters.” |' + id: totrans-214 prefs: [] type: TYPE_TB - en: '| Entropy | A degree of uncertainty, or randomness, added to the probability distribution of the action space. This helps promote the selection of random actions to explore the state/action space more broadly. |   |' + id: totrans-215 prefs: [] type: TYPE_TB - en: '| Discount factor | A factor that specifies how much the future rewards contribute @@ -1258,6 +1682,7 @@ an order of 10 future steps to make a move. With a discount factor of 0.999, the vehicle considers rewards from an order of 1,000 future steps to make a move. | The recommended discount factor values are 0.99, 0.999, and 0.9999. |' + id: totrans-216 prefs: [] type: TYPE_TB - en: '| Loss type | The loss type specifies the type of the objective function (cost @@ -1266,6 +1691,7 @@ the Huber loss takes smaller increments compared to the mean squared error loss. | When we have convergence problems, use the Huber loss type. When convergence is good and we want to train faster, use the mean squared error loss type. |' + id: totrans-217 prefs: [] type: TYPE_TB - en: '| Number of episodes between each training | This parameter controls how much @@ -1273,12 +1699,17 @@ complex problems that have more local maxima, a larger experience buffer is necessary to provide more uncorrelated data points. In this case, training will be slower but more stable. | The recommended values are 10, 20, and 40. |' + id: totrans-218 prefs: [] type: TYPE_TB + zh: '| 每次训练之间的剧集数量 | 此参数控制汽车在每次模型训练迭代之间应获取多少经验。对于具有更多局部最大值的更复杂问题,需要更大的经验缓冲区,以提供更多不相关的数据点。在这种情况下,训练会更慢但更稳定。| + 推荐值为10、20和40。|' - en: Insights into model training + id: totrans-219 prefs: - PREF_H3 type: TYPE_NORMAL + zh: 模型训练见解 - en: After the model is trained, at a macro level the rewards over time graph, like the one in [Figure 17-10](part0020.html#training_graph_and_simulation_video_stre), gives us an idea of how the training progressed and the point at which the model @@ -1288,18 +1719,25 @@ that analyzes the training logs, and provides suggestions. In this section, we look at some of the more useful visualization tools that can be used to gain insight into our model’s training. + id: totrans-220 prefs: [] type: TYPE_NORMAL + zh: 在模型训练完成后,从宏观角度来看,随时间变化的奖励图表,就像[图17-10](part0020.html#training_graph_and_simulation_video_stre)中的图表,让我们了解了训练的进展以及模型开始收敛的点。但它并没有给我们一个收敛策略的指示,也没有让我们了解我们的奖励函数的行为,或者汽车速度可以改进的地方。为了获得更多见解,我们开发了一个[Jupyter + Notebook](https://oreil.ly/OWw_E),分析训练日志,并提供建议。在本节中,我们将看一些更有用的可视化工具,可以用来深入了解我们模型的训练。 - en: The log file records every step that the car takes. At each step it records the x,y location of the car, yaw (rotation), the steering angle, throttle, the progress from the start line, action taken, reward, the closest waypoint, and so on. + id: totrans-221 prefs: [] type: TYPE_NORMAL + zh: 日志文件记录了汽车所采取的每一步。在每一步中,它记录了汽车的x、y位置,偏航(旋转),转向角,油门,从起点开始的进度,采取的行动,奖励,最近的航路点等等。 - en: Heatmap visualization + id: totrans-222 prefs: - PREF_H3 type: TYPE_NORMAL + zh: 热图可视化 - en: For complex reward functions, we might want to understand reward distribution on the track; that is, where did the reward function give rewards to the car on the track, and its magnitude. To visualize this, we can generate a heatmap, as @@ -1310,19 +1748,27 @@ the rest of the track is dark, indicating no rewards or rewards close to 0 were given when the car was in those locations. We can follow the [code](https://oreil.ly/n-w7G) to generate our own heatmap and investigate our reward distribution. + id: totrans-223 prefs: [] type: TYPE_NORMAL + zh: 对于复杂的奖励函数,我们可能想要了解赛道上的奖励分布;也就是说,奖励函数在赛道上给汽车奖励的位置以及大小。为了可视化这一点,我们可以生成一个热图,如[图17-18](part0020.html#heatmap_visualization_for_the_example_ce)所示。鉴于我们使用的奖励函数在赛道中心附近给出最大奖励,我们看到该区域很亮,中心线两侧的一个小带是红色的,表示奖励较少。最后,赛道的其余部分是黑暗的,表示当汽车在这些位置时没有奖励或接近0的奖励。我们可以按照[代码](https://oreil.ly/n-w7G)生成自己的热图,并调查我们的奖励分布。 - en: '![Heatmap visualization for the example centerline reward function](../images/00214.jpeg)' + id: totrans-224 prefs: [] type: TYPE_IMG + zh: '![示例中心线奖励函数的热图可视化](../images/00214.jpeg)' - en: Figure 17-18\. Heatmap visualization for the example centerline reward function + id: totrans-225 prefs: - PREF_H6 type: TYPE_NORMAL + zh: 图17-18. 示例中心线奖励函数的热图可视化 - en: Improving the speed of our model + id: totrans-226 prefs: - PREF_H3 type: TYPE_NORMAL + zh: 改进我们模型的速度 - en: After we run an evaluation, we get the result with lap times. At this point, we might be curious as to the path taken by the car or where it failed, or perhaps the points where it slowed down. To understand all of these, we can use the code @@ -1334,17 +1780,23 @@ (left) indicates that the car didn’t really go fast at the straight part of the track; this provides us with an opportunity. We could incentivize the model by giving more rewards to go faster at this part of the track. + id: totrans-227 prefs: [] type: TYPE_NORMAL + zh: 在我们运行评估之后,我们得到了圈速的结果。此时,我们可能会对汽车所采取的路径或失败的地方感到好奇,或者它减速的地方。为了了解所有这些,我们可以使用这个[笔记本](https://oreil.ly/tLOtk)中的代码来绘制一张赛道热图。在接下来的示例中,我们可以观察汽车在赛道周围导航时所采取的路径,并可视化它通过赛道上各个点的速度。这让我们了解了我们可以优化的部分。快速查看[图17-19](part0020.html#speed_heatmap_of_an_evaluation_runsemico)(左)表明汽车在赛道的直线部分并没有真正快;这为我们提供了一个机会。我们可以通过给予更多奖励来激励模型在赛道的这一部分更快地行驶。 - en: '![Speed heatmap of an evaluation run; (left) evaluation lap with the basic example reward function, (right) faster lap with modified reward function](../images/00180.jpeg)' + id: totrans-228 prefs: [] type: TYPE_IMG + zh: '![评估运行的速度热图;(左)使用基本示例奖励函数的评估圈,(右)使用修改后的奖励函数更快的圈](../images/00180.jpeg)' - en: Figure 17-19\. Speed heatmap of an evaluation run; (left) evaluation lap with the basic example reward function, (right) faster lap with modified reward function + id: totrans-229 prefs: - PREF_H6 type: TYPE_NORMAL + zh: 图17-19. 评估运行的速度热图;(左)使用基本示例奖励函数的评估圈,(右)使用修改后的奖励函数更快的圈 - en: 'In [Figure 17-19](part0020.html#speed_heatmap_of_an_evaluation_runsemico) (left), it seems like the car has a small zigzag pattern at times, so one improvement here could be that we penalize the car when it turns too much. In the code example @@ -1357,25 +1809,32 @@ just a glimpse in to improving the model. We can continue to iterate on our models and clock better lap times using these tools. All of the suggestions are incorporated in to the reward function example shown here:' + id: totrans-230 prefs: [] type: TYPE_NORMAL - en: '[PRE2]' + id: totrans-231 prefs: [] type: TYPE_PRE + zh: '[PRE2]' - en: Racing the AWS DeepRacer Car + id: totrans-232 prefs: - PREF_H1 type: TYPE_NORMAL - en: It’s time to bring what we’ve learned so far from the virtual to the physical world and race a real autonomous car. Toy sized, of course! + id: totrans-233 prefs: [] type: TYPE_NORMAL - en: If you own an AWS DeepRacer car, follow the instructions provided to test your model on a physical car. For those interested in buying the car, AWS DeepRacer is available for purchase on Amazon. + id: totrans-234 prefs: [] type: TYPE_NORMAL - en: Building the Track + id: totrans-235 prefs: - PREF_H2 type: TYPE_NORMAL @@ -1383,12 +1842,15 @@ and with a physical AWS DeepRacer car. First, let’s build a makeshift track at home to race our model. For simplicity, we’ll build only part of a racetrack, but instructions on how to build an entire track are provided [here](https://oreil.ly/2xZQA). + id: totrans-236 prefs: [] type: TYPE_NORMAL - en: 'To build a track, you need the following materials:' + id: totrans-237 prefs: [] type: TYPE_NORMAL - en: 'For track borders:' + id: totrans-238 prefs: [] type: TYPE_NORMAL - en: We can create a track with tape that is about two-inches wide and white or off-white @@ -1396,17 +1858,21 @@ thickness of the track markers is set to be two inches. For a dark surface, use a white or off-white tape. For example, [1.88-inch width, pearl white duct tape](https://oreil.ly/2x6dl) or [1.88-inch (less sticky) masking tape](https://oreil.ly/Hn0AD). + id: totrans-239 prefs: [] type: TYPE_NORMAL - en: 'For track surface:' + id: totrans-240 prefs: [] type: TYPE_NORMAL - en: We can create a track on a dark-colored hard floor such as hardwood, carpet, concrete, or [asphalt felt](https://oreil.ly/7Q1ae). The latter mimics the real-world road surface with minimal reflection. + id: totrans-241 prefs: [] type: TYPE_NORMAL - en: AWS DeepRacer Single-Turn Track Template + id: totrans-242 prefs: - PREF_H2 type: TYPE_NORMAL @@ -1416,84 +1882,103 @@ a straight line or make turns in one direction. *The angular dimensions for the turns specified are suggestive; we can use approximate measurements when laying down the track.* + id: totrans-243 prefs: [] type: TYPE_NORMAL - en: '![The test track layout](../images/00137.jpeg)' + id: totrans-244 prefs: [] type: TYPE_IMG - en: Figure 17-20\. The test track layout + id: totrans-245 prefs: - PREF_H6 type: TYPE_NORMAL - en: Running the Model on AWS DeepRacer + id: totrans-246 prefs: - PREF_H2 type: TYPE_NORMAL - en: To start the AWS DeepRacer vehicle on autonomous driving, we must upload at least one AWS DeepRacer model to our AWS DeepRacer vehicle. + id: totrans-247 prefs: [] type: TYPE_NORMAL - en: To upload a model, pick our trained model from the AWS DeepRacer console and then download the model artifacts from its Amazon S3 storage to a (local or network) drive that can be accessed from the computer. There’s an easy download model button provided on the model page. + id: totrans-248 prefs: [] type: TYPE_NORMAL - en: 'To upload a trained model to the vehicle, do the following:' + id: totrans-249 prefs: [] type: TYPE_NORMAL - en: From the device console’s main navigation pane, choose Models, as shown in [Figure 17-21](part0020.html#the_model_upload_menu_on_aws_deepracer_c). + id: totrans-250 prefs: - PREF_OL type: TYPE_NORMAL - en: '![The Model upload menu on the AWS DeepRacer car web console](../images/00101.jpeg)' + id: totrans-251 prefs: - PREF_IND type: TYPE_IMG - en: Figure 17-21\. The Model upload menu on the AWS DeepRacer car web console + id: totrans-252 prefs: - PREF_IND - PREF_H6 type: TYPE_NORMAL - en: On the Models page, above the Models list, choose Upload. + id: totrans-253 prefs: - PREF_OL type: TYPE_NORMAL - en: From the file picker, navigate to the drive or share where you downloaded the model artifacts and choose the model for upload. + id: totrans-254 prefs: - PREF_OL type: TYPE_NORMAL - en: When the model is uploaded successfully, it will be added to the Models list and can be loaded into the vehicle’s inference engine. + id: totrans-255 prefs: - PREF_OL type: TYPE_NORMAL - en: Driving the AWS DeepRacer Vehicle Autonomously + id: totrans-256 prefs: - PREF_H2 type: TYPE_NORMAL - en: 'To start autonomous driving, place the vehicle on a physical track and do the following:' + id: totrans-257 prefs: [] type: TYPE_NORMAL - en: 'Follow [the instructions](https://oreil.ly/OwzSz) to sign in to the vehicle’s device console, and then do the following for autonomous driving:' + id: totrans-258 prefs: - PREF_OL type: TYPE_NORMAL - en: On the “Control vehicle” page, in the Controls section, choose “Autonomous driving,” as shown in [Figure 17-22](part0020.html#driving_mode_selection_menu_on_aws_deepr). + id: totrans-259 prefs: - PREF_IND - PREF_OL type: TYPE_NORMAL - en: '![Driving mode selection menu on the AWS DeepRacer car web console](../images/00053.jpeg)' + id: totrans-260 prefs: - PREF_IND - PREF_IND type: TYPE_IMG - en: Figure 17-22\. Driving mode selection menu on the AWS DeepRacer car web console + id: totrans-261 prefs: - PREF_IND - PREF_IND @@ -1502,6 +1987,7 @@ - en: On the “Select a model” drop-down list ([Figure 17-23](part0020.html#model_selection_menu_on_aws_deepracer_ca)), choose an uploaded model, and then choose “Load model.” This will start loading the model into the inference engine. The process takes about 10 seconds to complete. + id: totrans-262 prefs: - PREF_OL type: TYPE_NORMAL @@ -1509,32 +1995,39 @@ maximum speed used in training the model. (Certain factors such as surface friction of the real track can reduce the maximum speed of the vehicle from the maximum speed used in the training. You’ll need to experiment to find the optimal setting.) + id: totrans-263 prefs: - PREF_OL type: TYPE_NORMAL - en: '![Model selection menu on AWS DeepRacer car web console](../images/00027.jpeg)' + id: totrans-264 prefs: - PREF_IND type: TYPE_IMG - en: Figure 17-23\. Model selection menu on AWS DeepRacer car web console + id: totrans-265 prefs: - PREF_IND - PREF_H6 type: TYPE_NORMAL - en: Choose “Start vehicle” to set the vehicle to drive autonomously. + id: totrans-266 prefs: - PREF_OL type: TYPE_NORMAL - en: Watch the vehicle drive on the physical track or the streaming video player on the device console. + id: totrans-267 prefs: - PREF_OL type: TYPE_NORMAL - en: To stop the vehicle, choose “Stop vehicle.” + id: totrans-268 prefs: - PREF_OL type: TYPE_NORMAL - en: Sim2Real transfer + id: totrans-269 prefs: - PREF_H3 type: TYPE_NORMAL @@ -1547,27 +2040,34 @@ even with great success in simulation, when the agent runs in the real world, we can experience failures. Here are some of the common approaches to handle the limitations of the simulator:' + id: totrans-270 prefs: [] type: TYPE_NORMAL - en: System identification + id: totrans-271 prefs: [] type: TYPE_NORMAL - en: Build a mathematical model of the real environment and calibrate the physical system to be as realistic as possible. + id: totrans-272 prefs: [] type: TYPE_NORMAL - en: Domain adaptation + id: totrans-273 prefs: [] type: TYPE_NORMAL - en: Map the simulation domain to the real environment, or vice versa, using techniques such as regularization, GANs, or transfer learning. + id: totrans-274 prefs: [] type: TYPE_NORMAL - en: Domain randomization + id: totrans-275 prefs: [] type: TYPE_NORMAL - en: Create a variety of simulation environments with randomized properties and train a model on data from all these environments. + id: totrans-276 prefs: [] type: TYPE_NORMAL - en: 'In the context of DeepRacer, the simulation fidelity is an approximate representation @@ -1583,32 +2083,40 @@ navigating using the white track extremity markers. Take a look at [Figure 17-24](part0020.html#gradcam_heatmaps_for_aws_deepracer_navig), which uses a technique called GradCAM to generate a heatmap of the most influential parts of the image to understand where the car is looking for navigation.' + id: totrans-277 prefs: [] type: TYPE_NORMAL - en: '![GradCAM heatmaps for AWS DeepRacer navigation](../images/00302.jpeg)' + id: totrans-278 prefs: [] type: TYPE_IMG - en: Figure 17-24\. GradCAM heatmaps for AWS DeepRacer navigation + id: totrans-279 prefs: - PREF_H6 type: TYPE_NORMAL - en: Further Exploration + id: totrans-280 prefs: - PREF_H1 type: TYPE_NORMAL - en: To continue with the adventure, you can become involved in various virtual and physical racing leagues. Following are a few options to explore. + id: totrans-281 prefs: [] type: TYPE_NORMAL - en: DeepRacer League + id: totrans-282 prefs: - PREF_H2 type: TYPE_NORMAL - en: 'AWS DeepRacer has a physical league at AWS summits and monthly virtual leagues. To race in the current virtual track and win prizes, visit the league page: [*https://console.aws.amazon.com/deepracer/home?region=us-east-1#leaderboards*](https://console.aws.amazon.com/deepracer/home?region=us-east-1#leaderboards).' + id: totrans-283 prefs: [] type: TYPE_NORMAL - en: Advanced AWS DeepRacer + id: totrans-284 prefs: - PREF_H2 type: TYPE_NORMAL @@ -1617,9 +2125,11 @@ architectures. To enable this experience, we provide a [Jupyter Notebook](https://oreil.ly/Xto3S)–based setup with which you can provision the components needed to train your custom AWS DeepRacer models. + id: totrans-285 prefs: [] type: TYPE_NORMAL - en: AI Driving Olympics + id: totrans-286 prefs: - PREF_H2 type: TYPE_NORMAL @@ -1634,16 +2144,20 @@ to get their code running relatively quickly. Since the first edition of the AI-DO, this competition has expanded to other top AI academic conferences and now includes challenges featuring both Duckietown and DeepRacer platforms. + id: totrans-287 prefs: [] type: TYPE_NORMAL - en: '![Duckietown at the AI Driving Olympics](../images/00259.jpeg)' + id: totrans-288 prefs: [] type: TYPE_IMG - en: Figure 17-25\. Duckietown at the AI Driving Olympics + id: totrans-289 prefs: - PREF_H6 type: TYPE_NORMAL - en: DIY Robocars + id: totrans-290 prefs: - PREF_H2 type: TYPE_NORMAL @@ -1652,9 +2166,11 @@ These are fun and engaging communities to try and work with other enthusiasts in the self-driving and autonomous robocar space. Many of them run monthly races and are great venues to physically race your robocar. + id: totrans-291 prefs: [] type: TYPE_NORMAL - en: Roborace + id: totrans-292 prefs: - PREF_H2 type: TYPE_NORMAL @@ -1666,6 +2182,7 @@ We are not talking about miniature-scaled cars anymore. These are full-sized, 1,350 kg, 4.8 meters, capable of reaching 200 mph (320 kph). The best part? We don’t need intricate hardware knowledge to race them. + id: totrans-293 prefs: [] type: TYPE_NORMAL - en: The real stars here are AI developers. Each team gets an identical car, so that @@ -1675,6 +2192,7 @@ with the powerful NVIDIA DRIVE platform capable of processing several teraflops per second. All we need to build is the algorithm to let the car stay on track, avoid getting into an accident, and of course, get ahead as fast as possible. + id: totrans-294 prefs: [] type: TYPE_NORMAL - en: To get started, Roborace provides a race simulation environment where precise @@ -1686,17 +2204,21 @@ lead to innovations and hopefully the learnings can be translated back to the autonomous car industry, making them safer, more reliable, and performant at the same time. + id: totrans-295 prefs: [] type: TYPE_NORMAL - en: '![Robocar from Roborace designed by Daniel Simon](../images/00221.jpeg)' + id: totrans-296 prefs: [] type: TYPE_IMG - en: Figure 17-26\. Robocar from Roborace designed by Daniel Simon (image courtesy of Roborace) + id: totrans-297 prefs: - PREF_H6 type: TYPE_NORMAL - en: Summary + id: totrans-298 prefs: - PREF_H1 type: TYPE_NORMAL @@ -1708,6 +2230,7 @@ teaching the car to drive in a simulator on its own. But why limit it to the virtual world? We brought the learnings into the physical world and raced a real car. And all it took was an hour!' + id: totrans-299 prefs: [] type: TYPE_NORMAL - en: Deep reinforcement learning is a relatively new field but an exciting one to @@ -1719,5 +2242,6 @@ It’s no surprise that a lot of machine learning researchers believe reinforcement learning has the potential to get us closer to *artificial general intelligence* and open up avenues that were previously considered science fiction. + id: totrans-300 prefs: [] type: TYPE_NORMAL