From a1ebb4dfac5493f1f16944b94808f6a149965792 Mon Sep 17 00:00:00 2001 From: wizardforcel <562826179@qq.com> Date: Thu, 8 Feb 2024 19:12:21 +0800 Subject: [PATCH] 2024-02-08 19:12:19 --- totrans/gen-dl_13.yaml | 45 ++++++ totrans/gen-dl_14.yaml | 342 +++++++++++++++++++++++++++++++++++++++++ 2 files changed, 387 insertions(+) diff --git a/totrans/gen-dl_13.yaml b/totrans/gen-dl_13.yaml index 33d04c0..6745916 100644 --- a/totrans/gen-dl_13.yaml +++ b/totrans/gen-dl_13.yaml @@ -753,51 +753,63 @@ id: totrans-100 prefs: [] type: TYPE_NORMAL + zh: 构成`TransformerBlock`层的子层在初始化函数中定义。 - en: '[![2](Images/2.png)](#co_transformers_CO2-2)' id: totrans-101 prefs: [] type: TYPE_NORMAL + zh: '[![2](Images/2.png)](#co_transformers_CO2-2)' - en: The causal mask is created to hide future keys from the query. id: totrans-102 prefs: [] type: TYPE_NORMAL + zh: 因果掩码被创建用来隐藏查询中的未来键。 - en: '[![3](Images/3.png)](#co_transformers_CO2-3)' id: totrans-103 prefs: [] type: TYPE_NORMAL + zh: '[![3](Images/3.png)](#co_transformers_CO2-3)' - en: The multihead attention layer is created, with the attention masks specified. id: totrans-104 prefs: [] type: TYPE_NORMAL + zh: 创建了多头注意力层,并指定了注意力掩码。 - en: '[![4](Images/4.png)](#co_transformers_CO2-4)' id: totrans-105 prefs: [] type: TYPE_NORMAL + zh: '[![4](Images/4.png)](#co_transformers_CO2-4)' - en: The first *add and normalization* layer. id: totrans-106 prefs: [] type: TYPE_NORMAL + zh: 第一个*加和归一化*层。 - en: '[![5](Images/5.png)](#co_transformers_CO2-5)' id: totrans-107 prefs: [] type: TYPE_NORMAL + zh: '[![5](Images/5.png)](#co_transformers_CO2-5)' - en: The feed-forward layers. id: totrans-108 prefs: [] type: TYPE_NORMAL + zh: 前馈层。 - en: '[![6](Images/6.png)](#co_transformers_CO2-6)' id: totrans-109 prefs: [] type: TYPE_NORMAL + zh: '[![6](Images/6.png)](#co_transformers_CO2-6)' - en: The second *add and normalization* layer. id: totrans-110 prefs: [] type: TYPE_NORMAL + zh: 第二个*加和归一化*层。 - en: Positional Encoding id: totrans-111 prefs: - PREF_H2 type: TYPE_NORMAL + zh: 位置编码 - en: 'There is one final step to cover before we can put everything together to train our GPT model. You may have noticed that in the multihead attention layer, there is nothing that cares about the ordering of the keys. The dot product between @@ -808,16 +820,19 @@ id: totrans-112 prefs: [] type: TYPE_NORMAL + zh: 在我们能够将所有内容整合在一起训练我们的GPT模型之前,还有一个最后的步骤要解决。您可能已经注意到,在多头注意力层中,没有任何关心键的顺序的内容。每个键和查询之间的点积是并行计算的,而不是像递归神经网络那样顺序计算。这是一种优势(因为并行化效率提高),但也是一个问题,因为我们显然需要注意力层能够预测以下两个句子的不同输出: - en: The dog looked at the boy and …​ (barked?) id: totrans-113 prefs: - PREF_UL type: TYPE_NORMAL + zh: 狗看着男孩然后…(叫?) - en: The boy looked at the dog and …​ (smiled?) id: totrans-114 prefs: - PREF_UL type: TYPE_NORMAL + zh: 男孩看着狗然后…(微笑?) - en: To solve this problem, we use a technique called *positional encoding* when creating the inputs to the initial Transformer block. Instead of only encoding each token using a *token embedding*, we also encode the position of the token, @@ -825,6 +840,7 @@ id: totrans-115 prefs: [] type: TYPE_NORMAL + zh: 为了解决这个问题,我们在创建初始Transformer块的输入时使用一种称为*位置编码*的技术。我们不仅使用*标记嵌入*对每个标记进行编码,还使用*位置嵌入*对标记的位置进行编码。 - en: The *token embedding* is created using a standard `Embedding` layer to convert each token into a learned vector. We can create the *positional embedding* in the same way, using a standard `Embedding` layer to convert each integer position @@ -832,17 +848,20 @@ id: totrans-116 prefs: [] type: TYPE_NORMAL + zh: '*标记嵌入*是使用标准的`Embedding`层创建的,将每个标记转换为一个学习到的向量。我们可以以相同的方式创建*位置嵌入*,使用标准的`Embedding`层将每个整数位置转换为一个学习到的向量。' - en: Tip id: totrans-117 prefs: - PREF_H6 type: TYPE_NORMAL + zh: 提示 - en: While GPT uses an `Embedding` layer to embed the position, the original Transformer paper used trigonometric functions—we’ll cover this alternative in [Chapter 11](ch11.xhtml#chapter_music), when we explore music generation. id: totrans-118 prefs: [] type: TYPE_NORMAL + zh: 虽然GPT使用`Embedding`层来嵌入位置,但原始Transformer论文使用三角函数——我们将在[第11章](ch11.xhtml#chapter_music)中介绍这种替代方法,当我们探索音乐生成时。 - en: To construct the joint token–position encoding, the token embedding is added to the positional embedding, as shown in [Figure 9-8](#positional_enc). This way, the meaning and position of each word in the sequence are captured in a single @@ -850,25 +869,30 @@ id: totrans-119 prefs: [] type: TYPE_NORMAL + zh: 为构建联合标记-位置编码,将标记嵌入加到位置嵌入中,如[图9-8](#positional_enc)所示。这样,序列中每个单词的含义和位置都被捕捉在一个向量中。 - en: '![](Images/gdl2_0908.png)' id: totrans-120 prefs: [] type: TYPE_IMG + zh: '![](Images/gdl2_0908.png)' - en: Figure 9-8\. The token embeddings are added to the positional embeddings to give the token position encoding id: totrans-121 prefs: - PREF_H6 type: TYPE_NORMAL + zh: 图9-8\. 将标记嵌入添加到位置嵌入以给出标记位置编码 - en: The code that defines our `TokenAndPositionEmbedding` layer is shown in [Example 9-5](#positional_embedding_code). id: totrans-122 prefs: [] type: TYPE_NORMAL + zh: 定义我们的`TokenAndPositionEmbedding`层的代码显示在[示例9-5](#positional_embedding_code)中。 - en: Example 9-5\. The `TokenAndPositionEmbedding` layer id: totrans-123 prefs: - PREF_H5 type: TYPE_NORMAL + zh: 示例9-5\. `TokenAndPositionEmbedding`层 - en: '[PRE5]' id: totrans-124 prefs: [] @@ -878,31 +902,38 @@ id: totrans-125 prefs: [] type: TYPE_NORMAL + zh: '[![1](Images/1.png)](#co_transformers_CO3-1)' - en: The tokens are embedded using an `Embedding` layer. id: totrans-126 prefs: [] type: TYPE_NORMAL + zh: 标记使用`Embedding`层进行嵌入。 - en: '[![2](Images/2.png)](#co_transformers_CO3-2)' id: totrans-127 prefs: [] type: TYPE_NORMAL + zh: '[![2](Images/2.png)](#co_transformers_CO3-2)' - en: The positions of the tokens are also embedded using an `Embedding` layer. id: totrans-128 prefs: [] type: TYPE_NORMAL + zh: 标记的位置也使用`Embedding`层进行嵌入。 - en: '[![3](Images/3.png)](#co_transformers_CO3-3)' id: totrans-129 prefs: [] type: TYPE_NORMAL + zh: '[![3](Images/3.png)](#co_transformers_CO3-3)' - en: The output from the layer is the sum of the token and position embeddings. id: totrans-130 prefs: [] type: TYPE_NORMAL + zh: 该层的输出是标记和位置嵌入的总和。 - en: Training GPT id: totrans-131 prefs: - PREF_H2 type: TYPE_NORMAL + zh: 训练GPT - en: Now we are ready to build and train our GPT model! To put everything together, we need to pass our input text through the token and position embedding layer, then through our Transformer block. The final output of the network is a simple @@ -910,35 +941,42 @@ id: totrans-132 prefs: [] type: TYPE_NORMAL + zh: 现在我们准备构建和训练我们的GPT模型!为了将所有内容整合在一起,我们需要将输入文本通过标记和位置嵌入层,然后通过我们的Transformer块。网络的最终输出是一个简单的具有softmax激活函数的`Dense`层,覆盖词汇表中的单词数量。 - en: Tip id: totrans-133 prefs: - PREF_H6 type: TYPE_NORMAL + zh: 提示 - en: For simplicity, we will use just one Transformer block, rather than the 12 in the paper. id: totrans-134 prefs: [] type: TYPE_NORMAL + zh: 为简单起见,我们将只使用一个Transformer块,而不是论文中的12个。 - en: The overall architecture is shown in [Figure 9-9](#transformer) and the equivalent code is provided in [Example 9-6](#transformer_code). id: totrans-135 prefs: [] type: TYPE_NORMAL + zh: 整体架构显示在[图9-9](#transformer)中,相应的代码在[示例9-6](#transformer_code)中提供。 - en: '![](Images/gdl2_0909.png)' id: totrans-136 prefs: [] type: TYPE_IMG + zh: '![](Images/gdl2_0909.png)' - en: Figure 9-9\. The simplified GPT model architecture id: totrans-137 prefs: - PREF_H6 type: TYPE_NORMAL + zh: 图9-9\. 简化的GPT模型架构 - en: Example 9-6\. A GPT model in Keras id: totrans-138 prefs: - PREF_H5 type: TYPE_NORMAL + zh: 示例9-6\. 在Keras中的GPT模型 - en: '[PRE6]' id: totrans-139 prefs: [] @@ -948,30 +986,37 @@ id: totrans-140 prefs: [] type: TYPE_NORMAL + zh: '[![1](Images/1.png)](#co_transformers_CO4-1)' - en: The input is padded (with zeros). id: totrans-141 prefs: [] type: TYPE_NORMAL + zh: 输入被填充(用零填充)。 - en: '[![2](Images/2.png)](#co_transformers_CO4-2)' id: totrans-142 prefs: [] type: TYPE_NORMAL + zh: '[![2](Images/2.png)](#co_transformers_CO4-2)' - en: The text is encoded using a `TokenAndPositionEmbedding` layer. id: totrans-143 prefs: [] type: TYPE_NORMAL + zh: 文本使用`TokenAndPositionEmbedding`层进行编码。 - en: '[![3](Images/3.png)](#co_transformers_CO4-3)' id: totrans-144 prefs: [] type: TYPE_NORMAL + zh: '[![3](Images/3.png)](#co_transformers_CO4-3)' - en: The encoding is passed through a `TransformerBlock`. id: totrans-145 prefs: [] type: TYPE_NORMAL + zh: 编码通过`TransformerBlock`传递。 - en: '[![4](Images/4.png)](#co_transformers_CO4-4)' id: totrans-146 prefs: [] type: TYPE_NORMAL + zh: '[![4](Images/4.png)](#co_transformers_CO4-4)' - en: The transformed output is passed through a `Dense` layer with softmax activation to predict a distribution over the subsequent word. id: totrans-147 diff --git a/totrans/gen-dl_14.yaml b/totrans/gen-dl_14.yaml index 9b577d3..7bc60e3 100644 --- a/totrans/gen-dl_14.yaml +++ b/totrans/gen-dl_14.yaml @@ -1,92 +1,126 @@ - en: Chapter 10\. Advanced GANs + id: totrans-0 prefs: - PREF_H1 type: TYPE_NORMAL + zh: 第10章. 高级GANs - en: '[Chapter 4](ch04.xhtml#chapter_gan) introduced generative adversarial networks (GANs), a class of generative model that has produced state-of-the-art results across a wide variety of image generation tasks. The flexibility in the model architecture and training process has led academics and deep learning practitioners to find new ways to design and train GANs, leading to many different advanced *flavors* of the architecture that we shall explore in this chapter.' + id: totrans-1 prefs: [] type: TYPE_NORMAL + zh: '[第4章](ch04.xhtml#chapter_gan)介绍了生成对抗网络(GANs),这是一类生成模型,在各种图像生成任务中取得了最先进的结果。模型架构和训练过程的灵活性导致学术界和深度学习从业者找到了设计和训练GAN的新方法,从而产生了许多不同的高级架构,我们将在本章中探讨。' - en: Introduction + id: totrans-2 prefs: - PREF_H1 type: TYPE_NORMAL + zh: 介绍 - en: Explaining all GAN developments and their repercussions in detail could easily fill another book. The [GAN Zoo repository](https://oreil.ly/Oy6bR) on GitHub contains over 500 distinct examples of GANs with linked papers, ranging from ABC-GAN to ZipNet-GAN! + id: totrans-3 prefs: [] type: TYPE_NORMAL + zh: 详细解释所有GAN发展及其影响可能需要另一本书。GitHub上的[GAN Zoo代码库](https://oreil.ly/Oy6bR)包含了500多个不同的GAN示例,涵盖了从ABC-GAN到ZipNet-GAN的各种GAN,并附有相关论文链接! - en: In this chapter we will cover the main GANs that have been influential in the field, including a detailed explanation of the model architecture and training process for each. + id: totrans-4 prefs: [] type: TYPE_NORMAL + zh: 在本章中,我们将介绍对该领域产生影响的主要GANs,包括对每个模型的模型架构和训练过程的详细解释。 - en: 'We will first explore three important models from NVIDIA that have pushed the boundaries of image generation: ProGAN, StyleGAN, and StyleGAN2\. We will analyze each of these models in enough detail to understand the fundamental concepts that underpin the architectures and see how they have each built on ideas from earlier papers.' + id: totrans-5 prefs: [] type: TYPE_NORMAL + zh: 我们将首先探讨NVIDIA推动图像生成边界的三个重要模型:ProGAN、StyleGAN和StyleGAN2。我们将对每个模型进行足够详细的分析,以理解支撑架构的基本概念,并看看它们如何各自建立在早期论文的想法基础上。 - en: 'We will also explore two other important GAN architectures that incorporate attention: the Self-Attention GAN (SAGAN) and BigGAN, which built on many of the ideas in the SAGAN paper. We have already seen the power of the attention mechanism in the context of Transformers in [Chapter 9](ch09.xhtml#chapter_transformer).' + id: totrans-6 prefs: [] type: TYPE_NORMAL + zh: 我们还将探讨另外两种重要的GAN架构,包括引入注意力机制的Self-Attention GAN(SAGAN)和BigGAN,后者在SAGAN论文中的许多想法基础上构建。我们已经在[第9章](ch09.xhtml#chapter_transformer)中看到了注意力机制在变换器中的威力。 - en: Lastly, we will cover VQ-GAN and ViT VQ-GAN, which incorporate a blend of ideas from variational autoencoders, Transformers, and GANs. VQ-GAN is a key component of Google’s state-of-the-art text-to-image generation model Muse.^([1](ch10.xhtml#idm45387005226448)) We will explore so-called multimodal models in more detail in [Chapter 13](ch13.xhtml#chapter_multimodal). + id: totrans-7 prefs: [] type: TYPE_NORMAL + zh: 最后,我们将介绍VQ-GAN和ViT VQ-GAN,它们融合了变分自动编码器、变换器和GAN的思想。VQ-GAN是谷歌最先进的文本到图像生成模型Muse的关键组成部分。我们将在[第13章](ch13.xhtml#chapter_multimodal)中更详细地探讨所谓的多模型。 - en: Training Your Own Models + id: totrans-8 prefs: - PREF_H1 type: TYPE_NORMAL + zh: 训练您自己的模型 - en: For conciseness I have chosen not to include code to directly build these models in the code repository for this book, but instead will point to publicly available implementations where possible, so that you can train your own versions if you wish. + id: totrans-9 prefs: [] type: TYPE_NORMAL + zh: 为了简洁起见,我选择不在本书的代码库中直接构建这些模型的代码,而是将尽可能指向公开可用的实现,以便您可以根据需要训练自己的版本。 - en: ProGAN + id: totrans-10 prefs: - PREF_H1 type: TYPE_NORMAL + zh: ProGAN - en: ProGAN is a technique developed by NVIDIA Labs in 2017^([2](ch10.xhtml#idm45387005216528)) to improve both the speed and stability of GAN training. Instead of immediately training a GAN on full-resolution images, the ProGAN paper suggests first training the generator and discriminator on low-resolution images of, say, 4 × 4 pixels and then incrementally adding layers throughout the training process to increase the resolution. + id: totrans-11 prefs: [] type: TYPE_NORMAL + zh: ProGAN是NVIDIA实验室在2017年开发的一种技术,旨在提高GAN训练的速度和稳定性。ProGAN论文建议,不要立即在全分辨率图像上训练GAN,而是首先在低分辨率图像(例如4×4像素)上训练生成器和鉴别器,然后在训练过程中逐步添加层以增加分辨率。 - en: Let’s take a look at the concept of *progressive training* in more detail. + id: totrans-12 prefs: [] type: TYPE_NORMAL + zh: 让我们更详细地了解*渐进式训练*的概念。 - en: Training Your Own ProGAN + id: totrans-13 prefs: - PREF_H1 type: TYPE_NORMAL + zh: 训练您自己的ProGAN - en: There is an excellent tutorial by Bharath K on training your own ProGAN using Keras available on the [Paperspace blog](https://oreil.ly/b2CJm). Bear in mind that training a ProGAN to achieve the results from the paper requires a significant amount of computing power. + id: totrans-14 prefs: [] type: TYPE_NORMAL + zh: Bharath K在[Paperspace博客](https://oreil.ly/b2CJm)上提供了一个关于使用Keras训练自己的ProGAN的优秀教程。请记住,训练ProGAN以达到论文中的结果需要大量的计算能力。 - en: Progressive Training + id: totrans-15 prefs: - PREF_H2 type: TYPE_NORMAL + zh: 渐进式训练 - en: As always with GANs, we build two independent networks, the generator and discriminator, with a fight for dominance taking place during the training process. + id: totrans-16 prefs: [] type: TYPE_NORMAL + zh: 与GANs一样,我们构建两个独立的网络,生成器和鉴别器,在训练过程中进行统治之争。 - en: In a normal GAN, the generator always outputs full-resolution images, even in the early stages of training. It is reasonable to think that this strategy might not be optimal—the generator might be slow to learn high-level structures in the @@ -94,60 +128,84 @@ images. Wouldn’t it be better to first train a lightweight GAN to output accurate low-resolution images and then see if we can build on this to gradually increase the resolution? + id: totrans-17 prefs: [] type: TYPE_NORMAL + zh: 在普通的GAN中,生成器总是输出全分辨率图像,即使在训练的早期阶段也是如此。可以合理地认为,这种策略可能不是最佳的——生成器可能在训练的早期阶段学习高级结构较慢,因为它立即在复杂的高分辨率图像上操作。首先训练一个轻量级的GAN以输出准确的低分辨率图像,然后逐渐增加分辨率,这样做会更好吗? - en: This simple idea leads us to *progressive training*, one of the key contributions of the ProGAN paper. The ProGAN is trained in stages, starting with a training set that has been condensed down to 4 × 4–pixel images using interpolation, as shown in [Figure 10-1](Images/#condensed_images). + id: totrans-18 prefs: [] type: TYPE_NORMAL + zh: 这个简单的想法引导我们进入*渐进式训练*,这是ProGAN论文的一个关键贡献。ProGAN分阶段训练,从一个已经通过插值压缩到4×4像素图像的训练集开始,如[图10-1](Images/#condensed_images)所示。 - en: '![](Images/gdl2_1001.png)' + id: totrans-19 prefs: [] type: TYPE_IMG + zh: '![](Images/gdl2_1001.png)' - en: Figure 10-1\. Images in the dataset can be compressed to lower resolution using interpolation + id: totrans-20 prefs: - PREF_H6 type: TYPE_NORMAL + zh: 图10-1。数据集中的图像可以使用插值压缩到较低分辨率 - en: We can then initially train the generator to transform a latent input noise vector z (say, of length 512) into an image of shape 4 × 4 × 3\. The matching discriminator will need to transform an input image of size 4 × 4 × 3 into a single scalar prediction. The network architectures for this first step are shown in [Figure 10-2](#progan_4). + id: totrans-21 prefs: [] type: TYPE_NORMAL + zh: 然后,我们可以最初训练生成器,将潜在输入噪声向量z(比如长度为512)转换为形状为4×4×3的图像。匹配的鉴别器需要将大小为4×4×3的输入图像转换为单个标量预测。这第一步的网络架构如[图10-2](#progan_4)所示。 - en: The blue box in the generator represents the convolutional layer that converts the set of feature maps into an RGB image (`toRGB`), and the blue box in the discriminator represents the convolutional layer that converts the RGB images into a set of feature maps (`fromRGB`). + id: totrans-22 prefs: [] type: TYPE_NORMAL + zh: 生成器中的蓝色框表示将特征图转换为RGB图像的卷积层(`toRGB`),鉴别器中的蓝色框表示将RGB图像转换为一组特征图的卷积层(`fromRGB`)。 - en: '![](Images/gdl2_1002.png)' + id: totrans-23 prefs: [] type: TYPE_IMG + zh: '![](Images/gdl2_1002.png)' - en: Figure 10-2\. The generator and discriminator architectures for the first stage of the ProGAN training process + id: totrans-24 prefs: - PREF_H6 type: TYPE_NORMAL + zh: 图10-2。ProGAN训练过程的第一阶段的生成器和鉴别器架构 - en: In the paper, the authors train this pair of networks until the discriminator has seen 800,000 real images. We now need to understand how the generator and discriminator are expanded to work with 8 × 8–pixel images. + id: totrans-25 prefs: [] type: TYPE_NORMAL + zh: 在论文中,作者训练这对网络,直到鉴别器看到了800,000张真实图像。现在我们需要了解如何扩展生成器和鉴别器以处理8×8像素图像。 - en: To expand the generator and discriminator, we need to blend in additional layers. This is managed in two phases, transition and stabilization, as shown in [Figure 10-3](#progan_training_gen). + id: totrans-26 prefs: [] type: TYPE_NORMAL + zh: 为了扩展生成器和鉴别器,我们需要融入额外的层。这在两个阶段中进行,过渡和稳定,如[图10-3](#progan_training_gen)所示。 - en: '![](Images/gdl2_1003.png)' + id: totrans-27 prefs: [] type: TYPE_IMG + zh: '![](Images/gdl2_1003.png)' - en: Figure 10-3\. The ProGAN generator training process, expanding the network from 4 × 4 images to 8 × 8 (dotted lines represent the rest of the network, not shown) + id: totrans-28 prefs: - PREF_H6 type: TYPE_NORMAL + zh: 图10-3。ProGAN生成器训练过程,将网络从4×4图像扩展到8×8(虚线代表网络的其余部分,未显示) - en: Let’s first look at the generator. During the *transition phase*, new upsampling and convolutional layers are appended to the existing network, with a residual connection set up to maintain the output from the existing trained `toRGB` layer. @@ -155,70 +213,96 @@ that is gradually increased from 0 to 1 throughout the transition phase to allow more of the new `toRGB` output through and less of the existing `toRGB` layer. This is to avoid a *shock* to the network as the new layers take over. + id: totrans-29 prefs: [] type: TYPE_NORMAL + zh: 让我们首先看一下生成器。在*过渡阶段*中,新的上采样和卷积层被附加到现有网络中,建立了一个残差连接以保持现有训练过的`toRGB`层的输出。关键的是,新层最初使用一个参数α进行掩蔽,该参数在整个过渡阶段逐渐从0增加到1,以允许更多新的`toRGB`输出通过,减少现有的`toRGB`层。这是为了避免网络在新层接管时出现*冲击*。 - en: Eventually, there is no flow through the old `toRGB` layer and the network enters the *stabilization phase*—a further period of training where the network can fine-tune the output, without any flow through the old `toRGB` layer. + id: totrans-30 prefs: [] type: TYPE_NORMAL + zh: 最终,旧的`toRGB`层不再有输出流,网络进入*稳定阶段*——进一步的训练期间,网络可以微调输出,而不经过旧的`toRGB`层。 - en: The discriminator uses a similar process, as shown in [Figure 10-4](#progan_training_dis). + id: totrans-31 prefs: [] type: TYPE_NORMAL + zh: 鉴别器使用类似的过程,如[图10-4](#progan_training_dis)所示。 - en: '![](Images/gdl2_1004.png)' + id: totrans-32 prefs: [] type: TYPE_IMG + zh: '![](Images/gdl2_1004.png)' - en: Figure 10-4\. The ProGAN discriminator training process, expanding the network from 4 × 4 images to 8 × 8 (dotted lines represent the rest of the network, not shown) + id: totrans-33 prefs: - PREF_H6 type: TYPE_NORMAL + zh: 图10-4。ProGAN鉴别器训练过程,将网络从4×4图像扩展到8×8(虚线代表网络的其余部分,未显示) - en: Here, we need to blend in additional downscaling and convolutional layers. Again, the layers are injected into the network—this time at the start of the network, just after the input image. The existing `fromRGB` layer is connected via a residual connection and gradually phased out as the new layers take over during the transition phase. The stabilization phase allows the discriminator to fine-tune using the new layers. + id: totrans-34 prefs: [] type: TYPE_NORMAL + zh: 在这里,我们需要融入额外的降采样和卷积层。同样,这些层被注入到网络中——这次是在网络的开始部分,就在输入图像之后。现有的`fromRGB`层通过残差连接连接,并在过渡阶段逐渐淡出,随着新层在过渡阶段接管时逐渐淡出。稳定阶段允许鉴别器使用新层进行微调。 - en: All transition and stabilization phases last until the discriminator has been shown 800,000 real images. Note that even through the network is trained progressively, no layers are *frozen*. Throughout the training process, all layers remain fully trainable. + id: totrans-35 prefs: [] type: TYPE_NORMAL + zh: 所有过渡和稳定阶段持续到鉴别器已经看到了800,000张真实图像。请注意,即使网络是渐进训练的,也没有层被*冻结*。在整个训练过程中,所有层都保持完全可训练。 - en: This process continues, growing the GAN from 4 × 4 images to 8 × 8, then 16 × 16, 32 × 32, and so on, until it reaches full resolution (1,024 × 1,024), as shown in [Figure 10-5](#progan). + id: totrans-36 prefs: [] type: TYPE_NORMAL + zh: 这个过程继续进行,将GAN从4×4图像扩展到8×8,然后16×16,32×32,依此类推,直到达到完整分辨率(1,024×1,024),如[图10-5](#progan)所示。 - en: '![](Images/gdl2_1005.png)' + id: totrans-37 prefs: [] type: TYPE_IMG + zh: '![](Images/gdl2_1005.png)' - en: 'Figure 10-5\. The ProGAN training mechanism, and some example generated faces (source: [Karras et al., 2017](https://arxiv.org/abs/1710.10196))' + id: totrans-38 prefs: - PREF_H6 type: TYPE_NORMAL + zh: 图10-5。ProGAN训练机制,以及一些示例生成的人脸(来源:[Karras等人,2017](https://arxiv.org/abs/1710.10196)) - en: The overall structure of the generator and discriminator after the full progressive training process is complete is shown in [Figure 10-6](#progan_network_diagram). + id: totrans-39 prefs: [] type: TYPE_NORMAL - en: '![](Images/gdl2_1006.png)' + id: totrans-40 prefs: [] type: TYPE_IMG - en: 'Figure 10-6\. The ProGAN generator and discriminator used to generate 1,024 × 1,024–pixel CelebA faces (source: [Karras et al., 2018](https://arxiv.org/abs/1812.04948))' + id: totrans-41 prefs: - PREF_H6 type: TYPE_NORMAL - en: The paper also makes several other important contributions, namely minibatch standard deviation, equalized learning rates, and pixelwise normalization, which are described briefly in the following sections. + id: totrans-42 prefs: [] type: TYPE_NORMAL - en: Minibatch standard deviation + id: totrans-43 prefs: - PREF_H3 type: TYPE_NORMAL @@ -230,9 +314,11 @@ can use this feature to distinguish the fake batches from the real batches! Therefore, the generator is incentivized to ensure it generates a similar amount of variety as is present in the real training data. + id: totrans-44 prefs: [] type: TYPE_NORMAL - en: Equalized learning rates + id: totrans-45 prefs: - PREF_H3 type: TYPE_NORMAL @@ -243,6 +329,7 @@ layer. This way, layers with a greater number of inputs will be initialized with weights that have a smaller deviation from zero, which generally improves the stability of the training process. + id: totrans-46 prefs: [] type: TYPE_NORMAL - en: The authors of the ProGAN paper found that this was causing problems when used @@ -254,6 +341,7 @@ more inputs). It was found that this causes an imbalance between the speed of training of the different layers of the generator and discriminator in ProGAN, so they used *equalized learning rates* to solve this problem. + id: totrans-47 prefs: [] type: TYPE_NORMAL - en: In ProGAN, weights are initialized using a simple standard Gaussian, regardless @@ -262,9 +350,11 @@ the optimizer sees each weight as having approximately the same dynamic range, so it applies the same learning rate. It is only when the layer is called that the weight is scaled by the factor from the He initializer. + id: totrans-48 prefs: [] type: TYPE_NORMAL - en: Pixelwise normalization + id: totrans-49 prefs: - PREF_H3 type: TYPE_NORMAL @@ -273,9 +363,11 @@ a unit length and helps to prevent the signal from spiraling out of control as it propagates through the network. The pixelwise normalization layer has no trainable weights. + id: totrans-50 prefs: [] type: TYPE_NORMAL - en: Outputs + id: totrans-51 prefs: - PREF_H2 type: TYPE_NORMAL @@ -284,25 +376,31 @@ in [Figure 10-7](#progan_examples). This demonstrated the power of ProGAN over earlier GAN architectures and paved the way for future iterations such as StyleGAN and StyleGAN2, which we shall explore in the next sections. + id: totrans-52 prefs: [] type: TYPE_NORMAL - en: '![](Images/gdl2_1007.png)' + id: totrans-53 prefs: [] type: TYPE_IMG - en: 'Figure 10-7\. Generated examples from a ProGAN trained progressively on the LSUN dataset at 256 × 256 resolution (source: [Karras et al., 2017](https://arxiv.org/abs/1710.10196))' + id: totrans-54 prefs: - PREF_H6 type: TYPE_NORMAL - en: StyleGAN + id: totrans-55 prefs: - PREF_H1 type: TYPE_NORMAL - en: StyleGAN^([3](ch10.xhtml#idm45387005140128)) is a GAN architecture from 2018 that builds on the earlier ideas in the ProGAN paper. In fact, the discriminator is identical; only the generator is changed. + id: totrans-56 prefs: [] type: TYPE_NORMAL + zh: StyleGAN^([3](ch10.xhtml#idm45387005140128))是2018年的一个GAN架构,建立在ProGAN论文中的早期思想基础上。实际上,鉴别器是相同的;只有生成器被改变。 - en: Often when training GANs it is difficult to separate out vectors in the latent space corresponding to high-level attributes—they are frequently *entangled*, meaning that adjusting an image in the latent space to give a face more freckles, @@ -310,40 +408,56 @@ generates fantastically realistic images, it is no exception to this general rule. We would ideally like to have full control of the style of the image, and this requires a disentangled separation of features in the latent space. + id: totrans-57 prefs: [] type: TYPE_NORMAL + zh: 通常在训练GAN时,很难将潜在空间中对应于高级属性的向量分离出来——它们经常是*纠缠在一起*,这意味着调整潜在空间中的图像以使脸部更多雀斑,例如,可能也会无意中改变背景颜色。虽然ProGAN生成了极其逼真的图像,但它也不例外。我们理想情况下希望完全控制图像的风格,这需要在潜在空间中对特征进行分离。 - en: 'StyleGAN achieves this by explicitly injecting style vectors into the network at different points: some that control high-level features (e.g., face orientation) and some that control low-level details (e.g., the way the hair falls across the forehead).' + id: totrans-58 prefs: [] type: TYPE_NORMAL + zh: StyleGAN通过在网络的不同点显式注入风格向量来实现这一点:一些控制高级特征(例如,面部方向)的向量,一些控制低级细节(例如,头发如何落在额头上)的向量。 - en: The overall architecture of the StyleGAN generator is shown in [Figure 10-8](#stylegan_arch). Let’s walk through this architecture step by step, starting with the mapping network. + id: totrans-59 prefs: [] type: TYPE_NORMAL + zh: StyleGAN生成器的整体架构如[图10-8](#stylegan_arch)所示。让我们逐步走过这个架构,从映射网络开始。 - en: '![](Images/gdl2_1008.png)' + id: totrans-60 prefs: [] type: TYPE_IMG + zh: '![](Images/gdl2_1008.png)' - en: 'Figure 10-8\. The StyleGAN generator architecture (source: [Karras et al., 2018](https://arxiv.org/abs/1812.04948))' + id: totrans-61 prefs: - PREF_H6 type: TYPE_NORMAL + zh: 图10-8。StyleGAN生成器架构(来源:[Karras et al., 2018](https://arxiv.org/abs/1812.04948)) - en: Training Your Own StyleGAN + id: totrans-62 prefs: - PREF_H1 type: TYPE_NORMAL + zh: 训练您自己的StyleGAN - en: There is an excellent tutorial by Soon-Yau Cheong on training your own StyleGAN using Keras available on the [Keras website](https://oreil.ly/MooSe). Bear in mind that training a StyleGAN to achieve the results from the paper requires a significant amount of computing power. + id: totrans-63 prefs: [] type: TYPE_NORMAL + zh: Soon-Yau Cheong在[Keras网站](https://oreil.ly/MooSe)上提供了一个关于使用Keras训练自己的StyleGAN的优秀教程。请记住,要实现论文中的结果,训练StyleGAN需要大量的计算资源。 - en: The Mapping Network + id: totrans-64 prefs: - PREF_H2 type: TYPE_NORMAL + zh: 映射网络 - en: The *mapping network* f is a simple feed-forward network that converts the input noise 𝐳 𝒵 into a different @@ -351,17 +465,26 @@ 𝒲 . This gives the generator the opportunity to disentangle the noisy input vector into distinct factors of variation, which can be easily picked up by the downstream style-generating layers. + id: totrans-65 prefs: [] type: TYPE_NORMAL + zh: '*映射网络* f 是一个简单的前馈网络,将输入噪声 𝐳 𝒵 + 转换为不同的潜在空间 𝐰 + 𝒲。这使得生成器有机会将嘈杂的输入向量分解为不同的变化因素,这些因素可以被下游的风格生成层轻松捕捉到。' - en: The point of doing this is to separate out the process of choosing a style for the image (the mapping network) from the generation of an image with a given style (the synthesis network). + id: totrans-66 prefs: [] type: TYPE_NORMAL + zh: 这样做的目的是将图像的风格选择过程(映射网络)与生成具有给定风格的图像的过程(合成网络)分开。 - en: The Synthesis Network + id: totrans-67 prefs: - PREF_H2 type: TYPE_NORMAL + zh: 合成网络 - en: 'The synthesis network is the generator of the actual image with a given style, as provided by the mapping network. As can be seen from [Figure 10-8](#stylegan_arch), the style vector 𝐰 is injected into the @@ -374,16 +497,26 @@ the specific style that should be injected at this point in the network—that is, they tell the synthesis network how to adjust the feature maps to move the generated image in the direction of the specified style.' + id: totrans-68 prefs: [] type: TYPE_NORMAL + zh: 合成网络是生成具有给定风格的实际图像的生成器,由映射网络提供。如[图10-8](#stylegan_arch)所示,风格向量 𝐰 被注入到合成网络的不同点,每次通过不同的密集连接层 A i,生成两个向量:一个偏置向量 𝐲 b,i + 和一个缩放向量 𝐲 s,i。这些向量定义了应该在网络中的这一点注入的特定风格,也就是告诉合成网络如何调整特征图以使生成的图像朝着指定的风格方向移动。 - en: This adjustment is achieved through *adaptive instance normalization* (AdaIN) layers. + id: totrans-69 prefs: [] type: TYPE_NORMAL + zh: 通过*自适应实例归一化*(AdaIN)层实现这种调整。 - en: Adaptive instance normalization + id: totrans-70 prefs: - PREF_H3 type: TYPE_NORMAL + zh: 自适应实例归一化 - en: 'An AdaIN layer is a type of neural network layer that adjusts the mean and variance of each feature map 𝐱 i with a reference style bias 𝐱 i -μ(𝐱 i ) σ(𝐱 i ) + 𝐲 b,i + id: totrans-72 prefs: [] type: TYPE_NORMAL + zh: AdaIN + ( 𝐱 i , 𝐲 ) + = 𝐲 s,i + 𝐱 i -μ(𝐱 + i ) σ(𝐱 + i ) + 𝐲 b,i - en: The adaptive instance normalization layers ensure that the style vectors that are injected into each layer only affect features at that layer, by preventing any style information from leaking through between layers. The authors show that this results in the latent vectors 𝐰 being significantly more disentangled than the original 𝐳 vectors. + id: totrans-73 prefs: [] type: TYPE_NORMAL - en: Since the synthesis network is based on the ProGAN architecture, it is trained @@ -424,9 +571,11 @@ the latent vector 𝐰 , but we can also switch the 𝐰 vector at different points in the synthesis network to change the style at a variety of levels of detail. + id: totrans-74 prefs: [] type: TYPE_NORMAL - en: Style mixing + id: totrans-75 prefs: - PREF_H3 type: TYPE_NORMAL @@ -445,9 +594,11 @@ w bold 2 right-parenthesis">𝐰 2 ) is chosen at random, to break any possible correlation between the vectors. + id: totrans-76 prefs: [] type: TYPE_NORMAL - en: Stochastic variation + id: totrans-77 prefs: - PREF_H3 type: TYPE_NORMAL @@ -456,26 +607,32 @@ for stochastic details such as the placement of individual hairs, or the background behind the face. Again, the depth at which the noise is injected affects the coarseness of the impact on the image. + id: totrans-78 prefs: [] type: TYPE_NORMAL - en: This also means that the initial input to the synthesis network can simply be a learned constant, rather than additional noise. There is enough stochasticity already present in the style inputs and the noise inputs to generate sufficient variation in the images. + id: totrans-79 prefs: [] type: TYPE_NORMAL - en: Outputs from StyleGAN + id: totrans-80 prefs: - PREF_H2 type: TYPE_NORMAL - en: '[Figure 10-9](#stylegan_w) shows StyleGAN in action.' + id: totrans-81 prefs: [] type: TYPE_NORMAL - en: '![](Images/gdl2_1009.png)' + id: totrans-82 prefs: [] type: TYPE_IMG - en: 'Figure 10-9\. Merging styles between two generated images at different levels of detail (source: [Karras et al., 2018](https://arxiv.org/abs/1812.04948))' + id: totrans-83 prefs: - PREF_H6 type: TYPE_NORMAL @@ -488,9 +645,11 @@ A. However, if the switch happens later, only fine-grained detail is carried across from source B, such as colors and microstructure of the face, while the coarse features from source A are preserved. + id: totrans-84 prefs: [] type: TYPE_NORMAL - en: StyleGAN2 + id: totrans-85 prefs: - PREF_H1 type: TYPE_NORMAL @@ -500,22 +659,27 @@ do not suffer as greatly from *artifacts*—water droplet–like areas of the image that were found to be caused by the adaptive instance normalization layers in StyleGAN, as shown in [Figure 10-10](#artifacts_stylegan). + id: totrans-86 prefs: [] type: TYPE_NORMAL - en: '![](Images/gdl2_1010.png)' + id: totrans-87 prefs: [] type: TYPE_IMG - en: 'Figure 10-10\. An artifact in a StyleGAN-generated image of a face (source: [Karras et al., 2019](https://arxiv.org/abs/1912.04958))' + id: totrans-88 prefs: - PREF_H6 type: TYPE_NORMAL - en: Both the generator and the discriminator in StyleGAN2 are different from the StyleGAN. In the next sections we will explore the key differences between the architectures. + id: totrans-89 prefs: [] type: TYPE_NORMAL - en: Training Your Own StyleGAN2 + id: totrans-90 prefs: - PREF_H1 type: TYPE_NORMAL @@ -523,9 +687,11 @@ on [GitHub](https://oreil.ly/alB6w). Bear in mind that training a StyleGAN2 to achieve the results from the paper requires a significant amount of computing power. + id: totrans-91 prefs: [] type: TYPE_NORMAL - en: Weight Modulation and Demodulation + id: totrans-92 prefs: - PREF_H2 type: TYPE_NORMAL @@ -536,6 +702,7 @@ by the modulation and demodulation steps in StyleGAN2 at runtime. In comparison, the AdaIN layers of StyleGAN operate on the image tensor as it flows through the network. + id: totrans-93 prefs: [] type: TYPE_NORMAL - en: The AdaIN layer in StyleGAN is simply an instance normalization followed by @@ -544,12 +711,15 @@ layers at runtime, rather than the output from the convolutional layers, as shown in [Figure 10-11](#stylegan2_styleblock). The authors show how this removes the artifact issue while retaining control of the image style. + id: totrans-94 prefs: [] type: TYPE_NORMAL - en: '![](Images/gdl2_1011.png)' + id: totrans-95 prefs: [] type: TYPE_IMG - en: Figure 10-11\. A comparison between the StyleGAN and StyleGAN2 style blocks + id: totrans-96 prefs: - PREF_H6 type: TYPE_NORMAL @@ -558,22 +728,30 @@ , where i indexes the number of input channels in the corresponding convolutional layer. This style vector is then applied to the weights of the convolutional layer as follows:' + id: totrans-97 prefs: [] type: TYPE_NORMAL - en: w i,j,k ' = s i · w i,j,k + id: totrans-98 prefs: [] type: TYPE_NORMAL + zh: w + i,j,k ' + = s i · w i,j,k - en: Here, j indexes the output channels of the layer and k indexes the spatial dimensions. This is the *modulation* step of the process. + id: totrans-99 prefs: [] type: TYPE_NORMAL - en: 'Then, we need to normalize the weights so that they again have a unit standard deviation, to ensure stability in the training process. This is the *demodulation* step:' + id: totrans-100 prefs: [] type: TYPE_NORMAL - en: w i,j,k + '' = w + i,j,k ' + i,k + w i,j,k + ' 2 +ε - en: where ϵ is a small constant value that prevents division by zero. + id: totrans-102 prefs: [] type: TYPE_NORMAL - en: In the paper, the authors show how this simple change is enough to prevent water-droplet artifacts, while retaining control over the generated images via the style vectors and ensuring the quality of the output remains high. + id: totrans-103 prefs: [] type: TYPE_NORMAL - en: Path Length Regularization + id: totrans-104 prefs: - PREF_H2 type: TYPE_NORMAL + zh: 路径长度正则化 - en: Another change made to the StyleGAN architecture is the inclusion of an additional penalty term in the loss function—*this is known as path length regularization*. + id: totrans-105 prefs: [] type: TYPE_NORMAL + zh: StyleGAN架构的另一个变化是在损失函数中包含了额外的惩罚项——*这被称为路径长度正则化*。 - en: We would like the latent space to be as smooth and uniform as possible, so that a fixed-size step in the latent space in any direction results in a fixed-magnitude change in the image. + id: totrans-106 prefs: [] type: TYPE_NORMAL + zh: 我们希望潜在空间尽可能平滑和均匀,这样在任何方向上潜在空间中的固定大小步长会导致图像的固定幅度变化。 - en: 'To encourage this property, StyleGAN2 aims to minimize the following term, alongside the usual Wasserstein loss with gradient penalty:' + id: totrans-107 prefs: [] type: TYPE_NORMAL + zh: 为了鼓励这一属性,StyleGAN2旨在最小化以下术语,以及通常的Wasserstein损失和梯度惩罚: - en: 𝔼 @@ -621,8 +820,16 @@ open="(" close=")">𝐉 𝑤 𝑦 2 -a 2 + id: totrans-108 prefs: [] type: TYPE_NORMAL + zh: 𝔼 + 𝑤,𝑦 𝐉 + 𝑤 𝑦 2 -a + 2 - en: Here, 𝑤 is a set of style vectors created by the mapping network, 𝑦 is a set of noisy images drawn from 𝐉 𝑤 = g 𝑤 is the Jacobian of the generator network with respect to the style vectors. + id: totrans-109 prefs: [] type: TYPE_NORMAL + zh: 在这里,𝑤是由映射网络创建的一组样式向量,𝑦是从𝒩 + ( 0 , 𝐈 )中绘制的一组嘈杂图像,𝐉 𝑤 + = g 𝑤是生成器网络相对于样式向量的雅可比矩阵。 - en: The term 𝐉 𝑤 𝑦 2 @@ -644,36 +858,56 @@ w Superscript down-tack Baseline y parallel-to Subscript 2">𝐉 𝑤 𝑦 2 as the training progresses. + id: totrans-110 prefs: [] type: TYPE_NORMAL + zh: 术语𝐉 + 𝑤 𝑦 2测量了经雅可比矩阵给出的梯度变换后图像𝑦的幅度。我们希望这个值接近一个常数a,这个常数是动态计算的,作为训练进行时𝐉 + 𝑤 𝑦 2的指数移动平均值。 - en: The authors find that this additional term makes exploring the latent space more reliable and consistent. Moreover, the regularization terms in the loss function are only applied once every 16 minibatches, for efficiency. This technique, called *lazy regularization*, does not cause a measurable drop in performance. + id: totrans-111 prefs: [] type: TYPE_NORMAL + zh: 作者发现,这个额外的术语使探索潜在空间更可靠和一致。此外,损失函数中的正则化项仅在每16个小批次中应用一次,以提高效率。这种技术称为*懒惰正则化*,不会导致性能的明显下降。 - en: No Progressive Growing + id: totrans-112 prefs: - PREF_H2 type: TYPE_NORMAL + zh: 没有渐进增长 - en: Another major update is in how StyleGAN2 is trained. Rather than adopting the usual progressive training mechanism, StyleGAN2 utilizes skip connections in the generator and residual connections in the discriminator to train the entire network as one. It no longer requires different resolutions to be trained independently and blended as part of the training process. + id: totrans-113 prefs: [] type: TYPE_NORMAL + zh: StyleGAN2训练的另一个重大更新是在训练方式上。StyleGAN2不再采用通常的渐进式训练机制,而是利用生成器中的跳过连接和鉴别器中的残差连接来将整个网络作为一个整体进行训练。它不再需要独立训练不同分辨率,并将其作为训练过程的一部分混合。 - en: '[Figure 10-12](#stylegan2_gen_dis) shows the generator and discriminator blocks in StyleGAN2.' + id: totrans-114 prefs: [] type: TYPE_NORMAL + zh: '[图10-12](#stylegan2_gen_dis)展示了StyleGAN2中的生成器和鉴别器块。' - en: '![](Images/gdl2_1012.png)' + id: totrans-115 prefs: [] type: TYPE_IMG + zh: '![](Images/gdl2_1012.png)' - en: Figure 10-12\. The generator and discriminator blocks in StyleGAN2 + id: totrans-116 prefs: - PREF_H6 type: TYPE_NORMAL + zh: 图10-12。StyleGAN2中的生成器和鉴别器块 - en: The crucial property that we would like to be able to preserve is that the StyleGAN2 starts by learning low-resolution features and gradually refines the output as training progresses. The authors show that this property is indeed preserved using @@ -684,43 +918,57 @@ begin to dominate, as the generator discovers more intricate ways to improve the realism of the images in order to fool the discriminator. This process is demonstrated in [Figure 10-13](#stylegan2_contrib). + id: totrans-117 prefs: [] type: TYPE_NORMAL + zh: 我们希望能够保留的关键属性是,StyleGAN2从学习低分辨率特征开始,并随着训练的进行逐渐完善输出。作者表明,使用这种架构确实保留了这一属性。在训练的早期阶段,每个网络都受益于在较低分辨率层中细化卷积权重,而通过跳过和残差连接将输出传递到较高分辨率层的方式基本上不受影响。随着训练的进行,较高分辨率层开始占主导地位,因为生成器发现了更复杂的方法来改善图像的逼真度,以欺骗鉴别器。这个过程在[图10-13](#stylegan2_contrib)中展示。 - en: '![](Images/gdl2_1013.png)' + id: totrans-118 prefs: [] type: TYPE_IMG + zh: '![](Images/gdl2_1013.png)' - en: Figure 10-13\. The contribution of each resolution layer to the output of the generator, by training time (adapted from [Karras et al., 2019](https://arxiv.org/pdf/1912.04958.pdf)) + id: totrans-119 prefs: - PREF_H6 type: TYPE_NORMAL + zh: 图10-13。每个分辨率层对生成器输出的贡献,按训练时间(改编自[Karras等人,2019](https://arxiv.org/pdf/1912.04958.pdf)) - en: Outputs from StyleGAN2 + id: totrans-120 prefs: - PREF_H2 type: TYPE_NORMAL + zh: StyleGAN2的输出 - en: Some examples of StyleGAN2 output are shown in [Figure 10-14](#stylegan2_output). To date, the StyleGAN2 architecture (and scaled variations such as StyleGAN-XL^([6](ch10.xhtml#idm45387004898624))) remain state of the art for image generation on datasets such as Flickr-Faces-HQ (FFHQ) and CIFAR-10, according to the benchmarking website [Papers with Code](https://oreil.ly/VwH2r). + id: totrans-121 prefs: [] type: TYPE_NORMAL - en: '![](Images/gdl2_1014.png)' + id: totrans-122 prefs: [] type: TYPE_IMG - en: 'Figure 10-14\. Uncurated StyleGAN2 output for the FFHQ face dataset and LSUN car dataset (source: [Karras et al., 2019](https://arxiv.org/pdf/1912.04958.pdf))' + id: totrans-123 prefs: - PREF_H6 type: TYPE_NORMAL - en: Other Important GANs + id: totrans-124 prefs: - PREF_H1 type: TYPE_NORMAL - en: In this section, we will explore two more architectures that have also contributed significantly to the development of GANs—SAGAN and BigGAN. + id: totrans-125 prefs: [] type: TYPE_NORMAL - en: Self-Attention GAN (SAGAN) + id: totrans-126 prefs: - PREF_H2 type: TYPE_NORMAL @@ -729,13 +977,16 @@ models such as the Transformer can also be incorporated into GAN-based models for image generation. [Figure 10-15](#sagan_attention) shows the self-attention mechanism from the paper introducing this architecture. + id: totrans-127 prefs: [] type: TYPE_NORMAL - en: '![](Images/gdl2_1015.png)' + id: totrans-128 prefs: [] type: TYPE_IMG - en: 'Figure 10-15\. The self-attention mechanism within the SAGAN model (source: [Zhang et al., 2018](https://arxiv.org/abs/1805.08318))' + id: totrans-129 prefs: - PREF_H6 type: TYPE_NORMAL @@ -749,14 +1000,17 @@ solves this problem by incorporating the attention mechanism that we explored earlier in this chapter into the GAN. The effect of this inclusion is shown in [Figure 10-16](Images/#sagan_images). + id: totrans-130 prefs: [] type: TYPE_NORMAL - en: '![](Images/gdl2_1016.png)' + id: totrans-131 prefs: [] type: TYPE_IMG - en: 'Figure 10-16\. A SAGAN-generated image of a bird (leftmost cell) and the attention maps of the final attention-based generator layer for the pixels covered by the three colored dots (rightmost cells) (source: [Zhang et al., 2018](https://arxiv.org/abs/1805.08318))' + id: totrans-132 prefs: - PREF_H6 type: TYPE_NORMAL @@ -767,31 +1021,38 @@ falls on other tail pixels, some of which are distant from the blue dot. It would be difficult to maintain this long-range dependency for pixels without attention, especially for long, thin structures in the image (such as the tail in this case). + id: totrans-133 prefs: [] type: TYPE_NORMAL - en: Training Your Own SAGAN + id: totrans-134 prefs: - PREF_H1 type: TYPE_NORMAL - en: The official code for training your own SAGAN using TensorFlow is available on [GitHub](https://oreil.ly/rvej0). Bear in mind that training a SAGAN to achieve the results from the paper requires a significant amount of computing power. + id: totrans-135 prefs: [] type: TYPE_NORMAL - en: BigGAN + id: totrans-136 prefs: - PREF_H2 type: TYPE_NORMAL - en: BigGAN,^([8](ch10.xhtml#idm45387004870736)) developed at DeepMind, extends the ideas from the SAGAN paper. [Figure 10-17](#biggan_examples) shows some of the images generated by BigGAN, trained on the ImageNet dataset at 128 × 128 resolution. + id: totrans-137 prefs: [] type: TYPE_NORMAL - en: '![](Images/gdl2_1017.png)' + id: totrans-138 prefs: [] type: TYPE_IMG - en: 'Figure 10-17\. Examples of images generated by BigGAN (source: [Brock et al., 2018](https://arxiv.org/abs/1809.11096))' + id: totrans-139 prefs: - PREF_H6 type: TYPE_NORMAL @@ -806,16 +1067,26 @@ that have magnitude greater than a certain threshold). The smaller the truncation threshold, the greater the believability of generated samples, at the expense of reduced variability. This concept is shown in [Figure 10-18](#truncation). + id: totrans-140 prefs: [] type: TYPE_NORMAL + zh: 除了对基本 SAGAN 模型进行一些增量更改外,论文中还概述了将模型提升到更高层次的几项创新。其中一项创新是所谓的“截断技巧”。这是指用于采样的潜在分布与训练期间使用的 + z + 𝒩 ( 0 , 𝐈 ) + 分布不同。具体来说,采样期间使用的分布是“截断正态分布”(重新采样具有大于一定阈值的 z + 值)。截断阈值越小,生成样本的可信度越高,但变异性降低。这个概念在[图 10-18](#truncation)中展示。 - en: '![](Images/gdl2_1018.png)' + id: totrans-141 prefs: [] type: TYPE_IMG + zh: '![](Images/gdl2_1018.png)' - en: 'Figure 10-18\. The truncation trick: from left to right, the threshold is set to 2, 1, 0.5, and 0.04 (source: [Brock et al., 2018](https://arxiv.org/abs/1809.11096))' + id: totrans-142 prefs: - PREF_H6 type: TYPE_NORMAL + zh: 图 10-18\. 截断技巧:从左到右,阈值设置为 2、1、0.5 和 0.04(来源:[Brock 等人,2018](https://arxiv.org/abs/1809.11096)) - en: Also, as the name suggests, BigGAN is an improvement over SAGAN in part simply by being *bigger*. BigGAN uses a batch size of 2,048—8 times larger than the batch size of 256 used in SAGAN—and a channel size that is increased by 50% in each @@ -823,24 +1094,36 @@ by the inclusion of a shared embedding, by orthogonal regularization, and by incorporating the latent vector z into each layer of the generator, rather than just the initial layer. + id: totrans-143 prefs: [] type: TYPE_NORMAL + zh: 正如其名称所示,BigGAN 在某种程度上是对 SAGAN 的改进,仅仅是因为它更“大”。BigGAN 使用的批量大小为 2,048,比 SAGAN 中使用的 + 256 的批量大小大 8 倍,并且每一层的通道大小增加了 50%。然而,BigGAN 还表明,通过包含共享嵌入、正交正则化以及将潜在向量 z + 包含到生成器的每一层中,而不仅仅是初始层,可以在结构上改进 SAGAN。 - en: For a full description of the innovations introduced by BigGAN, I recommend reading the original paper and [accompanying presentation material](https://oreil.ly/vPn8T). + id: totrans-144 prefs: [] type: TYPE_NORMAL + zh: 要全面了解 BigGAN 引入的创新,我建议阅读原始论文和[相关演示材料](https://oreil.ly/vPn8T)。 - en: Using BigGAN + id: totrans-145 prefs: - PREF_H1 type: TYPE_NORMAL + zh: 使用 BigGAN - en: A tutorial for generating images using a pre-trained BigGAN is available on [the TensorFlow website](https://oreil.ly/YLbLb). + id: totrans-146 prefs: [] type: TYPE_NORMAL + zh: 在[ TensorFlow 网站](https://oreil.ly/YLbLb)上提供了一个使用预训练的 BigGAN 生成图像的教程。 - en: VQ-GAN + id: totrans-147 prefs: - PREF_H2 type: TYPE_NORMAL + zh: VQ-GAN - en: Another important type of GAN is the Vector Quantized GAN (VQ-GAN), introduced in 2020.^([9](ch10.xhtml#idm45387004838864)) This model architecture builds upon an idea introduced in the 2017 paper “Neural Discrete Representation Learning”^([10](ch10.xhtml#idm45387004834704))—namely, @@ -849,17 +1132,26 @@ high-quality images while avoiding some of the issues often seen with traditional continuous latent space VAEs, such as *posterior collapse* (where the learned latent space becomes uninformative due to an overly powerful decoder). + id: totrans-148 prefs: [] type: TYPE_NORMAL + zh: 另一种重要的 GAN 类型是 2020 年推出的 Vector Quantized GAN(VQ-GAN)。这种模型架构建立在 2017 年的论文“神经离散表示学习”中提出的一个想法之上,即 + VAE 学习到的表示可以是离散的,而不是连续的。这种新型模型,即 Vector Quantized VAE(VQ-VAE),被证明可以生成高质量的图像,同时避免了传统连续潜在空间 + VAE 经常出现的一些问题,比如“后验坍缩”(学习到的潜在空间由于过于强大的解码器而变得无信息)。 - en: Tip + id: totrans-149 prefs: - PREF_H6 type: TYPE_NORMAL + zh: 提示 - en: The first version of DALL.E, a text-to-image model released by OpenAI in 2021 (see [Chapter 13](ch13.xhtml#chapter_multimodal)), utilized a VAE with a discrete latent space, similar to VQ-VAE. + id: totrans-150 prefs: [] type: TYPE_NORMAL + zh: OpenAI 在 2021 年发布的文本到图像模型 DALL.E 的第一个版本(参见[第 13 章](ch13.xhtml#chapter_multimodal))使用了具有离散潜在空间的 + VAE,类似于 VQ-VAE。 - en: By a *discrete latent space*, we mean a learned list of vectors (the *codebook*), each associated with a corresponding index. The job of the encoder in a VQ-VAE is to collapse the input image to a smaller grid of vectors that can then be compared @@ -869,15 +1161,23 @@ (the embedding size) that matches the number of channels in the output of the encoder and input to the decoder. For example, e 1 is a vector that can be interpreted as *background*. + id: totrans-151 prefs: [] type: TYPE_NORMAL + zh: 通过“离散潜在空间”,我们指的是一个学习到的向量列表(“码书”),每个向量与相应的索引相关联。VQ-VAE 中编码器的工作是将输入图像折叠到一个较小的向量网格中,然后将其与码书进行比较。然后,将每个网格方格向量(通过欧氏距离)最接近的码书向量传递给解码器进行解码,如[图 + 10-19](#vqvae)所示。码书是一个长度为 d(嵌入大小)的学习向量列表,与编码器输出和解码器输入中的通道数相匹配。例如,e 1 是一个可以解释为“背景”的向量。 - en: '![](Images/gdl2_1019.png)' + id: totrans-152 prefs: [] type: TYPE_IMG + zh: '![](Images/gdl2_1019.png)' - en: Figure 10-19\. A diagram of a VQ-VAE + id: totrans-153 prefs: - PREF_H6 type: TYPE_NORMAL + zh: 图 10-19\. VQ-VAE 的示意图 - en: The codebook can be thought of as a set of learned discrete concepts that are shared by the encoder and decoder in order to describe the contents of a given image. The VQ-VAE must find a way to make this set of discrete concepts as informative @@ -888,6 +1188,7 @@ as possible to vectors in the codebook. These terms replace the the KL divergence term between the encoded distribution and the standard Gaussian prior in a typical VAE. + id: totrans-154 prefs: [] type: TYPE_NORMAL - en: However, this architecture poses a question—how do we sample novel code grids @@ -900,26 +1201,32 @@ to predict the next code vector in the grid, given previous code vectors. In other words, the prior is learned by the model, rather than static as in the case of the vanilla VAE. + id: totrans-155 prefs: [] type: TYPE_NORMAL - en: Training Your Own VQ-VAE + id: totrans-156 prefs: - PREF_H1 type: TYPE_NORMAL - en: There is an excellent tutorial by Sayak Paul on training your own VQ-VAE using Keras available on the [Keras website](https://oreil.ly/dmcb4). + id: totrans-157 prefs: [] type: TYPE_NORMAL - en: The VQ-GAN paper details several key changes to the VQ-VAE architecture, as shown in [Figure 10-20](#vqgan). + id: totrans-158 prefs: [] type: TYPE_NORMAL - en: '![](Images/gdl2_1020.png)' + id: totrans-159 prefs: [] type: TYPE_IMG - en: 'Figure 10-20\. A diagram of a VQ-GAN: the GAN discriminator helps to encourage the VAE to generate less blurry images through an additional adversarial loss term' + id: totrans-160 prefs: - PREF_H6 type: TYPE_NORMAL @@ -931,6 +1238,7 @@ GAN discriminator is an additional component rather than a replacement of the VAE. The idea of combining a VAE with a GAN discriminator (VAE-GAN) was first introduced by Larsen et al. in their 2015 paper.^([11](ch10.xhtml#idm45387004808112)) + id: totrans-161 prefs: [] type: TYPE_NORMAL - en: Secondly, the GAN discriminator predicts if small patches of the images are @@ -948,6 +1256,7 @@ that VAEs produce images that are stylistically more blurry than real images, so the PatchGAN discriminator can encourage the VAE decoder to generate sharper images than it would naturally produce. + id: totrans-162 prefs: [] type: TYPE_NORMAL - en: Thirdly, rather than use a single MSE reconstruction loss that compares the @@ -957,6 +1266,7 @@ idea is from the 2016 paper by Hou et al.,^([14](ch10.xhtml#idm45387004793216)) where the authors show that this change to the loss function results in more realistic image generations. + id: totrans-163 prefs: [] type: TYPE_NORMAL - en: Lastly, instead of PixelCNN, a Transformer is used as the autoregressive part @@ -966,9 +1276,11 @@ use tokens that fall within a sliding window around the token to be predicted. This ensures that the model scales to larger images, which require a larger latent grid size and therefore more tokens to be generated by the Transformer. + id: totrans-164 prefs: [] type: TYPE_NORMAL - en: ViT VQ-GAN + id: totrans-165 prefs: - PREF_H2 type: TYPE_NORMAL @@ -976,6 +1288,7 @@ entitled “Vector-Quantized Image Modeling with Improved VQGAN.”^([15](ch10.xhtml#idm45387004783968)) Here, the authors show how the convolutional encoder and decoder of the VQ-GAN can be replaced with Transformers as shown in [Figure 10-21](#vit_vqgan). + id: totrans-166 prefs: [] type: TYPE_NORMAL - en: For the encoder, the authors use a *Vision Transformer* (ViT).^([16](ch10.xhtml#idm45387004780000)) @@ -983,6 +1296,7 @@ designed for natural language processing, to image data. Instead of using convolutional layers to extract features from an image, a ViT divides the image into a sequence of patches, which are tokenized and then fed as input to an encoder Transformer. + id: totrans-167 prefs: [] type: TYPE_NORMAL - en: Specifically, in the ViT VQ-GAN, the nonoverlapping input patches (each of size @@ -993,14 +1307,17 @@ model, with the overall output being a sequence of patches that can be stitched back together to form the original image. The overall encoder-decoder model is trained end-to-end as an autoencoder. + id: totrans-168 prefs: [] type: TYPE_NORMAL - en: '![](Images/gdl2_1021.png)' + id: totrans-169 prefs: [] type: TYPE_IMG - en: 'Figure 10-21\. A diagram of a ViT VQ-GAN: the GAN discriminator helps to encourage the VAE to generate less blurry images through an additional adversarial loss term (source: [Yu and Koh, 2022](https://ai.googleblog.com/2022/05/vector-quantized-image-modeling-with.html))^([17](ch10.xhtml#idm45387004774560))' + id: totrans-170 prefs: - PREF_H6 type: TYPE_NORMAL @@ -1009,23 +1326,28 @@ in total, there are three Transformers in a ViT VQ-GAN, in addition to the GAN discriminator and learned codebook. Examples of images generated by the ViT VQ-GAN from the paper are shown in [Figure 10-22](#vit_vqgan_ex). + id: totrans-171 prefs: [] type: TYPE_NORMAL - en: '![](Images/gdl2_1022.png)' + id: totrans-172 prefs: [] type: TYPE_IMG - en: 'Figure 10-22\. Example images generated by a ViT VQ-GAN trained on ImageNet (source: [Yu et al., 2021](https://arxiv.org/pdf/2110.04627.pdf))' + id: totrans-173 prefs: - PREF_H6 type: TYPE_NORMAL - en: Summary + id: totrans-174 prefs: - PREF_H1 type: TYPE_NORMAL - en: In this chapter, we have taken a tour of some of the most important and influential GAN papers since 2017\. In particular, we have explored ProGAN, StyleGAN, StyleGAN2, SAGAN, BigGAN, VQ-GAN, and ViT VQ-GAN. + id: totrans-175 prefs: [] type: TYPE_NORMAL - en: We started by exploring the concept of progressive training that was pioneered @@ -1037,6 +1359,7 @@ alongside additional enhancements such as path regularization. The paper also showed how the desirable property of gradual resolution refinement could be retained without having to the train the network progressively. + id: totrans-176 prefs: [] type: TYPE_NORMAL - en: We also saw how the concept of attention could be built into a GAN, with the @@ -1046,6 +1369,7 @@ spatial dimensions of the image. BigGAN was an extension of this idea that made several key changes and trained a larger network to improve the image quality further. + id: totrans-177 prefs: [] type: TYPE_NORMAL - en: In the VQ-GAN paper, the authors show how several different types of generative @@ -1056,78 +1380,96 @@ used to construct a novel sequence of code tokens that can be decoded by the VAE decoder to produce novel images. The ViT VQ-GAN paper extends this idea even further, by replacing the convolutional encoder and decoder of VQ-GAN with Transformers. + id: totrans-178 prefs: [] type: TYPE_NORMAL - en: '^([1](ch10.xhtml#idm45387005226448-marker)) Huiwen Chang et al., “Muse: Text-to-Image Generation via Masked Generative Transformers,” January 2, 2023, [*https://arxiv.org/abs/2301.00704*](https://arxiv.org/abs/2301.00704).' + id: totrans-179 prefs: [] type: TYPE_NORMAL - en: ^([2](ch10.xhtml#idm45387005216528-marker)) Tero Karras et al., “Progressive Growing of GANs for Improved Quality, Stability, and Variation,” October 27, 2017, [*https://arxiv.org/abs/1710.10196*](https://arxiv.org/abs/1710.10196). + id: totrans-180 prefs: [] type: TYPE_NORMAL - en: ^([3](ch10.xhtml#idm45387005140128-marker)) Tero Karras et al., “A Style-Based Generator Architecture for Generative Adversarial Networks,” December 12, 2018, [*https://arxiv.org/abs/1812.04948*](https://arxiv.org/abs/1812.04948). + id: totrans-181 prefs: [] type: TYPE_NORMAL - en: ^([4](ch10.xhtml#idm45387005090240-marker)) Xun Huang and Serge Belongie, “Arbitrary Style Transfer in Real-Time with Adaptive Instance Normalization,” March 20, 2017, [*https://arxiv.org/abs/1703.06868*](https://arxiv.org/abs/1703.06868). + id: totrans-182 prefs: [] type: TYPE_NORMAL - en: ^([5](ch10.xhtml#idm45387005019232-marker)) Tero Karras et al., “Analyzing and Improving the Image Quality of StyleGAN,” December 3, 2019, [*https://arxiv.org/abs/1912.04958*](https://arxiv.org/abs/1912.04958). + id: totrans-183 prefs: [] type: TYPE_NORMAL - en: '^([6](ch10.xhtml#idm45387004898624-marker)) Axel Sauer et al., “StyleGAN-XL: Scaling StyleGAN to Large Diverse Datasets,” February 1, 2022, [*https://arxiv.org/abs/2202.00273v2*](https://arxiv.org/abs/2202.00273v2).' + id: totrans-184 prefs: [] type: TYPE_NORMAL - en: ^([7](ch10.xhtml#idm45387004886752-marker)) Han Zhang et al., “Self-Attention Generative Adversarial Networks,” May 21, 2018, [*https://arxiv.org/abs/1805.08318*](https://arxiv.org/abs/1805.08318). + id: totrans-185 prefs: [] type: TYPE_NORMAL - en: ^([8](ch10.xhtml#idm45387004870736-marker)) Andrew Brock et al., “Large Scale GAN Training for High Fidelity Natural Image Synthesis,” September 28, 2018, [*https://arxiv.org/abs/1809.11096*](https://arxiv.org/abs/1809.11096). + id: totrans-186 prefs: [] type: TYPE_NORMAL - en: ^([9](ch10.xhtml#idm45387004838864-marker)) Patrick Esser et al., “Taming Transformers for High-Resolution Image Synthesis,” December 17, 2020, [*https://arxiv.org/abs/2012.09841*](https://arxiv.org/abs/2012.09841). + id: totrans-187 prefs: [] type: TYPE_NORMAL - en: ^([10](ch10.xhtml#idm45387004834704-marker)) Aaron van den Oord et al., “Neural Discrete Representation Learning,” November 2, 2017, [*https://arxiv.org/abs/1711.00937v2*](https://arxiv.org/abs/1711.00937v2). + id: totrans-188 prefs: [] type: TYPE_NORMAL - en: ^([11](ch10.xhtml#idm45387004808112-marker)) Anders Boesen Lindbo Larsen et al., “Autoencoding Beyond Pixels Using a Learned Similarity Metric,” December 31, 2015, [*https://arxiv.org/abs/1512.09300*](https://arxiv.org/abs/1512.09300). + id: totrans-189 prefs: [] type: TYPE_NORMAL - en: ^([12](ch10.xhtml#idm45387004801680-marker)) Phillip Isola et al., “Image-to-Image Translation with Conditional Adversarial Networks,” November 21, 2016, [*https://arxiv.org/abs/1611.07004v3*](https://arxiv.org/abs/1611.07004v3). + id: totrans-190 prefs: [] type: TYPE_NORMAL - en: ^([13](ch10.xhtml#idm45387004798080-marker)) Jun-Yan Zhu et al., “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks,” March 30, 2017, [*https://arxiv.org/abs/1703.10593*](https://arxiv.org/abs/1703.10593). + id: totrans-191 prefs: [] type: TYPE_NORMAL - en: ^([14](ch10.xhtml#idm45387004793216-marker)) Xianxu Hou et al., “Deep Feature Consistent Variational Autoencoder,” October 2, 2016, [*https://arxiv.org/abs/1610.00291*](https://arxiv.org/abs/1610.00291). + id: totrans-192 prefs: [] type: TYPE_NORMAL - en: ^([15](ch10.xhtml#idm45387004783968-marker)) Jiahui Yu et al., “Vector-Quantized Image Modeling with Improved VQGAN,” October 9, 2021, [*https://arxiv.org/abs/2110.04627*](https://arxiv.org/abs/2110.04627). + id: totrans-193 prefs: [] type: TYPE_NORMAL - en: '^([16](ch10.xhtml#idm45387004780000-marker)) Alexey Dosovitskiy et al., “An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale,” October 22, 2020, [*https://arxiv.org/abs/2010.11929v2*](https://arxiv.org/abs/2010.11929v2).' + id: totrans-194 prefs: [] type: TYPE_NORMAL - en: ^([17](ch10.xhtml#idm45387004774560-marker)) Jiahui Yu and Jing Yu Koh, “Vector-Quantized Image Modeling with Improved VQGAN,” May 18, 2022, [*https://ai.googleblog.com/2022/05/vector-quantized-image-modeling-with.html*](https://ai.googleblog.com/2022/05/vector-quantized-image-modeling-with.html). + id: totrans-195 prefs: [] type: TYPE_NORMAL