diff --git a/totrans/gen-dl_13.yaml b/totrans/gen-dl_13.yaml
index 33d04c0..6745916 100644
--- a/totrans/gen-dl_13.yaml
+++ b/totrans/gen-dl_13.yaml
@@ -753,51 +753,63 @@
id: totrans-100
prefs: []
type: TYPE_NORMAL
+ zh: 构成`TransformerBlock`层的子层在初始化函数中定义。
- en: '[![2](Images/2.png)](#co_transformers_CO2-2)'
id: totrans-101
prefs: []
type: TYPE_NORMAL
+ zh: '[![2](Images/2.png)](#co_transformers_CO2-2)'
- en: The causal mask is created to hide future keys from the query.
id: totrans-102
prefs: []
type: TYPE_NORMAL
+ zh: 因果掩码被创建用来隐藏查询中的未来键。
- en: '[![3](Images/3.png)](#co_transformers_CO2-3)'
id: totrans-103
prefs: []
type: TYPE_NORMAL
+ zh: '[![3](Images/3.png)](#co_transformers_CO2-3)'
- en: The multihead attention layer is created, with the attention masks specified.
id: totrans-104
prefs: []
type: TYPE_NORMAL
+ zh: 创建了多头注意力层,并指定了注意力掩码。
- en: '[![4](Images/4.png)](#co_transformers_CO2-4)'
id: totrans-105
prefs: []
type: TYPE_NORMAL
+ zh: '[![4](Images/4.png)](#co_transformers_CO2-4)'
- en: The first *add and normalization* layer.
id: totrans-106
prefs: []
type: TYPE_NORMAL
+ zh: 第一个*加和归一化*层。
- en: '[![5](Images/5.png)](#co_transformers_CO2-5)'
id: totrans-107
prefs: []
type: TYPE_NORMAL
+ zh: '[![5](Images/5.png)](#co_transformers_CO2-5)'
- en: The feed-forward layers.
id: totrans-108
prefs: []
type: TYPE_NORMAL
+ zh: 前馈层。
- en: '[![6](Images/6.png)](#co_transformers_CO2-6)'
id: totrans-109
prefs: []
type: TYPE_NORMAL
+ zh: '[![6](Images/6.png)](#co_transformers_CO2-6)'
- en: The second *add and normalization* layer.
id: totrans-110
prefs: []
type: TYPE_NORMAL
+ zh: 第二个*加和归一化*层。
- en: Positional Encoding
id: totrans-111
prefs:
- PREF_H2
type: TYPE_NORMAL
+ zh: 位置编码
- en: 'There is one final step to cover before we can put everything together to train
our GPT model. You may have noticed that in the multihead attention layer, there
is nothing that cares about the ordering of the keys. The dot product between
@@ -808,16 +820,19 @@
id: totrans-112
prefs: []
type: TYPE_NORMAL
+ zh: 在我们能够将所有内容整合在一起训练我们的GPT模型之前,还有一个最后的步骤要解决。您可能已经注意到,在多头注意力层中,没有任何关心键的顺序的内容。每个键和查询之间的点积是并行计算的,而不是像递归神经网络那样顺序计算。这是一种优势(因为并行化效率提高),但也是一个问题,因为我们显然需要注意力层能够预测以下两个句子的不同输出:
- en: The dog looked at the boy and … (barked?)
id: totrans-113
prefs:
- PREF_UL
type: TYPE_NORMAL
+ zh: 狗看着男孩然后…(叫?)
- en: The boy looked at the dog and … (smiled?)
id: totrans-114
prefs:
- PREF_UL
type: TYPE_NORMAL
+ zh: 男孩看着狗然后…(微笑?)
- en: To solve this problem, we use a technique called *positional encoding* when
creating the inputs to the initial Transformer block. Instead of only encoding
each token using a *token embedding*, we also encode the position of the token,
@@ -825,6 +840,7 @@
id: totrans-115
prefs: []
type: TYPE_NORMAL
+ zh: 为了解决这个问题,我们在创建初始Transformer块的输入时使用一种称为*位置编码*的技术。我们不仅使用*标记嵌入*对每个标记进行编码,还使用*位置嵌入*对标记的位置进行编码。
- en: The *token embedding* is created using a standard `Embedding` layer to convert
each token into a learned vector. We can create the *positional embedding* in
the same way, using a standard `Embedding` layer to convert each integer position
@@ -832,17 +848,20 @@
id: totrans-116
prefs: []
type: TYPE_NORMAL
+ zh: '*标记嵌入*是使用标准的`Embedding`层创建的,将每个标记转换为一个学习到的向量。我们可以以相同的方式创建*位置嵌入*,使用标准的`Embedding`层将每个整数位置转换为一个学习到的向量。'
- en: Tip
id: totrans-117
prefs:
- PREF_H6
type: TYPE_NORMAL
+ zh: 提示
- en: While GPT uses an `Embedding` layer to embed the position, the original Transformer
paper used trigonometric functions—we’ll cover this alternative in [Chapter 11](ch11.xhtml#chapter_music),
when we explore music generation.
id: totrans-118
prefs: []
type: TYPE_NORMAL
+ zh: 虽然GPT使用`Embedding`层来嵌入位置,但原始Transformer论文使用三角函数——我们将在[第11章](ch11.xhtml#chapter_music)中介绍这种替代方法,当我们探索音乐生成时。
- en: To construct the joint token–position encoding, the token embedding is added
to the positional embedding, as shown in [Figure 9-8](#positional_enc). This way,
the meaning and position of each word in the sequence are captured in a single
@@ -850,25 +869,30 @@
id: totrans-119
prefs: []
type: TYPE_NORMAL
+ zh: 为构建联合标记-位置编码,将标记嵌入加到位置嵌入中,如[图9-8](#positional_enc)所示。这样,序列中每个单词的含义和位置都被捕捉在一个向量中。
- en: '![](Images/gdl2_0908.png)'
id: totrans-120
prefs: []
type: TYPE_IMG
+ zh: '![](Images/gdl2_0908.png)'
- en: Figure 9-8\. The token embeddings are added to the positional embeddings to
give the token position encoding
id: totrans-121
prefs:
- PREF_H6
type: TYPE_NORMAL
+ zh: 图9-8\. 将标记嵌入添加到位置嵌入以给出标记位置编码
- en: The code that defines our `TokenAndPositionEmbedding` layer is shown in [Example 9-5](#positional_embedding_code).
id: totrans-122
prefs: []
type: TYPE_NORMAL
+ zh: 定义我们的`TokenAndPositionEmbedding`层的代码显示在[示例9-5](#positional_embedding_code)中。
- en: Example 9-5\. The `TokenAndPositionEmbedding` layer
id: totrans-123
prefs:
- PREF_H5
type: TYPE_NORMAL
+ zh: 示例9-5\. `TokenAndPositionEmbedding`层
- en: '[PRE5]'
id: totrans-124
prefs: []
@@ -878,31 +902,38 @@
id: totrans-125
prefs: []
type: TYPE_NORMAL
+ zh: '[![1](Images/1.png)](#co_transformers_CO3-1)'
- en: The tokens are embedded using an `Embedding` layer.
id: totrans-126
prefs: []
type: TYPE_NORMAL
+ zh: 标记使用`Embedding`层进行嵌入。
- en: '[![2](Images/2.png)](#co_transformers_CO3-2)'
id: totrans-127
prefs: []
type: TYPE_NORMAL
+ zh: '[![2](Images/2.png)](#co_transformers_CO3-2)'
- en: The positions of the tokens are also embedded using an `Embedding` layer.
id: totrans-128
prefs: []
type: TYPE_NORMAL
+ zh: 标记的位置也使用`Embedding`层进行嵌入。
- en: '[![3](Images/3.png)](#co_transformers_CO3-3)'
id: totrans-129
prefs: []
type: TYPE_NORMAL
+ zh: '[![3](Images/3.png)](#co_transformers_CO3-3)'
- en: The output from the layer is the sum of the token and position embeddings.
id: totrans-130
prefs: []
type: TYPE_NORMAL
+ zh: 该层的输出是标记和位置嵌入的总和。
- en: Training GPT
id: totrans-131
prefs:
- PREF_H2
type: TYPE_NORMAL
+ zh: 训练GPT
- en: Now we are ready to build and train our GPT model! To put everything together,
we need to pass our input text through the token and position embedding layer,
then through our Transformer block. The final output of the network is a simple
@@ -910,35 +941,42 @@
id: totrans-132
prefs: []
type: TYPE_NORMAL
+ zh: 现在我们准备构建和训练我们的GPT模型!为了将所有内容整合在一起,我们需要将输入文本通过标记和位置嵌入层,然后通过我们的Transformer块。网络的最终输出是一个简单的具有softmax激活函数的`Dense`层,覆盖词汇表中的单词数量。
- en: Tip
id: totrans-133
prefs:
- PREF_H6
type: TYPE_NORMAL
+ zh: 提示
- en: For simplicity, we will use just one Transformer block, rather than the 12 in
the paper.
id: totrans-134
prefs: []
type: TYPE_NORMAL
+ zh: 为简单起见,我们将只使用一个Transformer块,而不是论文中的12个。
- en: The overall architecture is shown in [Figure 9-9](#transformer) and the equivalent
code is provided in [Example 9-6](#transformer_code).
id: totrans-135
prefs: []
type: TYPE_NORMAL
+ zh: 整体架构显示在[图9-9](#transformer)中,相应的代码在[示例9-6](#transformer_code)中提供。
- en: '![](Images/gdl2_0909.png)'
id: totrans-136
prefs: []
type: TYPE_IMG
+ zh: '![](Images/gdl2_0909.png)'
- en: Figure 9-9\. The simplified GPT model architecture
id: totrans-137
prefs:
- PREF_H6
type: TYPE_NORMAL
+ zh: 图9-9\. 简化的GPT模型架构
- en: Example 9-6\. A GPT model in Keras
id: totrans-138
prefs:
- PREF_H5
type: TYPE_NORMAL
+ zh: 示例9-6\. 在Keras中的GPT模型
- en: '[PRE6]'
id: totrans-139
prefs: []
@@ -948,30 +986,37 @@
id: totrans-140
prefs: []
type: TYPE_NORMAL
+ zh: '[![1](Images/1.png)](#co_transformers_CO4-1)'
- en: The input is padded (with zeros).
id: totrans-141
prefs: []
type: TYPE_NORMAL
+ zh: 输入被填充(用零填充)。
- en: '[![2](Images/2.png)](#co_transformers_CO4-2)'
id: totrans-142
prefs: []
type: TYPE_NORMAL
+ zh: '[![2](Images/2.png)](#co_transformers_CO4-2)'
- en: The text is encoded using a `TokenAndPositionEmbedding` layer.
id: totrans-143
prefs: []
type: TYPE_NORMAL
+ zh: 文本使用`TokenAndPositionEmbedding`层进行编码。
- en: '[![3](Images/3.png)](#co_transformers_CO4-3)'
id: totrans-144
prefs: []
type: TYPE_NORMAL
+ zh: '[![3](Images/3.png)](#co_transformers_CO4-3)'
- en: The encoding is passed through a `TransformerBlock`.
id: totrans-145
prefs: []
type: TYPE_NORMAL
+ zh: 编码通过`TransformerBlock`传递。
- en: '[![4](Images/4.png)](#co_transformers_CO4-4)'
id: totrans-146
prefs: []
type: TYPE_NORMAL
+ zh: '[![4](Images/4.png)](#co_transformers_CO4-4)'
- en: The transformed output is passed through a `Dense` layer with softmax activation
to predict a distribution over the subsequent word.
id: totrans-147
diff --git a/totrans/gen-dl_14.yaml b/totrans/gen-dl_14.yaml
index 9b577d3..7bc60e3 100644
--- a/totrans/gen-dl_14.yaml
+++ b/totrans/gen-dl_14.yaml
@@ -1,92 +1,126 @@
- en: Chapter 10\. Advanced GANs
+ id: totrans-0
prefs:
- PREF_H1
type: TYPE_NORMAL
+ zh: 第10章. 高级GANs
- en: '[Chapter 4](ch04.xhtml#chapter_gan) introduced generative adversarial networks
(GANs), a class of generative model that has produced state-of-the-art results
across a wide variety of image generation tasks. The flexibility in the model
architecture and training process has led academics and deep learning practitioners
to find new ways to design and train GANs, leading to many different advanced
*flavors* of the architecture that we shall explore in this chapter.'
+ id: totrans-1
prefs: []
type: TYPE_NORMAL
+ zh: '[第4章](ch04.xhtml#chapter_gan)介绍了生成对抗网络(GANs),这是一类生成模型,在各种图像生成任务中取得了最先进的结果。模型架构和训练过程的灵活性导致学术界和深度学习从业者找到了设计和训练GAN的新方法,从而产生了许多不同的高级架构,我们将在本章中探讨。'
- en: Introduction
+ id: totrans-2
prefs:
- PREF_H1
type: TYPE_NORMAL
+ zh: 介绍
- en: Explaining all GAN developments and their repercussions in detail could easily
fill another book. The [GAN Zoo repository](https://oreil.ly/Oy6bR) on GitHub
contains over 500 distinct examples of GANs with linked papers, ranging from ABC-GAN
to ZipNet-GAN!
+ id: totrans-3
prefs: []
type: TYPE_NORMAL
+ zh: 详细解释所有GAN发展及其影响可能需要另一本书。GitHub上的[GAN Zoo代码库](https://oreil.ly/Oy6bR)包含了500多个不同的GAN示例,涵盖了从ABC-GAN到ZipNet-GAN的各种GAN,并附有相关论文链接!
- en: In this chapter we will cover the main GANs that have been influential in the
field, including a detailed explanation of the model architecture and training
process for each.
+ id: totrans-4
prefs: []
type: TYPE_NORMAL
+ zh: 在本章中,我们将介绍对该领域产生影响的主要GANs,包括对每个模型的模型架构和训练过程的详细解释。
- en: 'We will first explore three important models from NVIDIA that have pushed the
boundaries of image generation: ProGAN, StyleGAN, and StyleGAN2\. We will analyze
each of these models in enough detail to understand the fundamental concepts that
underpin the architectures and see how they have each built on ideas from earlier
papers.'
+ id: totrans-5
prefs: []
type: TYPE_NORMAL
+ zh: 我们将首先探讨NVIDIA推动图像生成边界的三个重要模型:ProGAN、StyleGAN和StyleGAN2。我们将对每个模型进行足够详细的分析,以理解支撑架构的基本概念,并看看它们如何各自建立在早期论文的想法基础上。
- en: 'We will also explore two other important GAN architectures that incorporate
attention: the Self-Attention GAN (SAGAN) and BigGAN, which built on many of the
ideas in the SAGAN paper. We have already seen the power of the attention mechanism
in the context of Transformers in [Chapter 9](ch09.xhtml#chapter_transformer).'
+ id: totrans-6
prefs: []
type: TYPE_NORMAL
+ zh: 我们还将探讨另外两种重要的GAN架构,包括引入注意力机制的Self-Attention GAN(SAGAN)和BigGAN,后者在SAGAN论文中的许多想法基础上构建。我们已经在[第9章](ch09.xhtml#chapter_transformer)中看到了注意力机制在变换器中的威力。
- en: Lastly, we will cover VQ-GAN and ViT VQ-GAN, which incorporate a blend of ideas
from variational autoencoders, Transformers, and GANs. VQ-GAN is a key component
of Google’s state-of-the-art text-to-image generation model Muse.^([1](ch10.xhtml#idm45387005226448))
We will explore so-called multimodal models in more detail in [Chapter 13](ch13.xhtml#chapter_multimodal).
+ id: totrans-7
prefs: []
type: TYPE_NORMAL
+ zh: 最后,我们将介绍VQ-GAN和ViT VQ-GAN,它们融合了变分自动编码器、变换器和GAN的思想。VQ-GAN是谷歌最先进的文本到图像生成模型Muse的关键组成部分。我们将在[第13章](ch13.xhtml#chapter_multimodal)中更详细地探讨所谓的多模型。
- en: Training Your Own Models
+ id: totrans-8
prefs:
- PREF_H1
type: TYPE_NORMAL
+ zh: 训练您自己的模型
- en: For conciseness I have chosen not to include code to directly build these models
in the code repository for this book, but instead will point to publicly available
implementations where possible, so that you can train your own versions if you
wish.
+ id: totrans-9
prefs: []
type: TYPE_NORMAL
+ zh: 为了简洁起见,我选择不在本书的代码库中直接构建这些模型的代码,而是将尽可能指向公开可用的实现,以便您可以根据需要训练自己的版本。
- en: ProGAN
+ id: totrans-10
prefs:
- PREF_H1
type: TYPE_NORMAL
+ zh: ProGAN
- en: ProGAN is a technique developed by NVIDIA Labs in 2017^([2](ch10.xhtml#idm45387005216528))
to improve both the speed and stability of GAN training. Instead of immediately
training a GAN on full-resolution images, the ProGAN paper suggests first training
the generator and discriminator on low-resolution images of, say, 4 × 4 pixels
and then incrementally adding layers throughout the training process to increase
the resolution.
+ id: totrans-11
prefs: []
type: TYPE_NORMAL
+ zh: ProGAN是NVIDIA实验室在2017年开发的一种技术,旨在提高GAN训练的速度和稳定性。ProGAN论文建议,不要立即在全分辨率图像上训练GAN,而是首先在低分辨率图像(例如4×4像素)上训练生成器和鉴别器,然后在训练过程中逐步添加层以增加分辨率。
- en: Let’s take a look at the concept of *progressive training* in more detail.
+ id: totrans-12
prefs: []
type: TYPE_NORMAL
+ zh: 让我们更详细地了解*渐进式训练*的概念。
- en: Training Your Own ProGAN
+ id: totrans-13
prefs:
- PREF_H1
type: TYPE_NORMAL
+ zh: 训练您自己的ProGAN
- en: There is an excellent tutorial by Bharath K on training your own ProGAN using
Keras available on the [Paperspace blog](https://oreil.ly/b2CJm). Bear in mind
that training a ProGAN to achieve the results from the paper requires a significant
amount of computing power.
+ id: totrans-14
prefs: []
type: TYPE_NORMAL
+ zh: Bharath K在[Paperspace博客](https://oreil.ly/b2CJm)上提供了一个关于使用Keras训练自己的ProGAN的优秀教程。请记住,训练ProGAN以达到论文中的结果需要大量的计算能力。
- en: Progressive Training
+ id: totrans-15
prefs:
- PREF_H2
type: TYPE_NORMAL
+ zh: 渐进式训练
- en: As always with GANs, we build two independent networks, the generator and discriminator,
with a fight for dominance taking place during the training process.
+ id: totrans-16
prefs: []
type: TYPE_NORMAL
+ zh: 与GANs一样,我们构建两个独立的网络,生成器和鉴别器,在训练过程中进行统治之争。
- en: In a normal GAN, the generator always outputs full-resolution images, even in
the early stages of training. It is reasonable to think that this strategy might
not be optimal—the generator might be slow to learn high-level structures in the
@@ -94,60 +128,84 @@
images. Wouldn’t it be better to first train a lightweight GAN to output accurate
low-resolution images and then see if we can build on this to gradually increase
the resolution?
+ id: totrans-17
prefs: []
type: TYPE_NORMAL
+ zh: 在普通的GAN中,生成器总是输出全分辨率图像,即使在训练的早期阶段也是如此。可以合理地认为,这种策略可能不是最佳的——生成器可能在训练的早期阶段学习高级结构较慢,因为它立即在复杂的高分辨率图像上操作。首先训练一个轻量级的GAN以输出准确的低分辨率图像,然后逐渐增加分辨率,这样做会更好吗?
- en: This simple idea leads us to *progressive training*, one of the key contributions
of the ProGAN paper. The ProGAN is trained in stages, starting with a training
set that has been condensed down to 4 × 4–pixel images using interpolation, as
shown in [Figure 10-1](Images/#condensed_images).
+ id: totrans-18
prefs: []
type: TYPE_NORMAL
+ zh: 这个简单的想法引导我们进入*渐进式训练*,这是ProGAN论文的一个关键贡献。ProGAN分阶段训练,从一个已经通过插值压缩到4×4像素图像的训练集开始,如[图10-1](Images/#condensed_images)所示。
- en: '![](Images/gdl2_1001.png)'
+ id: totrans-19
prefs: []
type: TYPE_IMG
+ zh: '![](Images/gdl2_1001.png)'
- en: Figure 10-1\. Images in the dataset can be compressed to lower resolution using
interpolation
+ id: totrans-20
prefs:
- PREF_H6
type: TYPE_NORMAL
+ zh: 图10-1。数据集中的图像可以使用插值压缩到较低分辨率
- en: We can then initially train the generator to transform a latent input noise
vector (say, of length 512) into an image
of shape 4 × 4 × 3\. The matching discriminator will need to transform an input
image of size 4 × 4 × 3 into a single scalar prediction. The network architectures
for this first step are shown in [Figure 10-2](#progan_4).
+ id: totrans-21
prefs: []
type: TYPE_NORMAL
+ zh: 然后,我们可以最初训练生成器,将潜在输入噪声向量(比如长度为512)转换为形状为4×4×3的图像。匹配的鉴别器需要将大小为4×4×3的输入图像转换为单个标量预测。这第一步的网络架构如[图10-2](#progan_4)所示。
- en: The blue box in the generator represents the convolutional layer that converts
the set of feature maps into an RGB image (`toRGB`), and the blue box in the discriminator
represents the convolutional layer that converts the RGB images into a set of
feature maps (`fromRGB`).
+ id: totrans-22
prefs: []
type: TYPE_NORMAL
+ zh: 生成器中的蓝色框表示将特征图转换为RGB图像的卷积层(`toRGB`),鉴别器中的蓝色框表示将RGB图像转换为一组特征图的卷积层(`fromRGB`)。
- en: '![](Images/gdl2_1002.png)'
+ id: totrans-23
prefs: []
type: TYPE_IMG
+ zh: '![](Images/gdl2_1002.png)'
- en: Figure 10-2\. The generator and discriminator architectures for the first stage
of the ProGAN training process
+ id: totrans-24
prefs:
- PREF_H6
type: TYPE_NORMAL
+ zh: 图10-2。ProGAN训练过程的第一阶段的生成器和鉴别器架构
- en: In the paper, the authors train this pair of networks until the discriminator
has seen 800,000 real images. We now need to understand how the generator and
discriminator are expanded to work with 8 × 8–pixel images.
+ id: totrans-25
prefs: []
type: TYPE_NORMAL
+ zh: 在论文中,作者训练这对网络,直到鉴别器看到了800,000张真实图像。现在我们需要了解如何扩展生成器和鉴别器以处理8×8像素图像。
- en: To expand the generator and discriminator, we need to blend in additional layers.
This is managed in two phases, transition and stabilization, as shown in [Figure 10-3](#progan_training_gen).
+ id: totrans-26
prefs: []
type: TYPE_NORMAL
+ zh: 为了扩展生成器和鉴别器,我们需要融入额外的层。这在两个阶段中进行,过渡和稳定,如[图10-3](#progan_training_gen)所示。
- en: '![](Images/gdl2_1003.png)'
+ id: totrans-27
prefs: []
type: TYPE_IMG
+ zh: '![](Images/gdl2_1003.png)'
- en: Figure 10-3\. The ProGAN generator training process, expanding the network from
4 × 4 images to 8 × 8 (dotted lines represent the rest of the network, not shown)
+ id: totrans-28
prefs:
- PREF_H6
type: TYPE_NORMAL
+ zh: 图10-3。ProGAN生成器训练过程,将网络从4×4图像扩展到8×8(虚线代表网络的其余部分,未显示)
- en: Let’s first look at the generator. During the *transition phase*, new upsampling
and convolutional layers are appended to the existing network, with a residual
connection set up to maintain the output from the existing trained `toRGB` layer.
@@ -155,70 +213,96 @@
that is gradually increased from 0 to 1 throughout the transition phase to allow
more of the new `toRGB` output through and less of the existing `toRGB` layer.
This is to avoid a *shock* to the network as the new layers take over.
+ id: totrans-29
prefs: []
type: TYPE_NORMAL
+ zh: 让我们首先看一下生成器。在*过渡阶段*中,新的上采样和卷积层被附加到现有网络中,建立了一个残差连接以保持现有训练过的`toRGB`层的输出。关键的是,新层最初使用一个参数进行掩蔽,该参数在整个过渡阶段逐渐从0增加到1,以允许更多新的`toRGB`输出通过,减少现有的`toRGB`层。这是为了避免网络在新层接管时出现*冲击*。
- en: Eventually, there is no flow through the old `toRGB` layer and the network enters
the *stabilization phase*—a further period of training where the network can fine-tune
the output, without any flow through the old `toRGB` layer.
+ id: totrans-30
prefs: []
type: TYPE_NORMAL
+ zh: 最终,旧的`toRGB`层不再有输出流,网络进入*稳定阶段*——进一步的训练期间,网络可以微调输出,而不经过旧的`toRGB`层。
- en: The discriminator uses a similar process, as shown in [Figure 10-4](#progan_training_dis).
+ id: totrans-31
prefs: []
type: TYPE_NORMAL
+ zh: 鉴别器使用类似的过程,如[图10-4](#progan_training_dis)所示。
- en: '![](Images/gdl2_1004.png)'
+ id: totrans-32
prefs: []
type: TYPE_IMG
+ zh: '![](Images/gdl2_1004.png)'
- en: Figure 10-4\. The ProGAN discriminator training process, expanding the network
from 4 × 4 images to 8 × 8 (dotted lines represent the rest of the network, not
shown)
+ id: totrans-33
prefs:
- PREF_H6
type: TYPE_NORMAL
+ zh: 图10-4。ProGAN鉴别器训练过程,将网络从4×4图像扩展到8×8(虚线代表网络的其余部分,未显示)
- en: Here, we need to blend in additional downscaling and convolutional layers. Again,
the layers are injected into the network—this time at the start of the network,
just after the input image. The existing `fromRGB` layer is connected via a residual
connection and gradually phased out as the new layers take over during the transition
phase. The stabilization phase allows the discriminator to fine-tune using the
new layers.
+ id: totrans-34
prefs: []
type: TYPE_NORMAL
+ zh: 在这里,我们需要融入额外的降采样和卷积层。同样,这些层被注入到网络中——这次是在网络的开始部分,就在输入图像之后。现有的`fromRGB`层通过残差连接连接,并在过渡阶段逐渐淡出,随着新层在过渡阶段接管时逐渐淡出。稳定阶段允许鉴别器使用新层进行微调。
- en: All transition and stabilization phases last until the discriminator has been
shown 800,000 real images. Note that even through the network is trained progressively,
no layers are *frozen*. Throughout the training process, all layers remain fully
trainable.
+ id: totrans-35
prefs: []
type: TYPE_NORMAL
+ zh: 所有过渡和稳定阶段持续到鉴别器已经看到了800,000张真实图像。请注意,即使网络是渐进训练的,也没有层被*冻结*。在整个训练过程中,所有层都保持完全可训练。
- en: This process continues, growing the GAN from 4 × 4 images to 8 × 8, then 16
× 16, 32 × 32, and so on, until it reaches full resolution (1,024 × 1,024), as
shown in [Figure 10-5](#progan).
+ id: totrans-36
prefs: []
type: TYPE_NORMAL
+ zh: 这个过程继续进行,将GAN从4×4图像扩展到8×8,然后16×16,32×32,依此类推,直到达到完整分辨率(1,024×1,024),如[图10-5](#progan)所示。
- en: '![](Images/gdl2_1005.png)'
+ id: totrans-37
prefs: []
type: TYPE_IMG
+ zh: '![](Images/gdl2_1005.png)'
- en: 'Figure 10-5\. The ProGAN training mechanism, and some example generated faces
(source: [Karras et al., 2017](https://arxiv.org/abs/1710.10196))'
+ id: totrans-38
prefs:
- PREF_H6
type: TYPE_NORMAL
+ zh: 图10-5。ProGAN训练机制,以及一些示例生成的人脸(来源:[Karras等人,2017](https://arxiv.org/abs/1710.10196))
- en: The overall structure of the generator and discriminator after the full progressive
training process is complete is shown in [Figure 10-6](#progan_network_diagram).
+ id: totrans-39
prefs: []
type: TYPE_NORMAL
- en: '![](Images/gdl2_1006.png)'
+ id: totrans-40
prefs: []
type: TYPE_IMG
- en: 'Figure 10-6\. The ProGAN generator and discriminator used to generate 1,024
× 1,024–pixel CelebA faces (source: [Karras et al., 2018](https://arxiv.org/abs/1812.04948))'
+ id: totrans-41
prefs:
- PREF_H6
type: TYPE_NORMAL
- en: The paper also makes several other important contributions, namely minibatch
standard deviation, equalized learning rates, and pixelwise normalization, which
are described briefly in the following sections.
+ id: totrans-42
prefs: []
type: TYPE_NORMAL
- en: Minibatch standard deviation
+ id: totrans-43
prefs:
- PREF_H3
type: TYPE_NORMAL
@@ -230,9 +314,11 @@
can use this feature to distinguish the fake batches from the real batches! Therefore,
the generator is incentivized to ensure it generates a similar amount of variety
as is present in the real training data.
+ id: totrans-44
prefs: []
type: TYPE_NORMAL
- en: Equalized learning rates
+ id: totrans-45
prefs:
- PREF_H3
type: TYPE_NORMAL
@@ -243,6 +329,7 @@
layer. This way, layers with a greater number of inputs will be initialized with
weights that have a smaller deviation from zero, which generally improves the
stability of the training process.
+ id: totrans-46
prefs: []
type: TYPE_NORMAL
- en: The authors of the ProGAN paper found that this was causing problems when used
@@ -254,6 +341,7 @@
more inputs). It was found that this causes an imbalance between the speed of
training of the different layers of the generator and discriminator in ProGAN,
so they used *equalized learning rates* to solve this problem.
+ id: totrans-47
prefs: []
type: TYPE_NORMAL
- en: In ProGAN, weights are initialized using a simple standard Gaussian, regardless
@@ -262,9 +350,11 @@
the optimizer sees each weight as having approximately the same dynamic range,
so it applies the same learning rate. It is only when the layer is called that
the weight is scaled by the factor from the He initializer.
+ id: totrans-48
prefs: []
type: TYPE_NORMAL
- en: Pixelwise normalization
+ id: totrans-49
prefs:
- PREF_H3
type: TYPE_NORMAL
@@ -273,9 +363,11 @@
a unit length and helps to prevent the signal from spiraling out of control as
it propagates through the network. The pixelwise normalization layer has no trainable
weights.
+ id: totrans-50
prefs: []
type: TYPE_NORMAL
- en: Outputs
+ id: totrans-51
prefs:
- PREF_H2
type: TYPE_NORMAL
@@ -284,25 +376,31 @@
in [Figure 10-7](#progan_examples). This demonstrated the power of ProGAN over
earlier GAN architectures and paved the way for future iterations such as StyleGAN
and StyleGAN2, which we shall explore in the next sections.
+ id: totrans-52
prefs: []
type: TYPE_NORMAL
- en: '![](Images/gdl2_1007.png)'
+ id: totrans-53
prefs: []
type: TYPE_IMG
- en: 'Figure 10-7\. Generated examples from a ProGAN trained progressively on the
LSUN dataset at 256 × 256 resolution (source: [Karras et al., 2017](https://arxiv.org/abs/1710.10196))'
+ id: totrans-54
prefs:
- PREF_H6
type: TYPE_NORMAL
- en: StyleGAN
+ id: totrans-55
prefs:
- PREF_H1
type: TYPE_NORMAL
- en: StyleGAN^([3](ch10.xhtml#idm45387005140128)) is a GAN architecture from 2018
that builds on the earlier ideas in the ProGAN paper. In fact, the discriminator
is identical; only the generator is changed.
+ id: totrans-56
prefs: []
type: TYPE_NORMAL
+ zh: StyleGAN^([3](ch10.xhtml#idm45387005140128))是2018年的一个GAN架构,建立在ProGAN论文中的早期思想基础上。实际上,鉴别器是相同的;只有生成器被改变。
- en: Often when training GANs it is difficult to separate out vectors in the latent
space corresponding to high-level attributes—they are frequently *entangled*,
meaning that adjusting an image in the latent space to give a face more freckles,
@@ -310,40 +408,56 @@
generates fantastically realistic images, it is no exception to this general rule.
We would ideally like to have full control of the style of the image, and this
requires a disentangled separation of features in the latent space.
+ id: totrans-57
prefs: []
type: TYPE_NORMAL
+ zh: 通常在训练GAN时,很难将潜在空间中对应于高级属性的向量分离出来——它们经常是*纠缠在一起*,这意味着调整潜在空间中的图像以使脸部更多雀斑,例如,可能也会无意中改变背景颜色。虽然ProGAN生成了极其逼真的图像,但它也不例外。我们理想情况下希望完全控制图像的风格,这需要在潜在空间中对特征进行分离。
- en: 'StyleGAN achieves this by explicitly injecting style vectors into the network
at different points: some that control high-level features (e.g., face orientation)
and some that control low-level details (e.g., the way the hair falls across the
forehead).'
+ id: totrans-58
prefs: []
type: TYPE_NORMAL
+ zh: StyleGAN通过在网络的不同点显式注入风格向量来实现这一点:一些控制高级特征(例如,面部方向)的向量,一些控制低级细节(例如,头发如何落在额头上)的向量。
- en: The overall architecture of the StyleGAN generator is shown in [Figure 10-8](#stylegan_arch).
Let’s walk through this architecture step by step, starting with the mapping network.
+ id: totrans-59
prefs: []
type: TYPE_NORMAL
+ zh: StyleGAN生成器的整体架构如[图10-8](#stylegan_arch)所示。让我们逐步走过这个架构,从映射网络开始。
- en: '![](Images/gdl2_1008.png)'
+ id: totrans-60
prefs: []
type: TYPE_IMG
+ zh: '![](Images/gdl2_1008.png)'
- en: 'Figure 10-8\. The StyleGAN generator architecture (source: [Karras et al.,
2018](https://arxiv.org/abs/1812.04948))'
+ id: totrans-61
prefs:
- PREF_H6
type: TYPE_NORMAL
+ zh: 图10-8。StyleGAN生成器架构(来源:[Karras et al., 2018](https://arxiv.org/abs/1812.04948))
- en: Training Your Own StyleGAN
+ id: totrans-62
prefs:
- PREF_H1
type: TYPE_NORMAL
+ zh: 训练您自己的StyleGAN
- en: There is an excellent tutorial by Soon-Yau Cheong on training your own StyleGAN
using Keras available on the [Keras website](https://oreil.ly/MooSe). Bear in
mind that training a StyleGAN to achieve the results from the paper requires a
significant amount of computing power.
+ id: totrans-63
prefs: []
type: TYPE_NORMAL
+ zh: Soon-Yau Cheong在[Keras网站](https://oreil.ly/MooSe)上提供了一个关于使用Keras训练自己的StyleGAN的优秀教程。请记住,要实现论文中的结果,训练StyleGAN需要大量的计算资源。
- en: The Mapping Network
+ id: totrans-64
prefs:
- PREF_H2
type: TYPE_NORMAL
+ zh: 映射网络
- en: The *mapping network* is a simple feed-forward
network that converts the input noise into a different
@@ -351,17 +465,26 @@
∈ 𝒲 . This gives the generator the opportunity
to disentangle the noisy input vector into distinct factors of variation, which
can be easily picked up by the downstream style-generating layers.
+ id: totrans-65
prefs: []
type: TYPE_NORMAL
+ zh: '*映射网络* 是一个简单的前馈网络,将输入噪声
+ 转换为不同的潜在空间 。这使得生成器有机会将嘈杂的输入向量分解为不同的变化因素,这些因素可以被下游的风格生成层轻松捕捉到。'
- en: The point of doing this is to separate out the process of choosing a style for
the image (the mapping network) from the generation of an image with a given style
(the synthesis network).
+ id: totrans-66
prefs: []
type: TYPE_NORMAL
+ zh: 这样做的目的是将图像的风格选择过程(映射网络)与生成具有给定风格的图像的过程(合成网络)分开。
- en: The Synthesis Network
+ id: totrans-67
prefs:
- PREF_H2
type: TYPE_NORMAL
+ zh: 合成网络
- en: 'The synthesis network is the generator of the actual image with a given style,
as provided by the mapping network. As can be seen from [Figure 10-8](#stylegan_arch),
the style vector is injected into the
@@ -374,16 +497,26 @@
the specific style that should be injected at this point in the network—that is,
they tell the synthesis network how to adjust the feature maps to move the generated
image in the direction of the specified style.'
+ id: totrans-68
prefs: []
type: TYPE_NORMAL
+ zh: 合成网络是生成具有给定风格的实际图像的生成器,由映射网络提供。如[图10-8](#stylegan_arch)所示,风格向量 被注入到合成网络的不同点,每次通过不同的密集连接层 ,生成两个向量:一个偏置向量
+ 和一个缩放向量 。这些向量定义了应该在网络中的这一点注入的特定风格,也就是告诉合成网络如何调整特征图以使生成的图像朝着指定的风格方向移动。
- en: This adjustment is achieved through *adaptive instance normalization* (AdaIN)
layers.
+ id: totrans-69
prefs: []
type: TYPE_NORMAL
+ zh: 通过*自适应实例归一化*(AdaIN)层实现这种调整。
- en: Adaptive instance normalization
+ id: totrans-70
prefs:
- PREF_H3
type: TYPE_NORMAL
+ zh: 自适应实例归一化
- en: 'An AdaIN layer is a type of neural network layer that adjusts the mean and
variance of each feature map with a reference style bias
+ id: totrans-72
prefs: []
type: TYPE_NORMAL
+ zh:
- en: The adaptive instance normalization layers ensure that the style vectors that
are injected into each layer only affect features at that layer, by preventing
any style information from leaking through between layers. The authors show that
this results in the latent vectors being
significantly more disentangled than the original
vectors.
+ id: totrans-73
prefs: []
type: TYPE_NORMAL
- en: Since the synthesis network is based on the ProGAN architecture, it is trained
@@ -424,9 +571,11 @@
the latent vector , but we can also switch
the vector at different points in the
synthesis network to change the style at a variety of levels of detail.
+ id: totrans-74
prefs: []
type: TYPE_NORMAL
- en: Style mixing
+ id: totrans-75
prefs:
- PREF_H3
type: TYPE_NORMAL
@@ -445,9 +594,11 @@
w bold 2 right-parenthesis">𝐰 2
) is chosen at random, to break any possible
correlation between the vectors.
+ id: totrans-76
prefs: []
type: TYPE_NORMAL
- en: Stochastic variation
+ id: totrans-77
prefs:
- PREF_H3
type: TYPE_NORMAL
@@ -456,26 +607,32 @@
for stochastic details such as the placement of individual hairs, or the background
behind the face. Again, the depth at which the noise is injected affects the coarseness
of the impact on the image.
+ id: totrans-78
prefs: []
type: TYPE_NORMAL
- en: This also means that the initial input to the synthesis network can simply be
a learned constant, rather than additional noise. There is enough stochasticity
already present in the style inputs and the noise inputs to generate sufficient
variation in the images.
+ id: totrans-79
prefs: []
type: TYPE_NORMAL
- en: Outputs from StyleGAN
+ id: totrans-80
prefs:
- PREF_H2
type: TYPE_NORMAL
- en: '[Figure 10-9](#stylegan_w) shows StyleGAN in action.'
+ id: totrans-81
prefs: []
type: TYPE_NORMAL
- en: '![](Images/gdl2_1009.png)'
+ id: totrans-82
prefs: []
type: TYPE_IMG
- en: 'Figure 10-9\. Merging styles between two generated images at different levels
of detail (source: [Karras et al., 2018](https://arxiv.org/abs/1812.04948))'
+ id: totrans-83
prefs:
- PREF_H6
type: TYPE_NORMAL
@@ -488,9 +645,11 @@
A. However, if the switch happens later, only fine-grained detail is carried across
from source B, such as colors and microstructure of the face, while the coarse
features from source A are preserved.
+ id: totrans-84
prefs: []
type: TYPE_NORMAL
- en: StyleGAN2
+ id: totrans-85
prefs:
- PREF_H1
type: TYPE_NORMAL
@@ -500,22 +659,27 @@
do not suffer as greatly from *artifacts*—water droplet–like areas of the image
that were found to be caused by the adaptive instance normalization layers in
StyleGAN, as shown in [Figure 10-10](#artifacts_stylegan).
+ id: totrans-86
prefs: []
type: TYPE_NORMAL
- en: '![](Images/gdl2_1010.png)'
+ id: totrans-87
prefs: []
type: TYPE_IMG
- en: 'Figure 10-10\. An artifact in a StyleGAN-generated image of a face (source:
[Karras et al., 2019](https://arxiv.org/abs/1912.04958))'
+ id: totrans-88
prefs:
- PREF_H6
type: TYPE_NORMAL
- en: Both the generator and the discriminator in StyleGAN2 are different from the
StyleGAN. In the next sections we will explore the key differences between the
architectures.
+ id: totrans-89
prefs: []
type: TYPE_NORMAL
- en: Training Your Own StyleGAN2
+ id: totrans-90
prefs:
- PREF_H1
type: TYPE_NORMAL
@@ -523,9 +687,11 @@
on [GitHub](https://oreil.ly/alB6w). Bear in mind that training a StyleGAN2 to
achieve the results from the paper requires a significant amount of computing
power.
+ id: totrans-91
prefs: []
type: TYPE_NORMAL
- en: Weight Modulation and Demodulation
+ id: totrans-92
prefs:
- PREF_H2
type: TYPE_NORMAL
@@ -536,6 +702,7 @@
by the modulation and demodulation steps in StyleGAN2 at runtime. In comparison,
the AdaIN layers of StyleGAN operate on the image tensor as it flows through the
network.
+ id: totrans-93
prefs: []
type: TYPE_NORMAL
- en: The AdaIN layer in StyleGAN is simply an instance normalization followed by
@@ -544,12 +711,15 @@
layers at runtime, rather than the output from the convolutional layers, as shown
in [Figure 10-11](#stylegan2_styleblock). The authors show how this removes the
artifact issue while retaining control of the image style.
+ id: totrans-94
prefs: []
type: TYPE_NORMAL
- en: '![](Images/gdl2_1011.png)'
+ id: totrans-95
prefs: []
type: TYPE_IMG
- en: Figure 10-11\. A comparison between the StyleGAN and StyleGAN2 style blocks
+ id: totrans-96
prefs:
- PREF_H6
type: TYPE_NORMAL
@@ -558,22 +728,30 @@
, where indexes the number of input channels
in the corresponding convolutional layer. This style vector is then applied to
the weights of the convolutional layer as follows:'
+ id: totrans-97
prefs: []
type: TYPE_NORMAL
- en:
+ id: totrans-98
prefs: []
type: TYPE_NORMAL
+ zh:
- en: Here, indexes the output channels of the
layer and indexes the spatial dimensions.
This is the *modulation* step of the process.
+ id: totrans-99
prefs: []
type: TYPE_NORMAL
- en: 'Then, we need to normalize the weights so that they again have a unit standard
deviation, to ensure stability in the training process. This is the *demodulation*
step:'
+ id: totrans-100
prefs: []
type: TYPE_NORMAL
- en:
- en: where is a small constant value that
prevents division by zero.
+ id: totrans-102
prefs: []
type: TYPE_NORMAL
- en: In the paper, the authors show how this simple change is enough to prevent water-droplet
artifacts, while retaining control over the generated images via the style vectors
and ensuring the quality of the output remains high.
+ id: totrans-103
prefs: []
type: TYPE_NORMAL
- en: Path Length Regularization
+ id: totrans-104
prefs:
- PREF_H2
type: TYPE_NORMAL
+ zh: 路径长度正则化
- en: Another change made to the StyleGAN architecture is the inclusion of an additional
penalty term in the loss function—*this is known as path length regularization*.
+ id: totrans-105
prefs: []
type: TYPE_NORMAL
+ zh: StyleGAN架构的另一个变化是在损失函数中包含了额外的惩罚项——*这被称为路径长度正则化*。
- en: We would like the latent space to be as smooth and uniform as possible, so that
a fixed-size step in the latent space in any direction results in a fixed-magnitude
change in the image.
+ id: totrans-106
prefs: []
type: TYPE_NORMAL
+ zh: 我们希望潜在空间尽可能平滑和均匀,这样在任何方向上潜在空间中的固定大小步长会导致图像的固定幅度变化。
- en: 'To encourage this property, StyleGAN2 aims to minimize the following term,
alongside the usual Wasserstein loss with gradient penalty:'
+ id: totrans-107
prefs: []
type: TYPE_NORMAL
+ zh: 为了鼓励这一属性,StyleGAN2旨在最小化以下术语,以及通常的Wasserstein损失和梯度惩罚:
- en:
+ id: totrans-108
prefs: []
type: TYPE_NORMAL
+ zh:
- en: Here, is a set of style vectors created
by the mapping network, is a set of noisy
images drawn from is the Jacobian of the
generator network with respect to the style vectors.
+ id: totrans-109
prefs: []
type: TYPE_NORMAL
+ zh: 在这里,是由映射网络创建的一组样式向量,是从中绘制的一组嘈杂图像,是生成器网络相对于样式向量的雅可比矩阵。
- en: The term
@@ -644,36 +858,56 @@
w Superscript down-tack Baseline y parallel-to Subscript 2">𝐉 𝑤 ⊤ 𝑦
2 as the training progresses.
+ id: totrans-110
prefs: []
type: TYPE_NORMAL
+ zh: 术语测量了经雅可比矩阵给出的梯度变换后图像的幅度。我们希望这个值接近一个常数,这个常数是动态计算的,作为训练进行时的指数移动平均值。
- en: The authors find that this additional term makes exploring the latent space
more reliable and consistent. Moreover, the regularization terms in the loss function
are only applied once every 16 minibatches, for efficiency. This technique, called
*lazy regularization*, does not cause a measurable drop in performance.
+ id: totrans-111
prefs: []
type: TYPE_NORMAL
+ zh: 作者发现,这个额外的术语使探索潜在空间更可靠和一致。此外,损失函数中的正则化项仅在每16个小批次中应用一次,以提高效率。这种技术称为*懒惰正则化*,不会导致性能的明显下降。
- en: No Progressive Growing
+ id: totrans-112
prefs:
- PREF_H2
type: TYPE_NORMAL
+ zh: 没有渐进增长
- en: Another major update is in how StyleGAN2 is trained. Rather than adopting the
usual progressive training mechanism, StyleGAN2 utilizes skip connections in the
generator and residual connections in the discriminator to train the entire network
as one. It no longer requires different resolutions to be trained independently
and blended as part of the training process.
+ id: totrans-113
prefs: []
type: TYPE_NORMAL
+ zh: StyleGAN2训练的另一个重大更新是在训练方式上。StyleGAN2不再采用通常的渐进式训练机制,而是利用生成器中的跳过连接和鉴别器中的残差连接来将整个网络作为一个整体进行训练。它不再需要独立训练不同分辨率,并将其作为训练过程的一部分混合。
- en: '[Figure 10-12](#stylegan2_gen_dis) shows the generator and discriminator blocks
in StyleGAN2.'
+ id: totrans-114
prefs: []
type: TYPE_NORMAL
+ zh: '[图10-12](#stylegan2_gen_dis)展示了StyleGAN2中的生成器和鉴别器块。'
- en: '![](Images/gdl2_1012.png)'
+ id: totrans-115
prefs: []
type: TYPE_IMG
+ zh: '![](Images/gdl2_1012.png)'
- en: Figure 10-12\. The generator and discriminator blocks in StyleGAN2
+ id: totrans-116
prefs:
- PREF_H6
type: TYPE_NORMAL
+ zh: 图10-12。StyleGAN2中的生成器和鉴别器块
- en: The crucial property that we would like to be able to preserve is that the StyleGAN2
starts by learning low-resolution features and gradually refines the output as
training progresses. The authors show that this property is indeed preserved using
@@ -684,43 +918,57 @@
begin to dominate, as the generator discovers more intricate ways to improve the
realism of the images in order to fool the discriminator. This process is demonstrated
in [Figure 10-13](#stylegan2_contrib).
+ id: totrans-117
prefs: []
type: TYPE_NORMAL
+ zh: 我们希望能够保留的关键属性是,StyleGAN2从学习低分辨率特征开始,并随着训练的进行逐渐完善输出。作者表明,使用这种架构确实保留了这一属性。在训练的早期阶段,每个网络都受益于在较低分辨率层中细化卷积权重,而通过跳过和残差连接将输出传递到较高分辨率层的方式基本上不受影响。随着训练的进行,较高分辨率层开始占主导地位,因为生成器发现了更复杂的方法来改善图像的逼真度,以欺骗鉴别器。这个过程在[图10-13](#stylegan2_contrib)中展示。
- en: '![](Images/gdl2_1013.png)'
+ id: totrans-118
prefs: []
type: TYPE_IMG
+ zh: '![](Images/gdl2_1013.png)'
- en: Figure 10-13\. The contribution of each resolution layer to the output of the
generator, by training time (adapted from [Karras et al., 2019](https://arxiv.org/pdf/1912.04958.pdf))
+ id: totrans-119
prefs:
- PREF_H6
type: TYPE_NORMAL
+ zh: 图10-13。每个分辨率层对生成器输出的贡献,按训练时间(改编自[Karras等人,2019](https://arxiv.org/pdf/1912.04958.pdf))
- en: Outputs from StyleGAN2
+ id: totrans-120
prefs:
- PREF_H2
type: TYPE_NORMAL
+ zh: StyleGAN2的输出
- en: Some examples of StyleGAN2 output are shown in [Figure 10-14](#stylegan2_output).
To date, the StyleGAN2 architecture (and scaled variations such as StyleGAN-XL^([6](ch10.xhtml#idm45387004898624)))
remain state of the art for image generation on datasets such as Flickr-Faces-HQ
(FFHQ) and CIFAR-10, according to the benchmarking website [Papers with Code](https://oreil.ly/VwH2r).
+ id: totrans-121
prefs: []
type: TYPE_NORMAL
- en: '![](Images/gdl2_1014.png)'
+ id: totrans-122
prefs: []
type: TYPE_IMG
- en: 'Figure 10-14\. Uncurated StyleGAN2 output for the FFHQ face dataset and LSUN
car dataset (source: [Karras et al., 2019](https://arxiv.org/pdf/1912.04958.pdf))'
+ id: totrans-123
prefs:
- PREF_H6
type: TYPE_NORMAL
- en: Other Important GANs
+ id: totrans-124
prefs:
- PREF_H1
type: TYPE_NORMAL
- en: In this section, we will explore two more architectures that have also contributed
significantly to the development of GANs—SAGAN and BigGAN.
+ id: totrans-125
prefs: []
type: TYPE_NORMAL
- en: Self-Attention GAN (SAGAN)
+ id: totrans-126
prefs:
- PREF_H2
type: TYPE_NORMAL
@@ -729,13 +977,16 @@
models such as the Transformer can also be incorporated into GAN-based models
for image generation. [Figure 10-15](#sagan_attention) shows the self-attention
mechanism from the paper introducing this architecture.
+ id: totrans-127
prefs: []
type: TYPE_NORMAL
- en: '![](Images/gdl2_1015.png)'
+ id: totrans-128
prefs: []
type: TYPE_IMG
- en: 'Figure 10-15\. The self-attention mechanism within the SAGAN model (source:
[Zhang et al., 2018](https://arxiv.org/abs/1805.08318))'
+ id: totrans-129
prefs:
- PREF_H6
type: TYPE_NORMAL
@@ -749,14 +1000,17 @@
solves this problem by incorporating the attention mechanism that we explored
earlier in this chapter into the GAN. The effect of this inclusion is shown in
[Figure 10-16](Images/#sagan_images).
+ id: totrans-130
prefs: []
type: TYPE_NORMAL
- en: '![](Images/gdl2_1016.png)'
+ id: totrans-131
prefs: []
type: TYPE_IMG
- en: 'Figure 10-16\. A SAGAN-generated image of a bird (leftmost cell) and the attention
maps of the final attention-based generator layer for the pixels covered by the
three colored dots (rightmost cells) (source: [Zhang et al., 2018](https://arxiv.org/abs/1805.08318))'
+ id: totrans-132
prefs:
- PREF_H6
type: TYPE_NORMAL
@@ -767,31 +1021,38 @@
falls on other tail pixels, some of which are distant from the blue dot. It would
be difficult to maintain this long-range dependency for pixels without attention,
especially for long, thin structures in the image (such as the tail in this case).
+ id: totrans-133
prefs: []
type: TYPE_NORMAL
- en: Training Your Own SAGAN
+ id: totrans-134
prefs:
- PREF_H1
type: TYPE_NORMAL
- en: The official code for training your own SAGAN using TensorFlow is available
on [GitHub](https://oreil.ly/rvej0). Bear in mind that training a SAGAN to achieve
the results from the paper requires a significant amount of computing power.
+ id: totrans-135
prefs: []
type: TYPE_NORMAL
- en: BigGAN
+ id: totrans-136
prefs:
- PREF_H2
type: TYPE_NORMAL
- en: BigGAN,^([8](ch10.xhtml#idm45387004870736)) developed at DeepMind, extends the
ideas from the SAGAN paper. [Figure 10-17](#biggan_examples) shows some of the
images generated by BigGAN, trained on the ImageNet dataset at 128 × 128 resolution.
+ id: totrans-137
prefs: []
type: TYPE_NORMAL
- en: '![](Images/gdl2_1017.png)'
+ id: totrans-138
prefs: []
type: TYPE_IMG
- en: 'Figure 10-17\. Examples of images generated by BigGAN (source: [Brock et al.,
2018](https://arxiv.org/abs/1809.11096))'
+ id: totrans-139
prefs:
- PREF_H6
type: TYPE_NORMAL
@@ -806,16 +1067,26 @@
that have magnitude greater than a certain threshold). The smaller the truncation
threshold, the greater the believability of generated samples, at the expense
of reduced variability. This concept is shown in [Figure 10-18](#truncation).
+ id: totrans-140
prefs: []
type: TYPE_NORMAL
+ zh: 除了对基本 SAGAN 模型进行一些增量更改外,论文中还概述了将模型提升到更高层次的几项创新。其中一项创新是所谓的“截断技巧”。这是指用于采样的潜在分布与训练期间使用的
+
+ 分布不同。具体来说,采样期间使用的分布是“截断正态分布”(重新采样具有大于一定阈值的
+ 值)。截断阈值越小,生成样本的可信度越高,但变异性降低。这个概念在[图 10-18](#truncation)中展示。
- en: '![](Images/gdl2_1018.png)'
+ id: totrans-141
prefs: []
type: TYPE_IMG
+ zh: '![](Images/gdl2_1018.png)'
- en: 'Figure 10-18\. The truncation trick: from left to right, the threshold is set
to 2, 1, 0.5, and 0.04 (source: [Brock et al., 2018](https://arxiv.org/abs/1809.11096))'
+ id: totrans-142
prefs:
- PREF_H6
type: TYPE_NORMAL
+ zh: 图 10-18\. 截断技巧:从左到右,阈值设置为 2、1、0.5 和 0.04(来源:[Brock 等人,2018](https://arxiv.org/abs/1809.11096))
- en: Also, as the name suggests, BigGAN is an improvement over SAGAN in part simply
by being *bigger*. BigGAN uses a batch size of 2,048—8 times larger than the batch
size of 256 used in SAGAN—and a channel size that is increased by 50% in each
@@ -823,24 +1094,36 @@
by the inclusion of a shared embedding, by orthogonal regularization, and by incorporating
the latent vector into each layer of the generator,
rather than just the initial layer.
+ id: totrans-143
prefs: []
type: TYPE_NORMAL
+ zh: 正如其名称所示,BigGAN 在某种程度上是对 SAGAN 的改进,仅仅是因为它更“大”。BigGAN 使用的批量大小为 2,048,比 SAGAN 中使用的
+ 256 的批量大小大 8 倍,并且每一层的通道大小增加了 50%。然而,BigGAN 还表明,通过包含共享嵌入、正交正则化以及将潜在向量
+ 包含到生成器的每一层中,而不仅仅是初始层,可以在结构上改进 SAGAN。
- en: For a full description of the innovations introduced by BigGAN, I recommend
reading the original paper and [accompanying presentation material](https://oreil.ly/vPn8T).
+ id: totrans-144
prefs: []
type: TYPE_NORMAL
+ zh: 要全面了解 BigGAN 引入的创新,我建议阅读原始论文和[相关演示材料](https://oreil.ly/vPn8T)。
- en: Using BigGAN
+ id: totrans-145
prefs:
- PREF_H1
type: TYPE_NORMAL
+ zh: 使用 BigGAN
- en: A tutorial for generating images using a pre-trained BigGAN is available on
[the TensorFlow website](https://oreil.ly/YLbLb).
+ id: totrans-146
prefs: []
type: TYPE_NORMAL
+ zh: 在[ TensorFlow 网站](https://oreil.ly/YLbLb)上提供了一个使用预训练的 BigGAN 生成图像的教程。
- en: VQ-GAN
+ id: totrans-147
prefs:
- PREF_H2
type: TYPE_NORMAL
+ zh: VQ-GAN
- en: Another important type of GAN is the Vector Quantized GAN (VQ-GAN), introduced
in 2020.^([9](ch10.xhtml#idm45387004838864)) This model architecture builds upon
an idea introduced in the 2017 paper “Neural Discrete Representation Learning”^([10](ch10.xhtml#idm45387004834704))—namely,
@@ -849,17 +1132,26 @@
high-quality images while avoiding some of the issues often seen with traditional
continuous latent space VAEs, such as *posterior collapse* (where the learned
latent space becomes uninformative due to an overly powerful decoder).
+ id: totrans-148
prefs: []
type: TYPE_NORMAL
+ zh: 另一种重要的 GAN 类型是 2020 年推出的 Vector Quantized GAN(VQ-GAN)。这种模型架构建立在 2017 年的论文“神经离散表示学习”中提出的一个想法之上,即
+ VAE 学习到的表示可以是离散的,而不是连续的。这种新型模型,即 Vector Quantized VAE(VQ-VAE),被证明可以生成高质量的图像,同时避免了传统连续潜在空间
+ VAE 经常出现的一些问题,比如“后验坍缩”(学习到的潜在空间由于过于强大的解码器而变得无信息)。
- en: Tip
+ id: totrans-149
prefs:
- PREF_H6
type: TYPE_NORMAL
+ zh: 提示
- en: The first version of DALL.E, a text-to-image model released by OpenAI in 2021
(see [Chapter 13](ch13.xhtml#chapter_multimodal)), utilized a VAE with a discrete
latent space, similar to VQ-VAE.
+ id: totrans-150
prefs: []
type: TYPE_NORMAL
+ zh: OpenAI 在 2021 年发布的文本到图像模型 DALL.E 的第一个版本(参见[第 13 章](ch13.xhtml#chapter_multimodal))使用了具有离散潜在空间的
+ VAE,类似于 VQ-VAE。
- en: By a *discrete latent space*, we mean a learned list of vectors (the *codebook*),
each associated with a corresponding index. The job of the encoder in a VQ-VAE
is to collapse the input image to a smaller grid of vectors that can then be compared
@@ -869,15 +1161,23 @@
(the embedding size) that matches the number of channels in the output of the
encoder and input to the decoder. For example, is a vector that can be interpreted as *background*.
+ id: totrans-151
prefs: []
type: TYPE_NORMAL
+ zh: 通过“离散潜在空间”,我们指的是一个学习到的向量列表(“码书”),每个向量与相应的索引相关联。VQ-VAE 中编码器的工作是将输入图像折叠到一个较小的向量网格中,然后将其与码书进行比较。然后,将每个网格方格向量(通过欧氏距离)最接近的码书向量传递给解码器进行解码,如[图
+ 10-19](#vqvae)所示。码书是一个长度为 (嵌入大小)的学习向量列表,与编码器输出和解码器输入中的通道数相匹配。例如, 是一个可以解释为“背景”的向量。
- en: '![](Images/gdl2_1019.png)'
+ id: totrans-152
prefs: []
type: TYPE_IMG
+ zh: '![](Images/gdl2_1019.png)'
- en: Figure 10-19\. A diagram of a VQ-VAE
+ id: totrans-153
prefs:
- PREF_H6
type: TYPE_NORMAL
+ zh: 图 10-19\. VQ-VAE 的示意图
- en: The codebook can be thought of as a set of learned discrete concepts that are
shared by the encoder and decoder in order to describe the contents of a given
image. The VQ-VAE must find a way to make this set of discrete concepts as informative
@@ -888,6 +1188,7 @@
as possible to vectors in the codebook. These terms replace the the KL divergence
term between the encoded distribution and the standard Gaussian prior in a typical
VAE.
+ id: totrans-154
prefs: []
type: TYPE_NORMAL
- en: However, this architecture poses a question—how do we sample novel code grids
@@ -900,26 +1201,32 @@
to predict the next code vector in the grid, given previous code vectors. In other
words, the prior is learned by the model, rather than static as in the case of
the vanilla VAE.
+ id: totrans-155
prefs: []
type: TYPE_NORMAL
- en: Training Your Own VQ-VAE
+ id: totrans-156
prefs:
- PREF_H1
type: TYPE_NORMAL
- en: There is an excellent tutorial by Sayak Paul on training your own VQ-VAE using
Keras available on the [Keras website](https://oreil.ly/dmcb4).
+ id: totrans-157
prefs: []
type: TYPE_NORMAL
- en: The VQ-GAN paper details several key changes to the VQ-VAE architecture, as
shown in [Figure 10-20](#vqgan).
+ id: totrans-158
prefs: []
type: TYPE_NORMAL
- en: '![](Images/gdl2_1020.png)'
+ id: totrans-159
prefs: []
type: TYPE_IMG
- en: 'Figure 10-20\. A diagram of a VQ-GAN: the GAN discriminator helps to encourage
the VAE to generate less blurry images through an additional adversarial loss
term'
+ id: totrans-160
prefs:
- PREF_H6
type: TYPE_NORMAL
@@ -931,6 +1238,7 @@
GAN discriminator is an additional component rather than a replacement of the
VAE. The idea of combining a VAE with a GAN discriminator (VAE-GAN) was first
introduced by Larsen et al. in their 2015 paper.^([11](ch10.xhtml#idm45387004808112))
+ id: totrans-161
prefs: []
type: TYPE_NORMAL
- en: Secondly, the GAN discriminator predicts if small patches of the images are
@@ -948,6 +1256,7 @@
that VAEs produce images that are stylistically more blurry than real images,
so the PatchGAN discriminator can encourage the VAE decoder to generate sharper
images than it would naturally produce.
+ id: totrans-162
prefs: []
type: TYPE_NORMAL
- en: Thirdly, rather than use a single MSE reconstruction loss that compares the
@@ -957,6 +1266,7 @@
idea is from the 2016 paper by Hou et al.,^([14](ch10.xhtml#idm45387004793216))
where the authors show that this change to the loss function results in more realistic
image generations.
+ id: totrans-163
prefs: []
type: TYPE_NORMAL
- en: Lastly, instead of PixelCNN, a Transformer is used as the autoregressive part
@@ -966,9 +1276,11 @@
use tokens that fall within a sliding window around the token to be predicted.
This ensures that the model scales to larger images, which require a larger latent
grid size and therefore more tokens to be generated by the Transformer.
+ id: totrans-164
prefs: []
type: TYPE_NORMAL
- en: ViT VQ-GAN
+ id: totrans-165
prefs:
- PREF_H2
type: TYPE_NORMAL
@@ -976,6 +1288,7 @@
entitled “Vector-Quantized Image Modeling with Improved VQGAN.”^([15](ch10.xhtml#idm45387004783968))
Here, the authors show how the convolutional encoder and decoder of the VQ-GAN
can be replaced with Transformers as shown in [Figure 10-21](#vit_vqgan).
+ id: totrans-166
prefs: []
type: TYPE_NORMAL
- en: For the encoder, the authors use a *Vision Transformer* (ViT).^([16](ch10.xhtml#idm45387004780000))
@@ -983,6 +1296,7 @@
designed for natural language processing, to image data. Instead of using convolutional
layers to extract features from an image, a ViT divides the image into a sequence
of patches, which are tokenized and then fed as input to an encoder Transformer.
+ id: totrans-167
prefs: []
type: TYPE_NORMAL
- en: Specifically, in the ViT VQ-GAN, the nonoverlapping input patches (each of size
@@ -993,14 +1307,17 @@
model, with the overall output being a sequence of patches that can be stitched
back together to form the original image. The overall encoder-decoder model is
trained end-to-end as an autoencoder.
+ id: totrans-168
prefs: []
type: TYPE_NORMAL
- en: '![](Images/gdl2_1021.png)'
+ id: totrans-169
prefs: []
type: TYPE_IMG
- en: 'Figure 10-21\. A diagram of a ViT VQ-GAN: the GAN discriminator helps to encourage
the VAE to generate less blurry images through an additional adversarial loss
term (source: [Yu and Koh, 2022](https://ai.googleblog.com/2022/05/vector-quantized-image-modeling-with.html))^([17](ch10.xhtml#idm45387004774560))'
+ id: totrans-170
prefs:
- PREF_H6
type: TYPE_NORMAL
@@ -1009,23 +1326,28 @@
in total, there are three Transformers in a ViT VQ-GAN, in addition to the GAN
discriminator and learned codebook. Examples of images generated by the ViT VQ-GAN
from the paper are shown in [Figure 10-22](#vit_vqgan_ex).
+ id: totrans-171
prefs: []
type: TYPE_NORMAL
- en: '![](Images/gdl2_1022.png)'
+ id: totrans-172
prefs: []
type: TYPE_IMG
- en: 'Figure 10-22\. Example images generated by a ViT VQ-GAN trained on ImageNet
(source: [Yu et al., 2021](https://arxiv.org/pdf/2110.04627.pdf))'
+ id: totrans-173
prefs:
- PREF_H6
type: TYPE_NORMAL
- en: Summary
+ id: totrans-174
prefs:
- PREF_H1
type: TYPE_NORMAL
- en: In this chapter, we have taken a tour of some of the most important and influential
GAN papers since 2017\. In particular, we have explored ProGAN, StyleGAN, StyleGAN2,
SAGAN, BigGAN, VQ-GAN, and ViT VQ-GAN.
+ id: totrans-175
prefs: []
type: TYPE_NORMAL
- en: We started by exploring the concept of progressive training that was pioneered
@@ -1037,6 +1359,7 @@
alongside additional enhancements such as path regularization. The paper also
showed how the desirable property of gradual resolution refinement could be retained
without having to the train the network progressively.
+ id: totrans-176
prefs: []
type: TYPE_NORMAL
- en: We also saw how the concept of attention could be built into a GAN, with the
@@ -1046,6 +1369,7 @@
spatial dimensions of the image. BigGAN was an extension of this idea that made
several key changes and trained a larger network to improve the image quality
further.
+ id: totrans-177
prefs: []
type: TYPE_NORMAL
- en: In the VQ-GAN paper, the authors show how several different types of generative
@@ -1056,78 +1380,96 @@
used to construct a novel sequence of code tokens that can be decoded by the VAE
decoder to produce novel images. The ViT VQ-GAN paper extends this idea even further,
by replacing the convolutional encoder and decoder of VQ-GAN with Transformers.
+ id: totrans-178
prefs: []
type: TYPE_NORMAL
- en: '^([1](ch10.xhtml#idm45387005226448-marker)) Huiwen Chang et al., “Muse: Text-to-Image
Generation via Masked Generative Transformers,” January 2, 2023, [*https://arxiv.org/abs/2301.00704*](https://arxiv.org/abs/2301.00704).'
+ id: totrans-179
prefs: []
type: TYPE_NORMAL
- en: ^([2](ch10.xhtml#idm45387005216528-marker)) Tero Karras et al., “Progressive
Growing of GANs for Improved Quality, Stability, and Variation,” October 27, 2017,
[*https://arxiv.org/abs/1710.10196*](https://arxiv.org/abs/1710.10196).
+ id: totrans-180
prefs: []
type: TYPE_NORMAL
- en: ^([3](ch10.xhtml#idm45387005140128-marker)) Tero Karras et al., “A Style-Based
Generator Architecture for Generative Adversarial Networks,” December 12, 2018,
[*https://arxiv.org/abs/1812.04948*](https://arxiv.org/abs/1812.04948).
+ id: totrans-181
prefs: []
type: TYPE_NORMAL
- en: ^([4](ch10.xhtml#idm45387005090240-marker)) Xun Huang and Serge Belongie, “Arbitrary
Style Transfer in Real-Time with Adaptive Instance Normalization,” March 20, 2017,
[*https://arxiv.org/abs/1703.06868*](https://arxiv.org/abs/1703.06868).
+ id: totrans-182
prefs: []
type: TYPE_NORMAL
- en: ^([5](ch10.xhtml#idm45387005019232-marker)) Tero Karras et al., “Analyzing and
Improving the Image Quality of StyleGAN,” December 3, 2019, [*https://arxiv.org/abs/1912.04958*](https://arxiv.org/abs/1912.04958).
+ id: totrans-183
prefs: []
type: TYPE_NORMAL
- en: '^([6](ch10.xhtml#idm45387004898624-marker)) Axel Sauer et al., “StyleGAN-XL:
Scaling StyleGAN to Large Diverse Datasets,” February 1, 2022, [*https://arxiv.org/abs/2202.00273v2*](https://arxiv.org/abs/2202.00273v2).'
+ id: totrans-184
prefs: []
type: TYPE_NORMAL
- en: ^([7](ch10.xhtml#idm45387004886752-marker)) Han Zhang et al., “Self-Attention
Generative Adversarial Networks,” May 21, 2018, [*https://arxiv.org/abs/1805.08318*](https://arxiv.org/abs/1805.08318).
+ id: totrans-185
prefs: []
type: TYPE_NORMAL
- en: ^([8](ch10.xhtml#idm45387004870736-marker)) Andrew Brock et al., “Large Scale
GAN Training for High Fidelity Natural Image Synthesis,” September 28, 2018, [*https://arxiv.org/abs/1809.11096*](https://arxiv.org/abs/1809.11096).
+ id: totrans-186
prefs: []
type: TYPE_NORMAL
- en: ^([9](ch10.xhtml#idm45387004838864-marker)) Patrick Esser et al., “Taming Transformers
for High-Resolution Image Synthesis,” December 17, 2020, [*https://arxiv.org/abs/2012.09841*](https://arxiv.org/abs/2012.09841).
+ id: totrans-187
prefs: []
type: TYPE_NORMAL
- en: ^([10](ch10.xhtml#idm45387004834704-marker)) Aaron van den Oord et al., “Neural
Discrete Representation Learning,” November 2, 2017, [*https://arxiv.org/abs/1711.00937v2*](https://arxiv.org/abs/1711.00937v2).
+ id: totrans-188
prefs: []
type: TYPE_NORMAL
- en: ^([11](ch10.xhtml#idm45387004808112-marker)) Anders Boesen Lindbo Larsen et
al., “Autoencoding Beyond Pixels Using a Learned Similarity Metric,” December
31, 2015, [*https://arxiv.org/abs/1512.09300*](https://arxiv.org/abs/1512.09300).
+ id: totrans-189
prefs: []
type: TYPE_NORMAL
- en: ^([12](ch10.xhtml#idm45387004801680-marker)) Phillip Isola et al., “Image-to-Image
Translation with Conditional Adversarial Networks,” November 21, 2016, [*https://arxiv.org/abs/1611.07004v3*](https://arxiv.org/abs/1611.07004v3).
+ id: totrans-190
prefs: []
type: TYPE_NORMAL
- en: ^([13](ch10.xhtml#idm45387004798080-marker)) Jun-Yan Zhu et al., “Unpaired Image-to-Image
Translation using Cycle-Consistent Adversarial Networks,” March 30, 2017, [*https://arxiv.org/abs/1703.10593*](https://arxiv.org/abs/1703.10593).
+ id: totrans-191
prefs: []
type: TYPE_NORMAL
- en: ^([14](ch10.xhtml#idm45387004793216-marker)) Xianxu Hou et al., “Deep Feature
Consistent Variational Autoencoder,” October 2, 2016, [*https://arxiv.org/abs/1610.00291*](https://arxiv.org/abs/1610.00291).
+ id: totrans-192
prefs: []
type: TYPE_NORMAL
- en: ^([15](ch10.xhtml#idm45387004783968-marker)) Jiahui Yu et al., “Vector-Quantized
Image Modeling with Improved VQGAN,” October 9, 2021, [*https://arxiv.org/abs/2110.04627*](https://arxiv.org/abs/2110.04627).
+ id: totrans-193
prefs: []
type: TYPE_NORMAL
- en: '^([16](ch10.xhtml#idm45387004780000-marker)) Alexey Dosovitskiy et al., “An
Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale,” October
22, 2020, [*https://arxiv.org/abs/2010.11929v2*](https://arxiv.org/abs/2010.11929v2).'
+ id: totrans-194
prefs: []
type: TYPE_NORMAL
- en: ^([17](ch10.xhtml#idm45387004774560-marker)) Jiahui Yu and Jing Yu Koh, “Vector-Quantized
Image Modeling with Improved VQGAN,” May 18, 2022, [*https://ai.googleblog.com/2022/05/vector-quantized-image-modeling-with.html*](https://ai.googleblog.com/2022/05/vector-quantized-image-modeling-with.html).
+ id: totrans-195
prefs: []
type: TYPE_NORMAL