From 1a2e5476cd0cb6eeec0643cce3fe9b7d4ea4a1b8 Mon Sep 17 00:00:00 2001
From: wizardforcel <562826179@qq.com>
Date: Thu, 8 Feb 2024 19:20:21 +0800
Subject: [PATCH] 2024-02-08 19:20:19

---
 totrans/gen-dl_17.yaml | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/totrans/gen-dl_17.yaml b/totrans/gen-dl_17.yaml
index ff3cc53..e235de1 100644
--- a/totrans/gen-dl_17.yaml
+++ b/totrans/gen-dl_17.yaml
@@ -3,6 +3,7 @@
   prefs:
   - PREF_H1
   type: TYPE_NORMAL
+  zh: 第13章。多模态模型
 - en: 'So far, we have analyzed generative learning problems that focus solely on
     one modality of data: either text, images, or music. We have seen how GANs and
     diffusion models can generate state-of-the-art images and how Transformers are
@@ -14,11 +15,13 @@
   id: totrans-1
   prefs: []
   type: TYPE_NORMAL
+  zh: 到目前为止，我们已经分析了专注于单一数据模态的生成学习问题：文本、图像或音乐。我们已经看到了GAN和扩散模型如何生成最先进的图像，以及Transformer如何引领文本和图像生成的方式。然而，作为人类，我们没有跨模态的困难——例如，描述给定照片中正在发生的事情，创作数字艺术来描绘书中虚构的幻想世界，或将电影配乐与给定场景的情感相匹配。我们能训练机器做同样的事吗？
 - en: Introduction
   id: totrans-2
   prefs:
   - PREF_H1
   type: TYPE_NORMAL
+  zh: 介绍
 - en: '*Multimodal learning* involves training generative models to convert between
     two or more different kinds of data. Some of the most impressive generative models
     introduced in the last two years have been multimodal in nature. In this chapter
@@ -27,17 +30,21 @@
   id: totrans-3
   prefs: []
   type: TYPE_NORMAL
+  zh: '*多模态学习*涉及训练生成模型以在两种或更多种不同类型的数据之间进行转换。在过去两年中引入的一些最令人印象深刻的生成模型具有多模态性质。在本章中，我们将详细探讨它们的工作原理，并考虑未来的生成建模将如何受到大型多模态模型的影响。'
 - en: 'We’ll explore four different vision-language models: DALL.E 2 from OpenAI;
     Imagen from Google Brain; Stable Diffusion from Stability AI, CompVis, and Runway;
     and Flamingo from DeepMind.'
   id: totrans-4
   prefs: []
   type: TYPE_NORMAL
+  zh: 我们将探讨四种不同的视觉语言模型：来自OpenAI的DALL.E 2；来自Google Brain的Imagen；来自Stability AI、CompVis和Runway的Stable
+    Diffusion；以及来自DeepMind的Flamingo。
 - en: Tip
   id: totrans-5
   prefs:
   - PREF_H6
   type: TYPE_NORMAL
+  zh: 提示
 - en: The aim of this chapter is to concisely explain how each model works, without
     going into the fine detail of every design decision. For more information, refer
     to the individual papers for each model, which explain all of the design choices
@@ -45,6 +52,7 @@
   id: totrans-6
   prefs: []
   type: TYPE_NORMAL
+  zh: 本章的目的是简明扼要地解释每个模型的工作原理，而不深入探讨每个设计决策的细节。有关更多信息，请参考每个模型的各自论文，其中详细解释了所有设计选择和架构决策。
 - en: Text-to-image generation focuses on producing state-of-the-art images from a
     given text prompt. For example, given the input “A head of broccoli made out of
     modeling clay, smiling in the sun,” we would like the model to be able to output
@@ -52,6 +60,7 @@
   id: totrans-7
   prefs: []
   type: TYPE_NORMAL
+  zh: 文本到图像生成侧重于从给定的文本提示生成最先进的图像。例如，给定输入“用造型粘土制成的一颗西兰花头，在阳光下微笑”，我们希望模型能够输出一个与文本提示精确匹配的图像，如[图13-1](#dalle_example)所示。
 - en: This is clearly a highly challenging problem. Text understanding and image generation
     are difficult to solve in their own right, as we have seen in previous chapters
     of this book. Multimodal modeling such as this presents an additional challenge,
@@ -61,15 +70,18 @@
   id: totrans-8
   prefs: []
   type: TYPE_NORMAL
+  zh: 这显然是一个极具挑战性的问题。文本理解和图像生成本身就很难解决，正如我们在本书的前几章中所看到的。这样的多模态建模提出了额外的挑战，因为模型还必须学习如何跨越两个领域之间的鸿沟，并学习一个共享表示，使其能够准确地将一段文本转换为高保真图像而不丢失信息。
 - en: '![](Images/gdl2_1301.png)'
   id: totrans-9
   prefs: []
   type: TYPE_IMG
+  zh: '![](Images/gdl2_1301.png)'
 - en: Figure 13-1\. An example of text-to-image generation by DALL.E 2
   id: totrans-10
   prefs:
   - PREF_H6
   type: TYPE_NORMAL
+  zh: 图13-1。DALL.E 2进行文本到图像生成的示例
 - en: Moreover, in order to be successful the model must be able to combine concepts
     and styles that it may never have seen before. For example, there are no Michelangelo
     frescos containing people wearing virtual reality headsets, but we would like
@@ -83,11 +95,13 @@
   id: totrans-11
   prefs: []
   type: TYPE_NORMAL
+  zh: 此外，为了取得成功，模型必须能够结合可能从未见过的概念和风格。例如，没有米开朗基罗的壁画中有人们戴着虚拟现实头盔，但我们希望我们的模型能够在我们要求时创建这样的图像。同样，模型准确推断生成图像中的对象如何与彼此相关，基于文本提示。例如，“宇航员骑着甜甜圈穿越太空”的图片应该与“宇航员在拥挤的空间里吃甜甜圈”的图片看起来截然不同。模型必须学习单词如何通过上下文赋予意义，以及如何将实体之间的明确文本关系转换为暗示相同含义的图像。
 - en: DALL.E 2
   id: totrans-12
   prefs:
   - PREF_H1
   type: TYPE_NORMAL
+  zh: DALL.E 2
 - en: The first model we shall explore is *DALL.E 2*, a model designed by OpenAI for
     text-to-image generation. The first version of this model, DALL.E,^([1](ch13.xhtml#idm45387001413744))
     was released in February 2021 and sparked a new wave of interest in generative
@@ -97,6 +111,8 @@
   id: totrans-13
   prefs: []
   type: TYPE_NORMAL
+  zh: 我们将要探索的第一个模型是*DALL.E 2*，这是由OpenAI设计用于文本到图像生成的模型。该模型的第一个版本，DALL.E，是在2021年2月发布的，引发了对生成多模态模型的新一波兴趣。在本节中，我们将调查该模型的第二次迭代，DALL.E
+    2，于2022年4月发布，距离第一个版本发布仅一年多一点。
 - en: DALL.E 2 is an extremely impressive model that has furthered our understanding
     of AI’s ability to solve these types of multimodal problems. It not only has ramifications
     academically, but also forces us to ask big questions relating to the role of