Recent advancements in 3D generation have leveraged synthetic datasets with ground truth 3D assets and predefined cameras. However, the potential of adopting real-world datasets, which can produce significantly more realistic 3D scenes, remains largely unexplored. In this work, we delve into the key challenge of the complex and scene-specific camera trajectories found in real-world captures. We introduce Director3D, a robust open-world text-to-3D generation framework, designed to generate both real-world 3D scenes and adaptive camera trajectories. To achieve this, (1) we first utilize a Trajectory Diffusion Transformer, acting as the Cinematographer, to model the distribution of camera trajectories based on textual descriptions. (2) Next, a Gaussian-driven Multi-view Latent Diffusion Model serves as the Decorator, modeling the image sequence distribution given the camera trajectories and texts. This model, fine-tuned from a 2D diffusion model, directly generates pixel-aligned 3D Gaussians as an immediate 3D scene representation for consistent denoising. (3) Lastly, the 3D Gaussians are refined by a novel SDS++ loss as the Detailer, which incorporates the prior of the 2D diffusion model. Extensive experiments demonstrate that Director3D outperforms existing methods, offering superior performance in real-world 3D generation.
近期在3D生成领域的进展利用了带有真实3D资产和预定义相机的合成数据集。然而,采用真实世界数据集的潜力,这些数据集可以生成更为现实的3D场景,到目前为止还大部分未被探索。在这项工作中,我们深入探讨了真实世界捕捉中发现的复杂且特定于场景的相机轨迹这一关键挑战。我们引入了Director3D,一个健壮的开放世界文本到3D生成框架,旨在生成真实世界的3D场景和自适应相机轨迹。为了实现这一目标,(1)我们首先利用轨迹扩散变压器,作为摄影师,根据文本描述来模拟相机轨迹的分布。(2)接下来,一个由高斯驱动的多视角潜在扩散模型充当装饰者,根据相机轨迹和文本模拟图像序列分布。该模型从2D扩散模型中微调而来,直接生成与像素对齐的3D高斯作为立即的3D场景表示,以实现一致的去噪。(3)最后,3D高斯由一种新颖的SDS++损失精炼,作为细节师,该损失结合了2D扩散模型的先验知识。广泛的实验表明,Director3D优于现有方法,为真实世界的3D生成提供了卓越的性能。