Text-to-3D content creation has recently received much attention, especially with the prevalence of 3D Gaussians Splatting. In general, GS-based methods comprise two key stages: initialization and rendering optimization. To achieve initialization, existing works directly apply random sphere initialization or 3D diffusion models, e.g., Point-E, to derive the initial shapes. However, such strategies suffer from two critical yet challenging problems: 1) the final shapes are still similar to the initial ones even after training; 2) shapes can be produced only from simple texts, e.g., "a dog", not for lexically richer texts, e.g., "a dog is sitting on the top of the airplane". To address these problems, this paper proposes a novel general framework to boost the 3D GS Initialization for text-to-3D generation upon the lexical richness. Our key idea is to aggregate 3D Gaussians into spatially uniform voxels to represent complex shapes while enabling the spatial interaction among the 3D Gaussians and semantic interaction between Gaussians and texts. Specifically, we first construct a voxelized representation, where each voxel holds a 3D Gaussian with its position, scale, and rotation fixed while setting opacity as the sole factor to determine a position's occupancy. We then design an initialization network mainly consisting of two novel components: 1) Global Information Perception (GIP) block and 2) Gaussians-Text Fusion (GTF) block. Such a design enables each 3D Gaussian to assimilate the spatial information from other areas and semantic information from texts. Extensive experiments show the superiority of our framework of high-quality 3D GS initialization against the existing methods, e.g., Shap-E, by taking lexically simple, medium, and hard texts. Also, our framework can be seamlessly plugged into SoTA training frameworks, e.g., LucidDreamer, for semantically consistent text-to-3D generation.
文本到 3D 内容生成最近受到了广泛关注,特别是在 3D Gaussian Splatting 的普及背景下。一般来说,基于 GS 的方法包括两个关键阶段:初始化和渲染优化。为了实现初始化,现有工作通常直接应用随机球体初始化或 3D 扩散模型(如 Point-E)来推导初始形状。然而,这些策略存在两个关键且具有挑战性的问题:1)即使在训练后,最终形状仍然类似于初始形状;2)形状只能从简单文本中生成,例如“狗”,而不能从语义更丰富的文本中生成,例如“一只狗坐在飞机上”。为了解决这些问题,本文提出了一个新颖的通用框架,以提高基于文本到 3D 生成的 3D GS 初始化性能。我们的核心思想是将 3D Gaussian 聚合到空间均匀的体素中,以表示复杂的形状,同时实现 3D Gaussian 之间的空间交互和 Gaussian 与文本之间的语义交互。具体而言,我们首先构建一个体素化表示,每个体素包含一个固定位置、尺度和旋转的 3D Gaussian,同时将透明度设置为确定位置占用的唯一因素。然后,我们设计了一个初始化网络,主要包括两个新颖的组件:1)全局信息感知(GIP)块和 2)Gaussian-Text 融合(GTF)块。这样的设计使每个 3D Gaussian 能够吸收来自其他区域的空间信息以及来自文本的语义信息。大量实验表明,我们的高质量 3D GS 初始化框架在处理简单、中等和困难文本时相较于现有方法(如 Shap-E)具有显著优势。此外,我们的框架可以无缝地集成到最先进的训练框架(如 LucidDreamer)中,以实现语义一致的文本到 3D 生成。