Generating high-quality 3D digital assets often requires expert knowledge of complex design tools. We introduce Specialized Generative Primitives, a generative framework that allows non-expert users to author high-quality 3D scenes in a seamless, lightweight, and controllable manner. Each primitive is an efficient generative model that captures the distribution of a single exemplar from the real world. With our framework, users capture a video of an environment, which we turn into a high-quality and explicit appearance model thanks to 3D Gaussian Splatting. Users then select regions of interest guided by semantically-aware features. To create a generative primitive, we adapt Generative Cellular Automata to single-exemplar training and controllable generation. We decouple the generative task from the appearance model by operating on sparse voxels and we recover a high-quality output with a subsequent sparse patch consistency step. Each primitive can be trained within 10 minutes and used to author new scenes interactively in a fully compositional manner. We showcase interactive sessions where various primitives are extracted from real-world scenes and controlled to create 3D assets and scenes in a few minutes. We also demonstrate additional capabilities of our primitives: handling various 3D representations to control generation, transferring appearances, and editing geometries.
生成高质量的3D数字资产通常需要掌握复杂设计工具的专业知识。我们提出了Specialized Generative Primitives,一种生成框架,能够让非专业用户以无缝、轻量化且可控的方式创建高质量3D场景。每个生成原语是一个高效的生成模型,能够捕捉单个真实世界样本的分布。 在我们的框架中,用户通过拍摄环境视频,我们利用3D高斯点绘(3D Gaussian Splatting)将其转化为高质量且显式的外观模型。用户可以基于语义感知特征选择感兴趣区域。为了创建生成原语,我们将**生成元胞自动机(Generative Cellular Automata)**适配于单样本训练和可控生成。我们通过在稀疏体素上操作,将生成任务与外观模型分离,并通过后续的稀疏补丁一致性步骤恢复高质量输出。 每个生成原语的训练时间不到10分钟,用户可在完全可组合的环境中交互式地创作新场景。我们展示了交互式操作会话,其中从真实场景中提取的不同原语被控制用于快速创建3D资产和场景。同时,我们还展示了原语的附加功能:支持多种3D表示控制生成、外观迁移以及几何编辑。