InfiniCube: Unbounded and Controllable Dynamic 3D Driving Scene Generation with World-Guided Video Models
We present InfiniCube, a scalable method for generating unbounded dynamic 3D driving scenes with high fidelity and controllability. Previous methods for scene generation either suffer from limited scales or lack geometric and appearance consistency along generated sequences. In contrast, we leverage the recent advancements in scalable 3D representation and video models to achieve large dynamic scene generation that allows flexible controls through HD maps, vehicle bounding boxes, and text descriptions. First, we construct a map-conditioned sparse-voxel-based 3D generative model to unleash its power for unbounded voxel world generation. Then, we re-purpose a video model and ground it on the voxel world through a set of carefully designed pixel-aligned guidance buffers, synthesizing a consistent appearance. Finally, we propose a fast feed-forward approach that employs both voxel and pixel branches to lift the dynamic videos to dynamic 3D Gaussians with controllable objects. Our method can generate controllable and realistic 3D driving scenes, and extensive experiments validate the effectiveness and superiority of our model.
我们提出了 InfiniCube,一种可扩展的方法,用于生成高保真且可控的无限动态三维驾驶场景。以往的场景生成方法要么受限于生成规模,要么在生成序列中缺乏几何和外观的一致性。相比之下,我们利用了近期在可扩展三维表示和视频模型方面的进展,实现了大型动态场景生成,并通过高清地图(HD maps)、车辆边界框和文本描述实现灵活控制。 首先,我们构建了一个基于地图约束的稀疏体素三维生成模型,释放其在生成无限体素世界中的潜力。接着,我们重新设计了一个视频模型,并通过一组精心设计的像素对齐引导缓冲器将其锚定在体素世界中,以合成一致的外观。最后,我们提出了一种快速前馈方法,结合体素分支和像素分支,将动态视频提升为包含可控对象的动态三维高斯表示。 我们的方法能够生成可控且逼真的三维驾驶场景,并通过大量实验验证了模型的有效性和优越性。