Reconstructing dynamic 3D scenes from 2D images and generating diverse views over time is challenging due to scene complexity and temporal dynamics. Despite advancements in neural implicit models, limitations persist: (i) Inadequate Scene Structure: Existing methods struggle to reveal the spatial and temporal structure of dynamic scenes from directly learning the complex 6D plenoptic function. (ii) Scaling Deformation Modeling: Explicitly modeling scene element deformation becomes impractical for complex dynamics. To address these issues, we consider the spacetime as an entirety and propose to approximate the underlying spatio-temporal 4D volume of a dynamic scene by optimizing a collection of 4D primitives, with explicit geometry and appearance modeling. Learning to optimize the 4D primitives enables us to synthesize novel views at any desired time with our tailored rendering routine. Our model is conceptually simple, consisting of a 4D Gaussian parameterized by anisotropic ellipses that can rotate arbitrarily in space and time, as well as view-dependent and time-evolved appearance represented by the coefficient of 4D spherindrical harmonics. This approach offers simplicity, flexibility for variable-length video and end-to-end training, and efficient real-time rendering, making it suitable for capturing complex dynamic scene motions. Experiments across various benchmarks, including monocular and multi-view scenarios, demonstrate our 4DGS model's superior visual quality and efficiency.
从2D图像重建动态3D场景并随时间生成多样化视图是一个挑战,因为场景复杂性和时间动态性。尽管在神经隐式模型方面取得了进步,但仍存在局限性:(i)场景结构不足:现有方法在直接学习复杂的6D全视函数时,难以揭示动态场景的空间和时间结构。 (ii)缩放变形建模:对于复杂动态,显式建模场景元素的变形变得不切实际。为解决这些问题,我们将时空视为一个整体,提出通过优化一组4D原语来近似动态场景的潜在时空4D体积,同时进行显式的几何和外观建模。学习优化4D原语使我们能够在任何期望的时间使用我们定制的渲染程序合成新视图。我们的模型在概念上简单,由4D高斯构成,由各向异性椭圆参数化,这些椭圆可以在空间和时间中任意旋转,以及视角依赖和随时间演变的外观,由4D球柱谐波系数表示。这种方法提供了简单性、适应可变长度视频和端到端训练的灵活性,以及高效的实时渲染,使其适合捕捉复杂的动态场景运动。在各种基准测试中,包括单眼和多视图场景,我们的4DGS模型展示了卓越的视觉质量和效率。