Although neural rendering has made significant advancements in creating lifelike, animatable full-body and head avatars, incorporating detailed expressions into full-body avatars remains largely unexplored. We present DEGAS, the first 3D Gaussian Splatting (3DGS)-based modeling method for full-body avatars with rich facial expressions. Trained on multiview videos of a given subject, our method learns a conditional variational autoencoder that takes both the body motion and facial expression as driving signals to generate Gaussian maps in the UV layout. To drive the facial expressions, instead of the commonly used 3D Morphable Models (3DMMs) in 3D head avatars, we propose to adopt the expression latent space trained solely on 2D portrait images, bridging the gap between 2D talking faces and 3D avatars. Leveraging the rendering capability of 3DGS and the rich expressiveness of the expression latent space, the learned avatars can be reenacted to reproduce photorealistic rendering images with subtle and accurate facial expressions. Experiments on an existing dataset and our newly proposed dataset of full-body talking avatars demonstrate the efficacy of our method. We also propose an audio-driven extension of our method with the help of 2D talking faces, opening new possibilities to interactive AI agents.
尽管神经渲染在创建逼真的、可动画的全身和头部化身方面取得了显著进展,但将详细的面部表情融入全身化身的研究仍然相对较少。我们提出了DEGAS,这是首个基于3D高斯投影(3DGS)的全身化身建模方法,可以表现丰富的面部表情。我们的方法在给定主体的多视角视频上进行训练,学习一个条件变分自编码器,它同时以身体动作和面部表情作为驱动信号来生成UV布局中的高斯图。为了驱动面部表情,我们提出使用仅在2D肖像图像上训练的表情潜在空间,而不是常用的3D头部化身中的3D可变形模型(3DMMs),从而弥合2D说话脸和3D化身之间的差距。利用3DGS的渲染能力和表情潜在空间的丰富表现力,所学习的化身可以重新演绎,以再现具有细腻且准确面部表情的逼真渲染图像。在现有数据集和我们新提出的全身说话化身数据集上的实验表明,我们的方法是有效的。我们还提出了一个基于音频驱动的方法扩展,通过2D说话脸的帮助,为交互式AI代理开辟了新的可能性。