Recent works on audio-driven talking head synthesis using Neural Radiance Fields (NeRF) have achieved impressive results. However, due to inadequate pose and expression control caused by NeRF implicit representation, these methods still have some limitations, such as unsynchronized or unnatural lip movements, and visual jitter and artifacts. In this paper, we propose GaussianTalker, a novel method for audio-driven talking head synthesis based on 3D Gaussian Splatting. With the explicit representation property of 3D Gaussians, intuitive control of the facial motion is achieved by binding Gaussians to 3D facial models. GaussianTalker consists of two modules, Speaker-specific Motion Translator and Dynamic Gaussian Renderer. Speaker-specific Motion Translator achieves accurate lip movements specific to the target speaker through universalized audio feature extraction and customized lip motion generation. Dynamic Gaussian Renderer introduces Speaker-specific BlendShapes to enhance facial detail representation via a latent pose, delivering stable and realistic rendered videos. Extensive experimental results suggest that GaussianTalker outperforms existing state-of-the-art methods in talking head synthesis, delivering precise lip synchronization and exceptional visual quality. Our method achieves rendering speeds of 130 FPS on NVIDIA RTX4090 GPU, significantly exceeding the threshold for real-time rendering performance, and can potentially be deployed on other hardware platforms.
近期关于使用神经辐射场(NeRF)进行音频驱动的仿真人头合成的研究已取得了令人印象深刻的成果。然而,由于NeRF隐式表示导致的姿势和表情控制不足,这些方法仍存在一些限制,如嘴唇动作不同步或不自然,以及视觉抖动和伪影。在本文中,我们提出了GaussianTalker,这是一种基于三维高斯涂抹的新型音频驱动仿真人头合成方法。通过将高斯绑定到三维面部模型,三维高斯的显式表示属性实现了面部动作的直观控制。GaussianTalker由两个模块组成,分别是特定发言者的运动翻译器和动态高斯渲染器。特定发言者的运动翻译器通过普遍化的音频特征提取和定制的嘴唇运动生成,实现了针对目标发言者的准确嘴唇动作。动态高斯渲染器引入了特定发言者的BlendShapes,通过潜在姿势增强面部细节表示,提供稳定和逼真的渲染视频。广泛的实验结果表明,GaussianTalker在仿真人头合成方面超越了现有的最先进方法,提供了精确的嘴唇同步和卓越的视觉质量。我们的方法在NVIDIA RTX4090 GPU上的渲染速度达到了130 FPS,大大超过了实时渲染性能的阈值,且有潜力部署在其他硬件平台上。