DPE: Disentanglement of Pose and Expression for General Video Portrait Editing


1 MAIS & NLPR, Institute of Automation, Chinese Academy of Sciences, Beijing, China       2 School of Artificial Intelligence, University of Chinese Academy of Sciences       3 Tencent AI Lab, ShenZhen, China
CVPR 2023

Abstract

One-shot video-driven talking face generation aims at producing a synthetic talking video by transferring the facial motion from a video to an arbitrary portrait image. Head pose and facial expression are always entangled in facial motion and transferred simultaneously. However, the entanglement sets up a barrier for these methods to be used in video portrait editing directly, where it may require to modify the expression only while maintaining the pose unchanged. One challenge of decoupling pose and expression is the lack of paired data, such as the same pose but different expressions. Only a few methods attempt to tackle this challenge with the feat of 3D Morphable Models (3DMMs) for explicit disentanglement. But 3DMMs are not accurate enough to capture facial details due to the limited number of Blendshapes, which has side effects on motion transfer. In this paper, we introduce a novel self-supervised disentanglement framework to decouple pose and expression without 3DMMs and paired data, which consists of a motion editing module, a pose generator, and an expression generator. The editing module projects faces into a latent space where pose motion and expression motion can be disentangled, and the pose or expression transfer can be performed in the latent space conveniently via addition. The two generators render the modified latent codes to images, respectively. Moreover, to guarantee the disentanglement, we propose a bidirectional cyclic training strategy with well-designed constraints. Evaluations demonstrate our method can control pose or expression independently and be used for general video editing.

Video1: Audio-driven Video Editing

Given a source video S and a piece of audio A, we transfer expression from A to S.

From left to right: source video, audio, video editing result. (We implement audio driving part with the help of SadTalker.)

Video2: Visual-driven Video Editing

Given a source video S, a driving video D, we transfer expression from D to S only.

Video3: Video-driven image animation.

Given a source image S, a driving video D1, and another driving video D2, we transfer expression from D1 to S and transfer pose from D2 to S.
Video: The input is a single image, and the pose and expression information come from different sources. Some videos are selected from here

Video4: Audio-driven image animation.

Given a source image S, a driving video D, and a piece of audio A, we transfer expression from A to S and transfer pose from D to S.

Video5: Pose Driving

Pipeline

BibTeX

@article{pang2023dpe,
        title={DPE: Disentanglement of Pose and Expression for General Video Portrait Editing},
        author={Pang, Youxin and Zhang, Yong and Quan, Weize and Fan, Yanbo and Cun, Xiaodong and Shan, Ying and Yan, Dong-ming},
        journal={arXiv preprint arXiv:2301.06281},
        year={2023}
      }