Zhenhui Ye, Ziyue Jiang, Yi Ren, Jinglin Liu, Jinzheng He, Zhou Zhao | Zhejiang University, ByteDance
| | | | 中文文档
This repository is the official PyTorch implementation of our ICLR-2023 paper, in which we propose GeneFace for generalized and high-fidelity audio-driven talking face generation. The inference pipeline is as follows:
Our GeneFace achieves better lip synchronization and expressiveness to out-of-domain audios. Watch this video for a clear lip-sync comparison against previous NeRF-based methods. You can also visit our project page for more details.
2023.2.22
We release a 1 minute-long demo video, in which GeneFace is driven by a Chinese song generated by DiffSinger.2023.2.20
We release a stable 3D landmark post-processing strategy ininference/ners/lm3d_nerf_infer.py
, which improve the stability and quality of the final results by a large margin.
We provide pre-trained models and processed datasets of GeneFace in this release to enable a quick start. In the following, we show how to infer the pre-trained models in 4 steps. If you want to train GeneFace on your own target person video, please reach to the following sections (Prepare Environments
, Prepare Datasets
, and Train Models
).
-
Step1. Create a new python env named
geneface
following the guide indocs/prepare_env/install_guide_nerf.md
. DownloadBFM_model_front.mat
at this link and place it into./deep_3drecon/BFM
and./deep_util/BFM_models
directory. -
Step2. Download the
lrs3.zip
andMay.zip
in the release and unzip it into thecheckpoints
directory. -
Step3. Download the binarized dataset of
May.mp4
at this link (about 3.5 GB) and place it into thedata/binary/videos/May/trainval_dataset.npy
directory.
After the above steps, the structure of your checkpoint
and data
directory should look like this:
> checkpoints
> lrs3
> lm3d_vae
> syncnet
> May
> postnet
> lm3d_nerf
> lm3d_nerf_torso
> data
> binary
> videos
> May
trainval_dataset.npy
- Step4. Run the scripts below:
bash scripts/infer_postnet.sh
bash scripts/infer_lm3d_nerf.sh
You can find a output video named infer_out/May/pred_video/zozo.mp4
.
Please follow the steps in docs/prepare_env
.
Please follow the steps in docs/process_data
.
Please follow the steps in docs/train_models
.
Apart from the May.mp4
provided in this repo, we also provide 8 target person videos that were used in our experiments. You can download them at this link. To train on a new video named <video_id>.mp4
, you should place it into the data/raw/videos/
directory, then create a new folder at egs/datasets/videos/<video_id>
and edit config files, according to the provided example folder egs/datasets/videos/May
.
You can also record your own video and train a unique GeneFace model for yourself!
- The inference process of NeRF-based renderer is relatively slow (it takes about 2 hours on 1 RTX2080Ti to render 250 frames at 512x512 resolution with
n_samples_per_ray_fine=128
). Currently, we could partially alleviate this problem by using multile GPUs or setting--n_samples_per_ray
and--n_samples_per_ray_fine
to a lower value. In the future we will add acceleration techniques on the NeRF-based renderer. - GeneFace use 3D landmark as the intermediate between the audio2motion and motion2image mapping. However, the 3D landmark sequence generated by the postnet sometimes have bad cases (such as shaking head, or extra-large mouth) and influence the quality of the rendered video. Currently, we partially alleviate this problem by postprocessing the predicted 3D landmark sequence. We call for better postprocessing methods.
@article{ye2023geneface,
title={GeneFace: Generalized and High-Fidelity Audio-Driven 3D Talking Face Synthesis},
author={Ye, Zhenhui and Jiang, Ziyue and Ren, Yi and Liu, Jinglin and He, Jinzheng and Zhao, Zhou},
journal={arXiv preprint arXiv:2301.13430},
year={2023}
}
Our codes are based on the following repos:
- NATSpeech (For the code template)
- AD-NeRF (For NeRF-related implementation)
- style_avatar (For 3DMM parameters extraction)