Follow-Your-Pose v2:
Multiple-Condition Guided Character Image Animation for Stable Pose Control

Jingyun Xue1,2*, Hongfa Wang2*, Qi Tian2*, Yue Ma2,4, Andong Wang2, Zhiyuan Zhao2, Shaobo Min2, Wenzhe Zhao2, Kaihao Zhang3, Heung-Yeung Shum4, Wei Liu2, Mengyang Liu2†, Wenhan Luo1,4†
1Sun Yat-sen University,    2Tencent, Hunyuan,   
3Harbin Institute of Technology (Shenzhen),    4HKUST

*Indicates Equal Contribution

Indicates Corresponding Author

Our method enables the animation of single or multiple characters.

Messi & Ronaldo

Curry & Durant

Abstract

Pose-controllable character video generation is in high demand with extensive applications for fields such as automatic advertising and content creation on social media platforms. While existing character image animation methods using pose sequences and reference images have shown promising performance, they tend to struggle with incoherent animation in complex scenarios, such as multiple character animation and body occlusion. Additionally, current methods request large-scale high-quality videos with stable backgrounds and temporal consistency as training datasets, otherwise, their performance will greatly deteriorate. These two issues hinder the practical utilization of character image animation tools. In this paper, we propose a practical and robust framework Follow-Your-Pose v2, which can be trained on noisy open-sourced videos readily available on the internet. Multi-condition guiders are designed to address the challenges of background stability, body occlusion in multi-character generation, and consistency of character appearance. Moreover, to fill the gap of fair evaluation of multi-character pose animation, we propose a new benchmark comprising approximately 4,000 frames. Extensive experiments demonstrate that our approach outperforms state-of-the-art methods by a margin of over 35% across 2 datasets and on 7 metrics. Meanwhile, qualitative assessments reveal a significant improvement in the quality of generated video, particularly in scenarios involving complex backgrounds and body occlusion of multi-character, suggesting the superiority of our approach.

Method

MY ALT TEXT

The overview of our method. The left half illustrates the data flow of the multiple condition guiders, with green and black arrows denoting training data flow, and red and black arrows indicating inference data flow. After processing the pose sequence/training video and reference images through the corresponding data flow, three condition maps are obtained. The pose sequence and three condition maps are incorporated into the initial multi-frame noise after being encoded separately by "Ref_Pose Guider", "Depth Guider", "Pose_Seq Guider", and "Optical Flow Guider" respectively. The right half shows the denoising U-Net and ReferenceNet. The initial noise is fed into U-Net for denoising to generate videos. After being encoded by VAE encoder, reference image is input into ReferenceNet to extract character features and then fed into the Spatial-Attention of U-Net for interaction. Furthermore, reference image is encoded by CLIP image encoder and then input into the Cross-Attention of U-Net and ReferenceNet for interaction. Note that the weights of U-Net and ReferenceNet are trainable, while the weights of both encoders are frozen.

Video Presentation

Gallery

Here we present more results to demonstrate the ability of our method to address various scenarios.

Single Character Animation.

Multiple Character Animation.

Various Clothes.

Various Age.

Various Ethnicity.

Complex background.

BibTeX

@article{xue2024follow,
  title={Follow-Your-Pose v2: Multiple-Condition Guided Character Image Animation for Stable Pose Control},
  author={Xue, Jingyun and Wang, Hongfa and Tian, Qi and Ma, Yue and Wang, Andong and Zhao, Zhiyuan and Min, Shaobo and Zhao, Wenzhe and Zhang, Kaihao and Shum, Heung-Yeung and others},
  journal={arXiv preprint arXiv:2406.03035},
  year={2024}
}