Text2AC-Zero: Consistent Synthesis of Animated Characters using 2D Diffusion

KAUST, Saudi Arabia

Zero-Shot                            Diverse Motions
Purely Text-Based             Temporally Consistent
No videos needed for training/tuning or guidance

Abstract

We propose a zero-shot approach for consistent Text-to-Animated-Characters synthesis based on pre-trained Text-to-Image (T2I) diffusion models. Existing Text-to-Video (T2V) methods are expensive to train and require large-scale video datasets to produce diverse characters and motions. At the same time, their zero-shot alternatives fail to produce temporally consistent videos. We strive to bridge this gap and we introduce a zero-shot approach that produces temporally consistent videos of animated characters and requires no training or fine-tuning. We leverage existing text-based motion diffusion models to generate diverse motions that we utilize to guide a T2I model. To achieve temporal consistency, we introduce the Spatial Latent Alignment module that exploits cross-frame dense correspondences that we compute to align the latents of the video frames. Furthermore, we propose Pixel-Wise Guidance to steer the diffusion process in a direction that minimizes visual discrepancies. Our proposed approach generates temporally consistent videos with diverse motions and styles, outperforming existing zero-shot T2V approaches in terms of pixel-wise consistency and user preference.

Summary

method
  • Text-to-Image (T2I) diffsuion models can generate diverse images of human characters.
  • However, generated images vary under any changes to the latent code or the guidance signal.
  • This makes it difficult to use T2I models for directly generating videos of human characters without video tuning or guidance.
  • We propose a new approach that employs motion diffusion models to generate motion guidance, while the video frames are generated using a pre-trained T2I diffusion model.
  • Our approach can produce consistent videos of animated characters given only a textual prompt without requiring video guidance, training or finetuning.

Method

method
  • Given a text prompt, we generate an animated skeleton using a motion diffusion model.
  • We fit a SMPL model to the skeleton, and we render a depth map and DensePose.
  • We compute dense cross-frame correspondences based on DensePose.
  • These correspondences are used to align the latents through the Spatial Latent Alignment module.
  • The consistency is improved further through a Pixel-Wise Guidance strategy.

Results


Prompt: <s> doing a jumpy dance


Prompt: <s> dances vals

Prompt: <s> Warms up before a battle

Prompt: <s> jumps off a cliff

Temporal Consistency Comparison


We compare against two zero-shot videos synthesis: MasaCtrl and Text2Video-Zero .
All approaches are fed with the depth guidance that we generate based on a motion diffusion odel.

Note that we apply video frame interpolation to all produced videos to increase the frame rate from 10 to 30 FPS, but it does not modify the generated contents in any way.