Text2AC-Zero: Consistent Synthesis of Animated Characters using 2D Diffusion

KAUST, Saudi Arabia

Abstract

We propose a zero-shot approach for consistent Text-to-Animated-Characters synthesis based on pre-trained Text-to-Image (T2I) diffusion models. Existing Text-to-Video (T2V) methods are expensive to train and require large-scale video datasets to produce diverse characters and motions. At the same time, their zero-shot alternatives fail to produce temporally consistent videos. We strive to bridge this gap and we introduce a zero-shot approach that produces temporally consistent videos of animated characters and requires no training or fine-tuning. We leverage existing text-based motion diffusion models to generate diverse motions that we utilize to guide a T2I model. To achieve temporal consistency, we introduce the Spatial Latent Alignment module that exploits cross-frame dense correspondences that we compute to align the latents of the video frames. Furthermore, we propose Pixel-Wise Guidance to steer the diffusion process in a direction that minimizes visual discrepancies. Our proposed approach generates temporally consistent videos with diverse motions and styles, outperforming existing zero-shot T2V approaches in terms of pixel-wise consistency and user preference.

Summary

Text-to-Image (T2I) diffsuion models can generate diverse images of human characters.
However, generated images vary under any changes to the latent code or the guidance signal.
This makes it difficult to use T2I models for directly generating videos of human characters without video tuning or guidance.
We propose a new approach that employs motion diffusion models to generate motion guidance, while the video frames are generated using a pre-trained T2I diffusion model.
Our approach can produce consistent videos of animated characters given only a textual prompt without requiring video guidance, training or finetuning.

Method

Given a text prompt, we generate an animated skeleton using a motion diffusion model.
We fit a SMPL model to the skeleton, and we render a depth map and DensePose.
We compute dense cross-frame correspondences based on DensePose.
These correspondences are used to align the latents through the Spatial Latent Alignment module.
The consistency is improved further through a Pixel-Wise Guidance strategy.

Temporal Consistency Comparison

We compare against two zero-shot videos synthesis: MasaCtrl and Text2Video-Zero .
All approaches are fed with the depth guidance that we generate based on a motion diffusion odel.

Note that we apply video frame interpolation to all produced videos to increase the frame rate from 10 to 30 FPS, but it does not modify the generated contents in any way.

Text2AC-Zero: Consistent Synthesis of Animated Characters using 2D Diffusion

Zero-Shot Diverse Motions
Purely Text-Based Temporally Consistent
No videos needed for training/tuning or guidance

Abstract

Summary

Method

Results

Temporal Consistency Comparison

A robot jumps on a trampoline.

A skier running on a snowy road.

A cyberbunk robot jumps off a grass pitch.

A ballerina doing performing a jumpy dance in a studio.

A mechanical cyborg moves to the left on a sandy beach.

Text2AC-Zero: Consistent Synthesis of Animated Characters using 2D Diffusion

Zero-Shot Diverse Motions Purely Text-Based Temporally Consistent No videos needed for training/tuning or guidance

Abstract

Summary

Method

Results

Temporal Consistency Comparison

A robot jumps on a trampoline.

A skier running on a snowy road.

A cyberbunk robot jumps off a grass pitch.

A ballerina doing performing a jumpy dance in a studio.

A mechanical cyborg moves to the left on a sandy beach.

Zero-Shot Diverse Motions
Purely Text-Based Temporally Consistent
No videos needed for training/tuning or guidance