LatentMan: Generating Consistent Animated Characters using Image Diffusion Models

KAUST, Saudi Arabia

Abstract

We propose a zero-shot approach for generating consistent videos of animated characters based on Text-to-Image (T2I) diffusion models. Existing Text-to-Video (T2V) methods are expensive to train and require large-scale video datasets to produce diverse characters and motions. At the same time, their zero-shot alternatives fail to produce temporally consistent videos with continuous motion. We strive to bridge this gap, and we introduce LatentMan, which leverages existing text-based motion diffusion models to generate diverse continuous motions to guide the T2I model. To boost the temporal consistency, we introduce the Spatial Latent Alignment module that exploits cross-frame dense correspondences that we compute to align the latents of the video frames. Furthermore, we propose Pixel-Wise Guidance to steer the diffusion process in a direction that minimizes visual discrepancies between frames. Our proposed approach outperforms existing zero-shot T2V approaches in generating videos of animated characters in terms of pixel-wise consistency and user preference.

Summary

Text-to-Image (T2I) diffusion models can generate diverse images of human characters.
However, generated images vary under any changes to the latent code or the guidance signal.
This makes it difficult to use T2I models for directly generating videos of human characters without video tuning or guidance.
We propose a new approach that employs motion diffusion models to generate a sequence of SMPL models while the video frames are generated using a pre-trained T2I diffusion model.
Our approach can produce consistent videos of animated characters given only a textual prompt without requiring video guidance, training, or finetuning.

Method

Given a text prompt, we generate an animated skeleton using a motion diffusion model.
We fit a SMPL model to the skeleton, and we render a depth map and DensePose.
We compute dense cross-frame correspondences based on DensePose.
These correspondences are used to align the latents through the Spatial Latent Alignment module.
The consistency is improved further through a Pixel-Wise Guidance strategy.

Temporal Consistency Comparison

We compare against two zero-shot videos synthesis: MasaCtrl and Text2Video-Zero .
All approaches are fed with the depth guidance that we generate based on a motion diffusion odel.

Note that we apply video frame interpolation to all produced videos to increase the frame rate from 10 to 30 FPS, but it does not modify the generated contents in any way.

LatentMan: Generating Consistent Animated Characters using Image Diffusion Models

CVPRW 2024

Zero-Shot Diverse Motions
Purely Text-Based Temporally Consistent
No videos needed for training/tuning or guidance

Abstract

Summary

Method

Results

Temporal Consistency Comparison

A robot jumps on a trampoline.

A skier running on a snowy road.

A cyberbunk robot jumps off a grass pitch.

A ballerina doing performing a jumpy dance in a studio.

A mechanical cyborg moves to the left on a sandy beach.

LatentMan: Generating Consistent Animated Characters using Image Diffusion Models

CVPRW 2024

Zero-Shot Diverse Motions Purely Text-Based Temporally Consistent No videos needed for training/tuning or guidance

Abstract

Summary

Method

Results

Temporal Consistency Comparison

A robot jumps on a trampoline.

A skier running on a snowy road.

A cyberbunk robot jumps off a grass pitch.

A ballerina doing performing a jumpy dance in a studio.

A mechanical cyborg moves to the left on a sandy beach.

Zero-Shot Diverse Motions
Purely Text-Based Temporally Consistent
No videos needed for training/tuning or guidance