LatentMan: Generating Consistent Animated Characters using Image Diffusion Models

KAUST, Saudi Arabia

Zero-Shot                            Diverse Motions
Purely Text-Based             Temporally Consistent
No videos needed for training/tuning or guidance

Abstract

We propose a zero-shot approach for generating consistent videos of animated characters based on Text-to-Image (T2I) diffusion models. Existing Text-to-Video (T2V) methods are expensive to train and require large-scale video datasets to produce diverse characters and motions. At the same time, their zero-shot alternatives fail to produce temporally consistent videos with continuous motion. We strive to bridge this gap, and we introduce LatentMan, which leverages existing text-based motion diffusion models to generate diverse continuous motions to guide the T2I model. To boost the temporal consistency, we introduce the Spatial Latent Alignment module that exploits cross-frame dense correspondences that we compute to align the latents of the video frames. Furthermore, we propose Pixel-Wise Guidance to steer the diffusion process in a direction that minimizes visual discrepancies between frames. Our proposed approach outperforms existing zero-shot T2V approaches in generating videos of animated characters in terms of pixel-wise consistency and user preference.

Summary

method
  • Text-to-Image (T2I) diffusion models can generate diverse images of human characters.
  • However, generated images vary under any changes to the latent code or the guidance signal.
  • This makes it difficult to use T2I models for directly generating videos of human characters without video tuning or guidance.
  • We propose a new approach that employs motion diffusion models to generate a sequence of SMPL models while the video frames are generated using a pre-trained T2I diffusion model.
  • Our approach can produce consistent videos of animated characters given only a textual prompt without requiring video guidance, training, or finetuning.

Method

method
  • Given a text prompt, we generate an animated skeleton using a motion diffusion model.
  • We fit a SMPL model to the skeleton, and we render a depth map and DensePose.
  • We compute dense cross-frame correspondences based on DensePose.
  • These correspondences are used to align the latents through the Spatial Latent Alignment module.
  • The consistency is improved further through a Pixel-Wise Guidance strategy.

Results


Prompt: <s> doing a jumpy dance


Prompt: <s> dances vals

Prompt: <s> Warms up before a battle

Prompt: <s> jumps off a cliff

Temporal Consistency Comparison


We compare against two zero-shot videos synthesis: MasaCtrl and Text2Video-Zero .
All approaches are fed with the depth guidance that we generate based on a motion diffusion odel.

Note that we apply video frame interpolation to all produced videos to increase the frame rate from 10 to 30 FPS, but it does not modify the generated contents in any way.