Collected sources and patterns will appear here. Add from search, explore, or the patterns library.
Image, TextPrompt -> Video
Generate temporally coherent future video frames by conditioning a diffusion transformer model on an initial static image and a text prompt.
Problem it solves
Pure text-to-video models cannot accurately simulate continuations of specific, pre-existing physical scenes or robot states.
Consumes
Emits
The real projects this mechanism was found in. Attribution is the point — this is how the best teams actually do it.