Collected sources and patterns will appear here. Add from search, explore, or the patterns library.
(ImageSequence, LanguageInstruction) -> VLEmbeddings
Project raw image sequences and language instructions into a unified token sequence using a pretrained Vision-Language Model (VLM) to condition downstream action predictors.
Problem it solves
Varying visual structures and natural language tasks must be mapped to a coherent, unified feature space to guide precise physical manipulation.
Consumes
Emits
The real projects this mechanism was found in. Attribution is the point — this is how the best teams actually do it.