Collected sources and patterns will appear here. Add from search, explore, or the patterns library.
Video -> Sequence<VisionToken>
Uniformly sample video frames and encode them into a continuous sequence of vision-language tokens.
Problem it solves
Naive processing of full-length video files exceeds the context window limits of Transformer models.
Consumes
Emits
The real projects this mechanism was found in. Attribution is the point — this is how the best teams actually do it.