Collected sources and patterns will appear here. Add from search, explore, or the patterns library.
(ImageFeatures, TextFeatures) -> AlignedMultimodalFeatures
Interleave visual token representations into a text transformer trunk via gated cross-attention layers.
Problem it solves
Standard early fusion of vision and language sequences scales poorly and lacks alignment control during SFT.
Consumes
Emits
The real projects this mechanism was found in. Attribution is the point — this is how the best teams actually do it.