gated cross-attention multimodal fusion

transform

(ImageFeatures, TextFeatures) -> AlignedMultimodalFeatures

Interleave visual token representations into a text transformer trunk via gated cross-attention layers.

Problem it solves

Standard early fusion of vision and language sequences scales poorly and lacks alignment control during SFT.

Consumes

ImageFeaturesTextFeatures

Emits

AlignedMultimodalFeatures

Distilled from 1 source

The real projects this mechanism was found in. Attribution is the point — this is how the best teams actually do it.