neighborhood-attention video-dit

AI / MLtransform

Tensor<HiddenStates> -> Tensor<HiddenStates>

Replace dense attention in video diffusion transformers with localized neighborhood attention to achieve sub-quadratic scaling across spatial-temporal tokens.

Problem it solves

Standard self-attention scales quadratically with spatial-temporal context length, causing memory bottlenecks during high-resolution video generation.

Consumes

Tensor<HiddenStates>

Emits

Tensor<HiddenStates>

Distilled from 1 source

The real projects this mechanism was found in. Attribution is the point — this is how the best teams actually do it.

nvidia-cosmos/cosmos-predict2github