interleaved-mamba-attention-block

transform

Sequence<Representation> -> Sequence<Representation>

Alternately process sequences using Mamba state-space layers and self-attention layers to achieve linear scaling with global context retrieval.

Problem it solves

Pure attention scales quadratically, while pure state-space models struggle with high-recall long-context retrieval.

Consumes

Sequence<Representation>

Emits

Sequence<Representation>

Distilled from 1 source

The real projects this mechanism was found in. Attribution is the point — this is how the best teams actually do it.