sequence-parallel-all-to-all-attention

transform

SequenceTensor -> PartitionedAttentionTensor

Partition sequence tensors along the time/sequence dimension across multiple GPUs, using an All-to-All collective to gather query, key, and value vectors for attention calculations.

Problem it solves

Extremely long context sequences exceed the memory capacity of a single GPU's attention head allocation.

Consumes

SequenceTensor

Emits

PartitionedAttentionTensor

Distilled from 1 source

The real projects this mechanism was found in. Attribution is the point — this is how the best teams actually do it.

deepspeedai/DeepSpeedgithub