hybrid-sequence-partitioning

AI / MLtransform

SequenceTensor -> PartitionedSequenceTensors

Shard input token sequences across arbitrary GPU cluster sizes by combining Ulysses head-wise partitioning with Ring-Attention token-wise ring communication.

Problem it solves

Long-sequence training is memory-bound and constrained by attention head counts when using Ulysses alone.

Consumes

SequenceTensorParallelismConfiguration

Emits

PartitionedSequenceTensors

Distilled from 1 source

The real projects this mechanism was found in. Attribution is the point — this is how the best teams actually do it.

modelscope/ms-swiftgithub