zero-overhead-gather-scatter-dispatching

transform

Tokens, GateIndices -> DispatchedTensors

Combine token sorting, gathering, and scattering into a single fused CUDA kernel to bypass host-device synchronizations during MoE dispatch.

Problem it solves

Separate gather/scatter steps introduce high scheduling overhead and communication bottlenecks between routing layers and expert layers.

Consumes

TokensGateIndices

Emits

DispatchedTensors

Distilled from 1 source

The real projects this mechanism was found in. Attribution is the point — this is how the best teams actually do it.