Collected sources and patterns will appear here. Add from search, explore, or the patterns library.
Tokens, GateIndices -> DispatchedTensors
Combine token sorting, gathering, and scattering into a single fused CUDA kernel to bypass host-device synchronizations during MoE dispatch.
Problem it solves
Separate gather/scatter steps introduce high scheduling overhead and communication bottlenecks between routing layers and expert layers.
Consumes
Emits
The real projects this mechanism was found in. Attribution is the point — this is how the best teams actually do it.