low-precision-sparse-gemm

transform

QuantizedWeights, Activations -> Activations

Execute highly parallel matrix multiplications directly on ultra-low-precision MoE expert weights (NVFP4, MXFP4, blockwise FP8) to minimize GPU memory bandwidth constraints.

Problem it solves

Large MoE model inference is highly memory-bound, bottlenecked by loading massive expert parameter sets at low batch sizes.

Consumes

QuantizedWeightsActivations

Emits

Activations

Distilled from 1 source

The real projects this mechanism was found in. Attribution is the point — this is how the best teams actually do it.

microsoft/Tutelgithub