Collected sources and patterns will appear here. Add from search, explore, or the patterns library.
QuantizedWeights, Activations -> Activations
Execute highly parallel matrix multiplications directly on ultra-low-precision MoE expert weights (NVFP4, MXFP4, blockwise FP8) to minimize GPU memory bandwidth constraints.
Problem it solves
Large MoE model inference is highly memory-bound, bottlenecked by loading massive expert parameter sets at low batch sizes.
Consumes
Emits
The real projects this mechanism was found in. Attribution is the point — this is how the best teams actually do it.