Collected sources and patterns will appear here. Add from search, explore, or the patterns library.
Tuple<Tensor<FP8>, Tensor<FP8>> -> Tensor<FP16>
Accumulate matrix products from low-precision inputs using a split FP16 and FP32 register strategy to avoid overflow while preserving GPU performance.
Problem it solves
Pure low-precision (FP8/INT8) accumulation leads to severe underflow/overflow in attention softmax, while full FP32 accumulation bottlenecks hardware execution units.
Consumes
Emits
The real projects this mechanism was found in. Attribution is the point — this is how the best teams actually do it.