two-level mixed-precision accumulation

AI / MLtransform

Tuple<Tensor<FP8>, Tensor<FP8>> -> Tensor<FP16>

Accumulate matrix products from low-precision inputs using a split FP16 and FP32 register strategy to avoid overflow while preserving GPU performance.

Problem it solves

Pure low-precision (FP8/INT8) accumulation leads to severe underflow/overflow in attention softmax, while full FP32 accumulation bottlenecks hardware execution units.

Consumes

Tensor<FP8>

Emits

Tensor<FP16>

Distilled from 1 source

The real projects this mechanism was found in. Attribution is the point — this is how the best teams actually do it.

thu-ml/SageAttentiongithub