outlier-smoothing quantization

AI / MLtransform

Tuple<Tensor<FP16>, Tensor<FP16>> -> Tuple<Tensor<INT8>, Tensor<INT8>>

Apply scaling factors to smooth out activation outliers across query and key tensors prior to low-precision integer quantization.

Problem it solves

Outliers in attention matrices cause massive quantization errors when converting to low-bit integers (e.g., INT8/INT4).

Consumes

Tensor<FP16>

Emits

Tensor<INT8>

Distilled from 1 source

The real projects this mechanism was found in. Attribution is the point — this is how the best teams actually do it.