Collected sources and patterns will appear here. Add from search, explore, or the patterns library.
Tuple<Tensor<FP16>, Tensor<FP16>> -> Tuple<Tensor<INT8>, Tensor<INT8>>
Apply scaling factors to smooth out activation outliers across query and key tensors prior to low-precision integer quantization.
Problem it solves
Outliers in attention matrices cause massive quantization errors when converting to low-bit integers (e.g., INT8/INT4).
Consumes
Emits
The real projects this mechanism was found in. Attribution is the point — this is how the best teams actually do it.