per-thread quantization

AI / MLtransform

Tensor<FP16> -> Tuple<Tensor<INT4>, Tensor<FP16>>

Quantize activation tensors at a per-thread granularity to maximize dynamic range representation without introducing global memory access overhead.

Problem it solves

Coarse-grained quantization (e.g., per-tensor or per-channel) degrades precision, while extremely fine-grained quantization typically incurs high scale-factor memory overhead.

Consumes

Tensor<FP16>

Emits

Tensor<INT4>Tensor<FP16>

Distilled from 1 source

The real projects this mechanism was found in. Attribution is the point — this is how the best teams actually do it.

thu-ml/SageAttentiongithub