nvfp4-quantized-training

transform

Tensor<BF16> -> Tensor<FP4>

Cast weights and activations to 4-bit floating-point formats during both forward and backward training passes to scale memory throughput.

Problem it solves

High-precision representation formats limit hardware utilization and batch sizing during massive pre-training runs.

Consumes

Tensor<BF16>

Emits

Tensor<FP4>

Distilled from 1 source

The real projects this mechanism was found in. Attribution is the point — this is how the best teams actually do it.