Mixed Precision Training

1 sources - 4 claims

In a large mixed-precision GB200 pretraining run, OFU tracked precision-dependent utilisation changes without knowing the numeric format in use. In GB200 precision scaling tests, OFU-derived speedups over TF32 were lower than theoretical speedups for BF16, FP8, and NVFP4. OFU is precision-agnostic because TPA counts Tensor Core instruction cycles regardless of numeric format. Lower-precision deviations from theoretical speedups arise from scaling-factor overhead in block-scaled formats.