Tile Quantization

TF32 is an outlier because cuBLAS selects XMMA and CUTLASS kernels that produce higher overhead at small sizes and converge more slowly. For well-aligned matrices of size at least 4,096, observed tile-quantization overhead was about 9% at…

1 sources - 4 claims

TF32 is an outlier because cuBLAS selects XMMA and CUTLASS kernels that produce higher overhead at small sizes and converge more slowly. For well-aligned matrices of size at least 4,096, observed tile-quantization overhead was about 9% at maximum, with means of 2–3%. Tile-quantization correction requires an NCU profiling pass, which adds overhead and cannot be used continuously. Small matrix sizes can incur tile-quantization overhead above 50% because of severe tile padding.