Mixed-Precision Quantization

2 sources - 10 claims

At 2.25 average bits, RCO achieved lower perplexity than EvoPress, IMPQ, and HIGGS on the reported datasets. At 2.5 bits, RCO had similar quality to EvoPress but much lower wall-clock time. Surrogate methods were closer to RCO at higher bitwidths because compression was easier. Quantization is applied only to transmitted adapter updates. Fed-FSTQ assigns adapter coordinates to 0, 2, 4, or 16 bit precision levels based on coordinate importance. Coordinate importance is computed from the diagonal Fisher estimate multiplied by the squared adapter update coordinate. Payload accounting includes quantized values, indices or masks, precision tags, and active group scales. For Qwen3-8B quantization, RCO optimized layer-bitwidth assignment after candidate bitwidths were pre-quantized with GPTQ. Higher-percentile coordinates receive FP16, mid-percentile coordinates receive INT4, lower retained coordinates receive INT2, and the remaining coordinates are pruned. RCO is especially relevant for high-compression quantization where proxy objectives fail.