In TPU v1, quantify the effect of FP32→int8 with per-channel vs per-tensor quantization and histogram vs MSE calibration on performance/watt, E2E latency, and accuracy, considering HBM, MXU tiling, and rescaling overhead.
Thursday, 04 December 2025
by JOSE ALFONSIN PENA
The effect of quantization approaches—specifically FP32 to int8 with per-channel versus per-tensor schemes and histogram versus mean squared error (MSE) calibration—on Google TPU v1 performance and accuracy is multifaceted. The interplay among quantization granularity, calibration techniques, hardware tiling, memory bandwidth, and overheads such as rescaling must be comprehensively analyzed to understand their influence on performance
- Published in Artificial Intelligence, EITC/AI/GCML Google Cloud Machine Learning, Expertise in Machine Learning, Tensor Processing Units - history and hardware
Tagged under:
Accuracy, Artificial Intelligence, Calibration, HBM, Int8, Latency, MXU, Performance, Quantization, TPU

