Google Splits TPU Gen 8 Into Dedicated Training and Inference Chips

Google shipped TPU 8t and TPU 8i at Cloud Next on April 22 -- the first time Google has split a TPU generation into a dedicated training chip and a dedicated inference chip as separate products. The training chip tops out at 121 ExaFLOPs per superpod with 2 petabytes of shared HBM. The inference chip carries 384 MB of on-chip SRAM (3x the previous generation) and a Boardfly interconnect topology that cuts maximum network diameter by 50%.

The design space for training and inference has diverged enough that a single chip is a compromise on both. Training needs maximum HBM capacity and interchip bandwidth for model-parallel runs at scale -- TPU 8t doubles interchip bandwidth and delivers near-linear scaling to one million chips. Inference needs SRAM headroom and low-latency interconnects for MoE routing under tight SLA constraints -- TPU 8i's Boardfly topology and a new Collectives Acceleration Engine cut on-chip latency by up to 5x. Those are not the same optimization target. Squeezing both into one die means losing on at least one axis.

The Boardfly topology is worth dwelling on. It is a routing architecture decision, not a frequency bump. Most inference bottlenecks in large MoE models are interconnect-bound, not compute-bound -- the model weights are partitioned, and every token requires a gather across partitions. Halving the network diameter halves the worst-case gather latency. That is the kind of structural improvement that compounds at scale in a way that a raw FLOP increase does not.

The downstream consequence is that ML teams now have a forcing function to instrument and profile their workload mix before specifying infrastructure. AWS has Trainium and Inferentia. Groq targets inference exclusively. Google is now in the same camp at the hyperscaler level. The era of the "general-purpose AI chip" is over for anyone operating at meaningful scale. The teams that win will be the ones that characterize their training-to-inference ratio, design their silicon mix around it, and redeploy when that ratio shifts -- treating it as a first-class infrastructure decision, not a purchasing afterthought.