Google Splits Its 8th-Gen TPU Into Two: One for Training, One for Inference

Google announced its eighth-generation TPUs at Cloud Next '26: TPU 8t for training and TPU 8i for inference, abandoning the single-chip-does-everything approach that defined earlier generations.

The bifurcation is the real signal here. Google is betting that training and inference workloads have diverged enough in their hardware requirements that one chip can no longer serve both without painful tradeoffs. TPU 8t goes all-in on scale -- a 64-board superpod reaches 9,600 chips and two petabytes of shared HBM, delivering 121 ExaFlops with double the interchip interconnect bandwidth of the previous gen. The Virgo network fabric ties it together. TPU 8i goes the other direction: 384 MB of on-chip SRAM (3x over prior gen) so a model's active working set stays entirely on chip, cutting the latency that HBM round-trips would otherwise impose.

The detail worth noting: near-linear scaling to a million chips in a single logical cluster via JAX and Pathways. That is not a benchmark footnote -- it changes how you write training jobs. If it holds, it means researchers stop thinking about cluster boundaries as a fundamental constraint. Google has not published the microarchitectural details yet, so treat the ExaFlops number as a marketing claim until independent benchmarks appear.

The hardware split also reinforces what the inference-chip wave (Groq, Cerebras, various startups) has been arguing for years: memory bandwidth and latency, not raw FLOPS, are the binding constraint for serving large models. Google's own infrastructure team apparently agrees now.