Google Cloud announced the eighth generation of its Tensor Processing Units at Google Cloud Next on April 22, splitting the line into two chips: TPU 8t for model training and TPU 8i for inference. Google claims 3x faster AI model training, 80% better performance per dollar versus the prior generation, and the ability to interconnect over one million TPUs in a coordinated cluster.
The training/inference split is the technically notable decision. Previous TPU generations were general-purpose accelerators built to amortize the tradeoffs between both workloads. Training requires massive parallelism and high memory bandwidth. Inference in production requires low latency and high throughput at lower precision, often with smaller model slices and frequent context switches. Designing for those separately means Google is no longer asking a single chip to optimize for fundamentally different utilization patterns. The question is whether customers can cleanly separate their workloads -- mixed pipelines with fine-tuning loops and live inference do not split neatly.
The million-chip cluster claim invites scrutiny. TSMC CoWoS packaging constraints and HBM3E supply are still the binding factors on how many accelerators can physically ship. Running 1M TPUs in a coordinated training job is also a systems engineering problem far beyond chip design -- interconnect topology, job scheduling at that scale, and failure handling in a cluster where MTBF math becomes hourly rather than daily are unsolved at production quality for most operators.
For teams designing AI infrastructure, the 8t/8i split is a procurement model shift: you are now buying specialized silicon per pipeline phase rather than a general-purpose accelerator. That requires understanding your AI workload at a fidelity that a single-chip purchase did not demand. If you can cleanly separate training and inference, the per-dollar efficiency gains are real. If you cannot, you are managing two SKUs instead of one.