Tenstorrent Galaxy Blackhole: 23 PFLOPS from 32 Chips, No Disaggregation Needed

Tenstorrent is launching the Galaxy Blackhole server: a 6U unit packing 32 Blackhole chips, delivering 23 PFLOPS in Block FP8 with a unified prefill and decode architecture. EE Times tested it ahead of launch: 255 tokens per second per user in chatbot-style workloads, and up to 350 tokens/second in Blitz Mode for agentic inference workloads. Time to first token in Blitz Mode is under 4 seconds for DeepSeek-671B.

The no-disaggregation stance is a deliberate counter-positioning move. The dominant infrastructure trend for large-scale LLM inference is to split prefill and decode across separate hardware pools (often different accelerator types) because their compute profiles are fundamentally different. Tenstorrent is betting that scale-out with uniform Blackhole chips in Galaxy clusters can match or beat that efficiency without the operational complexity of heterogeneous disaggregated systems.

Jim Keller's framing ("we do something no one else does, where we can hook up a lot of medium-performance chips together") is honest about what the architecture actually is: a scaling play, not a single-chip performance play. That is a different risk profile than Nvidia (dominant single-chip perf) or custom silicon (domain-locked). Whether it holds up at hyperscaler workloads is unproven, but the token-throughput numbers EE Times independently validated are not marketing slides.

The real test will be whether typical clusters at 4-32 Galaxies can handle the memory bandwidth requirements of very large models. The Blackhole's on-chip memory and interconnect are the constraint that does not show up in PFLOPS metrics. Watch for enterprise adopters and the pricing story.