Tenstorrent Galaxy Blackhole: Integrated Inference at 23 PFLOPS

Tenstorrent is launching the Galaxy Blackhole server this week -- a 6U rack unit with 32 Blackhole chips delivering 23 PFLOPS (Block FP8) for both prefill and decode in a single integrated box. No disaggregated prefill/decode cluster. No separate memory nodes. Just one system that handles the full inference pipeline.

The disaggregation trend has been framed as the only path to cost-effective large-model serving at scale. Tenstorrent is betting that enough SRAM bandwidth per chip, combined with a scalable Galaxy-to-Galaxy fabric, can get you the same tokenomics without the operational complexity of splitting prefill and decode across different hardware pools. EE Times tested Blitz Mode on DeepSeek-671B at 255 tokens per second per user before launch -- the company claims 350 in production. At that throughput, the argument for disaggregation gets harder to make for anyone running batch sizes of 8 to a few dozen concurrent users.

The detail worth noting: Tenstorrent Blackhole chips include 16 big RISC-V cores for data movement management. That means the host CPU is not in the critical path for orchestrating chip-to-chip communication across a Galaxy cluster. This was a documented bottleneck on the older Wormhole generation for small-batch workloads. The fix is silicon-level, not software-level.

The counter: real-time video generation as a headline demo is a thin proof point for the enterprises actually spending on inference infrastructure. The honest test is sustained large-batch LLM throughput under production traffic patterns, and Tenstorrent has not published those numbers yet. Still, the integrated approach is a credible alternative architecture -- and Jim Keller does not typically announce things he cannot ship.