Skip to content
hw.dev
hw.dev/signal/aws-graviton5-3nm-chiplet-m9g-2026
SignalThe Next Platform

AWS Graviton5 Breaks the Reticle Limit With Four 3nm Chiplets and Ships in M9g

AWS abandoned monolithic die design for Graviton5: four 48-core chiplets on TSMC 3nm stitched with 420 GB/sec die-to-die links, shipping in M9g instances and signaling where hyperscaler custom silicon is headed.

#ai-hardware#chiplets#semiconductor
Read Original

AWS shipped Graviton5 in M9g and M9gd EC2 instances, and the architecture is worth reading carefully. The design is not a bigger Graviton4. It is four separate 48-core chiplets, each carrying the Arm Neoverse V3 (Poseidon) compute subsystem, stitched together with die-to-die interconnects running at 420 GB/sec per link. Total core count: 192. Total memory controllers: 12 DDR5. Total PCIe: 8 controllers with 96 lanes and CXL 3.0 support. Process node: TSMC 3nm, up from 4nm on Graviton4.

The constraint being removed is the reticle limit. A 192-core monolithic die on 3nm at Graviton-density would push past what TSMC can expose in a single EUV shot, collapsing yield and collapsing the economics. By building four 48-core chiplets, AWS keeps each die small, keeps per-chiplet yield high, and uses the D2D fabric to stitch them into a single virtual processor. This is exactly the chiplet argument: pay for die-to-die interconnect overhead, save on yield-adjusted cost-per-core. The Annapurna Labs block diagram shown at re:Invent in December was wrong: the monolithic preview was a stand-in; the shipping part is chiplets all the way down.

CXL 3.0 support across 96 PCIe 6.0 lanes is the second signal. For hardware teams running memory-intensive workloads (DRAM simulation, large post-layout sign-off runs, ML training jobs that saturate local DRAM), CXL 3.0 means disaggregated memory capacity without the latency penalty of a second NUMA socket. AWS can tier memory across the CXL fabric at a cost that scales linearly rather than exponentially (DIMM capacity pricing is exponential; fabric-attached capacity is not).

Intel and AMD's argument for monolithic server silicon has always been determinism: one die, one scheduler, no D2D variance. Graviton5 at scale makes that argument harder to sustain. If a hyperscaler can ship 192 cores per socket in chiplet form, qualify it at datacenter volume, and undercut merchant silicon on cost-per-inference, the determinism premium for monolithic dies shrinks. Teams evaluating their 2026 cloud compute platform for EDA, simulation, or AI workloads should benchmark M9g against Intel Xeon 6+ and AMD EPYC Genoa today; the gap is real and it is widening.