Skip to content
hw.dev
hw.dev/signal/hbm-testing-shift-left-ai-chip-yield
SignalSemiconductor Engineering

HBM Is Now Half the Cost of AI Chips, and the Test Flow Has Not Caught Up

HBM accounts for nearly half the cost of an AI chip and is the number-one cause of GPU failures in data centers -- but the test flow for HBM stacks is still optimized for cost, not for catching defects before they become $100K+ assembly scraps.

#testing#verification#ai-hardware#semiconductor
Read Original

HBM now represents nearly half the cost of AI chips, and according to hyperscaler data cited by Synopsys, defective HBM stacks are the number-one cause of GPU failures in data centers. The test flow for those stacks is still designed around the cost of individual DRAM dies, not the cost of the fully assembled package they are about to become. That mismatch is the problem, and HBM4 with its 2,048-bit interface and 16-die stacks will make it significantly worse.

The current test flow runs multiple insertions at wafer level and stacked-die level, but the dominant production approach defers singulated stacked-die testing because probing exotic structures is expensive and the planarity challenge is real. This deference was rational when HBM was 4 dies per stack and a defective unit cost hundreds of dollars. It is less rational when HBM4 is 12-16 dies per stack, accounts for half the BOM cost of a GPU module, and a defective stack found at final assembly scraps the entire package surrounding eight stacks. A single late-found failure at GPU module scale is a five-figure write-off.

What is changing is the force of the economics. Synopsys, Teradyne, and Aehr Test Systems are all pushing test insertions further left, including wafer-level burn-in to catch infant mortality before stacking and proposed singulated stacked-die probing for integrators demanding higher known-good-stack guarantees. The DFT structure in HBM4 also changes: the 2,048-bit interface requires more TSVs, which means more opportunities for TSV defects and more test points needed to cover them.

For hardware teams shipping AI compute infrastructure, the near-term action is to revisit your supply agreements with HBM vendors on known-good-stack guarantees before HBM4 enters volume production. The GPU makers that negotiate KGS test coverage now will have lower yield surprise rates in 2027 assembly runs than those that do not. The test equipment suppliers (Teradyne, Aehr) benefit regardless; the DRAM vendors that move to stronger shift-left guarantees first will have a quality differentiation argument worth something at the volume these hyperscalers are buying.