The Physical AI Inference Gap: H100 Reaches Only 27 Percent of Peak HBM Bandwidth in Batch-1 Decode

For physical AI workloads (robots, autonomous vehicles, embodied agents), inference almost always runs as single-stream batch-1 autoregressive decode. The standard description is that this workload is memory-bandwidth-bound, so faster HBM means faster inference. A new benchmarking paper across H100 SXM5, A100-80GB, L40S, and L4 shows that description is true but dangerously incomplete. An L4 reaches roughly 81 percent of its analytic memory floor in batch-1 decode at ctx=2048. An H100, with 3x the HBM bandwidth, reaches only 27 percent. Buying more bandwidth does not buy proportional latency improvement.

The missing term is kernel launch overhead. CUDA Graphs (which pre-compile and replay kernel launch sequences) improve H100 decode latency by 1.259x (95% CI: 1.253-1.267) across fresh sessions at ctx=2048. On the L4, the same intervention gives only 1.028x. The overhead is invisible on slow, bandwidth-constrained GPUs and becomes dominant on fast ones. This is a structural property of the GPU execution model, not a tuning artifact.

For teams evaluating hardware for physical AI inference at the edge, the hardware selection calculus shifts. The question is not peak HBM bandwidth. It is the ratio of kernel launch overhead to compute time at your batch size and context length. A slower GPU with less launch overhead may close more of its theoretical peak than an H100 will at batch-1. The benchmarking methodology in the paper (44 validated cells across 4 GPUs, 4 context lengths, controlled bf16 SDPA) is directly reproducible. Teams speccing physical AI silicon in 2026 should run this benchmark against their target workload before committing to a GPU tier.