Skip to content
hw.dev
hw.dev/signal/hsco-bench-llm-agent-soc-hw-sw-codesign-2026
SignalarXiv

HSCO-Bench: LLM Agents Design Working SoCs But Waste Most of the Silicon

The first benchmark for end-to-end LLM-agent SoC co-design finds that today's frontier models can generate working accelerators but underutilize hardware capacity by 4x or more.

#fpga#eda#ai-hardware#tools
Read Original

Researchers from Columbia University shipped the first benchmark that asks LLM agents to do end-to-end hardware-software co-design: pick what to accelerate in a software workload, design and integrate custom accelerators into a System-on-Chip, and map the kernels onto the generated hardware. HSCO-Bench deployed resulting designs on a real AMD Virtex-7 FPGA. Only 2 of 5 frontier models tested could produce a valid SoC prototype at all.

The gap between "generates valid hardware" and "generates efficient hardware" is where the benchmark cuts deep. The best result hit a 16.22x peak speedup, which sounds good until you see the other number: those designs topped out at 23.67% hardware utilization. The agents could synthesize working accelerators, but they consistently left three-quarters of the available compute on the table. The root cause is the co-design problem itself: optimizing across the software-hardware abstraction boundary requires reasoning about both simultaneously, and current models treat them as loosely coupled passes.

This matters for the tooling axis. Every existing agentic EDA workflow (Synopsys AgentGineer, Cadence's AI flows, Siemens Fuse EDA) focuses on specific stages: RTL generation, verification, physical implementation, or signoff. None benchmarks end-to-end co-design where the software target and hardware architecture co-evolve. HSCO-Bench is the first tool that quantifies that gap, which means it is also the first tool that can measure improvement. The 16.22x peak speedup at 23.67% utilization is not a failure; it is a baseline.

The 4x+ hardware utilization gap is the near-term research target. Teams that close it first will have an agent capable of generating better-than-human co-design starting points in a domain where human co-design cycles currently take months. The AMD Virtex-7 platform is a concrete target, not a simulation. Watch this benchmark series the same way the ML community watched MLPerf: the number that matters is how fast the utilization gap closes over the next 18 months.