Researchers from Ghent University and ETH Zurich measured column-level sparsity across seven diffusion model workloads and found that the 52-85% element-level sparsity numbers widely cited in AI accelerator marketing overstate hardware-exploitable sparsity by up to 78 percentage points. On systolic-array hardware, a column with a single non-zero activation forces the entire column to compute. Element-level sparsity does not translate to column-level savings when activations are scattered rather than clustered.
The mechanism breaks down by workload class. UNet+transformer architectures show activation concentration that yields real cycle reductions up to 30.6%. Pure-transformer DiT models show activation dispersion, yielding only 12.4%. Motion and dance transformer models range from modest to 50.8%, with behavior driven by extreme token dimensions. The key finding is that workload group and model dimensions jointly determine whether sparsity-based layout optimization is beneficial at all. And across all workloads, memory stalls account for 84-89% of total cycles on a GDDR6-based accelerator, not compute bottlenecks.
The consequence for hardware teams evaluating AI accelerators: any vendor quoting element-level sparsity percentages without workload-specific column-level analysis is presenting numbers that do not correspond to hardware wins. The gap can be 78 percentage points. An accelerator claiming 80% sparsity throughput improvement on diffusion workloads could be delivering 2% in practice depending on model architecture. The paper includes a threshold accuracy sweep confirming that UNet+transformer workloads degrade gracefully under sparsity thresholding while motion models exhibit a sharp accuracy cliff at the next threshold above the primary operating point.
For anyone buying or building diffusion inference silicon in 2026: run workload-specific profiling at column granularity before signing off on a sparsity-based architecture. The paper's taxonomy is a practical framework. The vendors who do not provide this breakdown are either not measuring it or not publishing it, and either answer tells you something useful about their benchmark methodology.