Spatz: A Compact Vector Processing Unit for High-Performance and Energy-Efficient Shared-L1 Clusters

Spatz: A Compact Vector Processing Unit for High-Performance and Energy-Efficient Shared-L1 Clusters

Abstract

Spatz is a novel, compact 32-bit vector processing unit designed as an energy-efficient Processing Element for large-scale clusters leveraging shared L1 memory, specifically targeting mitigation of the Von Neumann Bottleneck. Built upon the integer embedded subset of the RISC-V Vector Extension, Spatz achieves significant power savings, requiring 40% less energy per operation compared to equivalent scalar cores. When integrated into the MemPool architecture, the Spatz-based system demonstrated 70% greater performance (285 GOPS) and over double the energy efficiency (266 GOPS/W) running 32-bit integer matrix multiplication.

Report

Key Highlights

  • VPU Innovation: Spatz is a compact, modular vector processing unit (VPU) designed specifically as a lean Processing Element (PE) for large-scale clusters utilizing shared L1 memory.
  • Energy Efficiency: A Spatz-based cluster requires only 7.9 pJ per 32-bit integer multiply-accumulate operation, achieving 40% less energy consumption than an equivalent cluster built with four Snitch scalar cores.
  • Performance Gain: The Spatz system achieved up to 285 GOPS in a 256x256 32-bit integer matrix multiplication, representing a 70% performance increase over the equivalent Snitch-based MemPool system.
  • Efficiency Leader: Spatz provides massive energy efficiency improvements, reaching 266 GOPS/W, which is more than double the efficiency of the Snitch-based system (128 GOPS/W).
  • VNB Mitigation: The core goal of Spatz is to leverage vector processing to reduce instruction fetch bandwidth, thereby mitigating the Von Neumann Bottleneck (VNB) prevalent in highly parallel architectures.

Technical Details

  • Architecture: Spatz is a compact, modular 32-bit vector processing unit.
  • ISA Compliance: It implements the integer embedded subset of the RISC-V Vector Extension (version 1.0).
  • Core Configuration: The tested cluster used four Multiply-Accumulate Units (MACUs).
  • Integration Platform: The unit was analyzed and benchmarked after being integrated within the MemPool, a large-scale many-core shared-L1 cluster architecture.
  • Benchmark Operation Cost: The energy cost for a 32-bit integer multiply-accumulate operation is 7.9 pJ.

Implications

  • Validation of Lean Vector PEs: The results decisively show the viability of using lean, simplified vector processors as high-performance and energy-efficient PEs, challenging the dominance of traditional scalar cores in tightly-coupled L1 clusters.
  • RISC-V Vector Extension Utility: Spatz successfully demonstrates that even the embedded integer subset of the RISC-V Vector Extension (V extension) can lead to substantial, measurable power and performance improvements in real-world many-core implementations.
  • Shifting Cluster Design Paradigm: This work suggests that future large-scale, tightly-coupled embedded parallel architectures should prioritize vector PEs over scalar PEs to maximize throughput while minimizing the instruction fetch and decode energy overhead, setting a new benchmark for system-level energy efficiency (266 GOPS/W).
lock-1

Technical Deep Dive Available

This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.

Read Full Report →