Enabling Efficient Hybrid Systolic Computation in Shared L1-Memory Manycore Clusters

Enabling Efficient Hybrid Systolic Computation in Shared L1-Memory Manycore Clusters

Abstract

This work introduces a hybrid architecture enabling efficient, reconfigurable systolic computation on shared L1-memory manycore clusters utilizing small RISC-V cores as processing elements (PEs). The innovation includes two low-overhead RISC-V ISA extensions, Xqueue and Queue-linked registers (QLRs), which automate queue management in shared memory for seamless dataflow. Demonstrated in the 256-PE MemPool cluster, this design achieves up to 65% greater energy efficiency and doubles compute utilization (up to 73%) with only a 6% area overhead.

Report

Key Highlights

  • Hybrid Architecture: Successfully merges the flexibility of shared-L1-memory manycore clusters with the high efficiency of systolic computation.
  • RISC-V PEs: Small, energy-efficient RISC-V cores are utilized as reconfigurable processing elements (PEs) in the systolic array.
  • Performance Gain: The hybrid architecture doubles the compute unit utilization, achieving up to 73% utilization on diverse DSP kernels.
  • Efficiency: Shows significant energy gains, operating up to 65% more energy efficient than the shared-memory baseline, reaching 208 GOPS/W.
  • Low Overhead: The necessary architectural modifications result in only a 6% increase in area.

Technical Details

  • Architectural Paradigm: PEs form diverse, reconfigurable systolic topologies, with communication handled via queues mapped directly into the cluster's shared L1 memory.
  • Core Technology: The system is built around energy-efficient RISC-V cores.
  • ISA Extension 1: Xqueue: A low-overhead RISC-V ISA extension that enables single-instruction access to shared-memory-mapped queues, streamlining explicit data movement.
  • ISA Extension 2: Queue-linked registers (QLRs): Allows for implicit and autonomous access to queues, effectively relieving the cores from executing explicit communication instructions, thereby enhancing PE utilization.
  • Implementation Vehicle: The architecture is demonstrated in MemPool, an open-source shared-memory cluster featuring 256 PEs.
  • Fabrication Specs: Implemented in a 22 nm FDX technology, running at 600 MHz without frequency degradation, with 63% of the power consumption localized within the PEs.

Implications

  • RISC-V Versatility: This work significantly enhances the capabilities of RISC-V manycore clusters, demonstrating they can efficiently handle highly structured, data-intensive workloads (like AI/DSP) without requiring complex, fixed-function accelerators.
  • Stream Processing: Xqueue and QLRs provide crucial low-overhead mechanisms for efficient stream processing and inter-core communication, addressing a traditional bottleneck in large-scale manycore designs.
  • Energy Efficiency Leadership: Achieving 208 GOPS/W positions this hybrid RISC-V approach as a highly competitive solution for energy-constrained applications, particularly in the embedded and edge AI domains.
  • Ecosystem Development: By building upon the open-source MemPool cluster and proposing specific, yet lightweight, RISC-V ISA extensions, the authors provide a clear pathway for the broader RISC-V ecosystem to adopt efficient, reconfigurable hardware acceleration techniques.
lock-1

Technical Deep Dive Available

This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.

Read Full Report →