Enabling Efficient Hybrid Systolic Computation in Shared L1-Memory Manycore Clusters
Abstract
This work introduces a hybrid architecture enabling efficient, reconfigurable systolic computation on shared L1-memory manycore clusters utilizing small RISC-V cores as processing elements (PEs). The innovation includes two low-overhead RISC-V ISA extensions, Xqueue and Queue-linked registers (QLRs), which automate queue management in shared memory for seamless dataflow. Demonstrated in the 256-PE MemPool cluster, this design achieves up to 65% greater energy efficiency and doubles compute utilization (up to 73%) with only a 6% area overhead.
Report
Key Highlights
- Hybrid Architecture: Successfully merges the flexibility of shared-L1-memory manycore clusters with the high efficiency of systolic computation.
- RISC-V PEs: Small, energy-efficient RISC-V cores are utilized as reconfigurable processing elements (PEs) in the systolic array.
- Performance Gain: The hybrid architecture doubles the compute unit utilization, achieving up to 73% utilization on diverse DSP kernels.
- Efficiency: Shows significant energy gains, operating up to 65% more energy efficient than the shared-memory baseline, reaching 208 GOPS/W.
- Low Overhead: The necessary architectural modifications result in only a 6% increase in area.
Technical Details
- Architectural Paradigm: PEs form diverse, reconfigurable systolic topologies, with communication handled via queues mapped directly into the cluster's shared L1 memory.
- Core Technology: The system is built around energy-efficient RISC-V cores.
- ISA Extension 1: Xqueue: A low-overhead RISC-V ISA extension that enables single-instruction access to shared-memory-mapped queues, streamlining explicit data movement.
- ISA Extension 2: Queue-linked registers (QLRs): Allows for implicit and autonomous access to queues, effectively relieving the cores from executing explicit communication instructions, thereby enhancing PE utilization.
- Implementation Vehicle: The architecture is demonstrated in MemPool, an open-source shared-memory cluster featuring 256 PEs.
- Fabrication Specs: Implemented in a 22 nm FDX technology, running at 600 MHz without frequency degradation, with 63% of the power consumption localized within the PEs.
Implications
- RISC-V Versatility: This work significantly enhances the capabilities of RISC-V manycore clusters, demonstrating they can efficiently handle highly structured, data-intensive workloads (like AI/DSP) without requiring complex, fixed-function accelerators.
- Stream Processing: Xqueue and QLRs provide crucial low-overhead mechanisms for efficient stream processing and inter-core communication, addressing a traditional bottleneck in large-scale manycore designs.
- Energy Efficiency Leadership: Achieving 208 GOPS/W positions this hybrid RISC-V approach as a highly competitive solution for energy-constrained applications, particularly in the embedded and edge AI domains.
- Ecosystem Development: By building upon the open-source MemPool cluster and proposing specific, yet lightweight, RISC-V ISA extensions, the authors provide a clear pathway for the broader RISC-V ecosystem to adopt efficient, reconfigurable hardware acceleration techniques.
Technical Deep Dive Available
This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.