Research

Using System Hyper Pipelining (SHP) to Improve the Performance of a Coarse-Grained Reconfigurable Architecture (CGRA) Mapped on an FPGA

Admin

0 views • 10 years ago (Updated) • 2 min read •

•

Abstract

This paper introduces System Hyper Pipelining (SHP), an advanced extension of C-Slow Retiming, applied to the Programming Elements (PEs) of a Coarse-Grained Reconfigurable Architecture (CGRA). SHP enables dynamic management of execution threads—allowing them to be stalled, bypassed, and reordered—which significantly increases performance per PE and implements complex Fork-Join operations. The architecture utilizes SHP-ed RISC-V cores as PEs implemented on an FPGA, successfully demonstrating improved local data sharing and reduced traffic on the CGRA's main routing structure.

Report

Key Highlights

Core Innovation: Application of System Hyper Pipelining (SHP) to the Programming Elements (PEs) within a Coarse-Grained Reconfigurable Architecture (CGRA).
Performance Gain: SHP achieves increased performance per PE compared to standard methods.
Flexibility: SHP extends C-Slow Retiming (CSR) by allowing a dynamic number of execution threads, which can be dynamically stalled, bypassed, or reordered.
Traffic Reduction: Local data sharing among multiple threads within the SHP-ed PE greatly reduces the overall data traffic load on the CGRA's global routing infrastructure.
Implementation Base: The PEs used in the CGRA implementation are SHP-ed RISC-V cores mapped onto an FPGA.

Technical Details

Feature	Description / Method
Base Architecture	Coarse-Grained Reconfigurable Architecture (CGRA)
Target Hardware	Field-Programmable Gate Array (FPGA)
Pipelining Method	System Hyper Pipelining (SHP), derived from C-Slow Retiming (CSR)
PE Composition	SHP-ed RISC-V Cores
Dynamic Threading	SHP supports variable thread counts and allows threads to be dynamically manipulated (stalled, bypassed, reordered).
Functionality Enabled	Implementation of Fork-Join operations directly on the PE using SHP's thread flexibility.
Optimization	Exploiting local data sharing among threads to minimize reliance on the CGRA's interconnect, thus reducing routing congestion and latency.

Implications

RISC-V Ecosystem: This work validates RISC-V as a highly flexible instruction set architecture suitable for constructing customized, high-performance programming elements within novel heterogeneous computing paradigms like CGRAs. It demonstrates RISC-V's role in acceleration beyond standard CPU roles.
Reconfigurable Computing: SHP offers a fundamental advancement in how multithreading is handled in CGRA environments, moving beyond rigid barrel processing toward dynamically scheduled processing elements, which is crucial for handling irregular data dependencies efficiently.
Efficiency and Scalability: By utilizing SHP to keep data locally shared and minimize movement across the global routing network, the design addresses a primary bottleneck in large-scale CGRAs (interconnect overhead). This implies improved power efficiency and better scalability for future domain-specific accelerators.
Compiler/Runtime Potential: The flexibility introduced by SHP (dynamic stalling and reordering) suggests complex, efficient runtime scheduling could be developed to maximize PE utilization for various applications.

Technical Deep Dive Available

This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.

Read Full Report →