Using System Hyper Pipelining (SHP) to Improve the Performance of a Coarse-Grained Reconfigurable Architecture (CGRA) Mapped on an FPGA
Abstract
This paper introduces System Hyper Pipelining (SHP), an advanced extension of C-Slow Retiming, applied to the Programming Elements (PEs) of a Coarse-Grained Reconfigurable Architecture (CGRA). SHP enables dynamic management of execution threads—allowing them to be stalled, bypassed, and reordered—which significantly increases performance per PE and implements complex Fork-Join operations. The architecture utilizes SHP-ed RISC-V cores as PEs implemented on an FPGA, successfully demonstrating improved local data sharing and reduced traffic on the CGRA's main routing structure.
Report
Key Highlights
- Core Innovation: Application of System Hyper Pipelining (SHP) to the Programming Elements (PEs) within a Coarse-Grained Reconfigurable Architecture (CGRA).
- Performance Gain: SHP achieves increased performance per PE compared to standard methods.
- Flexibility: SHP extends C-Slow Retiming (CSR) by allowing a dynamic number of execution threads, which can be dynamically stalled, bypassed, or reordered.
- Traffic Reduction: Local data sharing among multiple threads within the SHP-ed PE greatly reduces the overall data traffic load on the CGRA's global routing infrastructure.
- Implementation Base: The PEs used in the CGRA implementation are SHP-ed RISC-V cores mapped onto an FPGA.
Technical Details
| Feature | Description / Method |
|---|---|
| Base Architecture | Coarse-Grained Reconfigurable Architecture (CGRA) |
| Target Hardware | Field-Programmable Gate Array (FPGA) |
| Pipelining Method | System Hyper Pipelining (SHP), derived from C-Slow Retiming (CSR) |
| PE Composition | SHP-ed RISC-V Cores |
| Dynamic Threading | SHP supports variable thread counts and allows threads to be dynamically manipulated (stalled, bypassed, reordered). |
| Functionality Enabled | Implementation of Fork-Join operations directly on the PE using SHP's thread flexibility. |
| Optimization | Exploiting local data sharing among threads to minimize reliance on the CGRA's interconnect, thus reducing routing congestion and latency. |
Implications
- RISC-V Ecosystem: This work validates RISC-V as a highly flexible instruction set architecture suitable for constructing customized, high-performance programming elements within novel heterogeneous computing paradigms like CGRAs. It demonstrates RISC-V's role in acceleration beyond standard CPU roles.
- Reconfigurable Computing: SHP offers a fundamental advancement in how multithreading is handled in CGRA environments, moving beyond rigid barrel processing toward dynamically scheduled processing elements, which is crucial for handling irregular data dependencies efficiently.
- Efficiency and Scalability: By utilizing SHP to keep data locally shared and minimize movement across the global routing network, the design addresses a primary bottleneck in large-scale CGRAs (interconnect overhead). This implies improved power efficiency and better scalability for future domain-specific accelerators.
- Compiler/Runtime Potential: The flexibility introduced by SHP (dynamic stalling and reordering) suggests complex, efficient runtime scheduling could be developed to maximize PE utilization for various applications.
Technical Deep Dive Available
This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.