Taming Offload Overheads in a Massively Parallel Open-Source RISC-V MPSoC: Analysis and Optimization

Taming Offload Overheads in a Massively Parallel Open-Source RISC-V MPSoC: Analysis and Optimization

Abstract

This study analyzes and optimizes the significant synchronization and communication overheads associated with offloading fine-grained tasks in heterogeneous, massively parallel RISC-V MPSoCs, specifically using the open-source Occamy platform. Through a co-designed approach, the authors demonstrate that integrating multicast capabilities into the Network-on-Chip (NoC) of the 200+ core accelerator fabric drastically reduces latency. This optimization achieves up to 2.3x speedup, recovering over 70% of ideal performance, and includes a quantitative model for accurate runtime prediction.

Report

Key Highlights

  • Focuses on mitigating communication and synchronization overheads during computation offloading in massively parallel heterogeneous MPSoCs.
  • The analysis is performed on Occamy, an open-source RISC-V based MPSoC featuring over 200 accelerator cores.
  • The primary optimization involves co-designing hardware and offload routines, specifically integrating multicast capabilities into the Network-on-Chip (NoC).
  • The optimization yields application runtime improvements of up to 2.3x, successfully restoring more than 70% of the maximum theoretical speedups.
  • The work introduces a quantitative model capable of estimating application runtime, factoring in offload overheads, with a consistent prediction error below 15%.

Technical Details

  • Architecture Type: Heterogeneous Multi-Processor System-on-Chip (MPSoC), combining large host cores (optimized for single-thread performance) with many clusters of small, specialized accelerator cores (for data-parallel processing).
  • Platform: Occamy, an open-source, massively parallel RISC-V architecture.
  • Analysis: Detailed, cycle-accurate quantitative analysis used to precisely measure offload overheads, particularly how they scale with the number of accelerator cores.
  • Hardware Modification: Implementation of multicast capabilities within the Network-on-Chip (NoC) supporting the large accelerator fabric (200+ cores).
  • Objective: To reduce overheads that hamper efficiency for small and fine-grained parallel tasks.

Implications

  • Scalability of RISC-V MPSoCs: This research provides a crucial architectural template for future highly parallel RISC-V designs, proving that fundamental architectural bottlenecks (like offload communication) can be efficiently resolved through specialized hardware features like NoC multicast.
  • Enabling Fine-Grained Parallelism: By significantly reducing communication latency, the methodology makes offloading fine-grained tasks economical, thus broadening the applicability and efficiency of many-core accelerators for a wider range of workloads.
  • Hardware/Software Co-Design Validation: The results underscore the necessity of a holistic hardware-software co-design strategy; simply increasing core count is insufficient without corresponding optimization in communication infrastructure and runtime routines.
  • Performance Predictability: The proposed quantitative model offers a valuable tool for system architects, allowing them to accurately estimate the real-world performance impact of offload overheads before full deployment, facilitating better design decisions in the RISC-V domain.
lock-1

Technical Deep Dive Available

This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.

Read Full Report →