Evaluating IOMMU-Based Shared Virtual Addressing for RISC-V Embedded Heterogeneous SoCs
Abstract
This work quantitatively evaluates Input-Output Memory Management Unit (IOMMU)-based Shared Virtual Addressing (SVA) for RISC-V embedded heterogeneous SoCs, enabling zero-copy data offloading between host and accelerator. While IO address translation introduces significant overhead in high-latency memory systems (up to 17.6%), the authors demonstrate that integrating a Last-Level Cache drastically reduces this cost to below 1%. This finding validates SVA as a suitable and efficient mechanism for high-performance zero-copy operation in modern RISC-V heterogeneous architectures.
Report
Structured Report: Evaluating IOMMU-Based Shared Virtual Addressing for RISC-V Embedded Heterogeneous SoCs
Key Highlights
- Innovation Focus: The paper evaluates the integration of IOMMU technology to enable Shared Virtual Addressing (SVA) in open-source RISC-V embedded heterogeneous systems, thereby facilitating zero-copy data transfer.
- Problem Solved: SVA overcomes the traditional performance bottleneck and poor memory utilization caused by mandatory data copying between the host's virtual address space and the accelerator's physical address space.
- Performance Bottleneck Identified: On systems with high-latency memory, IO virtual address translation accounts for a significant performance hit, ranging from 4.2% up to 17.6% of the accelerator's runtime for compute kernels like
gemm. - Effective Mitigation: The research demonstrates that implementing a Last-Level Cache (LLC) virtually eliminates this overhead, dropping the address translation cost to 0.4% and 0.7% under the same high-latency conditions.
Technical Details
- Architecture Tested: An open-source heterogeneous RISC-V SoC design.
- Component Configuration: The SoC consists of a 64-bit host processor coupled with a 32-bit accelerator cluster.
- Core Mechanism: Input-Output Memory Management Units (IOMMUs) are used to map IO virtual addresses (IOVAs) used by accelerators to physical memory.
- Evaluation Method: The system design was emulated on an FPGA platform to measure real-world performance characteristics.
- Software Stack: Compute kernels were derived from the RajaPERF benchmark suite and implemented using heterogeneous OpenMP programming.
- Latency Challenge: Resolving an IO virtual address can require up to three sequential memory accesses on an IOTLB (Input-Output Translation Lookaside Buffer) miss, making the overhead highly sensitive to DRAM access latency.
Implications
- Enabling Zero-Copy RISC-V: This work establishes a quantitative foundation for implementing efficient zero-copy offloading in RISC-V embedded SoCs, which is essential for maximizing accelerator throughput and reducing host CPU load.
- System Design Guidance: It provides critical architectural guidance, proving that IOMMU integration is only fully efficient when memory access latency is aggressively mitigated, strongly favoring the inclusion of Last-Level Caches in these heterogeneous designs.
- Competitive Advantage: The successful evaluation of SVA integration enhances the competitive viability of RISC-V in high-performance embedded markets (e.g., automotive, edge computing) where energy efficiency and high throughput are paramount.
Technical Deep Dive Available
This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.