AXI-Pack: Near-Memory Bus Packing for Bandwidth-Efficient Irregular Workloads

AXI-Pack: Near-Memory Bus Packing for Bandwidth-Efficient Irregular Workloads

Abstract

AXI-Pack is proposed as an extension to the ARM AXI4 protocol designed to address the bandwidth inefficiencies of irregular memory streams (strided and indirect accesses) in on-chip interconnects. This extension introduces specialized burst semantics that allow packing multiple narrow data elements onto a wide bus, achieving end-to-end bandwidth efficiency while retaining AXI4 compatibility. Implemented on a RISC-V vector processor with a 256-bit bus, AXI-Pack demonstrated substantial performance improvements, achieving speedups up to 5.4x and energy efficiency gains up to 5.3x on strided benchmarks.

Report

Key Highlights

  • Target Problem: Inefficiency of handling irregular memory streams (strided and indirect accesses) using modern interconnects optimized for contiguous data.
  • Solution: AXI-Pack, an extension to the widely used ARM AXI4 protocol.
  • Mechanism: Adds irregular stream semantics to memory requests, enabling the packing of multiple narrow data elements onto a wide bus (bus packing).
  • Compatibility: Retains full compatibility with standard AXI4 and does not require modifications to non-burst-reshaping interconnect IPs.
  • Performance Results: Achieved near-ideal peak on-chip bus utilization (87% for strided, 39% for indirect) on a 256-bit interconnect.
  • End-to-End Gains: Demonstrated speedups of 5.4x (strided) and 2.4x (indirect), with corresponding energy efficiency improvements of 5.3x and 2.1x over the AXI4 baseline.
  • Implementation Base: Validated using an extension of an open-source RISC-V vector processor and a custom banked memory controller.

Technical Details

  • Protocol: AXI-Pack modifies the AXI4 protocol by adding extensions necessary to define strided and indirect burst requests.
  • Architecture: The end-to-end demonstration involved modifying an open-source RISC-V vector processor to generate AXI-Pack requests from its memory interface.
  • Memory Controller Design: A specialized banked memory controller was required on the memory side to efficiently parse and handle the new AXI-Pack requests, maximizing parallelism across banks.
  • Evaluation Setup: Performance testing was conducted on FP32 workloads utilizing a system with a 256-bit wide on-chip interconnect.
  • Key Optimization: The innovation lies in leveraging the wide bus width (256-bit) to aggregate small, non-contiguous data transfers that would otherwise require multiple inefficient narrow AXI4 transactions.

Implications

  • RISC-V Ecosystem Enhancement: By providing an efficient mechanism for irregular accesses, AXI-Pack directly benefits RISC-V vector extensions and specialized accelerators that often handle sparse matrices, graph processing, or database operations—workloads characterized by strided or indexed memory patterns.
  • Energy Efficiency in HPC/AI: The 5.3x energy efficiency improvement is crucial for data-intensive applications, allowing high-performance systems to handle irregular streams without excessive power consumption associated with wasted bus cycles.
  • Interconnect Standardization: AXI-Pack demonstrates a pathway to incrementally improve existing, widely adopted bus standards (like AXI) to meet modern bandwidth demands, avoiding the need for entirely new interconnect IP for irregular data movement.
  • Near-Memory Computing: This approach facilitates the efficient deployment of near-memory architectures, where maximizing the utilization of the wide, high-bandwidth path between the compute tile and the memory controller is paramount.
lock-1

Technical Deep Dive Available

This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.

Read Full Report →