AXI-Pack: Near-Memory Bus Packing for Bandwidth-Efficient Irregular Workloads
Abstract
AXI-Pack is proposed as an extension to the ARM AXI4 protocol designed to address the bandwidth inefficiencies of irregular memory streams (strided and indirect accesses) in on-chip interconnects. This extension introduces specialized burst semantics that allow packing multiple narrow data elements onto a wide bus, achieving end-to-end bandwidth efficiency while retaining AXI4 compatibility. Implemented on a RISC-V vector processor with a 256-bit bus, AXI-Pack demonstrated substantial performance improvements, achieving speedups up to 5.4x and energy efficiency gains up to 5.3x on strided benchmarks.
Report
Key Highlights
- Target Problem: Inefficiency of handling irregular memory streams (strided and indirect accesses) using modern interconnects optimized for contiguous data.
- Solution: AXI-Pack, an extension to the widely used ARM AXI4 protocol.
- Mechanism: Adds irregular stream semantics to memory requests, enabling the packing of multiple narrow data elements onto a wide bus (bus packing).
- Compatibility: Retains full compatibility with standard AXI4 and does not require modifications to non-burst-reshaping interconnect IPs.
- Performance Results: Achieved near-ideal peak on-chip bus utilization (87% for strided, 39% for indirect) on a 256-bit interconnect.
- End-to-End Gains: Demonstrated speedups of 5.4x (strided) and 2.4x (indirect), with corresponding energy efficiency improvements of 5.3x and 2.1x over the AXI4 baseline.
- Implementation Base: Validated using an extension of an open-source RISC-V vector processor and a custom banked memory controller.
Technical Details
- Protocol: AXI-Pack modifies the AXI4 protocol by adding extensions necessary to define strided and indirect burst requests.
- Architecture: The end-to-end demonstration involved modifying an open-source RISC-V vector processor to generate AXI-Pack requests from its memory interface.
- Memory Controller Design: A specialized banked memory controller was required on the memory side to efficiently parse and handle the new AXI-Pack requests, maximizing parallelism across banks.
- Evaluation Setup: Performance testing was conducted on FP32 workloads utilizing a system with a 256-bit wide on-chip interconnect.
- Key Optimization: The innovation lies in leveraging the wide bus width (256-bit) to aggregate small, non-contiguous data transfers that would otherwise require multiple inefficient narrow AXI4 transactions.
Implications
- RISC-V Ecosystem Enhancement: By providing an efficient mechanism for irregular accesses, AXI-Pack directly benefits RISC-V vector extensions and specialized accelerators that often handle sparse matrices, graph processing, or database operations—workloads characterized by strided or indexed memory patterns.
- Energy Efficiency in HPC/AI: The 5.3x energy efficiency improvement is crucial for data-intensive applications, allowing high-performance systems to handle irregular streams without excessive power consumption associated with wasted bus cycles.
- Interconnect Standardization: AXI-Pack demonstrates a pathway to incrementally improve existing, widely adopted bus standards (like AXI) to meet modern bandwidth demands, avoiding the need for entirely new interconnect IP for irregular data movement.
- Near-Memory Computing: This approach facilitates the efficient deployment of near-memory architectures, where maximizing the utilization of the wide, high-bandwidth path between the compute tile and the memory controller is paramount.
Technical Deep Dive Available
This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.