Indirection Stream Semantic Register Architecture for Efficient Sparse-Dense Linear Algebra
Abstract
This work introduces the Indirection Stream Semantic Register Architecture, an enhancement to a memory-streaming RISC-V ISA extension designed to efficiently handle sparse-dense linear algebra, which traditionally suffers from indirect memory lookup bottlenecks. By accelerating streaming indirection required by sparse formats like CSR, the architecture achieves single-core speedups of up to 7.2x and 80% FPU utilization over optimized baselines. The resulting system also demonstrates a 2.8x higher peak FP64 utilization in CSR matrix-vector multiplication compared to a GTX 1080 Ti utilizing the cuSPARSE library.
Report
Key Highlights
- Core Innovation: Introduction of an Indirection Stream Semantic Register Architecture integrated within a RISC-V ISA extension to handle memory indirection efficiently.
- Performance Gain: Achieved speedups of up to 7.2x on single-core FPU kernels (dot, matrix-vector, matrix-matrix products) over optimized baselines without the extension.
- Efficiency: The architecture enables single-core FPU utilization up to 80%.
- Multi-core Advantage: A multi-core implementation yielded up to 5.8x speedup and 2.7x improved energy efficiency.
- GPU Comparison: The proposed approach measured 2.8x higher peak FP64 utilization in CSR matrix-vector multiplication compared to a state-of-the-art cuSPARSE kernel running on an NVIDIA GTX 1080 Ti.
Technical Details
- Architectural Component: Indirection Stream Semantic Register Architecture.
- Base ISA: The innovation enhances an existing memory-streaming RISC-V Instruction Set Architecture (ISA) extension.
- Target Operations: Acceleration is focused on linear algebra products involving sparse formats (e.g., CSR and CSF), specifically targeting the overhead of indirect memory lookups.
- Supported Operations: Efficient implementations are shown for dot, matrix-vector, and matrix-matrix product kernels.
- Proposed Future Uses: The indirection hardware is suitable for general scatter-gather operations and codebook decoding.
Implications
- RISC-V Specialization: This work demonstrates the power of utilizing the extensible nature of the RISC-V ISA to solve domain-specific compute challenges, specifically the notorious memory indirection issue in sparse computing.
- HPC and ML Relevance: Sparse-dense linear algebra is foundational for many machine learning models and high-performance computing tasks; solving the efficiency problem provides a significant boost to RISC-V's viability in these areas.
- Competitive Hardware: Achieving superior FP64 utilization compared to specialized, highly-optimized commercial GPU libraries (like cuSPARSE) positions this RISC-V extension as a powerful, energy-efficient alternative for sparse workloads, challenging incumbent architectures in accelerator design.
Technical Deep Dive Available
This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.