Indirection Stream Semantic Register Architecture for Efficient Sparse-Dense Linear Algebra

Indirection Stream Semantic Register Architecture for Efficient Sparse-Dense Linear Algebra

Abstract

This work introduces the Indirection Stream Semantic Register Architecture, an enhancement to a memory-streaming RISC-V ISA extension designed to efficiently handle sparse-dense linear algebra, which traditionally suffers from indirect memory lookup bottlenecks. By accelerating streaming indirection required by sparse formats like CSR, the architecture achieves single-core speedups of up to 7.2x and 80% FPU utilization over optimized baselines. The resulting system also demonstrates a 2.8x higher peak FP64 utilization in CSR matrix-vector multiplication compared to a GTX 1080 Ti utilizing the cuSPARSE library.

Report

Key Highlights

  • Core Innovation: Introduction of an Indirection Stream Semantic Register Architecture integrated within a RISC-V ISA extension to handle memory indirection efficiently.
  • Performance Gain: Achieved speedups of up to 7.2x on single-core FPU kernels (dot, matrix-vector, matrix-matrix products) over optimized baselines without the extension.
  • Efficiency: The architecture enables single-core FPU utilization up to 80%.
  • Multi-core Advantage: A multi-core implementation yielded up to 5.8x speedup and 2.7x improved energy efficiency.
  • GPU Comparison: The proposed approach measured 2.8x higher peak FP64 utilization in CSR matrix-vector multiplication compared to a state-of-the-art cuSPARSE kernel running on an NVIDIA GTX 1080 Ti.

Technical Details

  • Architectural Component: Indirection Stream Semantic Register Architecture.
  • Base ISA: The innovation enhances an existing memory-streaming RISC-V Instruction Set Architecture (ISA) extension.
  • Target Operations: Acceleration is focused on linear algebra products involving sparse formats (e.g., CSR and CSF), specifically targeting the overhead of indirect memory lookups.
  • Supported Operations: Efficient implementations are shown for dot, matrix-vector, and matrix-matrix product kernels.
  • Proposed Future Uses: The indirection hardware is suitable for general scatter-gather operations and codebook decoding.

Implications

  • RISC-V Specialization: This work demonstrates the power of utilizing the extensible nature of the RISC-V ISA to solve domain-specific compute challenges, specifically the notorious memory indirection issue in sparse computing.
  • HPC and ML Relevance: Sparse-dense linear algebra is foundational for many machine learning models and high-performance computing tasks; solving the efficiency problem provides a significant boost to RISC-V's viability in these areas.
  • Competitive Hardware: Achieving superior FP64 utilization compared to specialized, highly-optimized commercial GPU libraries (like cuSPARSE) positions this RISC-V extension as a powerful, energy-efficient alternative for sparse workloads, challenging incumbent architectures in accelerator design.
lock-1

Technical Deep Dive Available

This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.

Read Full Report →