Short reasons for long vectors in HPC CPUs: a study based on RISC-V
Abstract
This study analyzes the performance benefits of exceptionally long vector units in High-Performance Computing (HPC) CPUs using a customizable RISC-V core setup. Testing vector lengths up to 256 double-precision elements against non-dense workloads (SpMV, BFS, PageRank, FFT), the research measures performance while varying memory latency and bandwidth. The results confirm that larger vector lengths significantly improve the tolerance to limitations within the memory subsystem, expanding vectorization opportunities beyond traditional dense linear algebra.
Report
Key Highlights
- The study investigates the impact of ultra-long vector units on CPU throughput in High-Performance Computing (HPC).
- A customizable RISC-V core was utilized to test vector lengths significantly exceeding commercial standards (which typically handle only 8 double-precision elements).
- The research focuses specifically on non-dense workloads (SpMV, BFS, PageRank, FFT) to examine performance outside of traditional dense linear algebra.
- Key finding: Larger vector lengths provide better tolerance against limitations in the memory subsystem, particularly memory latency and bandwidth bottlenecks.
Technical Details
- Architecture: Based on a RISC-V CPU core connected to a highly customizable vector unit.
- Vector Width Tested: The experimental setup was capable of operating up to 256 double-precision (DP) elements per instruction, a substantial increase over typical commercial SIMD units.
- Workloads: Four distinct, non-dense computational kernels were used: Sparse Matrix-Vector multiplication (SpMV), Breadth-First Search (BFS), PageRank, and Fast Fourier Transform (FFT).
- Methodology: Performance was measured by systematically varying three parameters: vector length, memory latency, and memory bandwidth, to assess the system's robustness.
Implications
- Validation of RISC-V Vector (RVV): The findings provide strong evidence supporting the architectural choice of highly scalable and configurable vector units, a core feature of the RISC-V Vector extension (RVV).
- Addressing Memory Bottlenecks: Demonstrating that long vectors can effectively mask high memory latency and limited bandwidth is crucial for advancing HPC architecture design.
- Expanding Vectorization Scope: The results offer a promising path for code developers working on non-dense scientific workloads, enabling vectorization benefits in domains previously considered poorly suited for traditional SIMD parallelism.
Technical Deep Dive Available
This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.