Exploiting long vectors with a CFD code: a co-design show case

Exploiting long vectors with a CFD code: a co-design show case

Abstract

This paper showcases a co-design methodology utilizing iterative analysis and compiler autovectorization to effectively exploit long vector architectures in HPC applications, specifically a production CFD code. The optimization process, designed to maximize efficiency while preserving code portability, was evaluated on an innovative RISC-V platform featuring a wide vector unit. The results demonstrated a substantial single-core speedup of $7.6\times$ compared to the scalar implementation, with portability confirmed across diverse architectures like Intel x86 and NEC SX-Aurora.

Report

Key Highlights

  • Co-Design Focus: The research centers on a co-design methodology to exploit long vector architectures (SIMD/Vector extensions) for data parallelism in HPC.
  • Optimization Strategy: The primary method is leveraging compiler autovectorization, focusing on iterative code improvement guided by detailed analysis tools to maximize efficiency and minimize code specialization (maintaining portability).
  • Application Success: The methodology was applied to a production Computational Fluid Dynamics (CFD) code.
  • Performance Result: Achieved a significant single-core speedup of $7.6\times$ over the scalar implementation on the target platform.
  • Portability Demonstrated: The optimized solution maintained performance benefits or showed no drawbacks when tested on other major HPC architectures, including Intel x86 and NEC SX-Aurora.

Technical Details

  • Vectorization Method: Compiler autovectorization is preferred over methods that require extensive code modification (e.g., intrinsics or guided vectorization via pragmas).
  • Target Architecture: An innovative configurable platform powered by a RISC-V core.
  • Vector Unit Specification: The platform includes a wide vector unit capable of handling up to 256 double-precision elements.
  • HPC Context: The study addresses a current trend in HPC systems to utilize SIMD or vector extensions for exploiting data parallelism.
  • Validation Platforms: Performance comparison utilized the RISC-V core, Intel x86, and NEC SX-Aurora architectures.

Implications

  • RISC-V Vector Validation: This work provides strong proof that configurable RISC-V cores, coupled with specialized wide vector units, can deliver exceptional performance necessary for demanding scientific workloads like CFD. The $7.6\times$ speedup validates the RISC-V Vector (RVV) ecosystem's potential in high-performance computing.
  • Software Ecosystem Maturity: The successful reliance on compiler autovectorization indicates growing maturity in RISC-V toolchains and compilers, enabling high performance without requiring developers to write vendor-specific intrinsic code.
  • Co-Design Utility: The demonstrated iterative co-design approach offers a blueprint for hardware and software developers to jointly optimize scientific applications, accelerating the deployment and adoption of new RISC-V HPC hardware.
  • HPC Portability Solution: By prioritizing source-level improvements for efficient autovectorization, the resulting code maintains high performance across competing vector architectures (RISC-V, x86, NEC), addressing a critical portability challenge in the heterogeneous HPC landscape.
lock-1

Technical Deep Dive Available

This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.

Read Full Report →