CuPBoP: CUDA for Parallelized and Broad-range Processors

CuPBoP: CUDA for Parallelized and Broad-range Processors

Abstract

CuPBoP (CUDA for Parallelized and Broad-range Processors) is a novel framework designed to execute proprietary CUDA code on non-NVIDIA hardware, addressing the limitations of vendor lock-in and enabling data-parallel computing in heterogeneous systems. Unlike previous methods relying on source-to-source translation, CuPBoP achieves portability without requiring manual code modifications or intermediate portable programming languages. The system supports various CPU backends, including X86, AArch64, and RISC-V, and demonstrates superior code coverage (69.6%) and competitive performance against existing frameworks on standard benchmarks.

Report

Key Highlights

  • CUDA Portability: CuPBoP enables the execution of CUDA programs on non-NVIDIA devices.
  • No Manual Modification: The framework eliminates the need for manual code adjustments, which was a significant workload burden in previous source-to-source translation approaches.
  • High Coverage: Achieved 69.6% coverage on the Rodinia benchmark suite, substantially higher than the 56.6% coverage provided by existing comparable frameworks.
  • Broad ISA Support: Supports execution across multiple CPU Instruction Set Architectures (ISAs), specifically X86, RISC-V, and AArch64.
  • Performance: Performance on CPU backends is reported to be close to or even higher than competing open-source projects.

Technical Details

  • Innovation: CuPBoP executes CUDA directly on target hardware without translating the source code into an intermediate portable language (e.g., OpenMP or MPI).
  • Target Architectures: The framework specifically targets and benchmarks performance on diverse architectures, including X86, RISC-V, and AArch64 CPU backends.
  • Metrics: Coverage is measured against the Rodinia benchmark suite, confirming the framework's broad applicability to real-world parallel workloads.
  • Comparison Basis: Performance analysis includes comparisons against existing frameworks that execute CUDA on non-NVIDIA devices, manually optimized OpenMP/MPI programs, and native execution on the latest NVIDIA Ampere architecture GPUs.

Implications

  • Breaking Vendor Lock-in: CuPBoP significantly democratizes access to the vast CUDA software ecosystem, reducing reliance on proprietary NVIDIA hardware for parallel computing tasks.
  • Advancing Heterogeneous Systems: By allowing existing CUDA programs to run on non-NVIDIA CPUs and accelerators, CuPBoP facilitates genuinely heterogeneous computing environments where performance can be leveraged across diverse device types.
  • Impact on RISC-V: The explicit support for the RISC-V ISA is critical for the RISC-V ecosystem. It immediately grants RISC-V hardware developers and users access to a massive library of high-performance parallel code, accelerating RISC-V adoption in domains requiring GPU-like acceleration (HPC, AI/ML inference). The ability to use CUDA on RISC-V provides a vital software bridge that was previously a major hurdle for the open ISA.
lock-1

Technical Deep Dive Available

This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.

Read Full Report →