Decoupled Control Flow and Data Access in RISC-V GPGPUs

Decoupled Control Flow and Data Access in RISC-V GPGPUs

Abstract

This paper addresses the performance limitations of Vortex, an open-source RISC-V GPGPU, by tackling high micro-code overheads associated with control flow (CF) management and memory access. The core innovation introduces decoupled CF and data access through a hardware CF manager and dedicated memory streaming lanes. These micro-architecture modifications resulted in substantial gains, achieving 8x faster execution, a 10x reduction in dynamic instruction count, and improving throughput up to 1.63 GFLOP/s/mm².

Report

Key Highlights

  • Target Platform: The research focuses on improving Vortex, a newly proposed open-source GPGPU platform built on the RISC-V Instruction Set Architecture (ISA).
  • Problem Statement: The platform currently suffers from poor performance due to significant micro-code overheads linked to complex control flow management and memory orchestration, common in memory-intensive kernels.
  • Core Innovation: The introduction of decoupled Control Flow (CF) and Data Access, implemented through simple yet powerful micro-architecture modifications.
  • Performance Metrics: Evaluation results show an 8x faster execution time and a 10x reduction in the dynamic instruction count for various kernels.
  • Throughput Improvement: Overall performance density increased significantly, moving from 0.35 to 1.63 GFLOP/s/mm².

Technical Details

The architectural improvements rely on two primary components designed to decouple instruction fetching and data handling:

  1. Hardware Control Flow (CF) Manager: This dedicated hardware module is introduced to accelerate control flow operations, specifically branching and predication, during regular loop execution. This handles the CF management overheads that typically dominate the dynamic instruction count in kernels.
  2. Decoupled Memory Streaming Lanes: These lanes are micro-architectural additions designed to orchestrate data access independently. Their purpose is to further hide memory latency by enabling useful computation to proceed concurrently with memory fetching and streaming.

These modifications specifically target and accelerate execution of memory-intensive kernels, such as linear algebra routines, which are fundamental building blocks for Machine Learning applications.

Implications

  • RISC-V Ecosystem Boost: By significantly enhancing performance, the paper provides a crucial boost to the RISC-V GPGPU movement, making the open-source Vortex platform a viable, competitive alternative to commercial GPU modeling platforms.
  • Enabling GPGPU Research: The performance gains solidify Vortex as an “ideal playground” for researchers. This open architecture allows for fresh research directions into GPGPU design that are often restricted or unavailable in proprietary commercial hardware.
  • Machine Learning Acceleration: The focus on efficiently executing memory-intensive kernels makes the enhanced Vortex platform highly relevant for advancing the next generation of Machine Learning applications, which heavily rely on these routines.
lock-1

Technical Deep Dive Available

This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.

Read Full Report →