A Configurable Mixed-Precision Fused Dot Product Unit for GPGPU Tensor Computation

A Configurable Mixed-Precision Fused Dot Product Unit for GPGPU Tensor Computation

Abstract

This paper introduces a scalable mixed-precision fused dot product unit designed to overcome the suboptimal throughput of discrete arithmetic units in GPGPUs for Deep Learning workloads. Integrated into the open-source RISC-V Vortex GPGPU's Tensor Core Unit extension, the architecture integrates floating-point and integer arithmetic pipelines for efficiency. The unit supports low-precision formats (e.g., FP8/BF8/INT8) for multiplication and higher-precision formats (FP32/INT32) for accumulation, achieving 9.812 GFLOPS throughput at 306.6 MHz on a Xilinx Alveo U55C FPGA.

Report

Key Highlights

  • Core Innovation: A scalable, mixed-precision fused dot product unit (FDPU) architecture for accelerated GPGPU tensor computation.
  • Problem Addressed: Suboptimal throughput and poor resource utilization stemming from existing open-source RTL implementations that use discrete arithmetic units for inner dot products.
  • Implementation Context: The unit is implemented as part of the open-source RISC-V based Vortex GPGPU's Tensor Core Unit extension.
  • Performance Metric: Achieved an ideal filled pipeline throughput of 9.812 GFLOPS in a 4-thread per warp configuration.
  • Hardware Validation: Demonstrated performance with a 4-cycle operation latency at 306.6 MHz clock frequency on the AMD Xilinx Alveo U55C FPGA.

Technical Details

  • Architecture: A singular fused architecture that integrates both floating-point and integer arithmetic pipelines to maximize throughput and efficiency for mixed-precision Matrix Multiplication Accumulation (MMA).
  • Low-Precision Input Support (Multiplication): Configurable to handle FP16, BF16, FP8, BF8, INT8, and UINT4 formats.
  • High-Precision Output Support (Accumulation): Supports accumulation in FP32 and INT32 formats.
  • Extensibility: The design framework allows for future extensions to evaluate and integrate other custom low-precision representations.
  • Performance Latency: 4-cycle operation latency.
  • Operating Frequency: 306.6 MHz on the target FPGA.

Implications

  • Advancing RISC-V GPGPU Capabilities: This work directly enhances the capability of the open-source RISC-V ecosystem by providing highly optimized, state-of-the-art tensor computation hardware, moving beyond basic vector units.
  • Deep Learning Acceleration: By efficiently supporting a wide range of current and emerging low-precision formats (especially FP8 and BF8), the unit enables high-throughput processing of critical Deep Learning workloads, addressing bottlenecks in computation-intensive phases.
  • Open-Source Maturity: The implementation within the Vortex GPGPU project elevates the sophistication of available open-source RTL, offering a high-performance alternative to proprietary tensor cores.
  • Flexibility for Research: The configurable and extensible nature of the unit facilitates academic and industrial research into optimal data formats and custom arithmetic representations for future AI hardware generations.
lock-1

Technical Deep Dive Available

This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.

Read Full Report →