A Configurable Mixed-Precision Fused Dot Product Unit for GPGPU Tensor Computation
Abstract
This paper introduces a scalable mixed-precision fused dot product unit designed to overcome the suboptimal throughput of discrete arithmetic units in GPGPUs for Deep Learning workloads. Integrated into the open-source RISC-V Vortex GPGPU's Tensor Core Unit extension, the architecture integrates floating-point and integer arithmetic pipelines for efficiency. The unit supports low-precision formats (e.g., FP8/BF8/INT8) for multiplication and higher-precision formats (FP32/INT32) for accumulation, achieving 9.812 GFLOPS throughput at 306.6 MHz on a Xilinx Alveo U55C FPGA.
Report
Key Highlights
- Core Innovation: A scalable, mixed-precision fused dot product unit (FDPU) architecture for accelerated GPGPU tensor computation.
- Problem Addressed: Suboptimal throughput and poor resource utilization stemming from existing open-source RTL implementations that use discrete arithmetic units for inner dot products.
- Implementation Context: The unit is implemented as part of the open-source RISC-V based Vortex GPGPU's Tensor Core Unit extension.
- Performance Metric: Achieved an ideal filled pipeline throughput of 9.812 GFLOPS in a 4-thread per warp configuration.
- Hardware Validation: Demonstrated performance with a 4-cycle operation latency at 306.6 MHz clock frequency on the AMD Xilinx Alveo U55C FPGA.
Technical Details
- Architecture: A singular fused architecture that integrates both floating-point and integer arithmetic pipelines to maximize throughput and efficiency for mixed-precision Matrix Multiplication Accumulation (MMA).
- Low-Precision Input Support (Multiplication): Configurable to handle FP16, BF16, FP8, BF8, INT8, and UINT4 formats.
- High-Precision Output Support (Accumulation): Supports accumulation in FP32 and INT32 formats.
- Extensibility: The design framework allows for future extensions to evaluate and integrate other custom low-precision representations.
- Performance Latency: 4-cycle operation latency.
- Operating Frequency: 306.6 MHz on the target FPGA.
Implications
- Advancing RISC-V GPGPU Capabilities: This work directly enhances the capability of the open-source RISC-V ecosystem by providing highly optimized, state-of-the-art tensor computation hardware, moving beyond basic vector units.
- Deep Learning Acceleration: By efficiently supporting a wide range of current and emerging low-precision formats (especially FP8 and BF8), the unit enables high-throughput processing of critical Deep Learning workloads, addressing bottlenecks in computation-intensive phases.
- Open-Source Maturity: The implementation within the Vortex GPGPU project elevates the sophistication of available open-source RTL, offering a high-performance alternative to proprietary tensor cores.
- Flexibility for Research: The configurable and extensible nature of the unit facilitates academic and industrial research into optimal data formats and custom arithmetic representations for future AI hardware generations.
Technical Deep Dive Available
This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.