MiniFloat-NN and ExSdotp: An ISA Extension and a Modular Open Hardware Unit for Low-Precision Training on RISC-V cores

MiniFloat-NN and ExSdotp: An ISA Extension and a Modular Open Hardware Unit for Low-Precision Training on RISC-V cores

Abstract

This paper introduces MiniFloat-NN, a RISC-V Instruction Set Architecture (ISA) extension, and ExSdotp, a modular open hardware unit, designed to accelerate low-precision neural network training using 8-bit and 16-bit floating-point formats. The ExSdotp unit implements fused sum-of-dot-product and three-term additions, saving approximately 30% area and critical path while avoiding precision loss from chained FP operations. A cluster of extended cores fabricated in 12 nm FinFET technology achieves high energy efficiency, demonstrating up to 575 GFLOPS/W for FP8-to-FP16 General Matrix Multiplications (GEMMs).

Report

Key Highlights

  • RISC-V Extension for Low-Precision Training: Introduction of MiniFloat-NN, a RISC-V ISA extension specifically targeting low-precision neural network (NN) training.
  • Low-Precision Format Support: The extension provides native hardware support for two 8-bit and two 16-bit floating-point (FP) formats.
  • ExSdotp Hardware Unit: Development of ExSdotp, a modular open hardware unit that efficiently supports new complex instructions.
  • Fused Operations: The ExSdotp module handles fused operations, specifically sum-of-dot-product (with accumulation in a larger format) and expanding/non-expanding three-term additions.
  • High Efficiency: A test cluster demonstrates peak performance of 575 GFLOPS/W when computing FP8-to-FP16 GEMMs.

Technical Details

  • ISA Extension (MiniFloat-NN): Extends the RISC-V architecture with instructions optimized for low-precision training, supporting various 8-bit and 16-bit FP representations.
  • Instruction Types: Includes sum-of-dot-product instructions that automatically accumulate results in a wider format to maintain precision, and three-term additions (A+B+C) in both expanding and non-expanding versions.
  • ExSdotp Architecture: A fused hardware unit designed to execute the complex, multi-term instructions efficiently. Its fused nature prevents precision degradation caused by the non-associativity of two consecutive standard FP additions.
  • Efficiency Gains: The ExSdotp module achieves a saving of around 30% in both area and critical path compared to implementing the same functionality using a cascade of two expanding fused multiply-add (FMA) units.
  • Implementation and Testing: The ExSdotp module was replicated in a SIMD wrapper and integrated into an open-source Floating-Point Unit (FPU) coupled with an open-source RISC-V core.
  • Testbed Performance: An 8-core cluster was implemented in 12 nm FinFET technology, running at 0.8 V and 1.26 GHz, showcasing the 575 GFLOPS/W figure for crucial deep learning operations.

Implications

  • Advancing RISC-V AI Capability: This work provides a standardized, open-source ISA extension that allows RISC-V cores to effectively compete in the specialized high-efficiency NN training market, previously dominated by proprietary architectures.
  • Enabling Mixed-Precision Training: By natively supporting 8-bit FP formats and expanding operations that accumulate into 16-bit, the design establishes the architectural foundation necessary for practical and precise mixed-precision training.
  • Hardware Design Optimization: The ExSdotp unit demonstrates that fusing complex arithmetic operations is key to overcoming architectural limitations in low-precision arithmetic, providing significant power and area benefits over traditional FMA chaining.
  • Foundation for Scalability: Integrating this unit into an open-source core provides a reusable, modular blueprint for designing future scalable cluster architectures dedicated to energy-efficient AI computation.
lock-1

Technical Deep Dive Available

This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.

Read Full Report →