MiniFloat-NN and ExSdotp: An ISA Extension and a Modular Open Hardware Unit for Low-Precision Training on RISC-V cores
Abstract
This paper introduces MiniFloat-NN, a RISC-V Instruction Set Architecture (ISA) extension, and ExSdotp, a modular open hardware unit, designed to accelerate low-precision neural network training using 8-bit and 16-bit floating-point formats. The ExSdotp unit implements fused sum-of-dot-product and three-term additions, saving approximately 30% area and critical path while avoiding precision loss from chained FP operations. A cluster of extended cores fabricated in 12 nm FinFET technology achieves high energy efficiency, demonstrating up to 575 GFLOPS/W for FP8-to-FP16 General Matrix Multiplications (GEMMs).
Report
Key Highlights
- RISC-V Extension for Low-Precision Training: Introduction of MiniFloat-NN, a RISC-V ISA extension specifically targeting low-precision neural network (NN) training.
- Low-Precision Format Support: The extension provides native hardware support for two 8-bit and two 16-bit floating-point (FP) formats.
- ExSdotp Hardware Unit: Development of ExSdotp, a modular open hardware unit that efficiently supports new complex instructions.
- Fused Operations: The ExSdotp module handles fused operations, specifically sum-of-dot-product (with accumulation in a larger format) and expanding/non-expanding three-term additions.
- High Efficiency: A test cluster demonstrates peak performance of 575 GFLOPS/W when computing FP8-to-FP16 GEMMs.
Technical Details
- ISA Extension (MiniFloat-NN): Extends the RISC-V architecture with instructions optimized for low-precision training, supporting various 8-bit and 16-bit FP representations.
- Instruction Types: Includes sum-of-dot-product instructions that automatically accumulate results in a wider format to maintain precision, and three-term additions (A+B+C) in both expanding and non-expanding versions.
- ExSdotp Architecture: A fused hardware unit designed to execute the complex, multi-term instructions efficiently. Its fused nature prevents precision degradation caused by the non-associativity of two consecutive standard FP additions.
- Efficiency Gains: The ExSdotp module achieves a saving of around 30% in both area and critical path compared to implementing the same functionality using a cascade of two expanding fused multiply-add (FMA) units.
- Implementation and Testing: The ExSdotp module was replicated in a SIMD wrapper and integrated into an open-source Floating-Point Unit (FPU) coupled with an open-source RISC-V core.
- Testbed Performance: An 8-core cluster was implemented in 12 nm FinFET technology, running at 0.8 V and 1.26 GHz, showcasing the 575 GFLOPS/W figure for crucial deep learning operations.
Implications
- Advancing RISC-V AI Capability: This work provides a standardized, open-source ISA extension that allows RISC-V cores to effectively compete in the specialized high-efficiency NN training market, previously dominated by proprietary architectures.
- Enabling Mixed-Precision Training: By natively supporting 8-bit FP formats and expanding operations that accumulate into 16-bit, the design establishes the architectural foundation necessary for practical and precise mixed-precision training.
- Hardware Design Optimization: The ExSdotp unit demonstrates that fusing complex arithmetic operations is key to overcoming architectural limitations in low-precision arithmetic, providing significant power and area benefits over traditional FMA chaining.
- Foundation for Scalability: Integrating this unit into an open-source core provides a reusable, modular blueprint for designing future scalable cluster architectures dedicated to energy-efficient AI computation.
Technical Deep Dive Available
This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.