MXDOTP: A RISC-V ISA Extension for Enabling Microscaling (MX) Floating-Point Dot Products
Abstract
MXDOTP is a novel RISC-V ISA extension designed to accelerate dot product computations using the energy-efficient and highly accurate 8-bit Microscaling Floating-Point (MXFP8) format. The extension integrates a specialized MXFP8 dot product-accumulate unit into the open-source Snitch RISC-V core, achieving high utilization through Stream Semantic Registers (SSRs). This hardware acceleration yields significant performance gains, providing a 25x speedup and 12.5x better energy efficiency compared to conventional software baselines.
Report
Key Highlights
- MXDOTP is the first proposed RISC-V Instruction Set Architecture (ISA) extension specifically targeting Microscaling (MX) floating-point dot products.
- The primary focus is on accelerating the 8-bit format, MXFP8, which uses a block-wise shared exponent scale for improved accuracy in AI applications.
- The extension was implemented by augmenting the existing open-source Snitch RISC-V core with a dedicated MXFP8 dot product-accumulate unit.
- The specialized hardware achieves substantial benefits, demonstrating a 25x speedup and 12.5x improved energy efficiency relative to a software baseline implementation (which relies on casting FP8 inputs to FP32 for accumulation).
- An 8-core cluster implemented in 12 nm FinFET technology achieves high energy efficiency, peaking at 356 GFLOPS/W during MXFP8 matrix multiplication.
Technical Details
- Target Format: The extension is optimized for Microscaling (MX) FP8, which utilizes a block-wise shared exponent to maintain precision at low bitwidths.
- Integration Method: MXDOTP extends the Snitch core architecture by adding a dedicated hardware unit without requiring modifications to the core's register file.
- Data Consumption: The unit is designed to fully consume blocks of eight 8-bit operands packed into 64-bit inputs.
- Data Streaming/Utilization: To maximize the efficiency of the accelerator, the architecture utilizes Snitch's existing Stream Semantic Registers (SSRs). This mechanism allows the feeding of four operands per cycle (including block scales), resulting in up to 80% unit utilization.
- Area Overhead: The implementation requires only a minimal 5.1% area increase on the Snitch core.
- Operating Conditions: Performance testing (356 GFLOPS/W) was conducted at 0.8 V and 1 GHz.
Implications
- Enabling MX Formats: MXDOTP provides essential hardware support for Microscaling formats, facilitating their adoption as a promising standard for low-bitwidth, high-accuracy AI computation.
- RISC-V Specialization: This work demonstrates the power and flexibility of the RISC-V ISA for custom acceleration. By defining a targeted ISA extension, developers can achieve massive efficiency gains specifically tailored for emerging data types and workloads like neural network inference.
- Competitive Edge in AI Hardware: The reported 12.5x energy efficiency improvement positions RISC-V cores extended with MXDOTP as highly competitive solutions for energy-constrained edge AI and high-throughput data center inference accelerators, minimizing power consumption for core linear algebra operations.
- Efficient Low-Bitwidth Computing: By handling complex scaling operations (inherent to MX formats) directly in hardware, the need for complex, power-intensive software emulation (like FP8 to FP32 casting) is eliminated, streamlining the computational pipeline.
Technical Deep Dive Available
This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.