RedMule: A Mixed-Precision Matrix-Matrix Operation Engine for Flexible and Energy-Efficient On-Chip Linear Algebra and TinyML Training Acceleration
Abstract
RedMulE is a specialized, mixed-precision matrix multiplication engine designed to enable energy-efficient TinyML training, which traditionally requires costly floating-point operations. Integrated into a RISC-V PULP cluster, the engine supports FP16 and hybrid FP8 formats, overcoming the power budget challenge for near-sensor applications. It demonstrates exceptional energy efficiency, achieving up to 1.67 TFLOPS/W and a peak performance of 117 GFLOPS (FP8) while operating under a 60 mW power consumption limit.
Report
Key Highlights
- Enables TinyML Training: RedMulE addresses the challenge of performing floating-point intensive training (backpropagation) algorithms within the strict power budgets (few tens of mW) of near-sensor TinyML devices.
- Exceptional Energy Efficiency: Achieves peak efficiency of up to 1.67 TFLOPS/W for GEMM-Ops using FP8, and 1.19 TFLOPS/W using FP16 at its best efficiency point (470 MHz, 0.65 V).
- Low Power Consumption: The RedMulE-augmented PULP cluster consumes less than 60 mW on average, even at peak performance.
- High Performance: Delivers up to 117 GFLOPS (FP8) and 58.5 GFLOPS (FP16) at its best performance point (613 MHz, 0.8 V).
- High Utilization: Demonstrates 99.4% utilization of the array of Computing Elements (CEs).
Technical Details
- Architecture: RedMulE (Reduced-Precision Matrix Multiplication Engine) is a specialized accelerator for General Matrix-Matrix Operations (GEMM-Ops).
- Integration: Incorporated into a Parallel Ultra-Low-Power (PULP) cluster, which leverages eight energy-efficient RISC-V cores and a shared tightly-coupled data memory.
- Process Technology: Implemented using 22 nm fabrication technology.
- Supported Precisions: Supports standard FP16, as well as two custom hybrid FP8 formats:
- FP8-A: {sign, exponent, mantissa} = {1, 4, 3}
- FP8-B: {sign, exponent, mantissa} = {1, 5, 2}
- Operation Modes: Optimized for flexibility, handling both standard GEMM and more complex GEMM-Ops efficiently.
Implications
- Breakthrough for Edge AI: By providing high-throughput, energy-efficient floating-point computation, RedMulE makes the critical transition from TinyML inference to full on-device training viable, significantly enhancing the autonomy and adaptability of near-sensor devices.
- Advancing the RISC-V Ecosystem: This work validates the strength of the RISC-V PULP architecture as a foundation for highly specialized, domain-specific accelerators, providing a high-performance linear algebra engine that complements the general-purpose RISC-V cores.
- Future of Low-Power Compute: The extreme efficiency achieved (TFLOPS/W level in the ULP domain) sets a new benchmark for energy-efficient computing, encouraging wider adoption of reduced and mixed-precision floating-point formats in hardware design for AI acceleration.
Technical Deep Dive Available
This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.