RedMulE: A Compact FP16 Matrix-Multiplication Accelerator for Adaptive Deep Learning on RISC-V-Based Ultra-Low-Power SoCs
Abstract
RedMulE is a compact, parametric FP16 matrix-multiplication accelerator designed to enable online finetuning and adaptation of Deep Learning models on ultra-low-power RISC-V-based SoCs. Integrating tightly within a PULP cluster, this engine addresses the challenge of affording parallel floating-point operations on sub-100 mW extreme-edge devices. Implemented in 22 nm technology, RedMulE achieves a cluster-level energy efficiency of 688 16-bit GFLOPS/W and delivers up to a 22x speedup over software execution on 8 RISC-V cores.
Report
Key Highlights
- Innovation: RedMulE (Reduced-precision matrix Multiplication Engine), a dedicated hardware accelerator for FP16 matrix multiplication.
- Target Application: Enabling online finetuning and adaptation of general Deep Learning models on extreme-edge, ultra-low-power SoCs (sub-100 mW).
- Integration: Conceived for tight integration within RISC-V clusters utilizing the PULP (Parallel Ultra-Low-Power) architecture.
- Energy Efficiency: Achieves a high cluster-level energy efficiency of 688 16-bit GFLOPS/W.
- Performance Gain: Provides up to 4.65x higher energy efficiency and a 22x speedup compared to software execution across 8 RISC-V cores.
Technical Details
- Processing Technology: Implemented using 22 nm technology.
- Precision: Focuses on FP16 (Half-precision floating-point) arithmetic, crucial for efficient DL training and inference kernels.
- Configuration Metrics (32-FMA Instance):
- Maximum Operating Frequency: Up to 666 MHz.
- Area Overhead: Occupies 0.07 mm².
- Area Ratio: Represents only 14% of the total area of an 8-core RISC-V cluster.
- Utilization/Throughput: Achieves 98.8% utilization, delivering 31.6 MAC/cycle.
- Power Consumption: The full cluster, including RedMulE, exhibits a power consumption of 43.5 mW.
Implications
- Democratization of Training: RedMulE eliminates the 'unaffordable' barrier associated with parallel floating-point operations on resource-constrained devices, making online learning and adaptive AI practical at the extreme edge.
- Strengthening RISC-V Ecosystem: Provides a specialized, high-performance computing primitive that greatly enhances the capability of RISC-V-based ultra-low-power SoCs (especially those based on PULP) to compete in the demanding AI hardware landscape.
- Efficiency Benchmark: The demonstrated efficiency (688 GFLOPS/W) sets a high standard for energy performance in sub-100 mW embedded systems, pushing the boundaries of what is possible in ultra-low-power deep learning hardware.
Technical Deep Dive Available
This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.