RedMulE: A Compact FP16 Matrix-Multiplication Accelerator for Adaptive Deep Learning on RISC-V-Based Ultra-Low-Power SoCs

RedMulE: A Compact FP16 Matrix-Multiplication Accelerator for Adaptive Deep Learning on RISC-V-Based Ultra-Low-Power SoCs

Abstract

RedMulE is a compact, parametric FP16 matrix-multiplication accelerator designed to enable online finetuning and adaptation of Deep Learning models on ultra-low-power RISC-V-based SoCs. Integrating tightly within a PULP cluster, this engine addresses the challenge of affording parallel floating-point operations on sub-100 mW extreme-edge devices. Implemented in 22 nm technology, RedMulE achieves a cluster-level energy efficiency of 688 16-bit GFLOPS/W and delivers up to a 22x speedup over software execution on 8 RISC-V cores.

Report

Key Highlights

  • Innovation: RedMulE (Reduced-precision matrix Multiplication Engine), a dedicated hardware accelerator for FP16 matrix multiplication.
  • Target Application: Enabling online finetuning and adaptation of general Deep Learning models on extreme-edge, ultra-low-power SoCs (sub-100 mW).
  • Integration: Conceived for tight integration within RISC-V clusters utilizing the PULP (Parallel Ultra-Low-Power) architecture.
  • Energy Efficiency: Achieves a high cluster-level energy efficiency of 688 16-bit GFLOPS/W.
  • Performance Gain: Provides up to 4.65x higher energy efficiency and a 22x speedup compared to software execution across 8 RISC-V cores.

Technical Details

  • Processing Technology: Implemented using 22 nm technology.
  • Precision: Focuses on FP16 (Half-precision floating-point) arithmetic, crucial for efficient DL training and inference kernels.
  • Configuration Metrics (32-FMA Instance):
    • Maximum Operating Frequency: Up to 666 MHz.
    • Area Overhead: Occupies 0.07 mm².
    • Area Ratio: Represents only 14% of the total area of an 8-core RISC-V cluster.
    • Utilization/Throughput: Achieves 98.8% utilization, delivering 31.6 MAC/cycle.
  • Power Consumption: The full cluster, including RedMulE, exhibits a power consumption of 43.5 mW.

Implications

  • Democratization of Training: RedMulE eliminates the 'unaffordable' barrier associated with parallel floating-point operations on resource-constrained devices, making online learning and adaptive AI practical at the extreme edge.
  • Strengthening RISC-V Ecosystem: Provides a specialized, high-performance computing primitive that greatly enhances the capability of RISC-V-based ultra-low-power SoCs (especially those based on PULP) to compete in the demanding AI hardware landscape.
  • Efficiency Benchmark: The demonstrated efficiency (688 GFLOPS/W) sets a high standard for energy performance in sub-100 mW embedded systems, pushing the boundaries of what is possible in ultra-low-power deep learning hardware.
lock-1

Technical Deep Dive Available

This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.

Read Full Report →