RISC-V Based TinyML Accelerator for Depthwise Separable Convolutions in Edge AI

RISC-V Based TinyML Accelerator for Depthwise Separable Convolutions in Edge AI

Abstract

This paper introduces a novel RISC-V Custom Function Unit (CFU) accelerator designed for TinyML applications, targeting the high memory wall cost associated with Depthwise Separable Convolutions (DSCs). The architecture utilizes a fused pixel-wise dataflow, eliminating the need for intermediate buffers by streaming data across all DSC stages to complete a single output pixel. This method drastically reduces data movement by up to 87% and achieves a significant speedup of 59.3x over the RISC-V software baseline, confirming the feasibility of zero-buffer TinyML acceleration.

Report

Key Highlights

  • Target Application: Efficient execution of Depthwise Separable Convolutions (DSCs), crucial components in lightweight CNN architectures like MobileNetV2, within Edge AI and TinyML environments.
  • Core Innovation: A novel hardware accelerator implementing a "fused pixel-wise dataflow" architecture, which effectively creates a zero-buffer pipeline.
  • Implementation Strategy: Integrated as a Custom Function Unit (CFU) to extend the standard RISC-V processor capabilities.
  • Performance: Achieves a measured speedup of up to 59.3x compared to baseline software execution on the RISC-V core.
  • Efficiency Gain: Reduces data movement related to intermediate feature maps by up to 87% by eliminating memory write/read operations between DSC stages (expansion, depthwise, projection).

Technical Details

Specification Detail
Dataflow Method Fused pixel-wise dataflow (zero-buffer); streams data across all DSC stages without writing intermediate feature maps to buffers or DRAM.
Implementation Vehicle RISC-V Custom Function Unit (CFU).
Evaluated Platform Xilinx Artix-7 FPGA.
ASIC Projection (High Performance) 28 nm process, 0.284 mm² footprint, 910 mW power consumption at 2 GHz.
ASIC Projection (Energy Efficient) 40 nm process, 1.20 mm² footprint, 233 mW power consumption at 300 MHz.
Problem Solved The "memory wall" bottleneck caused by high latency and energy costs of data transfer during conventional layer-by-layer execution of DSCs.

Implications

  1. Validating RISC-V Extensibility: This work successfully demonstrates the power and flexibility of the RISC-V architecture's Custom Function Unit interface, showing that it can be used to integrate highly specialized, high-performance accelerators that address complex neural network processing flows efficiently.
  2. Advancing TinyML Architecture: By proving the viability of a zero-buffer, fused-pipeline dataflow for complex operations like DSCs, the paper offers a critical strategy for overcoming the memory-access bottleneck that plagues energy-constrained edge devices. This approach enables faster execution and significantly lower power use for state-of-the-art lightweight CNNs.
  3. Path to Commercial TinyML Silicon: The compact ASIC footprint (0.284 mm²) and low power consumption estimates validate that this accelerator design is commercially viable for integration into resource-constrained IoT and edge device chips, positioning RISC-V as a dominant platform for future TinyML deployment.
lock-1

Technical Deep Dive Available

This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.

Read Full Report →