Research

RISC-V Based TinyML Accelerator for Depthwise Separable Convolutions in Edge AI

Admin

0 views • 3 months ago (Updated) • 2 min read •

•

Abstract

This paper introduces a novel RISC-V Custom Function Unit (CFU) accelerator designed for TinyML applications, targeting the high memory wall cost associated with Depthwise Separable Convolutions (DSCs). The architecture utilizes a fused pixel-wise dataflow, eliminating the need for intermediate buffers by streaming data across all DSC stages to complete a single output pixel. This method drastically reduces data movement by up to 87% and achieves a significant speedup of 59.3x over the RISC-V software baseline, confirming the feasibility of zero-buffer TinyML acceleration.

Report

Key Highlights

Target Application: Efficient execution of Depthwise Separable Convolutions (DSCs), crucial components in lightweight CNN architectures like MobileNetV2, within Edge AI and TinyML environments.
Core Innovation: A novel hardware accelerator implementing a "fused pixel-wise dataflow" architecture, which effectively creates a zero-buffer pipeline.
Implementation Strategy: Integrated as a Custom Function Unit (CFU) to extend the standard RISC-V processor capabilities.
Performance: Achieves a measured speedup of up to 59.3x compared to baseline software execution on the RISC-V core.
Efficiency Gain: Reduces data movement related to intermediate feature maps by up to 87% by eliminating memory write/read operations between DSC stages (expansion, depthwise, projection).

Technical Details

Specification	Detail
Dataflow Method	Fused pixel-wise dataflow (zero-buffer); streams data across all DSC stages without writing intermediate feature maps to buffers or DRAM.
Implementation Vehicle	RISC-V Custom Function Unit (CFU).
Evaluated Platform	Xilinx Artix-7 FPGA.
ASIC Projection (High Performance)	28 nm process, 0.284 mm² footprint, 910 mW power consumption at 2 GHz.
ASIC Projection (Energy Efficient)	40 nm process, 1.20 mm² footprint, 233 mW power consumption at 300 MHz.
Problem Solved	The "memory wall" bottleneck caused by high latency and energy costs of data transfer during conventional layer-by-layer execution of DSCs.

Implications

Validating RISC-V Extensibility: This work successfully demonstrates the power and flexibility of the RISC-V architecture's Custom Function Unit interface, showing that it can be used to integrate highly specialized, high-performance accelerators that address complex neural network processing flows efficiently.
Advancing TinyML Architecture: By proving the viability of a zero-buffer, fused-pipeline dataflow for complex operations like DSCs, the paper offers a critical strategy for overcoming the memory-access bottleneck that plagues energy-constrained edge devices. This approach enables faster execution and significantly lower power use for state-of-the-art lightweight CNNs.
Path to Commercial TinyML Silicon: The compact ASIC footprint (0.284 mm²) and low power consumption estimates validate that this accelerator design is commercially viable for integration into resource-constrained IoT and edge device chips, positioning RISC-V as a dominant platform for future TinyML deployment.

Technical Deep Dive Available

This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.

Read Full Report →