RISC-V Based TinyML Accelerator for Depthwise Separable Convolutions in Edge AI
Abstract
This paper introduces a novel RISC-V Custom Function Unit (CFU) accelerator designed for TinyML applications, targeting the high memory wall cost associated with Depthwise Separable Convolutions (DSCs). The architecture utilizes a fused pixel-wise dataflow, eliminating the need for intermediate buffers by streaming data across all DSC stages to complete a single output pixel. This method drastically reduces data movement by up to 87% and achieves a significant speedup of 59.3x over the RISC-V software baseline, confirming the feasibility of zero-buffer TinyML acceleration.
Report
Key Highlights
- Target Application: Efficient execution of Depthwise Separable Convolutions (DSCs), crucial components in lightweight CNN architectures like MobileNetV2, within Edge AI and TinyML environments.
- Core Innovation: A novel hardware accelerator implementing a "fused pixel-wise dataflow" architecture, which effectively creates a zero-buffer pipeline.
- Implementation Strategy: Integrated as a Custom Function Unit (CFU) to extend the standard RISC-V processor capabilities.
- Performance: Achieves a measured speedup of up to 59.3x compared to baseline software execution on the RISC-V core.
- Efficiency Gain: Reduces data movement related to intermediate feature maps by up to 87% by eliminating memory write/read operations between DSC stages (expansion, depthwise, projection).
Technical Details
| Specification | Detail |
|---|---|
| Dataflow Method | Fused pixel-wise dataflow (zero-buffer); streams data across all DSC stages without writing intermediate feature maps to buffers or DRAM. |
| Implementation Vehicle | RISC-V Custom Function Unit (CFU). |
| Evaluated Platform | Xilinx Artix-7 FPGA. |
| ASIC Projection (High Performance) | 28 nm process, 0.284 mm² footprint, 910 mW power consumption at 2 GHz. |
| ASIC Projection (Energy Efficient) | 40 nm process, 1.20 mm² footprint, 233 mW power consumption at 300 MHz. |
| Problem Solved | The "memory wall" bottleneck caused by high latency and energy costs of data transfer during conventional layer-by-layer execution of DSCs. |
Implications
- Validating RISC-V Extensibility: This work successfully demonstrates the power and flexibility of the RISC-V architecture's Custom Function Unit interface, showing that it can be used to integrate highly specialized, high-performance accelerators that address complex neural network processing flows efficiently.
- Advancing TinyML Architecture: By proving the viability of a zero-buffer, fused-pipeline dataflow for complex operations like DSCs, the paper offers a critical strategy for overcoming the memory-access bottleneck that plagues energy-constrained edge devices. This approach enables faster execution and significantly lower power use for state-of-the-art lightweight CNNs.
- Path to Commercial TinyML Silicon: The compact ASIC footprint (0.284 mm²) and low power consumption estimates validate that this accelerator design is commercially viable for integration into resource-constrained IoT and edge device chips, positioning RISC-V as a dominant platform for future TinyML deployment.
Technical Deep Dive Available
This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.