DARKSIDE: A Heterogeneous RISC-V Compute Cluster for Extreme-Edge On-Chip DNN Inference and Training
Abstract
DARKSIDE is a System-on-Chip featuring a heterogeneous cluster of eight RISC-V cores designed for extreme-edge (TinyML) DNN inference and training, integrating 2-bit to 32-bit mixed-precision integer capabilities. To boost performance, the cluster includes dedicated accelerators, notably a depthwise convolution engine and a 16-bit floating-point Tensor Product Engine (TPE). Implemented in 65nm CMOS, DARKSIDE achieves high efficiency, reaching 835 GOPS/W for 2-bit integer kernels and 300 GFLOPS/W for floating-point tensor operations, enabling competitive on-chip training speeds.
Report
DARKSIDE Technical Report
Key Highlights
- Target Application: Extreme-Edge (TinyML) on-chip DNN inference and training, meeting stringent latency, throughput, and energy requirements.
- Architecture: A heterogeneous compute cluster comprising 8 DSP-enhanced RISC-V cores and specialized digital accelerators.
- Peak Integer Efficiency: Achieves 835 GOPS/W when utilizing 2-bit integer DNN kernels, demonstrating ultra-low power operation.
- Floating-Point Capability: Integrates a specialized 16-bit Floating Point Tensor Product Engine (TPE) enabling high-efficiency performance (300 GFLOPS/W) necessary for on-chip training.
- Mixed Precision Support: Supports a wide range of precisions from 2-b up to 32-b integer arithmetic within the enhanced RISC-V cores.
Technical Details
- Core Configuration: 8 RISC-V cores integrated into a cluster, enhanced for mixed-precision integer arithmetic (2-b to 32-b).
- Fabrication Technology: Implemented in 65nm CMOS technology.
- Integer Performance: Achieves a peak integer performance of 65 GOPS.
- Accelerators Included: The cluster is enriched with three specific digital accelerators:
- A specialized engine for low-data-reuse depthwise convolution kernels (delivering up to 30 MAC/cycle).
- A minimal overhead datamover for marshaling data between 1-b and 32-b on-the-fly.
- A 16-bit Floating Point Tensor Product Engine (TPE) optimized for tiled matrix-multiplication acceleration.
- Floating-Point Performance: TPE provides up to 18.2 GFLOPS of performance or 300 GFLOPS/W efficiency.
Implications
- Validation of Heterogeneous RISC-V: DARKSIDE reinforces the trend of using flexible RISC-V clusters combined with dedicated hardware acceleration to solve the critical trade-off between performance/efficiency and programmability in TinyML.
- Enabling On-Chip Training: The inclusion of the 16-bit Floating Point TPE is a significant advancement, moving beyond typical ultra-low-power devices that focus solely on inference, making competitive, energy-efficient training feasible directly on the edge device.
- Addressing Extreme-Edge Needs: By supporting highly quantized (2-b) inference while maintaining high performance density, DARKSIDE offers a potent solution for the most restrictive power and area budgets at the extreme edge.
- Technology Viability: Implementing the SoC in 65nm CMOS demonstrates that high GOPS/W efficiency can be achieved even with slightly older, more cost-effective fabrication processes.
Technical Deep Dive Available
This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.