Research

DARKSIDE: A Heterogeneous RISC-V Compute Cluster for Extreme-Edge On-Chip DNN Inference and Training

Admin

0 views • 3 years ago (Updated) • 2 min read •

•

Abstract

DARKSIDE is a System-on-Chip featuring a heterogeneous cluster of eight RISC-V cores designed for extreme-edge (TinyML) DNN inference and training, integrating 2-bit to 32-bit mixed-precision integer capabilities. To boost performance, the cluster includes dedicated accelerators, notably a depthwise convolution engine and a 16-bit floating-point Tensor Product Engine (TPE). Implemented in 65nm CMOS, DARKSIDE achieves high efficiency, reaching 835 GOPS/W for 2-bit integer kernels and 300 GFLOPS/W for floating-point tensor operations, enabling competitive on-chip training speeds.

Report

DARKSIDE Technical Report

Key Highlights

Target Application: Extreme-Edge (TinyML) on-chip DNN inference and training, meeting stringent latency, throughput, and energy requirements.
Architecture: A heterogeneous compute cluster comprising 8 DSP-enhanced RISC-V cores and specialized digital accelerators.
Peak Integer Efficiency: Achieves 835 GOPS/W when utilizing 2-bit integer DNN kernels, demonstrating ultra-low power operation.
Floating-Point Capability: Integrates a specialized 16-bit Floating Point Tensor Product Engine (TPE) enabling high-efficiency performance (300 GFLOPS/W) necessary for on-chip training.
Mixed Precision Support: Supports a wide range of precisions from 2-b up to 32-b integer arithmetic within the enhanced RISC-V cores.

Technical Details

Core Configuration: 8 RISC-V cores integrated into a cluster, enhanced for mixed-precision integer arithmetic (2-b to 32-b).
Fabrication Technology: Implemented in 65nm CMOS technology.
Integer Performance: Achieves a peak integer performance of 65 GOPS.
Accelerators Included: The cluster is enriched with three specific digital accelerators:
1. A specialized engine for low-data-reuse depthwise convolution kernels (delivering up to 30 MAC/cycle).
2. A minimal overhead datamover for marshaling data between 1-b and 32-b on-the-fly.
3. A 16-bit Floating Point Tensor Product Engine (TPE) optimized for tiled matrix-multiplication acceleration.
Floating-Point Performance: TPE provides up to 18.2 GFLOPS of performance or 300 GFLOPS/W efficiency.

Implications

Validation of Heterogeneous RISC-V: DARKSIDE reinforces the trend of using flexible RISC-V clusters combined with dedicated hardware acceleration to solve the critical trade-off between performance/efficiency and programmability in TinyML.
Enabling On-Chip Training: The inclusion of the 16-bit Floating Point TPE is a significant advancement, moving beyond typical ultra-low-power devices that focus solely on inference, making competitive, energy-efficient training feasible directly on the edge device.
Addressing Extreme-Edge Needs: By supporting highly quantized (2-b) inference while maintaining high performance density, DARKSIDE offers a potent solution for the most restrictive power and area budgets at the extreme edge.
Technology Viability: Implementing the SoC in 65nm CMOS demonstrates that high GOPS/W efficiency can be achieved even with slightly older, more cost-effective fabrication processes.

Technical Deep Dive Available

This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.

Read Full Report →