DARKSIDE: A Heterogeneous RISC-V Compute Cluster for Extreme-Edge On-Chip DNN Inference and Training

DARKSIDE: A Heterogeneous RISC-V Compute Cluster for Extreme-Edge On-Chip DNN Inference and Training

Abstract

DARKSIDE is a System-on-Chip featuring a heterogeneous cluster of eight RISC-V cores designed for extreme-edge (TinyML) DNN inference and training, integrating 2-bit to 32-bit mixed-precision integer capabilities. To boost performance, the cluster includes dedicated accelerators, notably a depthwise convolution engine and a 16-bit floating-point Tensor Product Engine (TPE). Implemented in 65nm CMOS, DARKSIDE achieves high efficiency, reaching 835 GOPS/W for 2-bit integer kernels and 300 GFLOPS/W for floating-point tensor operations, enabling competitive on-chip training speeds.

Report

DARKSIDE Technical Report

Key Highlights

  • Target Application: Extreme-Edge (TinyML) on-chip DNN inference and training, meeting stringent latency, throughput, and energy requirements.
  • Architecture: A heterogeneous compute cluster comprising 8 DSP-enhanced RISC-V cores and specialized digital accelerators.
  • Peak Integer Efficiency: Achieves 835 GOPS/W when utilizing 2-bit integer DNN kernels, demonstrating ultra-low power operation.
  • Floating-Point Capability: Integrates a specialized 16-bit Floating Point Tensor Product Engine (TPE) enabling high-efficiency performance (300 GFLOPS/W) necessary for on-chip training.
  • Mixed Precision Support: Supports a wide range of precisions from 2-b up to 32-b integer arithmetic within the enhanced RISC-V cores.

Technical Details

  • Core Configuration: 8 RISC-V cores integrated into a cluster, enhanced for mixed-precision integer arithmetic (2-b to 32-b).
  • Fabrication Technology: Implemented in 65nm CMOS technology.
  • Integer Performance: Achieves a peak integer performance of 65 GOPS.
  • Accelerators Included: The cluster is enriched with three specific digital accelerators:
    1. A specialized engine for low-data-reuse depthwise convolution kernels (delivering up to 30 MAC/cycle).
    2. A minimal overhead datamover for marshaling data between 1-b and 32-b on-the-fly.
    3. A 16-bit Floating Point Tensor Product Engine (TPE) optimized for tiled matrix-multiplication acceleration.
  • Floating-Point Performance: TPE provides up to 18.2 GFLOPS of performance or 300 GFLOPS/W efficiency.

Implications

  • Validation of Heterogeneous RISC-V: DARKSIDE reinforces the trend of using flexible RISC-V clusters combined with dedicated hardware acceleration to solve the critical trade-off between performance/efficiency and programmability in TinyML.
  • Enabling On-Chip Training: The inclusion of the 16-bit Floating Point TPE is a significant advancement, moving beyond typical ultra-low-power devices that focus solely on inference, making competitive, energy-efficient training feasible directly on the edge device.
  • Addressing Extreme-Edge Needs: By supporting highly quantized (2-b) inference while maintaining high performance density, DARKSIDE offers a potent solution for the most restrictive power and area budgets at the extreme edge.
  • Technology Viability: Implementing the SoC in 65nm CMOS demonstrates that high GOPS/W efficiency can be achieved even with slightly older, more cost-effective fabrication processes.
lock-1

Technical Deep Dive Available

This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.

Read Full Report →