A 3 TOPS/W RISC-V Parallel Cluster for Inference of Fine-Grain Mixed-Precision Quantized Neural Networks

A 3 TOPS/W RISC-V Parallel Cluster for Inference of Fine-Grain Mixed-Precision Quantized Neural Networks

Abstract

This work introduces Flex-V, a novel RISC-V parallel cluster designed for highly energy-efficient inference of fine-grain mixed-precision Quantized Neural Networks (QNNs) in demanding IoT environments. The architecture integrates custom fused Mac&Load instructions and uses Control-Status Registers (CSRs) to manage mixed-precision formats, achieving 91.5 MAC/cycle. Implemented in commercial 22nm FDX technology, the cluster delivers a peak energy efficiency of 3.26 TOPS/W and offers up to 8.5x speed-up over the baseline.

Report

Structured Report

Key Highlights

  • Record Energy Efficiency: Achieves a high performance density of up to 3.26 TOPS/W (Tera-Operations Per Watt) on the implemented cluster.
  • Performance Gains: Demonstrates an 8.5x speed-up compared to the baseline architecture.
  • High Programmability Performance: Improves end-to-end QNN performance by 2x - 2.5x compared to existing, fully flexible programmable processor solutions.
  • Minimal Overhead: The enhancements result in an area overhead of only 5.6% relative to the baseline RISC-V core.
  • Target Application: Designed specifically to meet the strict memory and energy requirements of Deep Neural Network (DNN) deployment on IoT end-nodes.

Technical Details

  • Processor Architecture: Introduces Flex-V, a processor based on the RISC-V Instruction Set Architecture (ISA).
  • Cluster Configuration: The system uses a tightly-coupled cluster composed of eight Flex-V processors.
  • Custom Instructions: Features novel, fused Mac&Load instructions optimized for mixed-precision dot products.
  • Mixed-Precision Management: To avoid the exponential growth of the encoding space typically caused by mixed-precision variants, quantization formats are efficiently encoded into the Control-Status Registers (CSRs).
  • Throughput: Achieves a measured efficiency of up to 91.5 MAC/cycle (Multiply-Accumulate operations per cycle).
  • Deployment Stack: A complete hardware and software stack is provided, including a dedicated compiler, optimized libraries, and a memory-aware deployment flow for end-to-end DNN execution.
  • Implementation: Fabricated using a commercial 22nm FDX technology process.

Implications

  • Validation of RISC-V Customization: This work powerfully validates the RISC-V ISA's extensibility by showing that fine-grain custom instructions (like Mac&Load) are essential for reaching state-of-the-art energy efficiency in highly specialized domains like quantized AI inference.
  • Setting Edge AI Benchmarks: The 3 TOPS/W efficiency sets a new, high benchmark for programmable AI accelerators targeting the power-constrained IoT and edge computing market, positioning this design competitively against dedicated NPUs (Neural Processing Units).
  • Addressing Data Movement Bottlenecks: The use of fused Mac&Load instructions directly addresses the critical memory and data movement bottlenecks that typically limit the efficiency of DNN inference on traditional architectures.
  • Enabling Complex Edge Algorithms: By demonstrating efficient execution of mixed-precision QNNs, the architecture enables the deployment of complex algorithms requiring high accuracy with minimal memory footprint on tiny, battery-powered devices.
lock-1

Technical Deep Dive Available

This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.

Read Full Report →