A 3 TOPS/W RISC-V Parallel Cluster for Inference of Fine-Grain Mixed-Precision Quantized Neural Networks
Abstract
This work introduces Flex-V, a novel RISC-V parallel cluster designed for highly energy-efficient inference of fine-grain mixed-precision Quantized Neural Networks (QNNs) in demanding IoT environments. The architecture integrates custom fused Mac&Load instructions and uses Control-Status Registers (CSRs) to manage mixed-precision formats, achieving 91.5 MAC/cycle. Implemented in commercial 22nm FDX technology, the cluster delivers a peak energy efficiency of 3.26 TOPS/W and offers up to 8.5x speed-up over the baseline.
Report
Structured Report
Key Highlights
- Record Energy Efficiency: Achieves a high performance density of up to 3.26 TOPS/W (Tera-Operations Per Watt) on the implemented cluster.
- Performance Gains: Demonstrates an 8.5x speed-up compared to the baseline architecture.
- High Programmability Performance: Improves end-to-end QNN performance by 2x - 2.5x compared to existing, fully flexible programmable processor solutions.
- Minimal Overhead: The enhancements result in an area overhead of only 5.6% relative to the baseline RISC-V core.
- Target Application: Designed specifically to meet the strict memory and energy requirements of Deep Neural Network (DNN) deployment on IoT end-nodes.
Technical Details
- Processor Architecture: Introduces Flex-V, a processor based on the RISC-V Instruction Set Architecture (ISA).
- Cluster Configuration: The system uses a tightly-coupled cluster composed of eight Flex-V processors.
- Custom Instructions: Features novel, fused
Mac&Loadinstructions optimized for mixed-precision dot products. - Mixed-Precision Management: To avoid the exponential growth of the encoding space typically caused by mixed-precision variants, quantization formats are efficiently encoded into the Control-Status Registers (CSRs).
- Throughput: Achieves a measured efficiency of up to 91.5 MAC/cycle (Multiply-Accumulate operations per cycle).
- Deployment Stack: A complete hardware and software stack is provided, including a dedicated compiler, optimized libraries, and a memory-aware deployment flow for end-to-end DNN execution.
- Implementation: Fabricated using a commercial 22nm FDX technology process.
Implications
- Validation of RISC-V Customization: This work powerfully validates the RISC-V ISA's extensibility by showing that fine-grain custom instructions (like
Mac&Load) are essential for reaching state-of-the-art energy efficiency in highly specialized domains like quantized AI inference. - Setting Edge AI Benchmarks: The 3 TOPS/W efficiency sets a new, high benchmark for programmable AI accelerators targeting the power-constrained IoT and edge computing market, positioning this design competitively against dedicated NPUs (Neural Processing Units).
- Addressing Data Movement Bottlenecks: The use of fused
Mac&Loadinstructions directly addresses the critical memory and data movement bottlenecks that typically limit the efficiency of DNN inference on traditional architectures. - Enabling Complex Edge Algorithms: By demonstrating efficient execution of mixed-precision QNNs, the architecture enables the deployment of complex algorithms requiring high accuracy with minimal memory footprint on tiny, battery-powered devices.
Technical Deep Dive Available
This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.