Marsellus: A Heterogeneous RISC-V AI-IoT End-Node SoC with 2-to-8b DNN Acceleration and 30%-Boost Adaptive Body Biasing
Abstract
Marsellus is a heterogeneous RISC-V System-on-a-Chip designed for low-power AI-IoT end-nodes fabricated in GlobalFoundries 22nm FDX. The SoC combines a cluster of 16 RISC-V DSP cores supporting specialized 2-to-4b arithmetic (XpulpNN) with a 2-8bit Reconfigurable Binary Engine (RBE) for high-efficiency DNN acceleration. It achieves high peak efficiency of 12.4 Top/s/W for hardware-accelerated layers, leveraging Adaptive Body Biasing (ABB) for dynamic optimization across diverse operating conditions.
Report
Key Highlights
- Target Application: Heterogeneous RISC-V SoC tailored for power-constrained AI-IoT end-nodes (e.g., augmented reality, personalized healthcare).
- Fabrication Technology: Implemented in GlobalFoundries 22nm FDX (FD-SOI).
- Peak Efficiency: Achieves up to 637 Gop/s or 12.4 Top/s/W on hardware-accelerated DNN layers (2-bit precision).
- Core Architecture: Features a cluster of 16 RISC-V Digital Signal Processing (DSP) cores supporting highly quantized arithmetic extensions (XpulpNN).
- Power Management Innovation: Integrates On-Chip Monitoring (OCM) with an Adaptive Body Biasing (ABB) generator for on-the-fly voltage adaptation, enabling a potential 30%-boost in performance/efficiency.
Technical Details
- Processing Cores: The main compute cluster consists of 16 RISC-V DSP cores, designed to handle both complex signal processing (with floating-point support) and diverse AI workloads.
- Software Acceleration: Cores utilize specialized instruction set extensions, known as XpulpNN, which enable efficient execution of 4-bit and 2-bit arithmetic, combined with fused MAC&LOAD operations.
- Hardware Accelerator (RBE): A dedicated 2-to-8bit Reconfigurable Binary Engine accelerates specific, highly parallel DNN operations, focusing on 3x3 and 1x1 (pointwise) convolutions.
- Adaptivity Mechanism: The Adaptive Body Biasing (ABB) system uses a hardware control loop fed by OCM blocks to dynamically adjust the transistor threshold voltages ($V_{th}$). This technique minimizes leakage power or boosts speed depending on the immediate operational needs.
- Performance Breakdown: Maximum software performance (on the 16-core cluster) reaches 180 Gop/s or 3.32 Top/s/W at 2-bit precision.
Implications
- Validation of RISC-V for Edge AI: Marsellus demonstrates that RISC-V architectures can effectively serve as the core technology for highly sophisticated, heterogeneous, power-sensitive AI edge devices, matching or exceeding typical performance requirements for AI-IoT.
- Pioneering Deep Quantization: By providing native support (via XpulpNN and the RBE) for extremely low precision (down to 2-bit), the SoC advances the feasibility of deploying large, complex DNNs within stringent memory and power budgets.
- Leveraging FD-SOI Technology: The successful implementation and integration of Adaptive Body Biasing in the 22nm FDX process highlights the critical role FD-SOI plays in providing fine-grained, dynamic power management essential for next-generation always-on edge devices.
- Addressing Workload Diversity: The SoC's structure—combining a powerful, general-purpose DSP cluster with a dedicated, highly efficient accelerator—allows it to successfully manage the dual requirements of AI-IoT: high-throughput, low-precision inference and demanding high-precision control/signal processing tasks.
Technical Deep Dive Available
This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.