Deeploy: Enabling Energy-Efficient Deployment of Small Language Models On Heterogeneous Microcontrollers
Abstract
Deeploy is a novel Deep Neural Network (DNN) compiler enabling the energy-efficient, end-to-end deployment of Small Language Models (SLMs) directly onto heterogeneous microcontroller (MCU) class chips without relying on external high-bandwidth memory. The compiler automates optimization across multicore RISC-V (RV32) processors augmented with ML instruction extensions and a hardware Neural Processing Unit (NPU). This solution achieves leading-edge performance of 340 Tokens per second at an energy efficiency of 490 microJoules per Token for an SLM trained on the TinyStories dataset.
Report
Key Highlights
- First End-to-End SLM Deployment: Successfully executes a Small Language Model (SLM) on an MCU-class device without requiring high-bandwidth external memory access.
- Introducing Deeploy: A novel DNN compiler designed specifically for aggressively constrained, heterogeneous edge devices.
- Leading Performance Metrics: Achieved 340 Tokens/second throughput with an energy efficiency of 490 $\mu J/$Token.
- Target Architecture: Deployment focuses on multicore RISC-V (RV32) microcontrollers integrated with specialized ML extensions and a hardware Neural Processing Unit (NPU).
Technical Details
- Compiler Function: Deeploy is a specialized DNN compiler responsible for automatic exploration and optimization of the complex memory vs. computation tradeoffs inherent in aggressive SLM deployment.
- Code Generation: The compiler generates highly-optimized C code, minimizing the required runtime support for executing the SLM.
- Hardware Utilization: Deeploy ensures full exploitation of the heterogeneous platform, partitioning workloads efficiently across the RV32 cores' ML instruction extensions and the dedicated NPU.
- Model Specifics: The validation benchmark utilized an SLM trained on the TinyStories dataset.
Implications
- Advancing TinyML Capabilities: This work significantly pushes the frontier of Tiny Machine Learning (TinyML) by making complex Transformer-based models (SLMs) viable on resource-constrained, ultra-low-power microcontrollers, moving beyond traditional vision or simple sensing tasks.
- Validating RISC-V for AI: The success validates the multicore RISC-V RV32 architecture, especially when combined with specialized ML instruction extensions and NPUs, as a highly energy-efficient platform for emerging AI workloads at the deep edge.
- Compiler Ecosystem Maturity: The introduction of Deeploy addresses a critical gap in the software toolchain, demonstrating how specialized compilers are essential to abstract complexity and fully leverage the performance and power benefits of heterogeneous RISC-V hardware designs.
Technical Deep Dive Available
This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.