Deeploy: Enabling Energy-Efficient Deployment of Small Language Models On Heterogeneous Microcontrollers

Deeploy: Enabling Energy-Efficient Deployment of Small Language Models On Heterogeneous Microcontrollers

Abstract

Deeploy is a novel Deep Neural Network (DNN) compiler enabling the energy-efficient, end-to-end deployment of Small Language Models (SLMs) directly onto heterogeneous microcontroller (MCU) class chips without relying on external high-bandwidth memory. The compiler automates optimization across multicore RISC-V (RV32) processors augmented with ML instruction extensions and a hardware Neural Processing Unit (NPU). This solution achieves leading-edge performance of 340 Tokens per second at an energy efficiency of 490 microJoules per Token for an SLM trained on the TinyStories dataset.

Report

Key Highlights

  • First End-to-End SLM Deployment: Successfully executes a Small Language Model (SLM) on an MCU-class device without requiring high-bandwidth external memory access.
  • Introducing Deeploy: A novel DNN compiler designed specifically for aggressively constrained, heterogeneous edge devices.
  • Leading Performance Metrics: Achieved 340 Tokens/second throughput with an energy efficiency of 490 $\mu J/$Token.
  • Target Architecture: Deployment focuses on multicore RISC-V (RV32) microcontrollers integrated with specialized ML extensions and a hardware Neural Processing Unit (NPU).

Technical Details

  • Compiler Function: Deeploy is a specialized DNN compiler responsible for automatic exploration and optimization of the complex memory vs. computation tradeoffs inherent in aggressive SLM deployment.
  • Code Generation: The compiler generates highly-optimized C code, minimizing the required runtime support for executing the SLM.
  • Hardware Utilization: Deeploy ensures full exploitation of the heterogeneous platform, partitioning workloads efficiently across the RV32 cores' ML instruction extensions and the dedicated NPU.
  • Model Specifics: The validation benchmark utilized an SLM trained on the TinyStories dataset.

Implications

  • Advancing TinyML Capabilities: This work significantly pushes the frontier of Tiny Machine Learning (TinyML) by making complex Transformer-based models (SLMs) viable on resource-constrained, ultra-low-power microcontrollers, moving beyond traditional vision or simple sensing tasks.
  • Validating RISC-V for AI: The success validates the multicore RISC-V RV32 architecture, especially when combined with specialized ML instruction extensions and NPUs, as a highly energy-efficient platform for emerging AI workloads at the deep edge.
  • Compiler Ecosystem Maturity: The introduction of Deeploy addresses a critical gap in the software toolchain, demonstrating how specialized compilers are essential to abstract complexity and fully leverage the performance and power benefits of heterogeneous RISC-V hardware designs.
lock-1

Technical Deep Dive Available

This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.

Read Full Report →