Research

Deeploy: Enabling Energy-Efficient Deployment of Small Language Models On Heterogeneous Microcontrollers

Admin

0 views • 2 years ago (Updated) • 2 min read •

•

Abstract

Deeploy is a novel Deep Neural Network (DNN) compiler enabling the energy-efficient, end-to-end deployment of Small Language Models (SLMs) directly onto heterogeneous microcontroller (MCU) class chips without relying on external high-bandwidth memory. The compiler automates optimization across multicore RISC-V (RV32) processors augmented with ML instruction extensions and a hardware Neural Processing Unit (NPU). This solution achieves leading-edge performance of 340 Tokens per second at an energy efficiency of 490 microJoules per Token for an SLM trained on the TinyStories dataset.

Report

Key Highlights

First End-to-End SLM Deployment: Successfully executes a Small Language Model (SLM) on an MCU-class device without requiring high-bandwidth external memory access.
Introducing Deeploy: A novel DNN compiler designed specifically for aggressively constrained, heterogeneous edge devices.
Leading Performance Metrics: Achieved 340 Tokens/second throughput with an energy efficiency of 490 $\mu J/$Token.
Target Architecture: Deployment focuses on multicore RISC-V (RV32) microcontrollers integrated with specialized ML extensions and a hardware Neural Processing Unit (NPU).

Technical Details

Compiler Function: Deeploy is a specialized DNN compiler responsible for automatic exploration and optimization of the complex memory vs. computation tradeoffs inherent in aggressive SLM deployment.
Code Generation: The compiler generates highly-optimized C code, minimizing the required runtime support for executing the SLM.
Hardware Utilization: Deeploy ensures full exploitation of the heterogeneous platform, partitioning workloads efficiently across the RV32 cores' ML instruction extensions and the dedicated NPU.
Model Specifics: The validation benchmark utilized an SLM trained on the TinyStories dataset.

Implications

Advancing TinyML Capabilities: This work significantly pushes the frontier of Tiny Machine Learning (TinyML) by making complex Transformer-based models (SLMs) viable on resource-constrained, ultra-low-power microcontrollers, moving beyond traditional vision or simple sensing tasks.
Validating RISC-V for AI: The success validates the multicore RISC-V RV32 architecture, especially when combined with specialized ML instruction extensions and NPUs, as a highly energy-efficient platform for emerging AI workloads at the deep edge.
Compiler Ecosystem Maturity: The introduction of Deeploy addresses a critical gap in the software toolchain, demonstrating how specialized compilers are essential to abstract complexity and fully leverage the performance and power benefits of heterogeneous RISC-V hardware designs.

Technical Deep Dive Available

This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.

Read Full Report →