Toward Attention-based TinyML: A Heterogeneous Accelerated Architecture and Automated Deployment Flow

Toward Attention-based TinyML: A Heterogeneous Accelerated Architecture and Automated Deployment Flow

Abstract

This work addresses the difficulty of running modern Attention-based Transformer models in resource-constrained Tiny Machine Learning (tinyML) environments. The authors introduce a heterogeneous RISC-V architecture, featuring an octa-core cluster coupled with a hardwired accelerator specialized for quantized Attention operations. Supported by an automated deployment flow, this system achieves industry-leading energy efficiency of 2960 GOp/J and 154 GOp/s throughput for end-to-end 8-bit Transformer inference.

Report

Key Highlights

  • Attention-based TinyML: The architecture is specifically designed to enable the deployment of computationally demanding Attention and Transformer models within the strict power envelope of tinyML systems.
  • Heterogeneous Architecture: It employs a specialized architectural template that couples RISC-V processors with hardwired acceleration tailored for the demanding mathematical operations of Attention mechanisms.
  • Automated Deployment: An automated flow is used to facilitate end-to-end 8-bit (INT8) Transformer inference, streamlining the path from model training to hardware execution.
  • Leading Energy Efficiency: The design achieves exceptional performance metrics, reporting 2960 GOp/J in energy efficiency and 154 GOp/s in throughput.

Technical Details

  • Core Architecture: The system uses an octa-core RISC-V cluster.
  • Acceleration: Includes a dedicated hardwired accelerator specifically optimized for executing quantized Attention operations.
  • Quantization: The system supports full end-to-end 8-bit Transformer inference.
  • Technology Node: Implemented using 22 nm FD-SOI technology.
  • Operating Point: Achieved results were measured at a low operating voltage of 0.65 V.
  • Performance Metrics: The resulting system delivers 2960 GOp/J (Energy Efficiency) and 154 GOp/s (Throughput).

Implications

  • Advancing TinyML Capabilities: This research significantly expands the potential of tinyML, moving beyond traditional Convolutional Neural Networks (CNNs) to enable more powerful and state-of-the-art Transformer architectures on edge devices.
  • RISC-V Ecosystem Validation: The successful implementation of complex, high-efficiency acceleration demonstrates the viability and competitive advantage of using customizable RISC-V processors as the primary computing base for next-generation ML hardware.
  • Benchmark for Efficiency: The reported efficiency of 2960 GOp/J sets a new, high-water mark for processing modern ML workloads in the energy-constrained domain, influencing future hardware design standards.
  • Deployment Simplification: The introduction of an automated deployment flow is crucial for commercialization, reducing the complexity required to map intricate Transformer models onto highly specialized heterogeneous hardware.
lock-1

Technical Deep Dive Available

This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.

Read Full Report →