Toward Attention-based TinyML: A Heterogeneous Accelerated Architecture and Automated Deployment Flow
Abstract
This work addresses the difficulty of running modern Attention-based Transformer models in resource-constrained Tiny Machine Learning (tinyML) environments. The authors introduce a heterogeneous RISC-V architecture, featuring an octa-core cluster coupled with a hardwired accelerator specialized for quantized Attention operations. Supported by an automated deployment flow, this system achieves industry-leading energy efficiency of 2960 GOp/J and 154 GOp/s throughput for end-to-end 8-bit Transformer inference.
Report
Key Highlights
- Attention-based TinyML: The architecture is specifically designed to enable the deployment of computationally demanding Attention and Transformer models within the strict power envelope of tinyML systems.
- Heterogeneous Architecture: It employs a specialized architectural template that couples RISC-V processors with hardwired acceleration tailored for the demanding mathematical operations of Attention mechanisms.
- Automated Deployment: An automated flow is used to facilitate end-to-end 8-bit (INT8) Transformer inference, streamlining the path from model training to hardware execution.
- Leading Energy Efficiency: The design achieves exceptional performance metrics, reporting 2960 GOp/J in energy efficiency and 154 GOp/s in throughput.
Technical Details
- Core Architecture: The system uses an octa-core RISC-V cluster.
- Acceleration: Includes a dedicated hardwired accelerator specifically optimized for executing quantized Attention operations.
- Quantization: The system supports full end-to-end 8-bit Transformer inference.
- Technology Node: Implemented using 22 nm FD-SOI technology.
- Operating Point: Achieved results were measured at a low operating voltage of 0.65 V.
- Performance Metrics: The resulting system delivers 2960 GOp/J (Energy Efficiency) and 154 GOp/s (Throughput).
Implications
- Advancing TinyML Capabilities: This research significantly expands the potential of tinyML, moving beyond traditional Convolutional Neural Networks (CNNs) to enable more powerful and state-of-the-art Transformer architectures on edge devices.
- RISC-V Ecosystem Validation: The successful implementation of complex, high-efficiency acceleration demonstrates the viability and competitive advantage of using customizable RISC-V processors as the primary computing base for next-generation ML hardware.
- Benchmark for Efficiency: The reported efficiency of 2960 GOp/J sets a new, high-water mark for processing modern ML workloads in the energy-constrained domain, influencing future hardware design standards.
- Deployment Simplification: The introduction of an automated deployment flow is crucial for commercialization, reducing the complexity required to map intricate Transformer models onto highly specialized heterogeneous hardware.
Technical Deep Dive Available
This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.