A Flexible Template for Edge Generative AI with High-Accuracy Accelerated Softmax & GELU
Abstract
This paper introduces a BFloat16 RISC-V acceleration template for edge Generative AI, specifically addressing the performance bottleneck caused by softmax and GELU non-linearities in Transformer models. The innovation lies in SoftEx, a novel hardware accelerator utilizing an approximate exponentiation algorithm that achieves 121x speedup over standard software while maintaining high accuracy (0.14% MRE). Integrated into a heterogeneous cluster alongside a systolic array, SoftEx enables a 1.58x throughput increase (310 GOPS) and a 1.42x improvement in energy efficiency (1.34 TOPS/W) for end-to-end ViT inference workloads.
Report
Key Highlights
- Target Application: Acceleration template for Transformer-based Generative AI (GenAI) models, optimized for edge computing using BFloat16 precision.
- Bottleneck Solved: Specifically addresses performance bottlenecks caused by the complex non-linearities, Softmax and GELU, which become dominant when matrix multiplication (MatMul) is heavily accelerated.
- Core Innovation: Introduction of SoftEx, a novel hardware accelerator dedicated to high-accuracy, high-speed computation of softmax and GELU functions.
- Performance Gains: SoftEx achieves a massive 121x speedup in exponentiation over glibc's implementation and boosts end-to-end ViT inference throughput by 1.58x.
- Efficiency: The solution offers up to 10.8x lower energy consumption for softmax and improves overall energy efficiency by 1.42x (achieving 1.34 TOPS/W at 0.55V).
Technical Details
- Architecture Template: A heterogeneous, tightly-coupled cluster design.
- Cluster Components:
- 8 general-purpose RISC-V cores.
- 256KiB of shared SRAM.
- A 24x8 systolic array dedicated to MatMul operations.
- The SoftEx accelerator for non-linearities.
- SoftEx Method: Implements an approximate exponentiation algorithm designed to balance high computational efficiency with accuracy (Mean Relative Error of 0.14%).
- Technology & Area: Fabricated in 12nm technology. SoftEx occupies only 0.039 mm$^2$, representing 3.22% of the total cluster area.
- Operational Metrics: The cluster achieves an operating frequency of 1.12 GHz.
- Acceleration Factor (SoftEx vs. RISC-V software):
- Softmax computation accelerated up to 10.8x (and 10.8x energy reduction).
- GELU computation accelerated up to 5.11x (and 5.29x energy reduction).
- End-to-End Metrics: Achieves 310 GOPS throughput at 0.8V and 1.34 TOPS/W energy efficiency at 0.55V.
Implications
- Validation of Heterogeneous RISC-V for AI: This work strongly validates the RISC-V ecosystem's ability to create highly specialized, energy-efficient accelerators (SoftEx) that complement general-purpose cores and standard AI units (systolic arrays). This approach is critical for the future of custom silicon.
- Enabling Edge GenAI: By efficiently accelerating BFloat16 Transformer non-linearities, the template addresses a critical bottleneck for running large, complex Generative AI models (like ViT) directly on energy-constrained edge devices, moving intelligence away from the cloud.
- Design Paradigm Shift: It highlights that future high-performance AI hardware must not only focus on optimizing MatMul but also dedicate resources to accelerating non-linear functions (Softmax/GELU), which become the latency choke points in optimized systems. SoftEx sets a benchmark for achieving accuracy alongside massive speedup in these specialized domains.
- Competitive Edge: The high efficiency (1.34 TOPS/W) provides a competitive solution for specialized AI hardware, positioning RISC-V-based designs as viable alternatives to proprietary architectures for demanding GenAI workloads.
Technical Deep Dive Available
This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.