A Flexible Template for Edge Generative AI with High-Accuracy Accelerated Softmax & GELU

A Flexible Template for Edge Generative AI with High-Accuracy Accelerated Softmax & GELU

Abstract

This paper introduces a BFloat16 RISC-V acceleration template for edge Generative AI, specifically addressing the performance bottleneck caused by softmax and GELU non-linearities in Transformer models. The innovation lies in SoftEx, a novel hardware accelerator utilizing an approximate exponentiation algorithm that achieves 121x speedup over standard software while maintaining high accuracy (0.14% MRE). Integrated into a heterogeneous cluster alongside a systolic array, SoftEx enables a 1.58x throughput increase (310 GOPS) and a 1.42x improvement in energy efficiency (1.34 TOPS/W) for end-to-end ViT inference workloads.

Report

Key Highlights

  • Target Application: Acceleration template for Transformer-based Generative AI (GenAI) models, optimized for edge computing using BFloat16 precision.
  • Bottleneck Solved: Specifically addresses performance bottlenecks caused by the complex non-linearities, Softmax and GELU, which become dominant when matrix multiplication (MatMul) is heavily accelerated.
  • Core Innovation: Introduction of SoftEx, a novel hardware accelerator dedicated to high-accuracy, high-speed computation of softmax and GELU functions.
  • Performance Gains: SoftEx achieves a massive 121x speedup in exponentiation over glibc's implementation and boosts end-to-end ViT inference throughput by 1.58x.
  • Efficiency: The solution offers up to 10.8x lower energy consumption for softmax and improves overall energy efficiency by 1.42x (achieving 1.34 TOPS/W at 0.55V).

Technical Details

  • Architecture Template: A heterogeneous, tightly-coupled cluster design.
  • Cluster Components:
    • 8 general-purpose RISC-V cores.
    • 256KiB of shared SRAM.
    • A 24x8 systolic array dedicated to MatMul operations.
    • The SoftEx accelerator for non-linearities.
  • SoftEx Method: Implements an approximate exponentiation algorithm designed to balance high computational efficiency with accuracy (Mean Relative Error of 0.14%).
  • Technology & Area: Fabricated in 12nm technology. SoftEx occupies only 0.039 mm$^2$, representing 3.22% of the total cluster area.
  • Operational Metrics: The cluster achieves an operating frequency of 1.12 GHz.
  • Acceleration Factor (SoftEx vs. RISC-V software):
    • Softmax computation accelerated up to 10.8x (and 10.8x energy reduction).
    • GELU computation accelerated up to 5.11x (and 5.29x energy reduction).
  • End-to-End Metrics: Achieves 310 GOPS throughput at 0.8V and 1.34 TOPS/W energy efficiency at 0.55V.

Implications

  • Validation of Heterogeneous RISC-V for AI: This work strongly validates the RISC-V ecosystem's ability to create highly specialized, energy-efficient accelerators (SoftEx) that complement general-purpose cores and standard AI units (systolic arrays). This approach is critical for the future of custom silicon.
  • Enabling Edge GenAI: By efficiently accelerating BFloat16 Transformer non-linearities, the template addresses a critical bottleneck for running large, complex Generative AI models (like ViT) directly on energy-constrained edge devices, moving intelligence away from the cloud.
  • Design Paradigm Shift: It highlights that future high-performance AI hardware must not only focus on optimizing MatMul but also dedicate resources to accelerating non-linear functions (Softmax/GELU), which become the latency choke points in optimized systems. SoftEx sets a benchmark for achieving accuracy alongside massive speedup in these specialized domains.
  • Competitive Edge: The high efficiency (1.34 TOPS/W) provides a competitive solution for specialized AI hardware, positioning RISC-V-based designs as viable alternatives to proprietary architectures for demanding GenAI workloads.
lock-1

Technical Deep Dive Available

This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.

Read Full Report →