VEXP: A Low-Cost RISC-V ISA Extension for Accelerated Softmax Computation in Transformers

VEXP: A Low-Cost RISC-V ISA Extension for Accelerated Softmax Computation in Transformers

Abstract

The VEXP project introduces a low-cost RISC-V Instruction Set Architecture (ISA) extension specifically designed to accelerate the Softmax computation bottleneck found in modern Transformer models. This is achieved by integrating a custom Bfloat16 exponentiation arithmetic block, leveraging Schraudolph's approximation, into the Floating-Point Unit (FPU) with a negligible 1% area overhead. The optimization results in drastic speedups, delivering 8.2x performance improvement for the FlashAttention-2 kernel and up to 5.8x reduction in end-to-end inference latency for models like GPT-2 and ViT.

Report

Key Highlights

  • Target Bottleneck: Softmax computation, specifically the exponentiation step, which has become the primary performance constraint in aggressively accelerated Transformer architectures.
  • Solution: VEXP, a custom RISC-V ISA extension that adds specialized Bfloat16 exponentiation capabilities.
  • Efficiency: Achieves a massive performance gain with only a 1% area overhead on the Floating-Point Unit (FPU) of the RISC-V cores.
  • Softmax Performance: Executes Softmax with 162.7x less latency and 74.3x less energy compared to the baseline cluster.
  • Kernel Performance: Delivers an 8.2x performance improvement and 4.1x higher energy efficiency for the FlashAttention-2 kernel in GPT-2 configuration.
  • End-to-End Results: Enables multi-cluster systems to execute GPT-2, GPT-3, and ViT inference with up to 5.8x reduction in latency and 3.6x reduction in energy consumption without requiring re-training.

Technical Details

  • Implementation Location: The custom arithmetic block is integrated directly into the Floating-Point Unit (FPU) of the RISC-V cores within a compute cluster.
  • Customization Method: Utilization of custom Instruction Set Architecture (ISA) extensions to expose the acceleration unit to software kernels.
  • Mathematical Method: The custom exponentiation block utilizes a novel approximation algorithm based on Schraudolph's method.
  • Data Format: The acceleration targets Bfloat16 precision for exponentiation.
  • Software Optimization: Performance gains depend on optimizing software kernels to effectively leverage the newly introduced VEXP ISA extension.

Implications

  • Validating RISC-V Extensibility: This work strongly validates the RISC-V architecture's strength in custom hardware design, allowing critical ML bottlenecks to be solved with minimal area cost through ISA extensions.
  • Addressing AI Bottlenecks: By shifting the focus from general matrix multiplication acceleration to specific non-linear functions like Softmax, VEXP offers a path toward more balanced and efficient Transformer hardware.
  • High Efficiency for Edge/Embedded AI: The combination of low area overhead (1%) and substantial energy efficiency gains (up to 74.3x reduction in Softmax energy) makes this approach highly relevant for energy-constrained or embedded AI applications.
  • Immediate ML Impact: Achieving significant speedups (up to 5.8x) for major pre-trained models (GPT-x, ViT) without requiring complex model adjustments or re-training means the innovation is immediately applicable to existing deep learning workloads.
lock-1

Technical Deep Dive Available

This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.

Read Full Report →