Optimizing Foundation Model Inference on a Many-tiny-core Open-source RISC-V Platform

Optimizing Foundation Model Inference on a Many-tiny-core Open-source RISC-V Platform

Abstract

This work presents the first end-to-end inference results of transformer Foundation Models on a general-purpose, open-source many-tiny-core RISC-V platform, leveraging specialized ISA extensions and distributed primitives. The optimized implementation achieves significant speedups compared to baselines, reaching 12.8x for encoder-only models and up to 35.6x for autoregressive decoder-only models. Critically, the platform demonstrates superior energy efficiency, yielding 294 GFLOPS/W and outperforming State-of-the-Art dedicated accelerators by more than 2x.

Report

Key Highlights

  • First End-to-End Inference on RISC-V: Demonstrated the execution of transformer-based Foundation Models (FMs) entirely on an open-source, many-tiny-core RISC-V general-purpose platform.
  • High Encoder Speedup: Achieved up to 12.8x speedup for encoder-only models compared to the baseline version.
  • High Decoder Speedup: Demonstrated up to 35.6x speedup in the Autoregressive (AR) mode for decoder-only models, and 16.1x speedup in the Non-Autoregressive (NAR) mode.
  • Superior Energy Efficiency: Reached 294 GFLOPS/W, which is more than 2x better than existing State-of-the-Art (SoA) dedicated accelerators.
  • High Utilization: Achieved over 79% FPU utilization for encoder-only models and 2.04x higher FPU utilization compared to the best SoA dedicated accelerator for decoder-only models.

Technical Details

  • Platform: An open-source, many-tiny-core RISC-V general-purpose computing platform.
  • Target Models: Focused on two foundational transformer topologies: encoder-only (e.g., for CV) and decoder-only (e.g., for NLP).
  • Optimization Methods: Included the implementation of distributed Softmax primitives.
  • ISA Extensions Utilized: Leveraged specific Instruction Set Architecture (ISA) extensions designed for SIMD floating-point operand streaming and instruction repetition.
  • Memory Management: Employed specialized DMA (Direct Memory Access) engines to minimize costly main memory accesses and tolerate their inherent latency.

Implications

  • Validating General-Purpose RISC-V for AI: This research successfully validates the capability of general-purpose RISC-V hardware, when appropriately optimized, to handle demanding FM inference workloads, moving beyond reliance solely on high-performance GPUs or custom hardwired proprietary accelerators.
  • Advancing Open-Source Hardware: By using an open-source platform, the work provides transparent and replicable methodologies for AI acceleration, fostering innovation and reducing barriers to entry in the hardware AI ecosystem.
  • Edge and Energy-Efficient AI: The exceptional energy efficiency (294 GFLOPS/W) makes this RISC-V architecture highly suitable for deploying sophisticated AI models in power-constrained environments, such as edge devices or embedded systems, where specialized accelerators are often too expensive or inflexible.
  • Ecosystem Growth: Proving that general-purpose RISC-V can compete (and exceed) SoA dedicated hardware in efficiency will drive greater investment and development interest in the RISC-V instruction set for parallel computing and machine learning.
lock-1

Technical Deep Dive Available

This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.

Read Full Report →