Research

Optimizing Foundation Model Inference on a Many-tiny-core Open-source RISC-V Platform

Admin

0 views • 2 years ago (Updated) • 2 min read •

•

Abstract

This work presents the first end-to-end inference results of transformer Foundation Models on a general-purpose, open-source many-tiny-core RISC-V platform, leveraging specialized ISA extensions and distributed primitives. The optimized implementation achieves significant speedups compared to baselines, reaching 12.8x for encoder-only models and up to 35.6x for autoregressive decoder-only models. Critically, the platform demonstrates superior energy efficiency, yielding 294 GFLOPS/W and outperforming State-of-the-Art dedicated accelerators by more than 2x.

Report

Key Highlights

First End-to-End Inference on RISC-V: Demonstrated the execution of transformer-based Foundation Models (FMs) entirely on an open-source, many-tiny-core RISC-V general-purpose platform.
High Encoder Speedup: Achieved up to 12.8x speedup for encoder-only models compared to the baseline version.
High Decoder Speedup: Demonstrated up to 35.6x speedup in the Autoregressive (AR) mode for decoder-only models, and 16.1x speedup in the Non-Autoregressive (NAR) mode.
Superior Energy Efficiency: Reached 294 GFLOPS/W, which is more than 2x better than existing State-of-the-Art (SoA) dedicated accelerators.
High Utilization: Achieved over 79% FPU utilization for encoder-only models and 2.04x higher FPU utilization compared to the best SoA dedicated accelerator for decoder-only models.

Technical Details

Platform: An open-source, many-tiny-core RISC-V general-purpose computing platform.
Target Models: Focused on two foundational transformer topologies: encoder-only (e.g., for CV) and decoder-only (e.g., for NLP).
Optimization Methods: Included the implementation of distributed Softmax primitives.
ISA Extensions Utilized: Leveraged specific Instruction Set Architecture (ISA) extensions designed for SIMD floating-point operand streaming and instruction repetition.
Memory Management: Employed specialized DMA (Direct Memory Access) engines to minimize costly main memory accesses and tolerate their inherent latency.

Implications

Validating General-Purpose RISC-V for AI: This research successfully validates the capability of general-purpose RISC-V hardware, when appropriately optimized, to handle demanding FM inference workloads, moving beyond reliance solely on high-performance GPUs or custom hardwired proprietary accelerators.
Advancing Open-Source Hardware: By using an open-source platform, the work provides transparent and replicable methodologies for AI acceleration, fostering innovation and reducing barriers to entry in the hardware AI ecosystem.
Edge and Energy-Efficient AI: The exceptional energy efficiency (294 GFLOPS/W) makes this RISC-V architecture highly suitable for deploying sophisticated AI models in power-constrained environments, such as edge devices or embedded systems, where specialized accelerators are often too expensive or inflexible.
Ecosystem Growth: Proving that general-purpose RISC-V can compete (and exceed) SoA dedicated hardware in efficiency will drive greater investment and development interest in the RISC-V instruction set for parallel computing and machine learning.

Technical Deep Dive Available

This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.

Read Full Report →