Empowering Vector Architectures for ML: The CAMP Architecture for Matrix Multiplication

Empowering Vector Architectures for ML: The CAMP Architecture for Matrix Multiplication

Abstract

This study introduces the Cartesian Accumulative Matrix Pipeline (CAMP) architecture, a novel design leveraging a hybrid multiplier to significantly enhance matrix multiplication within Vector Architectures (VAs) and SIMD units, optimized for Quantized Neural Networks (QNNs). The CAMP architecture improves both throughput and energy efficiency, specifically targeting ARMv8 SVE and edge RISC-V SIMD platforms. Evaluation demonstrates substantial performance gains, achieving up to 23x speedup on a RISC-V edge SoC and 17x on the ARM A64FX core, while incurring minimal area overhead (1% to 4%).

Report

Key Highlights

  • Novel Architecture: The paper proposes the Cartesian Accumulative Matrix Pipeline (CAMP) architecture to optimize matrix multiplication, a cornerstone of machine learning (ML).
  • Targeted Improvement: CAMP is specifically designed to overcome the efficiency struggles of existing Vector Architectures (VAs) and SIMD units when processing quantized data formats common in Quantized Neural Networks (QNNs).
  • Exceptional Performance Gains: The proposed micro-architecture achieved performance improvements of up to 17$\times$ compared to the ARM A64FX core baseline and up to 23$\times$ compared to a baseline RISC-V edge System-on-Chip (SoC).
  • Low Overhead: Synthesis results show the CAMP design adds negligible area overhead: 1% on the ARM TSMC 7nm target and 4% on the GlobalFoundries 22nm RISC-V target.
  • Energy Efficiency: Besides high throughput, CAMP's design also contributes to enhanced energy efficiency, suitable for low-power applications.

Technical Details

  • Architecture Name: Cartesian Accumulative Matrix Pipeline (CAMP).
  • Design Focus: Enhancing matrix multiplication, critical for executing Large Language Models (LLMs) and Convolutional Neural Networks (CNNs).
  • Mechanism: CAMP utilizes a simple yet effective architecture centered around a hybrid multiplier to process quantized data more efficiently.
  • Target Platforms: The architecture is evaluated against modern high-performance (ARMv8 Scalable Vector Extension - SVE) and low-power edge (RISC-V SIMD-based) platforms.
  • Synthesis Targets: Evaluation involved physical synthesis and place-and-route (PnR) using Synopsys tools, targeting:
    • ARM A64FX comparison: TSMC 7nm process technology.
    • RISC-V SoC comparison: GlobalFoundries 22nm process technology.

Implications

  • RISC-V Competitiveness in AI: CAMP directly addresses a major bottleneck (quantized matrix multiplication) in VAs/SIMD units, significantly boosting the ML capability of edge RISC-V systems. The 23x improvement makes RISC-V highly competitive for edge AI processing.
  • Democratization of Efficient ML: By enabling superior performance with minimal area and energy cost, CAMP facilitates the deployment of increasingly popular QNNs (used in LLMs and CNNs) on constrained, low-power hardware, furthering the adoption of RISC-V in edge computing.
  • Architectural Blueprint: The simple, effective hybrid multiplier approach provides a proven blueprint for future extensions and optimizations within the RISC-V Vector (RVV) and custom SIMD ecosystems, maximizing performance without demanding massive silicon area.
lock-1

Technical Deep Dive Available

This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.

Read Full Report →