Efficient transformer adaptation for analog in-memory computing via low-rank adapters
Abstract
This paper proposes Analog Hardware-Aware Low-Rank Adaptation (AHWA-LoRA) to efficiently adapt large transformer models for Analog In-Memory Computing (AIMC) hardware, circumventing the need for costly full model retraining or analog device reprogramming. AHWA-LoRA fixes the core analog weights as meta-weights and introduces lightweight digital LoRA modules for task and hardware adaptation. Deploying this approach on a hybrid architecture utilizing RISC-V multi-core accelerators results in efficient inference with only a 4% per-layer overhead compared to a fully AIMC implementation.
Report
Key Highlights
- Problem Solved: Addresses the limitations of deploying flexible transformer models on static, weight-stationary Analog In-Memory Computing (AIMC) hardware, where adaptation usually requires retraining the entire model or time-consuming device reprogramming.
- Core Innovation: Introduction of Analog Hardware-Aware Low-Rank Adaptation (AHWA-LoRA) training.
- Mechanism: The AHWA-LoRA method keeps the large analog weights fixed (acting as "meta-weights") and uses external, lightweight LoRA modules to handle both hardware calibration and downstream task adaptation.
- Performance: The resulting hybrid AIMC/digital architecture achieves efficient transformer inference with a minimal 4% per-layer overhead compared to an ideal, fully AIMC implementation.
- Validation: Demonstrated effectiveness and scalability across complex tasks, including the SQuAD v1.1 and GLUE benchmarks, instruction tuning, and reinforcement learning.
Technical Details
- Computing Paradigm: Analog In-Memory Computing (AIMC).
- Training Method: AHWA-LoRA (Analog Hardware-Aware Low-Rank Adaptation).
- Adaptation Strategy: Parameter-Efficient Fine-Tuning (PEFT) using LoRA modules, which are decoupled from the static analog weight matrix.
- Hardware Implementation: The paper evaluates a practical deployment scenario using a hybrid pipeline strategy.
- Processing Components: The pipeline balances the high throughput of static AIMC tiles (for primary matrix operations) with the low latency/flexibility of digital processing units required for the LoRA module calculations.
- Digital Accelerator: The digital LoRA processing is implemented using optimized pipeline strategies on RISC-V-based programmable multi-core accelerators.
Implications
- Enhanced AIMC Utility: AHWA-LoRA significantly increases the practical utility of AIMC accelerators for real-world applications like Large Language Models (LLMs), which demand rapid, frequent task adaptation (e.g., in edge computing or personalized AI services).
- RISC-V in AI Acceleration: This work validates the role of RISC-V multi-core architectures not just as control planes, but as essential, low-overhead computational partners in advanced hybrid AI systems. By efficiently handling the digital LoRA calculations, RISC-V enables the full speed potential of the specialized analog compute tiles.
- Hybrid Heterogeneous Computing: The successful integration of static analog compute (AIMC) with flexible digital compute (RISC-V) establishes a template for future heterogeneous hardware accelerators, where performance bottlenecks are solved via intelligent architectural partitioning rather than brute-force scaling.
- Energy Efficiency: By keeping the large analog weights fixed, the solution eliminates the high time and energy cost associated with reprogramming non-volatile analog memory devices for every new task, leading to significant system-level energy savings during adaptation.
Technical Deep Dive Available
This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.