NeCTAr: A Heterogeneous RISC-V SoC for Language Model Inference in Intel 16
Abstract
NeCTAr (Near-Cache Transformer Accelerator) is a 16nm heterogeneous multicore RISC-V System-on-Chip (SoC) integrating near-core and near-memory accelerators optimized for efficient sparse and dense machine learning inference. The prototype chip, designed for language model acceleration, operates at 400MHz and 0.85V. Demonstrating high energy efficiency, NeCTAr achieves 109 GOPs/W for matrix-vector multiplications while running inference on a sparse language model (ReLU-Llama).
Report
Key Highlights
- Name & Function: NeCTAr (Near-Cache Transformer Accelerator), a heterogeneous multicore RISC-V SoC focused on efficient Language Model (LM) inference.
- Manufacturing Process: Fabricated using the Intel 16nm technology node.
- Acceleration Strategy: Features both near-core and near-memory accelerators to optimize data movement for ML kernels.
- Target Models: Optimized for both sparse and dense machine learning kernels.
- Energy Efficiency: Achieves 109 GOPs/W for matrix-vector multiplications (MVMs).
- Validation: Effectiveness is demonstrated by running inference on a sparse language model named ReLU-Llama.
Technical Details
- Architecture: Heterogeneous multicore RISC-V SoC structure.
- Process Node: 16nm (from Intel).
- Operational Specs: The prototype chip runs at a core frequency of 400MHz with a supply voltage of 0.85V.
- Accelerator Placement: Utilizes dual acceleration methods: specialized processing units placed physically near the core (near-core) and near the memory structures (near-memory).
- Performance Measurement: The quoted efficiency of 109 GOPs/W specifically relates to matrix-vector multiplication tasks, which are fundamental building blocks of transformer networks.
Implications
- RISC-V Maturity: This design validates RISC-V as a competitive, energy-efficient architecture for high-performance AI and machine learning workloads, specifically in the complex domain of LLM inference.
- Addressing Memory Bottleneck: The incorporation of near-core and near-memory accelerators is a crucial architectural step towards mitigating the memory wall problem, which is a major limiter of performance and efficiency in modern data-intensive AI hardware.
- Advanced Node Adoption: Utilizing the Intel 16nm process node demonstrates that sophisticated RISC-V designs are successfully taping out and operating on leading foundry technologies, pushing the boundaries of the open instruction set.
- Sparse Model Support: Explicit support for sparse ML kernels and validation using a sparse model (ReLU-Llama) highlights the SoC's adaptability to future, more efficient sparsity-aware LLM architectures, which are key for reducing memory footprint and computation.
Technical Deep Dive Available
This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.