A Dynamic Allocation Scheme for Adaptive Shared-Memory Mapping on Kilo-core RV Clusters for Attention-Based Model Deployment
Abstract
This paper presents the Dynamic Allocation Scheme (DAS) to resolve severe shared L1 memory contention and underutilization issues in kilo-core RISC-V clusters designed for attention-based models. DAS is a runtime programmable address remapping hardware coupled with a unified memory allocator that significantly improves data locality within the NUMA PE-to-L1 interconnect. Evaluation on a 1024-PE cluster showed that DAS achieved a 1.94x speedup over fixed interleaving for a ViT-L/16 encoder layer while maintaining 81% PE utilization.
Report
Key Highlights
- Core Innovation: Dynamic Allocation Scheme (DAS), a runtime programmable address remapping unit, is introduced to optimize memory mapping and minimize contention in large shared-memory clusters.
- Scalability Target: The scheme addresses performance bottlenecks (reduced throughput and utilization) inherent in aggressively scaled-up kilo-core RISC-V clusters (specifically 1024 Processing Elements).
- Performance Gain: DAS achieved a 1.94x speedup over the fixed word-level interleaved baseline for attention-based workloads.
- Efficiency: A Vision Transformer (ViT)-L/16 encoder layer executed in just 5.67 ms, achieving a high PE utilization of 0.81 (81%).
- Implementation Overhead: Implemented using 12nm FinFET technology, DAS incurs a negligible area overhead of less than 0.1%.
Technical Details
- Architecture Focus: Solving the low throughput problem caused by hierarchical PE-to-L1 intra-cluster interconnects when handling kernels with diverse arithmetic intensities and memory access patterns.
- Mechanism: DAS utilizes a unified memory allocator and dedicated hardware for dynamic address remapping to adaptively map data onto the multi-banked L1 cache.
- Testbed Specification: Evaluation was performed on a 1024-PE RISC-V cluster featuring a Non-Uniform Memory Access (NUMA) PE-to-L1 interconnect structure.
- Target Workload: Attention-based models, exemplified by the Vision Transformer (ViT)-L/16, which are critical for modern machine learning deployment.
- Baseline Comparison: Performance metrics are measured against a standard fixed word-level interleaved memory mapping scheme.
Implications
- Enabling Scalable AI: This work provides a crucial architectural improvement enabling highly scalable RISC-V accelerators to efficiently handle memory-intensive, complex AI models like Transformers.
- Optimizing RISC-V Clusters: DAS successfully mitigates the severe contention issues traditionally observed when scaling shared-memory RISC-V clusters into the kilo-core range, making massive parallelism feasible and efficient.
- Low-Cost Performance Boost: The sub-0.1% area overhead for DAS demonstrates that significant performance benefits (1.94x speedup) can be achieved through clever memory management hardware with minimal physical cost, making it highly attractive for commercial chip designs.
Technical Deep Dive Available
This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.