TCDM Burst Access: Breaking the Bandwidth Barrier in Shared-L1 RVV Clusters Beyond 1000 FPUs
Abstract
This paper presents the TCDM Burst Access architecture, a software-transparent solution designed to overcome bandwidth limitations in high-core-count RISC-V Vector (RVV) clusters sharing multi-banked L1 memory (TCDM). By employing a Burst Manager to parallelize memory responses and improve utilization in hierarchical interconnects, the architecture addresses contention caused by bursty vector accesses. The solution demonstrates ultra-high bandwidth utilization (up to 80% of peak) and achieves up to 2.76x performance and 1.9x energy efficiency improvements with minimal logic overhead.
Report
Key Highlights
- Performance Scaling: The architecture enables efficient scaling of RVV clusters, specifically validated up to 1024 FPUs, by breaking the bandwidth barrier associated with shared L1 memory access.
- Significant Gains: Achieves up to 2.76x performance and 1.9x energy efficiency in real-world kernel benchmarks compared to the serialized access baseline.
- Bandwidth Improvement: Boosts baseline cluster bandwidth by 118% (16-FPU cluster) up to 226% (256-FPU cluster), reaching approximately 80% of the cores-memory peak bandwidth.
- Low Overhead: The implementation requires minimal logic area overhead, reported at less than 8%.
Technical Details
- Innovation: TCDM Burst Access architecture, providing software-transparent support for burst memory transactions.
- Problem Context: Designed to mitigate internal contention and performance degradation in hierarchical intra-cluster networks resulting from the bursty memory access patterns typical of SIMD/vector cores.
- Architectural Components: Utilizes a dedicated Burst Manager responsible for dispatching burst requests efficiently to the multi-banked L1 TCDM.
- Data Retirement: Improves utilization by retiring multiple 32-bit words from burst responses in parallel using channels with parametric data-width.
- Technology Node: The design was implemented and validated using 12-nm FinFET technology node.
Implications
- Enabling Massive Parallelism: This architecture validates the feasibility of scaling RISC-V Vector clusters far beyond current limitations (e.g., beyond 1000 FPUs) while maintaining high performance and energy efficiency.
- RISC-V Ecosystem Maturity: Provides a critical hardware component necessary for competitive RISC-V solutions in areas demanding high-throughput, like Deep Learning and high-performance computing (HPC).
- Addressing Bottlenecks: Effectively addresses the inherent implementation bottleneck of flat interconnects in dense clusters, proving that hierarchical networks can achieve near-peak bandwidth utilization when coupled with optimized burst access mechanisms.
- Energy Efficiency: The significant improvement in energy efficiency (1.9x) makes these large-scale RVV clusters highly desirable for power-constrained environments.
Technical Deep Dive Available
This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.