TCDM Burst Access: Breaking the Bandwidth Barrier in Shared-L1 RVV Clusters Beyond 1000 FPUs

TCDM Burst Access: Breaking the Bandwidth Barrier in Shared-L1 RVV Clusters Beyond 1000 FPUs

Abstract

This paper presents the TCDM Burst Access architecture, a software-transparent solution designed to overcome bandwidth limitations in high-core-count RISC-V Vector (RVV) clusters sharing multi-banked L1 memory (TCDM). By employing a Burst Manager to parallelize memory responses and improve utilization in hierarchical interconnects, the architecture addresses contention caused by bursty vector accesses. The solution demonstrates ultra-high bandwidth utilization (up to 80% of peak) and achieves up to 2.76x performance and 1.9x energy efficiency improvements with minimal logic overhead.

Report

Key Highlights

  • Performance Scaling: The architecture enables efficient scaling of RVV clusters, specifically validated up to 1024 FPUs, by breaking the bandwidth barrier associated with shared L1 memory access.
  • Significant Gains: Achieves up to 2.76x performance and 1.9x energy efficiency in real-world kernel benchmarks compared to the serialized access baseline.
  • Bandwidth Improvement: Boosts baseline cluster bandwidth by 118% (16-FPU cluster) up to 226% (256-FPU cluster), reaching approximately 80% of the cores-memory peak bandwidth.
  • Low Overhead: The implementation requires minimal logic area overhead, reported at less than 8%.

Technical Details

  • Innovation: TCDM Burst Access architecture, providing software-transparent support for burst memory transactions.
  • Problem Context: Designed to mitigate internal contention and performance degradation in hierarchical intra-cluster networks resulting from the bursty memory access patterns typical of SIMD/vector cores.
  • Architectural Components: Utilizes a dedicated Burst Manager responsible for dispatching burst requests efficiently to the multi-banked L1 TCDM.
  • Data Retirement: Improves utilization by retiring multiple 32-bit words from burst responses in parallel using channels with parametric data-width.
  • Technology Node: The design was implemented and validated using 12-nm FinFET technology node.

Implications

  • Enabling Massive Parallelism: This architecture validates the feasibility of scaling RISC-V Vector clusters far beyond current limitations (e.g., beyond 1000 FPUs) while maintaining high performance and energy efficiency.
  • RISC-V Ecosystem Maturity: Provides a critical hardware component necessary for competitive RISC-V solutions in areas demanding high-throughput, like Deep Learning and high-performance computing (HPC).
  • Addressing Bottlenecks: Effectively addresses the inherent implementation bottleneck of flat interconnects in dense clusters, proving that hierarchical networks can achieve near-peak bandwidth utilization when coupled with optimized burst access mechanisms.
  • Energy Efficiency: The significant improvement in energy efficiency (1.9x) makes these large-scale RVV clusters highly desirable for power-constrained environments.
lock-1

Technical Deep Dive Available

This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.

Read Full Report →