Research

TCDM Burst Access: Breaking the Bandwidth Barrier in Shared-L1 RVV Clusters Beyond 1000 FPUs

Admin

0 views • a year ago (Updated) • 2 min read •

•

Abstract

This paper presents the TCDM Burst Access architecture, a software-transparent solution designed to overcome bandwidth limitations in high-core-count RISC-V Vector (RVV) clusters sharing multi-banked L1 memory (TCDM). By employing a Burst Manager to parallelize memory responses and improve utilization in hierarchical interconnects, the architecture addresses contention caused by bursty vector accesses. The solution demonstrates ultra-high bandwidth utilization (up to 80% of peak) and achieves up to 2.76x performance and 1.9x energy efficiency improvements with minimal logic overhead.

Report

Key Highlights

Performance Scaling: The architecture enables efficient scaling of RVV clusters, specifically validated up to 1024 FPUs, by breaking the bandwidth barrier associated with shared L1 memory access.
Significant Gains: Achieves up to 2.76x performance and 1.9x energy efficiency in real-world kernel benchmarks compared to the serialized access baseline.
Bandwidth Improvement: Boosts baseline cluster bandwidth by 118% (16-FPU cluster) up to 226% (256-FPU cluster), reaching approximately 80% of the cores-memory peak bandwidth.
Low Overhead: The implementation requires minimal logic area overhead, reported at less than 8%.

Technical Details

Innovation: TCDM Burst Access architecture, providing software-transparent support for burst memory transactions.
Problem Context: Designed to mitigate internal contention and performance degradation in hierarchical intra-cluster networks resulting from the bursty memory access patterns typical of SIMD/vector cores.
Architectural Components: Utilizes a dedicated Burst Manager responsible for dispatching burst requests efficiently to the multi-banked L1 TCDM.
Data Retirement: Improves utilization by retiring multiple 32-bit words from burst responses in parallel using channels with parametric data-width.
Technology Node: The design was implemented and validated using 12-nm FinFET technology node.

Implications

Enabling Massive Parallelism: This architecture validates the feasibility of scaling RISC-V Vector clusters far beyond current limitations (e.g., beyond 1000 FPUs) while maintaining high performance and energy efficiency.
RISC-V Ecosystem Maturity: Provides a critical hardware component necessary for competitive RISC-V solutions in areas demanding high-throughput, like Deep Learning and high-performance computing (HPC).
Addressing Bottlenecks: Effectively addresses the inherent implementation bottleneck of flat interconnects in dense clusters, proving that hierarchical networks can achieve near-peak bandwidth utilization when coupled with optimized burst access mechanisms.
Energy Efficiency: The significant improvement in energy efficiency (1.9x) makes these large-scale RVV clusters highly desirable for power-constrained environments.

Technical Deep Dive Available

This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.

Read Full Report →