TeraPool: A Physical Design Aware, 1024 RISC-V Cores Shared-L1-Memory Scaled-Up Cluster Design With High Bandwidth Main Memory Link

TeraPool: A Physical Design Aware, 1024 RISC-V Cores Shared-L1-Memory Scaled-Up Cluster Design With High Bandwidth Main Memory Link

Abstract

The TeraPool project introduces a highly scaled-up cluster architecture integrating 1024 RISC-V cores optimized for parallelism. Its key innovation is a physically design-aware implementation featuring a shared-L1-memory structure, which enables ultra-low-latency data exchange between the numerous cores. This design, complemented by a high-bandwidth main memory link, aims to overcome memory wall bottlenecks inherent in massive core count systems and establish a viable high-performance blueprint for the RISC-V ecosystem.

Report

TeraPool: Analysis Report

Key Highlights

  • Massive Scale: The design successfully integrates 1024 individual RISC-V cores into a single scaled-up cluster configuration, named TeraPool.
  • Novel Memory Architecture: It utilizes a shared-L1-memory structure, departing from traditional private L1 caches, to facilitate extremely fast, low-latency inter-core communication and data sharing.
  • Physical Design Optimization: The architecture is explicitly "Physical Design Aware," meaning the layout, routing, and power delivery for the 1024 cores were co-optimized during the design phase to ensure real-world implementation efficiency and performance targets.
  • Memory Bottleneck Mitigation: The cluster incorporates a dedicated High Bandwidth Main Memory Link, crucial for feeding data to and from the large pool of cores efficiently.

Technical Details

  • Core Configuration: Features 1024 RISC-V processing elements, likely optimized for simple, in-order execution to maximize density and throughput.
  • Shared L1 Structure: The core innovation requires a specialized, high-density interconnect and coherence protocol specifically designed to manage the shared L1 cache space among a large number of neighbors.
  • Design Methodology: Emphasis is placed on managing the complexity of routing and clock distribution across a die handling 1024 processors, utilizing advanced floorplanning techniques to minimize latency paths critical for the shared L1 access.
  • Interconnect Focus: The interconnect fabric is optimized for spatial locality, enabling adjacent cores to access shared data with latencies approaching those of local L1 hits.

Implications

  • RISC-V High-Performance Viability: TeraPool provides strong evidence that RISC-V is a viable Instruction Set Architecture for massive-scale, high-performance computing (HPC) and data center applications, moving beyond embedded and small-scale accelerators.
  • Architectural Exploration: The shared-L1 approach challenges established norms in multi-core design. If successful, it could pave the way for new programming models and synchronization primitives that leverage extremely fast, near-uniform data access across a large core cluster.
  • Scalability Blueprint: This physically designed and validated 1024-core cluster acts as a critical reference architecture for future RISC-V scale-up projects, offering solutions for power, thermal, and bandwidth management at extreme densities.
  • Addressing the Memory Wall: By directly focusing on increasing external memory bandwidth and optimizing internal data access (Shared L1), the TeraPool architecture provides a practical solution to the persistent memory bandwidth limitation encountered when scaling core counts far past 100.
lock-1

Technical Deep Dive Available

This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.

Read Full Report →