FlooNoC: A 645 Gbps/link 0.15 pJ/B/hop Open-Source NoC with Wide Physical Links and End-to-End AXI4 Parallel Multi-Stream Support

FlooNoC: A 645 Gbps/link 0.15 pJ/B/hop Open-Source NoC with Wide Physical Links and End-to-End AXI4 Parallel Multi-Stream Support

Abstract

FlooNoC is an open-source Network-on-Chip (NoC) designed specifically for the massive bulk data transfer demands of modern AI accelerators, featuring very wide, AXI4-compliant physical links. It achieves a leading energy efficiency of 0.15 pJ/B/hop and delivers 645 Gbps/link, significantly outpacing state-of-the-art solutions. The architecture introduces a novel multi-stream DMA engine for end-to-end AXI4 ordering, optimizing performance while maintaining a minimal area footprint of 3.5% per compute tile.

Report

Key Highlights

  • Leading Performance Metrics: Achieves 645 Gbps/link bandwidth, resulting in 103 Tbps total aggregate bandwidth across an 8x4 mesh of tiles.
  • Energy Efficiency Benchmark: Delivers a leading-edge energy efficiency of 0.15 pJ/B/hop at 0.8 V.
  • Open-Source and Specialized: FlooNoC is an open-source NoC tailored for domain-specific AI accelerators requiring massive bulk data transfers.
  • Comparative Advantage: Offers three times the energy efficiency and more than double the link bandwidth compared to state-of-the-art NoCs.
  • Area and Density Gain: Compared to traditional AXI4 multi-layer interconnects, FlooNoC achieves a 30% area reduction, corresponding to a 47% increase in GFLOPSDP (Double Precision Giga Floating-Point Operations Per Second) within the same floorplan.

Technical Details

  • Link Architecture: Utilizes very wide physical links instantiated on high levels of metal to maximize throughput.
  • Protocol Compliance: The architecture is fully Advanced eXtensible Interface (AXI4) compliant.
  • Multi-Stream Ordering: Introduces a novel end-to-end AXI4 ordering approach enabled by a multi-stream capable Direct Memory Access (DMA) engine, simplifying network interfaces by eliminating inter-stream dependencies.
  • Latency Mitigation: Supports non-blocking transactions at the transport level to enhance latency tolerance.
  • Dedicated Paths: Features dedicated physical links specifically for short, latency-critical messages.
  • Implementation Technology: Physical feasibility is demonstrated via a complete end-to-end reference implementation fabricated in 12nm FinFET technology.
  • Scale: The test configuration utilized an 8x4 mesh of processor cluster tiles incorporating a total of 288 RISC-V cores.
  • Area Footprint: Imposes a minimal area overhead of only 3.5% per compute tile.

Implications

  • Advancing AI Hardware: FlooNoC provides a critical solution for the data movement challenges faced by next-generation domain-specific AI accelerators, which rely heavily on efficient bulk data transfers rather than latency-critical cache line transfers.
  • RISC-V Many-Core Scaling: As an open-source, highly scalable, and AXI4-compliant NoC, FlooNoC is an ideal candidate for integration into large-scale, high-performance RISC-V processor clusters, enabling dense computational deployments (like the 288-core test configuration).
  • Efficiency and Cost Reduction: The demonstrated 30% area reduction and significant energy efficiency gains translate directly into lower manufacturing costs (Power, Performance, Area or PPA benefits) and reduced operational power consumption for future chip designs.
  • Ecosystem Contribution: Being open-source, FlooNoC can accelerate innovation within the wider hardware and RISC-V ecosystems by providing a high-performance interconnect baseline that minimizes design overhead.
lock-1

Technical Deep Dive Available

This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.

Read Full Report →