Occamy: A 432-Core Dual-Chiplet Dual-HBM2E 768-DP-GFLOP/s RISC-V System for 8-to-64-bit Dense and Sparse Computing in 12nm FinFET

Occamy: A 432-Core Dual-Chiplet Dual-HBM2E 768-DP-GFLOP/s RISC-V System for 8-to-64-bit Dense and Sparse Computing in 12nm FinFET

Abstract

Occamy is a 432-core, 768-DP-GFLOP/s, dual-chiplet RISC-V system designed specifically to maximize compute efficiency across both dense and sparse FP8-to-FP64 ML and HPC workloads. Utilizing dual-HBM2E memory, a latency-tolerant interconnect, and specialized in-core streaming units, the system achieves high Floating-Point Unit (FPU) utilization, reaching 89% in dense linear algebra. Notably, Occamy surpasses state-of-the-art processors in compute density metrics, achieving up to 11x the density in sparse-dense linear algebra, and is released as open-source RTL.

Report

Occamy: A 432-Core RISC-V System Analysis

Key Highlights

  • High-Core Count RISC-V: Features 432 individual cores integrated into a dual-chiplet architecture.
  • Peak Performance: Achieves a maximum performance of 768 Double Precision (DP) GFLOP/s.
  • Workload Versatility: Optimized to handle the entire spectrum of data precision, from 8-bit to 64-bit (FP8-to-FP64), supporting both dense and sparse computation modes crucial for modern ML and HPC.
  • Leading Compute Density: On sparse-dense linear algebra, Occamy delivers a technology-node-normalized compute density 11x higher than the State-of-the-Art (SoA) processors.
  • Open-Source RTL: The complete Register Transfer Level (RTL) code for Occamy is freely available under a permissive open-source license, encouraging community adoption and customization.
  • High Utilization: Demonstrates exceptional efficiency, achieving 89% FPU utilization on dense linear algebra and up to 83% on stencil codes.

Technical Details

  • Architecture & Implementation: The compute chiplets are fabricated using a mature 12 nm FinFET process node.
  • Packaging: Employs a dual-chiplet configuration mounted on a passive interposer named Hedwig, which is implemented in a 65 nm node.
  • Memory System: Integrated with dual High Bandwidth Memory 2E (HBM2E) stacks, ensuring massive memory throughput critical for data-intensive workloads.
  • Data Handling Optimization: The architecture includes specialized in-core streaming units (SUs) designed to accelerate memory access patterns common in dense and sparse computations.
  • Interconnect: Features a latency-tolerant hierarchical interconnect structure to manage communication efficiently among the numerous cores and memory interfaces.
  • Sparse Performance Metrics: On sparse-sparse linear algebra, Occamy reaches throughputs up to 187 GCOMP/s at an efficiency of 17.4 GCOMP/s/W.
  • ML Benchmarks: Achieves 75% FPU utilization on dense Large Language Model (LLM) inference and 54% FPU utilization on graph-sparse (GCN) ML inference workloads.

Implications

  • Validation of RISC-V in HPC/AI: Occamy provides a powerful demonstration that RISC-V can be scaled into a highly competitive platform for high-performance computing and complex AI acceleration, rivaling proprietary architectures in specialized domains.
  • Chiplet Ecosystem Advancement: The successful deployment of a dual-chiplet system utilizing a standard interposer (Hedwig) validates modular, heterogeneous integration techniques, aligning RISC-V development with modern semiconductor manufacturing trends.
  • Addressing Sparse Workloads: By specifically designing hardware (SUs and interconnect) for efficient sparse processing, Occamy addresses a critical bottleneck in contemporary computing, especially relevant for emerging graph neural networks and efficient LLM serving.
  • Acceleration of Open Hardware: The permissive open-source release of the RTL significantly lowers the barrier to entry for researchers and companies looking to build upon a high-performance, proven multi-core accelerator design, fostering innovation within the broader open hardware ecosystem.
lock-1

Technical Deep Dive Available

This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.

Read Full Report →