Occamy: A 432-Core Dual-Chiplet Dual-HBM2E 768-DP-GFLOP/s RISC-V System for 8-to-64-bit Dense and Sparse Computing in 12nm FinFET
Abstract
Occamy is a 432-core, 768-DP-GFLOP/s, dual-chiplet RISC-V system designed specifically to maximize compute efficiency across both dense and sparse FP8-to-FP64 ML and HPC workloads. Utilizing dual-HBM2E memory, a latency-tolerant interconnect, and specialized in-core streaming units, the system achieves high Floating-Point Unit (FPU) utilization, reaching 89% in dense linear algebra. Notably, Occamy surpasses state-of-the-art processors in compute density metrics, achieving up to 11x the density in sparse-dense linear algebra, and is released as open-source RTL.
Report
Occamy: A 432-Core RISC-V System Analysis
Key Highlights
- High-Core Count RISC-V: Features 432 individual cores integrated into a dual-chiplet architecture.
- Peak Performance: Achieves a maximum performance of 768 Double Precision (DP) GFLOP/s.
- Workload Versatility: Optimized to handle the entire spectrum of data precision, from 8-bit to 64-bit (FP8-to-FP64), supporting both dense and sparse computation modes crucial for modern ML and HPC.
- Leading Compute Density: On sparse-dense linear algebra, Occamy delivers a technology-node-normalized compute density 11x higher than the State-of-the-Art (SoA) processors.
- Open-Source RTL: The complete Register Transfer Level (RTL) code for Occamy is freely available under a permissive open-source license, encouraging community adoption and customization.
- High Utilization: Demonstrates exceptional efficiency, achieving 89% FPU utilization on dense linear algebra and up to 83% on stencil codes.
Technical Details
- Architecture & Implementation: The compute chiplets are fabricated using a mature 12 nm FinFET process node.
- Packaging: Employs a dual-chiplet configuration mounted on a passive interposer named Hedwig, which is implemented in a 65 nm node.
- Memory System: Integrated with dual High Bandwidth Memory 2E (HBM2E) stacks, ensuring massive memory throughput critical for data-intensive workloads.
- Data Handling Optimization: The architecture includes specialized in-core streaming units (SUs) designed to accelerate memory access patterns common in dense and sparse computations.
- Interconnect: Features a latency-tolerant hierarchical interconnect structure to manage communication efficiently among the numerous cores and memory interfaces.
- Sparse Performance Metrics: On sparse-sparse linear algebra, Occamy reaches throughputs up to 187 GCOMP/s at an efficiency of 17.4 GCOMP/s/W.
- ML Benchmarks: Achieves 75% FPU utilization on dense Large Language Model (LLM) inference and 54% FPU utilization on graph-sparse (GCN) ML inference workloads.
Implications
- Validation of RISC-V in HPC/AI: Occamy provides a powerful demonstration that RISC-V can be scaled into a highly competitive platform for high-performance computing and complex AI acceleration, rivaling proprietary architectures in specialized domains.
- Chiplet Ecosystem Advancement: The successful deployment of a dual-chiplet system utilizing a standard interposer (Hedwig) validates modular, heterogeneous integration techniques, aligning RISC-V development with modern semiconductor manufacturing trends.
- Addressing Sparse Workloads: By specifically designing hardware (SUs and interconnect) for efficient sparse processing, Occamy addresses a critical bottleneck in contemporary computing, especially relevant for emerging graph neural networks and efficient LLM serving.
- Acceleration of Open Hardware: The permissive open-source release of the RTL significantly lowers the barrier to entry for researchers and companies looking to build upon a high-performance, proven multi-core accelerator design, fostering innovation within the broader open hardware ecosystem.
Technical Deep Dive Available
This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.