Research

HERMES: High-Performance RISC-V Memory Hierarchy for ML Workloads

Admin

0 views • a year ago (Updated) • 1 min read •

•

Abstract

HERMES proposes a high-performance memory hierarchy specifically for RISC-V architectures executing demanding ML workloads like CNNs and Transformers. The core innovation involves integrating shared L3 caches with fine-grained coherence protocols and specialized pathways linking directly to deep-learning accelerators such as Gemmini. Evaluated using simulation tools like Gem5 and DRAMSim2, this design aims to achieve low-latency and scalable memory operations critical for modern AI applications.

Report

Key Highlights

Target Focus: HERMES is designed to optimize the memory subsystem of RISC-V architectures specifically for challenging Machine Learning (ML) workloads.
Goal: Address the persistent memory bandwidth, latency, and scalability challenges posed by ML models (e.g., CNNs, RNNs, Transformers).
Core Innovation: Integration of shared L3 caches utilizing fine-grained coherence protocols.
Accelerator Integration: The hierarchy includes specialized, high-bandwidth pathways connecting directly to deep-learning accelerators, citing Gemmini as an example.
Methodology: Performance and scalability were evaluated using comprehensive simulation tools, namely Gem5 and DRAMSim2.

Technical Details

Architecture Base: RISC-V.
Key Techniques Employed:
- Advanced prefetching.
- Tensor-aware caching mechanisms.
- Hybrid memory models.
Cache Configuration: The design emphasizes shared L3 caches.
Coherence Protocol: Utilizes fine-grained coherence protocols to manage data consistency across the shared L3 cache structure.
Evaluation Tools: Gem5 (full system simulator) and DRAMSim2 (DRAM memory simulator) were used for performance baselining and scalability analysis.

Implications

Enhancing RISC-V for AI: This work significantly enhances the suitability of the open-source RISC-V instruction set architecture for competitive, high-performance ML and AI hardware development.
Addressing the Memory Wall: By implementing techniques like tensor-aware caching and hybrid memory, HERMES directly combats the 'memory wall' bottleneck often observed in data-intensive ML processing.
Optimizing Heterogeneous Systems: The creation of specialized pathways to accelerators (like Gemmini) improves overall system throughput and efficiency by drastically reducing data transfer latency between the CPU/cache and the ML processing unit.
Future Development Roadmap: The findings and design choices laid out by HERMES serve as a crucial foundation for developing future scalable and low-latency specialized hardware platforms for AI.

Technical Deep Dive Available

This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.

Read Full Report →