HERMES: High-Performance RISC-V Memory Hierarchy for ML Workloads

HERMES: High-Performance RISC-V Memory Hierarchy for ML Workloads

Abstract

HERMES proposes a high-performance memory hierarchy specifically for RISC-V architectures executing demanding ML workloads like CNNs and Transformers. The core innovation involves integrating shared L3 caches with fine-grained coherence protocols and specialized pathways linking directly to deep-learning accelerators such as Gemmini. Evaluated using simulation tools like Gem5 and DRAMSim2, this design aims to achieve low-latency and scalable memory operations critical for modern AI applications.

Report

Key Highlights

  • Target Focus: HERMES is designed to optimize the memory subsystem of RISC-V architectures specifically for challenging Machine Learning (ML) workloads.
  • Goal: Address the persistent memory bandwidth, latency, and scalability challenges posed by ML models (e.g., CNNs, RNNs, Transformers).
  • Core Innovation: Integration of shared L3 caches utilizing fine-grained coherence protocols.
  • Accelerator Integration: The hierarchy includes specialized, high-bandwidth pathways connecting directly to deep-learning accelerators, citing Gemmini as an example.
  • Methodology: Performance and scalability were evaluated using comprehensive simulation tools, namely Gem5 and DRAMSim2.

Technical Details

  • Architecture Base: RISC-V.
  • Key Techniques Employed:
    • Advanced prefetching.
    • Tensor-aware caching mechanisms.
    • Hybrid memory models.
  • Cache Configuration: The design emphasizes shared L3 caches.
  • Coherence Protocol: Utilizes fine-grained coherence protocols to manage data consistency across the shared L3 cache structure.
  • Evaluation Tools: Gem5 (full system simulator) and DRAMSim2 (DRAM memory simulator) were used for performance baselining and scalability analysis.

Implications

  • Enhancing RISC-V for AI: This work significantly enhances the suitability of the open-source RISC-V instruction set architecture for competitive, high-performance ML and AI hardware development.
  • Addressing the Memory Wall: By implementing techniques like tensor-aware caching and hybrid memory, HERMES directly combats the 'memory wall' bottleneck often observed in data-intensive ML processing.
  • Optimizing Heterogeneous Systems: The creation of specialized pathways to accelerators (like Gemmini) improves overall system throughput and efficiency by drastically reducing data transfer latency between the CPU/cache and the ML processing unit.
  • Future Development Roadmap: The findings and design choices laid out by HERMES serve as a crucial foundation for developing future scalable and low-latency specialized hardware platforms for AI.
lock-1

Technical Deep Dive Available

This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.

Read Full Report →