HERMES: High-Performance RISC-V Memory Hierarchy for ML Workloads
Abstract
HERMES proposes a high-performance memory hierarchy specifically for RISC-V architectures executing demanding ML workloads like CNNs and Transformers. The core innovation involves integrating shared L3 caches with fine-grained coherence protocols and specialized pathways linking directly to deep-learning accelerators such as Gemmini. Evaluated using simulation tools like Gem5 and DRAMSim2, this design aims to achieve low-latency and scalable memory operations critical for modern AI applications.
Report
Key Highlights
- Target Focus: HERMES is designed to optimize the memory subsystem of RISC-V architectures specifically for challenging Machine Learning (ML) workloads.
- Goal: Address the persistent memory bandwidth, latency, and scalability challenges posed by ML models (e.g., CNNs, RNNs, Transformers).
- Core Innovation: Integration of shared L3 caches utilizing fine-grained coherence protocols.
- Accelerator Integration: The hierarchy includes specialized, high-bandwidth pathways connecting directly to deep-learning accelerators, citing Gemmini as an example.
- Methodology: Performance and scalability were evaluated using comprehensive simulation tools, namely Gem5 and DRAMSim2.
Technical Details
- Architecture Base: RISC-V.
- Key Techniques Employed:
- Advanced prefetching.
- Tensor-aware caching mechanisms.
- Hybrid memory models.
- Cache Configuration: The design emphasizes shared L3 caches.
- Coherence Protocol: Utilizes fine-grained coherence protocols to manage data consistency across the shared L3 cache structure.
- Evaluation Tools: Gem5 (full system simulator) and DRAMSim2 (DRAM memory simulator) were used for performance baselining and scalability analysis.
Implications
- Enhancing RISC-V for AI: This work significantly enhances the suitability of the open-source RISC-V instruction set architecture for competitive, high-performance ML and AI hardware development.
- Addressing the Memory Wall: By implementing techniques like tensor-aware caching and hybrid memory, HERMES directly combats the 'memory wall' bottleneck often observed in data-intensive ML processing.
- Optimizing Heterogeneous Systems: The creation of specialized pathways to accelerators (like Gemmini) improves overall system throughput and efficiency by drastically reducing data transfer latency between the CPU/cache and the ML processing unit.
- Future Development Roadmap: The findings and design choices laid out by HERMES serve as a crucial foundation for developing future scalable and low-latency specialized hardware platforms for AI.
Technical Deep Dive Available
This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.