CoroAMU: Unleashing Memory-Driven Coroutines through Latency-Aware Decoupled Operations

CoroAMU: Unleashing Memory-Driven Coroutines through Latency-Aware Decoupled Operations

Abstract

CoroAMU is a hardware-software co-designed system addressing severe memory latency issues in data-intensive applications running on disaggregated memory systems. The system features a compiler that optimizes coroutine management by minimizing context and coalescing requests, paired with an enhanced Asynchronous Memory Unit (AMU) supporting decoupled memory operations and memory-guided branch prediction. Implemented on the XiangShan RISC-V processor, CoroAMU demonstrates significant efficiency, achieving up to 4.87x performance speedup in high-latency environments compared to a baseline processor.

Report

CoroAMU: Unleashing Memory-Driven Coroutines through Latency-Aware Decoupled Operations

Key Highlights

  • Core Innovation: CoroAMU is a hardware-software co-designed system specifically tailored for memory-centric coroutines to efficiently hide memory latency.
  • Target Problem: Mitigating the challenges of high memory latency, particularly in modern disaggregated memory systems.
  • Performance Results: The combined hardware-software approach yields substantial performance gains, delivering 3.39x and 4.87x average performance improvements over the baseline processor under 200ns and 800ns emulated disaggregated latency, respectively.
  • Compiler Efficiency: The CoroAMU compiler alone achieves a 1.51x speedup over existing state-of-the-art coroutine methods on standard Intel server processors.
  • Implementation Platform: The system is built using LLVM and implemented on the open-source XiangShan RISC-V processor over an FPGA platform.

Technical Details

  • Hardware Component: The architecture enhances the Asynchronous Memory Unit (AMU) to support latency-aware decoupled memory operations.
  • Software Component: The compiler procedures (implemented via LLVM) focus on optimizing coroutine code generation, minimizing context save/restore overhead, and coalescing multiple memory requests into fewer, larger operations.
  • Architectural Features: The hardware incorporates coroutine-specific memory operations to interface effectively with the dynamic coroutine schedulers.
  • Prediction Mechanism: A novel memory-guided branch prediction mechanism is introduced to further optimize instruction flow based on expected memory access latency.
  • Evaluation Environment: Performance evaluation utilized an FPGA platform emulating disaggregated system latencies (specifically tested at 200ns and 800ns).

Implications

  • Future Memory Architectures: CoroAMU provides a critical blueprint for handling performance degradation in future high-latency memory environments (like disaggregated or CXL-based pooling), confirming that memory-centric parallelization requires specialized hardware support.
  • RISC-V Ecosystem Leadership: By utilizing and optimizing the open-source XiangShan RISC-V processor, CoroAMU demonstrates RISC-V's viability as a platform for cutting-edge hardware architecture research and high-performance computing solutions.
  • Co-design Validation: The results underscore the necessity of a hardware-software co-design approach to effectively leverage fine-grained concurrency (coroutines) while simultaneously reducing the inherent runtime overhead that traditionally limits their widespread adoption.
  • Data Center Performance: The significant speedups observed in high-latency scenarios are crucial for data centers and cloud environments where memory latency across network boundaries (for disaggregated memory) is a primary bottleneck.
lock-1

Technical Deep Dive Available

This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.

Read Full Report →