Hybrid Modular Redundancy: Exploring Modular Redundancy Approaches in RISC-V Multi-Core Computing Clusters for Reliable Processing in Space

Hybrid Modular Redundancy: Exploring Modular Redundancy Approaches in RISC-V Multi-Core Computing Clusters for Reliable Processing in Space

Abstract

This paper presents Hybrid Modular Redundancy (HMR), a novel fault-tolerance scheme designed for RISC-V multi-core computing clusters intended for space applications. HMR enables flexible, on-demand dual-core or triple-core lockstep grouping with runtime split-lock capabilities to balance performance and reliability. The architecture introduces a high-performance hardware-based fault recovery method achieving recovery in just 24 clock cycles with modest area overhead (~9.4%), making it the first open-source RISC-V system integrating these functionalities.

Report

Key Highlights

  • Novel Redundancy Scheme: Introduction of Hybrid Modular Redundancy (HMR) to address the reliability needs of Space Cyber-Physical Systems (S-CPS).
  • Flexible Operation: HMR supports on-demand dual-core and triple-core lockstep grouping, allowing reliability to be tuned at runtime.
  • Runtime Adaptability: Includes split-lock capabilities, enabling the system to switch between non-redundant (performance) and redundant (reliability) modes with low overhead (<400 clock cycles).
  • High-Speed Recovery: A proposed hardware-based recovery approach drastically reduces fault recovery time to only 24 clock cycles.
  • Minimal Area Overhead: The software-based recovery system achieves fault tolerance with a minimal area penalty of 1.3% over a baseline 12-core cluster (0.612 mm²).

Technical Details

  • Architecture: Fault-tolerant multi-core cluster based on RISC-V processors.
  • Operating Frequency: The cluster operates at 430 MHz.
  • Performance Metrics (Matrix Multiplication):
    • Non-redundant mode: 1160 MOPS.
    • Dual-core mode (DMR): 617 MOPS.
    • Triple-core mode (TMR): 414 MOPS.
  • Recovery Methods and Overhead:
    • Software-based: Requires 363 clock cycles for recovery; 0.612 mm² area occupancy (1.3% area overhead).
    • Hardware-based: Requires 24 clock cycles for rapid recovery; 0.660 mm² area occupancy (9.4% area overhead).

Implications

  • Cost-Effective Reliability for Space: HMR provides a viable alternative to prohibitively expensive radiation-hardened components, making high reliability more accessible for satellite and spacecraft onboard computers.
  • Advancing RISC-V in Critical Systems: This work significantly enhances the suitability of open-source RISC-V architectures for mission-critical applications by demonstrating integrated, flexible, and high-performance fault tolerance.
  • Dynamic System Management: The runtime split-lock capability allows engineers to finely tune the reliability vs. performance trade-offs during a mission, optimizing resource use depending on the task's criticality.
  • Open-Source Foundation: Being the first system to integrate these advanced fault tolerance features on an open-source RISC-V device establishes a strong foundation for future research and deployment of resilient compute clusters.
lock-1

Technical Deep Dive Available

This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.

Read Full Report →