Hybrid Modular Redundancy: Exploring Modular Redundancy Approaches in RISC-V Multi-Core Computing Clusters for Reliable Processing in Space
Abstract
This paper presents Hybrid Modular Redundancy (HMR), a novel fault-tolerance scheme designed for RISC-V multi-core computing clusters intended for space applications. HMR enables flexible, on-demand dual-core or triple-core lockstep grouping with runtime split-lock capabilities to balance performance and reliability. The architecture introduces a high-performance hardware-based fault recovery method achieving recovery in just 24 clock cycles with modest area overhead (~9.4%), making it the first open-source RISC-V system integrating these functionalities.
Report
Key Highlights
- Novel Redundancy Scheme: Introduction of Hybrid Modular Redundancy (HMR) to address the reliability needs of Space Cyber-Physical Systems (S-CPS).
- Flexible Operation: HMR supports on-demand dual-core and triple-core lockstep grouping, allowing reliability to be tuned at runtime.
- Runtime Adaptability: Includes split-lock capabilities, enabling the system to switch between non-redundant (performance) and redundant (reliability) modes with low overhead (<400 clock cycles).
- High-Speed Recovery: A proposed hardware-based recovery approach drastically reduces fault recovery time to only 24 clock cycles.
- Minimal Area Overhead: The software-based recovery system achieves fault tolerance with a minimal area penalty of 1.3% over a baseline 12-core cluster (0.612 mm²).
Technical Details
- Architecture: Fault-tolerant multi-core cluster based on RISC-V processors.
- Operating Frequency: The cluster operates at 430 MHz.
- Performance Metrics (Matrix Multiplication):
- Non-redundant mode: 1160 MOPS.
- Dual-core mode (DMR): 617 MOPS.
- Triple-core mode (TMR): 414 MOPS.
- Recovery Methods and Overhead:
- Software-based: Requires 363 clock cycles for recovery; 0.612 mm² area occupancy (1.3% area overhead).
- Hardware-based: Requires 24 clock cycles for rapid recovery; 0.660 mm² area occupancy (9.4% area overhead).
Implications
- Cost-Effective Reliability for Space: HMR provides a viable alternative to prohibitively expensive radiation-hardened components, making high reliability more accessible for satellite and spacecraft onboard computers.
- Advancing RISC-V in Critical Systems: This work significantly enhances the suitability of open-source RISC-V architectures for mission-critical applications by demonstrating integrated, flexible, and high-performance fault tolerance.
- Dynamic System Management: The runtime split-lock capability allows engineers to finely tune the reliability vs. performance trade-offs during a mission, optimizing resource use depending on the task's criticality.
- Open-Source Foundation: Being the first system to integrate these advanced fault tolerance features on an open-source RISC-V device establishes a strong foundation for future research and deployment of resilient compute clusters.
Technical Deep Dive Available
This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.