On-Demand Redundancy Grouping: Selectable Soft-Error Tolerance for a Multicore Cluster
Abstract
This paper introduces On-Demand Redundancy Grouping (ODRG), a novel architectural scheme that provides selectable soft-error tolerance for multicore clusters operating in critical or hostile environments. Implemented on a six-core open-source RISC-V cluster, ODRG allows run-time switching between highly reliable fault-tolerant operation and high-performance independent computation. The system demonstrates excellent efficiency, adding only 1% to the cluster area while enabling up to a 2.96x performance increase when redundancy is deactivated.
Report
Key Highlights
- Novelty: Introduction of On-Demand Redundancy Grouping (ODRG) for soft-error tolerance in multicore clusters.
- Selectable Modes: Allows run-time switching between fault-tolerant modes (using core redundancy) and high-performance modes (using all cores independently).
- Target Platform: Augmentation of a six-core open-source RISC-V cluster.
- Efficiency: The ODRG unit adds minimal area overhead, totaling only 1% of the cluster area.
- Performance Gain: When redundancy is not required, the redundant cores can be utilized for independent tasks, yielding up to a 2.96x performance increase for selected applications.
- Fault Recovery Speed: The solution is 2.5x faster in fault recovery re-synchronization compared to commercial state-of-the-art implementations.
Technical Details
- Method: On-Demand Redundancy Grouping (ODRG) is an architectural approach providing run-time configurable soft-error tolerance at the core level.
- Configuration Flexibility: The six-core cluster can operate either as two fault-tolerant cores (implying 3-core redundancy groups, e.g., TMR) or six individual, high-performance cores.
- Overhead Metrics: The ODRG unit adds less than 11% of a single core's area when used to form a three-core redundancy group. Timing increase is reported as negligible.
- Implementation Base: Utilized an open-source RISC-V cluster, demonstrating the feasibility of ODRG in open architectures.
Implications
- Aerospace/Critical Applications: ODRG directly addresses the serious concern of run-time faults caused by radiation in hostile environments (like space), making RISC-V clusters viable for high-reliability and safety-critical missions.
- Mixed-Criticality Systems: The ability to dynamically trade reliability for performance provides crucial flexibility for systems requiring periods of high security/redundancy alternating with periods of peak computational speed.
- RISC-V Ecosystem Maturity: This work contributes significant intellectual property to the RISC-V hardware architecture space, positioning RISC-V as a strong competitor to proprietary architectures traditionally dominating high-reliability and fault-tolerant computing.
- Competitive Advantage: Achieving superior fault recovery speed (2.5x faster) and lower overhead compared to commercial state-of-the-art demonstrates that open-source implementations can match or exceed commercial reliability standards.
Technical Deep Dive Available
This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.