On-Demand Redundancy Grouping: Selectable Soft-Error Tolerance for a Multicore Cluster

On-Demand Redundancy Grouping: Selectable Soft-Error Tolerance for a Multicore Cluster

Abstract

This paper introduces On-Demand Redundancy Grouping (ODRG), a novel architectural scheme that provides selectable soft-error tolerance for multicore clusters operating in critical or hostile environments. Implemented on a six-core open-source RISC-V cluster, ODRG allows run-time switching between highly reliable fault-tolerant operation and high-performance independent computation. The system demonstrates excellent efficiency, adding only 1% to the cluster area while enabling up to a 2.96x performance increase when redundancy is deactivated.

Report

Key Highlights

  • Novelty: Introduction of On-Demand Redundancy Grouping (ODRG) for soft-error tolerance in multicore clusters.
  • Selectable Modes: Allows run-time switching between fault-tolerant modes (using core redundancy) and high-performance modes (using all cores independently).
  • Target Platform: Augmentation of a six-core open-source RISC-V cluster.
  • Efficiency: The ODRG unit adds minimal area overhead, totaling only 1% of the cluster area.
  • Performance Gain: When redundancy is not required, the redundant cores can be utilized for independent tasks, yielding up to a 2.96x performance increase for selected applications.
  • Fault Recovery Speed: The solution is 2.5x faster in fault recovery re-synchronization compared to commercial state-of-the-art implementations.

Technical Details

  • Method: On-Demand Redundancy Grouping (ODRG) is an architectural approach providing run-time configurable soft-error tolerance at the core level.
  • Configuration Flexibility: The six-core cluster can operate either as two fault-tolerant cores (implying 3-core redundancy groups, e.g., TMR) or six individual, high-performance cores.
  • Overhead Metrics: The ODRG unit adds less than 11% of a single core's area when used to form a three-core redundancy group. Timing increase is reported as negligible.
  • Implementation Base: Utilized an open-source RISC-V cluster, demonstrating the feasibility of ODRG in open architectures.

Implications

  • Aerospace/Critical Applications: ODRG directly addresses the serious concern of run-time faults caused by radiation in hostile environments (like space), making RISC-V clusters viable for high-reliability and safety-critical missions.
  • Mixed-Criticality Systems: The ability to dynamically trade reliability for performance provides crucial flexibility for systems requiring periods of high security/redundancy alternating with periods of peak computational speed.
  • RISC-V Ecosystem Maturity: This work contributes significant intellectual property to the RISC-V hardware architecture space, positioning RISC-V as a strong competitor to proprietary architectures traditionally dominating high-reliability and fault-tolerant computing.
  • Competitive Advantage: Achieving superior fault recovery speed (2.5x faster) and lower overhead compared to commercial state-of-the-art demonstrates that open-source implementations can match or exceed commercial reliability standards.
lock-1

Technical Deep Dive Available

This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.

Read Full Report →