Experimental evaluation of neutron-induced errors on a multicore RISC-V platform
Abstract
This study experimentally evaluates neutron-induced soft errors on the GAP8 multicore RISC-V ASIC platform, addressing the need for reliability data in safety-critical domains. The research found that computing-intensive applications, specifically Convolutional Neural Networks (CNN), exhibited an error rate 3.2x higher than the platform average under neutron exposure. Crucially, the evaluation also revealed significant inherent application resilience, as 96.12% of errors observed during CNN execution did not result in misclassification.
Report
Key Highlights
- Targeted Evaluation: This is an experimental evaluation of neutron-induced soft errors specifically targeting a commercial multicore RISC-V ASIC platform.
- High Error Rate in Computing: Computing-intensive applications, such as classification Convolutional Neural Networks (CNN), demonstrated an error rate 3.2x higher than the average error rate recorded on the platform.
- Silent Errors: A vast majority of errors (96.12%) induced during the CNN execution did not lead to functional failures (misclassifications), indicating high inherent fault masking within the application.
- Major Failure Mode: The primary source of application interruption failure on the GAP8 platform was determined to be application hangs (e.g., due to an infinite loop or a racing condition).
Technical Details
- Platform: A commercial multicore RISC-V ASIC platform known as GAP8 was used for the experiments.
- Error Source: The platform was exposed to a neutron beam to simulate soft errors caused by atmospheric radiation.
- Workload: The tests focused on computing-intensive workloads, specifically classification Convolutional Neural Networks (CNN).
- Scope: The research aims to fill the evaluation gap concerning application error rates on RISC-V processors, which is typically well-documented for standard x86 architectures.
Implications
- Validating RISC-V for Critical Systems: By providing concrete experimental data on error rates under neutron flux, this work is essential for qualifying RISC-V architectures for safety-critical and mission-critical domains (like aerospace or automotive).
- Guiding Redundancy Strategies: The finding that computing-intensive tasks have significantly higher error rates suggests that specific hardware or software redundancy measures must be prioritized for core computational units running AI/ML workloads on RISC-V.
- Resilience of ML Applications: The 96.12% silent error rate for CNNs implies that these modern workloads possess high inherent tolerance to soft errors, potentially reducing the need for expensive high-level error correction in specific domains.
- Focus on Multicore Stability: The identification of application hangs (racing conditions) as the major interruption source directs future RISC-V platform developers to focus mitigation efforts on improving multicore synchronization mechanisms and operating system resilience rather than solely targeting data path errors.
Technical Deep Dive Available
This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.