Introducing Instruction-Accurate Simulators for Performance Estimation of Autotuning Workloads

Introducing Instruction-Accurate Simulators for Performance Estimation of Autotuning Workloads

Abstract

This paper introduces a novel interface enabling autotuning workloads, critical for optimizing Machine Learning, to execute efficiently on instruction-accurate simulators rather than scarce target hardware. By training machine learning predictors on simulation statistics, the authors achieve highly scalable performance estimation across diverse architectures. This simulation-based approach reliably identifies the best workload implementations within the top 3% of predictions for tested x86, ARM, and RISC-V systems, demonstrating superior speed compared to native execution, especially for embedded targets.

Report

Key Highlights

  • Scalability Innovation: The study presents an interface allowing autotuning workloads to run on simulators, drastically increasing scalability and parallel execution capability, especially when target hardware (HW) availability is limited.
  • Performance Prediction: The core technique involves training various predictors (likely ML models) to forecast the actual runtime performance on the target HW based on statistical data collected from instruction-accurate simulations.
  • High Accuracy: The methodology proved highly effective, consistently placing the true best workload implementation within the top 3% of all predictions across all tested architectures.
  • Architectural Scope: The validation included diverse instruction sets: x86, ARM, and RISC-V-based architectures.
  • Efficiency Gain: For embedded architectures, the simulation approach was shown to outperform native execution on the target HW when running as few as three samples in parallel on three simulators.

Technical Details

  • Optimization Target: Machine Learning (ML) workloads, known for their large optimization spaces, were used as the basis for autotuning.
  • Simulation Requirement: The method requires the use of fast instruction-accurate simulators to generate meaningful statistics necessary for training the predictors.
  • Prediction Inputs: Predictors are trained using features derived from simulation statistics (e.g., instruction counts, cycle counts, memory accesses).
  • Core Method: The approach replaces the traditionally mandatory step of executing performance evaluations directly on the physical target hardware with a parallel simulation phase combined with an ML-based performance mapping.

Implications

  • Benefit for RISC-V Ecosystem: As RISC-V is often deployed in custom and embedded contexts where physical hardware prototypes may be rare or expensive, this simulation-based autotuning method enables rapid and comprehensive optimization of compilers and specialized hardware designs without needing constant access to the final silicon.
  • Accelerated Development Cycle: Developers can parallelize optimization (autotuning) across massive compute clusters running simulators, significantly reducing the time required to find optimal software implementations for new architectures.
  • Customization and Specialization: The high accuracy across multiple ISA families (including RISC-V) confirms that this method is robust for hardware architects looking to evaluate the performance impact of specialized instruction set extensions or microarchitectural tweaks early in the design phase.
  • Cost Reduction: By shifting extensive testing from expensive, specialized target hardware to generalized, easily accessible parallel computing resources, the total cost and time associated with ML workload optimization are substantially reduced, especially for resource-constrained embedded systems.
lock-1

Technical Deep Dive Available

This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.

Read Full Report →