CHERI-SIMT: Implementing Capability Memory Protection in GPUs

CHERI-SIMT: Implementing Capability Memory Protection in GPUs

Abstract

CHERI-SIMT introduces the first unified architecture to integrate the CHERI Capability Hardware Enhanced RISC Instructions model directly into the Single Instruction, Multiple Threads (SIMT) execution pipeline of GPUs. This innovation addresses the critical lack of robust hardware memory safety in high-throughput accelerators, protecting parallel workloads from common memory vulnerabilities like buffer overflows. The proposed design utilizes specialized vectorized capability checking units and memory structures to enforce fine-grained permissions with minimal impact on overall performance and area overhead.

Report

CHERI-SIMT: Implementing Capability Memory Protection in GPUs

Key Highlights

  • Capability Integration in SIMT: Successfully adapts the rigorous CHERI memory protection model, typically designed for general-purpose CPUs, into the highly parallel, vectorized SIMT architecture characteristic of modern GPUs.
  • Enhanced Memory Safety for Accelerators: Provides robust hardware-enforced memory safety for thousands of concurrent threads (warps/wavefronts), mitigating severe vulnerabilities (e.g., illegal pointer dereferences, heap corruption) that are common in GPU kernel programming.
  • Minimal Performance Overhead: Achieves capability enforcement with low performance overhead (e.g., reported overheads typically less than 10%) compared to unprotected baseline GPU architectures, maintaining high throughput crucial for AI and HPC workloads.
  • Hardware Efficiency: Introduces specialized microarchitectural components designed to manage capability metadata efficiently at scale, avoiding bottlenecks associated with traditional metadata lookups.

Technical Details

  • Vectorized Capability Checking: The core innovation involves a Capability Vector Unit (CVU) integrated into the Streaming Multiprocessor (SM) pipeline. This unit performs parallel bounds and permission checks for all threads within a single warp simultaneously, capitalizing on SIMT coherence.
  • Capability Memory Hierarchy: To handle the increased metadata requirements, the design incorporates Capability TLBs (CTLB) and specialized capability caches. These structures utilize techniques like metadata compression and parallel lookups to minimize latency penalties associated with fetching and validating capability pointers.
  • RISC-V ISA Extensions: The architecture relies on specific extensions of the RISC-V ISA, derived from the CHERI standard, customized for vectorized operations. These include instructions for warp-level capability derivation, sealing, and movement optimized for shared memory and register file access patterns.
  • Thread Context Management: Each thread within a warp retains its own set of Capability Status Registers (CSRs) and thread-local capabilities, ensuring fine-grained, thread-specific protection boundaries are maintained during execution.

Implications

  • Revolutionizing HPC and AI Security: CHERI-SIMT provides a fundamental security guarantee for mission-critical GPU workloads, such as large language model training, financial modeling, and scientific simulations, where data integrity and protection against memory corruption are paramount.
  • Expanding the RISC-V/CHERI Ecosystem: This work significantly extends the reach of the RISC-V ecosystem by demonstrating that advanced memory safety features like CHERI are viable not just for CPUs and embedded systems, but also for highly specialized parallel accelerators (GPUs).
  • Industry Standard for Accelerators: Sets a strong precedent for future accelerator design, signaling a potential shift towards mandatory, hardware-enforced memory protection as a baseline security requirement across the entire computing stack, especially as GPUs handle increasingly sensitive data and operating system tasks.
  • Simplifying Secure Parallel Programming: By offloading complex memory validation to the hardware, CHERI-SIMT reduces the burden on programmers to manually manage memory bounds, potentially leading to fewer security bugs in highly complex parallel kernels.
lock-1

Technical Deep Dive Available

This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.

Read Full Report →