Compressed Real Numbers for AI: a case-study using a RISC-V CPU
Abstract
This paper investigates optimizing Deep Neural Network (DNN) inference on CPUs by utilizing compressed number formats like bfloat and posit to reduce memory bandwidth demands. The core innovation is proposing a method to load 16-bit or 8-bit compressed tensors directly into the vector registers of a vector-capable CPU, performing decompression just prior to computation. This strategy significantly improves cache efficiency and bandwidth utilization, making lower-precision formats viable even when using standard 32-bit Floating-Point Units for the final calculation.
Report
Key Highlights
- Target Problem: Improving memory bandwidth and cache efficiency during DNN inference, particularly on CPUs where dedicated GPUs are absent.
- Solution: Implementing a post-load decompression strategy where compressed real numbers (bfloat, posit) are loaded into vector registers and decompressed in situ before processing.
- Formats Studied: Focuses on 16-bit and 8-bit versions of both bfloat and posit number formats, which are known to maintain DNN accuracy while reducing storage size.
- Performance Gain: The technique aims to save significant bandwidth and increase cache effectiveness compared to storing and transferring uncompressed (binary32) data.
- Context: The paper performs a case study using a RISC-V CPU and evaluates the architectural parameters necessary for this compressed approach to be performance-advantageous.
Technical Details
- Baseline Format: Single precision IEEE 754 floating point (binary32) is the standard format often used for training and default CPU inference.
- Low-Precision Formats: bfloat (16-bit and 8-bit) and posit (16-bit and 8-bit) are used to compress the storage of weights/biases.
- Processing Environment: The proposed method targets vector-capable CPUs, leveraging their vector registers for the decompression phase.
- Implementation Flow: Compressed operands are loaded from memory into vector registers, decompressed to 32-bit representation (if necessary for the FPU), and then executed on the 32-bit FPU.
- Architectural Analysis: The study identifies the specific architectural requirements and bottlenecks (e.g., latency of decompression, vector unit size) that make the proposed compression method superior to standard 32-bit processing.
Implications
- RISC-V Ecosystem Advancement: This research directly supports the use of RISC-V processors for Machine Learning inference by demonstrating a powerful architectural optimization for handling low-precision data without requiring specialized hardware accelerators or custom 16-bit FPUs.
- Memory Bottleneck Alleviation: By effectively halving or quartering the memory footprint of DNN weights, the technique significantly addresses the primary bandwidth bottleneck common in data-intensive AI workloads.
- Enhanced Edge AI Capability: The solution allows embedded and edge devices based on RISC-V to perform complex DNN inference with better performance-per-watt and memory utilization, making AI models more deployable in constrained environments.
- Promotion of Posit and bfloat: The study provides practical evidence and methodology for integrating non-traditional number formats (like posit) efficiently into standard CPU pipelines, encouraging broader adoption across the computing industry.
Technical Deep Dive Available
This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.