A Scalable Near-Memory Architecture for Training Deep Neural Networks on Large In-Memory Datasets
Abstract
This paper proposes NTX, a scalable near-memory acceleration engine designed for high-precision training of deep neural networks, moving beyond the traditional focus on inference acceleration. NTX is implemented on the Logic Base die of a Hybrid Memory Cube (HMC) and uses a loose coupling with RISC-V cores to minimize offloading overheads by 7x. The architecture demonstrates significant performance advantages, achieving 2.7x higher energy efficiency than contemporary GPUs while delivering 1.2 Tflop/s for full floating-point training.
Report
Key Highlights
- Novel Focus: The research specifically targets the acceleration of Deep Neural Network training using near-memory computing (NMC), a domain historically overlooked in favor of inference acceleration.
- NTX Accelerator: Introduction of the NTX near-memory acceleration engine, optimized for gradient-based training methods.
- Performance Metrics: NTX achieves a compute performance of 1.2 Tflop/s using full floating-point precision (IEEE754).
- Energy Efficiency: Demonstrates a 2.7x energy efficiency improvement compared to contemporary GPUs, while requiring 4.4x less silicon area.
- Scalability: When scaled to meshes of HMCs in a data center scenario, NTX maintains above 95% parallel and energy efficiency, offering 3.1x performance or 2.1x energy savings over GPU-based systems.
Technical Details
- Architecture Location: The NTX engine is embedded into the residual area on the Logic Base die of a Hybrid Memory Cube (HMC), leveraging the high bandwidth and low latency of near-memory access.
- Coupling Mechanism: NTX co-processors employ a loose coupling mechanism with standard RISC-V cores. This implementation is crucial for reducing offloading overhead, showing a 7x improvement over previous published results.
- Data Path: An optimized, high-precision IEEE754 compliant data path is utilized to handle fast convolutions and accurate gradient propagation, ensuring the quality required for state-of-the-art training.
Implications
- Validation of RISC-V in HPC/AI: The successful integration of RISC-V cores as the host processors, loosely coupled with the specialized NTX accelerator, validates the RISC-V instruction set architecture as a robust foundation for building high-performance, heterogeneous computing platforms tailored for AI workloads.
- Advancing Near-Memory Computing (NMC): This work provides a powerful proof-of-concept for deploying complex, high-precision training algorithms within the memory stack itself (PIM/NMC), effectively tackling the 'memory wall' bottleneck inherent in large-scale dataset training.
- Disruption in Data Center AI: By demonstrating superior energy efficiency and competitive performance in a scalable mesh configuration, the NTX architecture offers a viable alternative to traditional GPU-centric training systems, potentially driving the adoption of customized, energy-optimized hardware in data centers.
Technical Deep Dive Available
This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.