NTX: An Energy-efficient Streaming Accelerator for Floating-point Generalized Reduction Workloads in 22nm FD-SOI
Abstract
NTX is an energy-efficient streaming accelerator designed for 32-bit floating-point generalized reduction workloads, including training large Deep Neural Networks. Implemented in 22nm FD-SOI technology and orchestrated by a RISC-V core, the accelerator delivers up to 20 Gflop/s at 168 mW. Projections show that scaling NTX to 14nm can achieve 1.4 Tflop/s while offering a 3x energy efficiency improvement over contemporary GPUs using 10.4x less silicon area.
Report
Key Highlights
- Energy Efficiency: NTX is a highly energy-efficient streaming accelerator focusing on floating-point Generalized Reduction Workloads (like Multiply-Accumulate).
- Implementation: The design was realized in a 22 nm FD-SOI technology process.
- Generalization: While originating from Deep Learning needs, the architecture demonstrates high efficiency, consistently achieving up to 87% of its peak performance across general reduction workloads.
- Scalability: The architecture is modular, enabling deployment from low-power embedded scenarios up to high-performance GPU-class systems.
- Competitive Edge: When scaled to 14nm, NTX is projected to offer a 3x improvement in energy efficiency compared to contemporary GPUs, using 10.4x less silicon area.
Technical Details
- Target Workloads: Floating-point Generalized Reduction (MAC-intensive kernels).
- Processing Unit Type: Set of 32-bit floating-point streaming co-processors.
- Control Plane: The co-processors are loosely coupled to a RISC-V core, which handles orchestration, data movement, and computation coordination.
- Measured Performance (22nm FD-SOI): Achieves 20 Gflop/s at a clock frequency of 1.25 GHz.
- Power Consumption (22nm FD-SOI): Operates at 168 mW.
- Projected Performance (14nm scale): Capable of 1.4 Tflop/s for training state-of-the-art networks with full floating-point precision.
Implications
- Validation of RISC-V: NTX reinforces the role of RISC-V as a highly suitable, low-overhead control processor for orchestrating complex, highly parallelized streaming accelerators in heterogeneous SoC platforms.
- Accelerator Design Paradigm: This work provides a strong case for highly specialized, fixed-precision (32-bit floating-point) streaming engines to handle reduction operations, suggesting that specialization is key to achieving dramatic energy efficiency gains over general-purpose architectures like GPUs.
- Adoption of FD-SOI: Utilizing 22nm FD-SOI demonstrates the viability of specialized silicon technologies for creating power-optimized hardware, potentially accelerating the adoption of such accelerators in embedded and mobile devices where power budget is critical.
Technical Deep Dive Available
This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.