NTX: An Energy-efficient Streaming Accelerator for Floating-point Generalized Reduction Workloads in 22nm FD-SOI

NTX: An Energy-efficient Streaming Accelerator for Floating-point Generalized Reduction Workloads in 22nm FD-SOI

Abstract

NTX is an energy-efficient streaming accelerator designed for 32-bit floating-point generalized reduction workloads, including training large Deep Neural Networks. Implemented in 22nm FD-SOI technology and orchestrated by a RISC-V core, the accelerator delivers up to 20 Gflop/s at 168 mW. Projections show that scaling NTX to 14nm can achieve 1.4 Tflop/s while offering a 3x energy efficiency improvement over contemporary GPUs using 10.4x less silicon area.

Report

Key Highlights

  • Energy Efficiency: NTX is a highly energy-efficient streaming accelerator focusing on floating-point Generalized Reduction Workloads (like Multiply-Accumulate).
  • Implementation: The design was realized in a 22 nm FD-SOI technology process.
  • Generalization: While originating from Deep Learning needs, the architecture demonstrates high efficiency, consistently achieving up to 87% of its peak performance across general reduction workloads.
  • Scalability: The architecture is modular, enabling deployment from low-power embedded scenarios up to high-performance GPU-class systems.
  • Competitive Edge: When scaled to 14nm, NTX is projected to offer a 3x improvement in energy efficiency compared to contemporary GPUs, using 10.4x less silicon area.

Technical Details

  • Target Workloads: Floating-point Generalized Reduction (MAC-intensive kernels).
  • Processing Unit Type: Set of 32-bit floating-point streaming co-processors.
  • Control Plane: The co-processors are loosely coupled to a RISC-V core, which handles orchestration, data movement, and computation coordination.
  • Measured Performance (22nm FD-SOI): Achieves 20 Gflop/s at a clock frequency of 1.25 GHz.
  • Power Consumption (22nm FD-SOI): Operates at 168 mW.
  • Projected Performance (14nm scale): Capable of 1.4 Tflop/s for training state-of-the-art networks with full floating-point precision.

Implications

  • Validation of RISC-V: NTX reinforces the role of RISC-V as a highly suitable, low-overhead control processor for orchestrating complex, highly parallelized streaming accelerators in heterogeneous SoC platforms.
  • Accelerator Design Paradigm: This work provides a strong case for highly specialized, fixed-precision (32-bit floating-point) streaming engines to handle reduction operations, suggesting that specialization is key to achieving dramatic energy efficiency gains over general-purpose architectures like GPUs.
  • Adoption of FD-SOI: Utilizing 22nm FD-SOI demonstrates the viability of specialized silicon technologies for creating power-optimized hardware, potentially accelerating the adoption of such accelerators in embedded and mobile devices where power budget is critical.
lock-1

Technical Deep Dive Available

This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.

Read Full Report →