Tensor Slicing and Optimization for Multicore NPUs

Tensor Slicing and Optimization for Multicore NPUs

Abstract

This paper introduces the Tensor Slicing Optimization (TSO) pass for the TensorFlow XLA/LLVM compiler, designed to improve CNN performance on highly constrained Multicore Neural Processor Units (NPUs). TSO efficiently partitions convolution tensors to maximize parallelism and memory utilization while minimizing host-to-NPU data transfers by leveraging DRAM memory burst time estimates. Evaluating TSO on a 32-core RISC-V NeuroMorphic Processor (NMP) demonstrated significant execution time reductions, yielding speed-ups up to 21.7% compared to standard slicing techniques.

Report

Key Highlights

  • Innovation: Introduction of the Tensor Slicing Optimization (TSO) compiler pass for efficient data parallelization of Convolutional Neural Networks (CNNs).
  • Goal: Minimize memory transactions between the host and NPU on-chip memory while maximizing parallelism and MAC utilization across NPU cores.
  • Methodology: TSO guides tensor slicing based on DRAM memory burst time estimates, optimizing for memory transfer efficiency rather than just core utilization.
  • Performance Gain: Experimental results showed speed-ups of up to 21.7% when comparing the burst-based TSO technique against a non-optimized, no-burst data slicing approach.
  • Toolchain Integration: The TSO optimization was implemented as a pass in the TensorFlow XLA/LLVM compiler and was also successfully ported and validated on the Glow Machine Learning framework.
  • Target Hardware: The approach was tested on the NeuroMorphic Processor (NMP), a multicore NPU.

Technical Details

  • Optimization Name: Tensor Slicing Optimization (TSO).
  • Framework Integration: Implemented as a compiler optimization pass within the TensorFlow XLA/LLVM stack.
  • Architecture Focus: Highly-constrained Multicore Neural Processor Units (NPUs) characterized by small on-chip memory footprints.
  • Core Hardware Used for Evaluation: The NeuroMorphic Processor (NMP), which consists of 32 RISC-V cores.
  • RISC-V Extensions: The RISC-V cores within the NMP are extended with novel, presumably custom, CNN instructions to accelerate neural network computations.
  • Optimization Metric: The key differentiator is using hardware-specific timing information (DRAM memory burst time estimates) to determine the optimal tensor slice size, ensuring efficient data transfers alongside workload balancing.

Implications

  • Advancing RISC-V NPUs: This work validates that complex ML optimizations can be effectively integrated into the software stack for RISC-V based NPUs, which often rely on custom instructions (like the CNN extensions mentioned) for competitive performance.
  • Compiler Toolchain Maturation: By integrating TSO into major ML compiler infrastructures (XLA/LLVM and Glow), the paper demonstrates a path toward mature, performance-critical software stacks necessary for deploying ML models efficiently on specialized hardware ecosystems like RISC-V.
  • Solving Heterogeneous Memory Bottlenecks: The emphasis on optimizing for DRAM burst time directly addresses the persistent challenge of latency and bandwidth bottlenecks when bridging large host memory (DRAM) and small, fast on-chip memories, a common issue in specialized accelerators and edge AI devices.
  • Performance Benchmarking: The quantifiable 21.7% speed-up provides strong evidence that hardware-aware compiler optimization is critical for maximizing the utilization of multicore RISC-V accelerators in real-world CNN applications.
lock-1

Technical Deep Dive Available

This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.

Read Full Report →