Fused-Tiled Layers: Minimizing Data Movement on RISC-V SoCs with Software-Managed Caches

Fused-Tiled Layers: Minimizing Data Movement on RISC-V SoCs with Software-Managed Caches

Abstract

This work introduces Fused-Tiled Layers (FTL), a novel algorithm designed for the automatic fusion of tiled layers in Deep Neural Networks (DNNs) to minimize excessive data movement. FTL addresses the common issue where layer-wise tiling in specialized SoCs increases total memory transfer, creating bottlenecks in software-managed cache architectures. Implemented on a RISC-V (RV32) heterogeneous SoC, FTL demonstrates a significant efficiency improvement, resulting in up to 60.1% runtime reduction by decreasing off-chip and on-chip data movement by 47.1%.

Report

Key Highlights

  • Novel Algorithm: Introduces Fused-Tiled Layers (FTL) for automatic fusion of tiled layers in DNN computational graphs.
  • Target Optimization: Specifically designed to counteract the data movement overhead caused by traditional layer-wise tiling in SoCs with software-managed memory hierarchies.
  • Performance Gain: Achieved up to 60.1% reduction in runtime during testing.
  • Data Movement Reduction: Minimized total data transfer (both on-chip and off-chip) by 47.1%.
  • Deployment Focus: Integrated and tuned within an open-source deployment framework for RISC-V targets.

Technical Details

  • Architecture Focus: Heterogeneous RISC-V (RV32) SoCs.
  • Memory Model: The optimizations target multi-level software-managed memory hierarchies, typical in specialized DNN accelerators.
  • Problem Addressed: While layer-wise tiling reduces memory occupation, it traditionally necessitates more memory transfers, resulting in costly off-chip copies and energy inefficiency.
  • Methodology: FTL acts as a compiler optimization layer, performing automatic fusion to execute multiple tiled layers concurrently, thereby reducing intermediate I/O.
  • Validation Case: Performance metrics were derived from testing FTL on a typical Multi-Layer Perceptron (MLP) stage, specifically citing its application in a Vision Transformer (ViT).

Implications

  • RISC-V Ecosystem Enhancement: FTL provides a critical software optimization layer that maximizes the efficiency of RISC-V-based DNN accelerators, making them more competitive against proprietary solutions.
  • Improved Energy Efficiency: By dramatically reducing the volume of off-chip memory transfers, FTL directly translates to lower power consumption, crucial for battery-powered edge AI and IoT devices.
  • Software/Hardware Co-Design: The success of FTL validates the need for joint software (compiler/framework) and hardware (RISC-V SoC architecture) co-design, leveraging the flexibility of the open RISC-V standard.
  • Bottleneck Resolution: This innovation addresses one of the primary performance and energy bottlenecks in modern computing—the memory wall—specifically tailoring the solution for embedded accelerators.
lock-1

Technical Deep Dive Available

This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.

Read Full Report →