Fused-Tiled Layers: Minimizing Data Movement on RISC-V SoCs with Software-Managed Caches
Abstract
This work introduces Fused-Tiled Layers (FTL), a novel algorithm designed for the automatic fusion of tiled layers in Deep Neural Networks (DNNs) to minimize excessive data movement. FTL addresses the common issue where layer-wise tiling in specialized SoCs increases total memory transfer, creating bottlenecks in software-managed cache architectures. Implemented on a RISC-V (RV32) heterogeneous SoC, FTL demonstrates a significant efficiency improvement, resulting in up to 60.1% runtime reduction by decreasing off-chip and on-chip data movement by 47.1%.
Report
Key Highlights
- Novel Algorithm: Introduces Fused-Tiled Layers (FTL) for automatic fusion of tiled layers in DNN computational graphs.
- Target Optimization: Specifically designed to counteract the data movement overhead caused by traditional layer-wise tiling in SoCs with software-managed memory hierarchies.
- Performance Gain: Achieved up to 60.1% reduction in runtime during testing.
- Data Movement Reduction: Minimized total data transfer (both on-chip and off-chip) by 47.1%.
- Deployment Focus: Integrated and tuned within an open-source deployment framework for RISC-V targets.
Technical Details
- Architecture Focus: Heterogeneous RISC-V (RV32) SoCs.
- Memory Model: The optimizations target multi-level software-managed memory hierarchies, typical in specialized DNN accelerators.
- Problem Addressed: While layer-wise tiling reduces memory occupation, it traditionally necessitates more memory transfers, resulting in costly off-chip copies and energy inefficiency.
- Methodology: FTL acts as a compiler optimization layer, performing automatic fusion to execute multiple tiled layers concurrently, thereby reducing intermediate I/O.
- Validation Case: Performance metrics were derived from testing FTL on a typical Multi-Layer Perceptron (MLP) stage, specifically citing its application in a Vision Transformer (ViT).
Implications
- RISC-V Ecosystem Enhancement: FTL provides a critical software optimization layer that maximizes the efficiency of RISC-V-based DNN accelerators, making them more competitive against proprietary solutions.
- Improved Energy Efficiency: By dramatically reducing the volume of off-chip memory transfers, FTL directly translates to lower power consumption, crucial for battery-powered edge AI and IoT devices.
- Software/Hardware Co-Design: The success of FTL validates the need for joint software (compiler/framework) and hardware (RISC-V SoC architecture) co-design, leveraging the flexibility of the open RISC-V standard.
- Bottleneck Resolution: This innovation addresses one of the primary performance and energy bottlenecks in modern computing—the memory wall—specifically tailoring the solution for embedded accelerators.
Technical Deep Dive Available
This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.