Flare: Flexible In-Network Allreduce

Flare: Flexible In-Network Allreduce

Abstract

Flare is a flexible programmable switch designed to accelerate the computationally intensive allreduce communication operation in distributed systems by offloading aggregation to the network. Existing in-network solutions lack customization for specific data types, sparse data, or custom operators; Flare addresses this by leveraging PsPIN, a specialized RISC-V architecture. By designing and analyzing novel aggregation algorithms on this architecture, the work demonstrates improved performance compared to current state-of-the-art approaches.

Report

Key Highlights

  • Target Operation: Focuses on optimizing the allreduce operation, a routine critical to distributed applications like deep learning and high-performance computing (HPC).
  • Innovation: Introduces "Flare," a flexible programmable switch designed for in-network aggregation to reduce bandwidth usage and network traffic.
  • Customization: Solves the rigidness of current solutions by providing capabilities to handle custom operators, specific data types, sparse data, and ensuring aggregation reproducibility.
  • Hardware Foundation: The programmable switch is built using PsPIN, a RISC-V architecture implementing the sPIN programming model.
  • Results: Demonstrates performance improvements through the analysis and modeling of different aggregation algorithms tailored for this flexible architecture.

Technical Details

  • Architecture Core: The flexible programmable switch architecture relies on PsPIN, which serves as the processing element within the network fabric.
  • CPU/ISA: PsPIN is based on the RISC-V architecture, leveraging its flexibility and extensibility.
  • Programming Model: The system utilizes the sPIN (Stream Processing In Network) programming model, designed for efficiently handling and processing data streams directly within the switch.
  • Methodology: The work involves designing, modeling, and analyzing distinct aggregation algorithms specifically optimized for deployment on the PsPIN RISC-V based programmable switch.
  • Goal of Offloading: By offloading the aggregation (reduction) function to the switch, Flare minimizes data movement to and from host CPUs, accelerating overall communication latency.

Implications

  • RISC-V Ecosystem Expansion: This work showcases a vital application for RISC-V in the high-performance networking domain, specifically for in-network computing (INC) and hardware accelerators. The use of PsPIN validates RISC-V's role as a viable and customizable instruction set for specialized network processing units.
  • HPC and AI Acceleration: Since allreduce is fundamental to large-scale machine learning training and distributed simulation, Flare provides a necessary architectural evolution. The ability to program custom aggregations directly in the network fabric allows organizations to optimize performance for complex or unusual data formats (e.g., custom floating-point types or sparse gradients).
  • Enabling Software-Defined Networks (SDN): By using a programmable RISC-V core, Flare contributes to the movement toward more flexible, software-defined network acceleration, allowing network behavior and communication primitives to be updated and customized post-deployment, rather than being fixed by proprietary ASIC designs.
lock-1

Technical Deep Dive Available

This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.

Read Full Report →