Octopus: A Heterogeneous In-network Computing Accelerator Enabling Deep Learning for network

Octopus: A Heterogeneous In-network Computing Accelerator Enabling Deep Learning for network

Abstract

The Octopus accelerator addresses the challenge of deploying Deep Learning (DL) models directly onto programmable in-network computing devices, which typically lack the necessary processing power and generality. It employs a heterogeneous architecture combining a dedicated feature extractor with collaborative vector accelerators and a systolic array, all governed by a RISC-V core. This design achieves high-throughput performance validated on an FPGA, including 31 Mpkt/s feature extraction and 207 ns packet-based computing latency.

Report

Key Highlights

  • Novel Architecture: Proposes Octopus, a specialized heterogeneous in-network computing accelerator designed specifically to run Deep Learning (DL) models on the network data plane.
  • Performance Breakthrough: Successfully tackles limitations in computing power, task granularity, and model generality faced by existing in-network devices.
  • High Throughput: Demonstrated exceptional performance metrics, including 31 Mpkt/s for feature extracting and 207 ns for packet-based computing latency.
  • Heterogeneous Collaboration: The core computing power is derived from a collaborating Vector Accelerator and Systolic Array, optimizing for both low-latency and high-throughput general tasks.

Technical Details

  • Core Architecture: Heterogeneous In-Network Computing (INC) design.
  • Primary Components:
    • Feature Extractor: Designed for fast and efficient initial data processing.
    • Vector Accelerator & Systolic Array: Work collaboratively to provide general, low-latency/high-throughput computing for packet-and-flow-based tasks.
    • RISC-V Core: Used for global controlling functions.
    • Memory: Utilizes an on-chip memory fabric for storage and connectivity.
  • Implementation Platform: The Octopus accelerator design was implemented and validated on an FPGA.
  • Measured Performance:
    • Feature Extracting: 31 Mpkt/s
    • Packet-Based Computing Latency: 207 ns
    • Flow-Based Computing Throughput: 90 kflow/s

Implications

  • Advancing Intelligent Networks: Octopus enables true deployment of complex DL models (for tasks like traffic classification or intrusion detection) directly within the high-speed data plane, reducing reliance on external CPUs or servers.
  • Validation of RISC-V in INC: The explicit use of the RISC-V core for global controlling reinforces its growing role as the flexible, open control plane ISA for highly specialized, domain-specific heterogeneous accelerators in networking hardware.
  • Accelerating Data Plane Customization: This work validates the approach of combining customized processing units (Feature Extractors, Systolic Arrays) with a standard, open control processor (RISC-V) to achieve specialized application performance previously unattainable in general-purpose network processors.
  • Future Hardware Trend: Octopus contributes to the paradigm shift toward integrating advanced AI inference capabilities directly into network switches and SmartNICs.
lock-1

Technical Deep Dive Available

This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.

Read Full Report →