The nanoPU: Redesigning the CPU-Network Interface to Minimize RPC Tail Latency
Abstract
The nanoPU is a new CPU architecture optimized for networking tasks, specifically designed to minimize tail latency for Remote Procedure Calls (RPCs). It achieves an extremely fast 65ns wire-to-wire latency—a 13x improvement over state-of-the-art methods—by fundamentally redesigning the CPU-network interface. This redesign bypasses the cache and memory hierarchy, placing arriving network messages directly into the CPU register file, while offloading critical networking and scheduling functions to hardware.
Report
Key Highlights
- Extreme Latency Reduction: The nanoPU achieves a remarkably low wire-to-wire latency of just 65ns for RPCs, demonstrating a 13x performance speedup compared to current state-of-the-art systems.
- Cache Bypass Mechanism: The fundamental innovation is the elimination of memory bottlenecks by bypassing the cache and main memory hierarchy, allowing arriving network messages to be written directly into the CPU register file.
- Hardware Offloading: Critical software functions, including reliable network transport, congestion control, core selection, and thread scheduling, are moved into specialized hardware.
- Tail Latency Guarantee: The architecture includes a unique feature specifically designed to bound the tail latency experienced by high-priority applications.
- RISC-V Foundation: The prototype nanoPU is built upon a modified RISC-V CPU architecture.
Technical Details
- Base Architecture: Modified RISC-V CPU.
- Data Flow Optimization: Network data path is shortened by direct injection from the network interface into the register file, avoiding typical DMA, memory copies, cache pollution, and OS overheads.
- Offloaded Functions: Key networking stack components that are hardware-accelerated include:
- Reliable Network Transport.
- Congestion Control.
- Application Core Selection.
- Thread/Packet Scheduling.
- Evaluation Environment: Performance was validated using cycle-accurate simulations of a 324-core system, implemented and evaluated on AWS FPGAs.
- Benchmark Applications: Tested using real-world, latency-sensitive applications, specifically MICA (a distributed key-value store) and chain replication.
Implications
- Advancing RISC-V Specialization: The nanoPU showcases the power and flexibility of the open RISC-V instruction set architecture, proving its viability as a customizable base for creating highly specialized, domain-specific computing units optimized for data center networking and low-latency RPC fabrics.
- Future Data Center Design: By dramatically reducing the overhead associated with processing network traffic, the nanoPU provides a blueprint for next-generation systems where microseconds, or even nanoseconds, determine service quality and competitiveness (e.g., in cloud microservices and high-frequency trading).
- Shift to Hardware-Software Co-design: This work reinforces the trend that maximizing performance in the face of memory wall limitations requires deeply integrating network processing and scheduling logic directly into the CPU/chip fabric, pushing the boundaries of what is traditionally handled by the OS kernel.
- QoS and Predictability: The specialized feature for bounding tail latency is crucial for maintaining Service Level Objectives (SLOs) in complex distributed systems, ensuring guaranteed performance levels for critical applications.
Technical Deep Dive Available
This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.