Bare-Metal RISC-V + NVDLA SoC for Efficient Deep Learning Inference

Bare-Metal RISC-V + NVDLA SoC for Efficient Deep Learning Inference

Abstract

This paper introduces a novel System-on-Chip (SoC) architecture that tightly couples a 32-bit Codasip uRISC_V core with the open-source NVDLA for efficient deep learning inference on edge devices. The core innovation is a bare-metal toolflow that generates assembly application code, successfully bypassing complex operating system overheads to maximize execution speed and storage efficiency. Benchmarked on an AMD ZCU102 FPGA using NVDLA-small configuration, the system achieved fast inference times for ResNet-18 (16.2 ms) at a clock frequency of 100 MHz.

Report

Key Highlights

  • Presents a novel SoC architecture combining a RISC-V core with the open-source NVDLA for high-efficiency deep learning inference.
  • Utilizes a bare-metal toolflow that generates optimized assembly code, bypassing the high overhead associated with traditional operating systems to achieve greater execution speed.
  • The tightly coupled hardware and bare-metal software methodology significantly improves storage efficiency, specifically targeting resource-constrained edge computing solutions.
  • The system was successfully implemented and evaluated on an AMD ZCU102 FPGA using the NVDLA-small configuration.

Technical Details

  • Core Architecture: A 32-bit, 4-stage pipelined RISC-V core, specifically the Codasip uRISC_V, is used as the control processor.
  • Accelerator: The open-source NVIDIA Deep Learning Accelerator (NVDLA) is tightly coupled to the CPU.
  • Software Flow: Model acceleration offloading is handled by bare-metal application code generated directly in assembly, circumventing the need for an operating system.
  • Evaluation Platform: AMD ZCU102 FPGA board.
  • Clock Frequency: System evaluation performed at 100 MHz.
  • **Performance Benchmarks (Inference Time):
    • LeNet-5: 4.8 ms
    • ResNet-18: 16.2 ms
    • ResNet-50: 1.1 s

Implications

  • Optimization for Edge AI: The bare-metal approach establishes a standard for minimizing latency and maximizing determinism in resource-constrained edge AI, addressing critical real-time performance requirements.
  • RISC-V and Open Hardware Validation: The project successfully demonstrates the feasibility and performance benefits of integrating two key open-source hardware components—the RISC-V ISA and the NVDLA IP—into a competitive, domain-specific accelerator.
  • Low-Overhead Solution: By proving that complex deep learning models can be executed efficiently without OS interference, this work encourages the development of highly specialized, low-power RISC-V based accelerators.
lock-1

Technical Deep Dive Available

This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.

Read Full Report →