Bare-Metal RISC-V + NVDLA SoC for Efficient Deep Learning Inference
Abstract
This paper introduces a novel System-on-Chip (SoC) architecture that tightly couples a 32-bit Codasip uRISC_V core with the open-source NVDLA for efficient deep learning inference on edge devices. The core innovation is a bare-metal toolflow that generates assembly application code, successfully bypassing complex operating system overheads to maximize execution speed and storage efficiency. Benchmarked on an AMD ZCU102 FPGA using NVDLA-small configuration, the system achieved fast inference times for ResNet-18 (16.2 ms) at a clock frequency of 100 MHz.
Report
Key Highlights
- Presents a novel SoC architecture combining a RISC-V core with the open-source NVDLA for high-efficiency deep learning inference.
- Utilizes a bare-metal toolflow that generates optimized assembly code, bypassing the high overhead associated with traditional operating systems to achieve greater execution speed.
- The tightly coupled hardware and bare-metal software methodology significantly improves storage efficiency, specifically targeting resource-constrained edge computing solutions.
- The system was successfully implemented and evaluated on an AMD ZCU102 FPGA using the NVDLA-small configuration.
Technical Details
- Core Architecture: A 32-bit, 4-stage pipelined RISC-V core, specifically the Codasip uRISC_V, is used as the control processor.
- Accelerator: The open-source NVIDIA Deep Learning Accelerator (NVDLA) is tightly coupled to the CPU.
- Software Flow: Model acceleration offloading is handled by bare-metal application code generated directly in assembly, circumventing the need for an operating system.
- Evaluation Platform: AMD ZCU102 FPGA board.
- Clock Frequency: System evaluation performed at 100 MHz.
- **Performance Benchmarks (Inference Time):
- LeNet-5: 4.8 ms
- ResNet-18: 16.2 ms
- ResNet-50: 1.1 s
Implications
- Optimization for Edge AI: The bare-metal approach establishes a standard for minimizing latency and maximizing determinism in resource-constrained edge AI, addressing critical real-time performance requirements.
- RISC-V and Open Hardware Validation: The project successfully demonstrates the feasibility and performance benefits of integrating two key open-source hardware components—the RISC-V ISA and the NVDLA IP—into a competitive, domain-specific accelerator.
- Low-Overhead Solution: By proving that complex deep learning models can be executed efficiently without OS interference, this work encourages the development of highly specialized, low-power RISC-V based accelerators.
Technical Deep Dive Available
This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.