Hardware/Software Co-Design of RISC-V Extensions for Accelerating Sparse DNNs on FPGAs

Hardware/Software Co-Design of RISC-V Extensions for Accelerating Sparse DNNs on FPGAs

Abstract

This paper proposes novel RISC-V instruction set extensions designed through hardware/software co-design to efficiently accelerate sparse Deep Neural Networks (DNNs) on FPGAs. The design introduces custom functional units: one leveraging reserved bits in weight blocks to skip semi-structured computations, and another utilizing a variable cycle MAC unit for unstructured sparsity. The combined accelerator achieves high performance, providing speedups of up to 5x on standard TinyML applications while requiring minimal additional FPGA resources.

Report

Key Highlights

  • Acceleration Target: Deep Neural Networks (DNNs) featuring semi-structured and unstructured sparsity.
  • Platform: RISC-V custom Instruction Set Extensions (ISEs) implemented via hardware/software co-design on FPGAs.
  • Performance Gain: The combined design provides a maximum speedup factor of 5x compared to baseline methods.
  • Resource Efficiency: Designs consume a small amount of additional FPGA resources, making the solution viable even for resource-constrained, small FPGAs.
  • Benchmarking: Validated using standard TinyML applications, including keyword spotting, image classification, and person detection.

Technical Details

  • Core Methodology: Hardware/Software Co-Design, exploiting RISC-V customizability to integrate DNN model characteristics directly into the architecture.
  • Semi-Structured Sparsity Acceleration: This approach utilizes the fine-grained, bit-level configurability of FPGAs. The design reserves specific bits within a block of DNN weights to encode sparsity information pertinent to succeeding blocks, enabling the custom functional unit to proactively skip zero-value computations.
  • Unstructured Sparsity Acceleration: A dedicated variable cycle sequential Multiply-and-Accumulate (MAC) unit is proposed. This unit is optimized to execute only as many multiplications as there are non-zero weights, avoiding wasteful computation cycles.
  • Individual Performance: The implementation dedicated to unstructured sparsity yields speedups up to 3x, while the semi-structured design provides speedups up to 4x.

Implications

  • RISC-V Customization Maturity: This work exemplifies the powerful advantages of RISC-V's open instruction set, showcasing how custom extensions can be tailored specifically for computationally intensive, domain-specific tasks like sparse DNN inference.
  • Efficiency in Edge Computing: By achieving significant acceleration (up to 5x) with minimal FPGA resource overhead, this methodology is highly relevant for power- and resource-constrained embedded systems and TinyML applications.
  • Future Sparse Acceleration: The proposed methods for efficiently handling both semi-structured and unstructured pruning provide a new direction for accelerating compressed neural networks, potentially influencing future hardware accelerator standards and instruction sets for AI at the edge.
lock-1

Technical Deep Dive Available

This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.

Read Full Report →