End-to-end 100-TOPS/W Inference With Analog In-Memory Computing: Are We There Yet?
Abstract
This paper addresses the integration challenges of Analog In-Memory Acceleration (IMA) within digital systems by proposing a heterogeneous architecture coupling 8 RISC-V cores with an IMA in a shared-memory cluster. Analysis of the MobileNetV2 bottleneck layer revealed that while IMA excels at pointwise convolutions, inefficient parameter mapping severely penalizes depthwise layers. The proposed hybrid solution, which splits computation between the IMA and RISC-V cores, achieves a 3x speed-up over software while saving 50% of the area compared to an all-IMA configuration of similar performance.
Report
End-to-end 100-TOPS/W Inference With Analog In-Memory Computing
Key Highlights
- Architectural Innovation: Introduction of a heterogeneous shared-memory cluster coupling 8 RISC-V cores with Analog In-Memory Acceleration (IMA).
- Efficiency Goal: Analyzing pathways toward 100-TOPS/W inference efficiency.
- Performance Result: Achieves a 3x speed-up over pure software (SW) execution for the complex MobileNetV2 bottleneck layer.
- Area Optimization: The final proposed hybrid solution saves 50% of the silicon area compared to an equivalent high-performance, all-in IMA implementation.
- Key Finding: IMA provides significant speed-ups for pointwise layers, but inefficient parameter mapping makes it unsuitable for depthwise layers, necessitating a hybrid approach.
Technical Details
- Architecture: Heterogeneous shared-memory cluster architecture.
- Core Components: 8 RISC-V processor cores paired with an IMA block.
- Target Application: Deep Neural Network (DNN) inference.
- Use Case Analyzed: MobileNetV2 bottleneck layer, which contains both pointwise (1x1) and depthwise convolutions.
- Integration Challenge: Integrating the high-speed, analog IMA block efficiently into the overall digital system flow.
- Optimal Strategy (Hybrid): Pointwise convolutions are executed on the specialized IMA hardware, leveraging its speed and efficiency, while depthwise convolutions are offloaded to the flexible digital RISC-V cluster cores.
Implications
- Validation of RISC-V Heterogeneity: This work strongly reinforces the RISC-V ecosystem's role as the ideal, open platform for integrating highly specialized, experimental accelerators like IMA.
- Shifting AI Accelerator Design: It highlights that future high-efficiency AI systems (aiming for 100 TOPS/W) will not be monolithic. Optimal performance requires intelligent task partitioning, where the RISC-V cores handle the computationally awkward layers (like depthwise) while the analog accelerators handle the matrix-intensive layers (like pointwise).
- Efficiency Benchmark: The paper provides critical trade-off analyses (throughput vs. area) necessary for system architects designing next-generation, ultra-low-power edge computing devices that rely on novel memory technologies.
Technical Deep Dive Available
This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.