Maestro: A 302 GFLOPS/W and 19.8GFLOPS RISC-V Vector-Tensor Architecture for Wearable Ultrasound Edge Computing

Maestro: A 302 GFLOPS/W and 19.8GFLOPS RISC-V Vector-Tensor Architecture for Wearable Ultrasound Edge Computing

Abstract

Maestro is a RISC-V System-on-Chip featuring a unified Vector-Tensor Unit (VTU) and specialized FFT accelerators, designed for highly efficient, on-device signal processing in Wearable Ultrasound (WUS) edge computing. Fabricated in low-cost 65nm CMOS technology, the VTU achieves a remarkable peak efficiency of 302 GFLOPS/W at FP16, addressing the latency and privacy issues of remote offload. The architecture provides a 5x speedup over state-of-the-art counterparts while consuming only 12mW for the complete ML-based pipeline.

Report

Structured Report: Maestro RISC-V Vector-Tensor Architecture

Key Highlights

  • Record Efficiency: The Vector-Tensor Unit (VTU) achieves a peak energy efficiency of 302 GFLOPS/W at FP16 precision.
  • High Performance: Maestro delivers a peak performance of 19.8 GFLOPS (FP16) from the VTU, and 3.6 GFLOPS (FP16) from the FFT accelerator.
  • Application Speedup: The SoC provides a 5x speedup compared to a state-of-the-art SoC with a similar mission profile.
  • Ultra-Low Power: The entire system consumes only 12mW during the US channel preprocessing and ML-based postprocessing pipeline, requiring only 2.5mJ of energy per task.
  • Target Domain: Specifically designed to enable true edge computing for Wearable Ultrasound (WUS) devices, removing the need for high-latency, privacy-compromising remote data offload.

Technical Details

  • Architecture: RISC-V System-on-Chip (SoC) integrating specialized accelerators.
  • Core Accelerators: Includes a unified Vector-Tensor Unit (VTU) for high-throughput AI calculations and memory-coupled Fast Fourier Transform (FFT) accelerators for signal processing.
  • Fabrication: Utilizes low-cost, mature TSMC 65nm CMOS technology.
  • Precision Support: Supports multi-precision floating-point operations (FP16 and FP32).
  • FFT Accelerator Performance: Achieves a peak efficiency of 60.6 GFLOPS/W.
  • Workload Metrics (CNN): When running Convolutional Neural Network (CNN) workloads for gesture recognition, the architecture achieved 19.52 GFLOPS at an extremely efficient rate of 298.03 GFLOPS/W.

Implications

  • Validating RISC-V in Edge AI: Maestro strongly validates the use of customized RISC-V architectures in highly specialized, ultra-low-power AI/signal processing applications, demonstrating that RISC-V can compete with proprietary IPs in the critical wearable and medical tech sectors.
  • Enabling Autonomous Wearables: By achieving high computational density (19.8 GFLOPS) at extremely low power (12mW), Maestro overcomes the fundamental computational barriers preventing complex signal and ML processing directly at the source, thus enabling truly autonomous, latency-free WUS devices.
  • Optimized Hybrid Processing: The integration of a unified VTU alongside specialized FFT hardware demonstrates an efficient path for complex real-time edge pipelines that require both heavy signal processing and subsequent deep learning analysis.
  • Cost-Effective High Performance: Utilizing mature, low-cost TSMC 65nm technology to achieve record-setting efficiency metrics shows that high-performance edge AI does not necessarily require the latest, most expensive fabrication nodes.
lock-1

Technical Deep Dive Available

This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.

Read Full Report →