PUMA: Efficient and Low-Cost Memory Allocation and Alignment Support for Processing-Using-Memory Architectures
Abstract
Processing-Using-DRAM (PUD) architectures require strict data alignment, mandating that operands reside in the same DRAM subarray and are aligned to row boundaries, constraints standard OS allocators cannot meet. This paper proposes PUMA, a lazy kernel-level data allocation routine that uses internal DRAM mapping and huge pages to create fine-grained, aligned, and virtually contiguous memory objects suitable for PUD systems. Implemented and tested on a RISC-V emulation platform supporting operations like RowClone and Ambit, PUMA significantly outperforms baseline memory allocators by ensuring successful in-memory execution.
Report
PUMA: Efficient and Low-Cost Memory Allocation and Alignment Support for Processing-Using-Memory Architectures
Key Highlights
- Problem Addressed: Standard operating system memory allocation routines (e.g.,
malloc,posix_memalign) fail to satisfy the restrictive data layout and alignment demands of Processing-Using-DRAM (PUD) architectures. - Core Requirement: PUD architectures require source and destination operands to (i) reside within the same DRAM subarray and (ii) be aligned precisely to the boundaries of a DRAM row.
- Solution: PUMA (Processing-Using-Memory Allocation) is a new lazy data allocation routine implemented in the kernel designed to influence the OS memory allocator.
- Performance: PUMA successfully enables PUD operations and significantly outperforms baseline memory allocators across all evaluated microbenchmarks and allocation sizes.
Technical Details
- PUMA Mechanism: PUMA utilizes the system's internal DRAM mapping information in conjunction with huge pages. It then splits these huge pages into finer-grained allocation units that are guaranteed to be both aligned to the page address/size and virtually contiguous, satisfying PUD hardware constraints.
- Implementation: PUMA is implemented as a kernel module.
- Emulation Environment: The system was emulated using QEMU, specifically targeting a RISC-V machine running the Fedora 33 distribution with the Linux v5.9.0 Kernel.
- Supported PUD Operations: The emulated PUD substrate supports row copy operations (as seen in architectures like RowClone) and Boolean operations (AND/OR/NOT, as seen in architectures like Ambit).
- Fallback Mechanism: If a given memory operation cannot be executed in the PUD substrate due to data misalignment, the operation defaults to execution on the host CPU.
Implications
- Enabling Technology for PUD: PUMA provides the critical OS-level software support necessary to bridge the gap between theoretical Processing-Using-Memory (PUM) hardware capabilities and practical application deployment. Without correct memory management, PUD performance gains are lost to frequent CPU fallbacks.
- RISC-V Ecosystem Relevance: By successfully implementing and testing PUMA on a RISC-V machine running a modern Linux kernel, the research demonstrates that these advanced memory architectures are viable within the growing RISC-V hardware ecosystem.
- Performance Maximization: By ensuring data alignment, PUMA allows PUD systems to perform high-speed, parallel in-memory operations efficiently, maximizing throughput and energy savings inherent to PUM designs.
Technical Deep Dive Available
This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.