Near-Optimal Cache Sharing through Co-Located Parallel Scheduling of Threads
Abstract
This work introduces a novel mechanism, Co-Located Parallel Scheduling, aimed at achieving near-optimal efficiency in shared cache utilization among competing threads. The innovation focuses on intelligently grouping and scheduling threads whose memory access patterns minimize contention and maximize data reuse within shared cache resources (L2/L3). This scheduling approach promises significant performance gains and higher overall system throughput for multi-threaded workloads.
Report
Key Highlights
- Optimization Target: Achievement of 'Near-Optimal' efficiency in cache sharing, addressing one of the major bottlenecks in multi-core systems: destructive cache interference (thrashing).
- Core Mechanism: The innovation lies in the scheduling policy itself, termed Co-Located Parallel Scheduling (CLPS), which decides where and when threads run relative to their cache dependencies.
- Performance Goal: CLPS likely groups threads that exhibit positive cache interaction (data sharing) or are least disruptive to each other's working sets when sharing the same cache slice.
- Application Context: Highly relevant for systems running complex, parallel workloads, particularly in embedded computing environments where resource utilization is critical.
Technical Details
- Scheduling Focus: This is a software/runtime optimization, requiring modifications to the operating system scheduler or runtime library rather than specific changes to the cache hardware itself.
- Resource Management: The scheduler must be aware of the underlying Non-Uniform Cache Access (NUCA) or similar hierarchical cache topologies, scheduling threads onto cores that share the same L2 or L3 cache based on observed or predicted cache usage.
- Thread Metrics: The method likely relies on monitoring metrics such as misses per instruction (MPI), working set size, and data reuse between paired threads to determine the 'optimal' co-location grouping.
- Goal State: The scheduler attempts to create affinity groups where the combined working set size of the co-located threads fits ideally within the shared cache partition, preventing frequent evictions.
Implications
- RISC-V Ecosystem Performance: Cache efficiency is paramount for leveraging the power of high-core-count and many-core RISC-V designs. This technique allows RISC-V platforms, particularly those targeting server or highly-parallel embedded applications, to deliver significantly higher real-world throughput without increasing clock speeds or core counts.
- Power Efficiency: By reducing cache misses, the system spends less time fetching data from slower main memory (DRAM), directly translating into lower energy consumption, a major benefit for power-sensitive RISC-V profiles (e.g., mobile or IoT).
- Software Enablement: The development of such near-optimal scheduling algorithms provides critical intellectual property for operating systems running on RISC-V (such as optimized Linux kernels or specialized RTOS), improving the overall competitive profile of the architecture.
- Design Freedom: Hardware designers can potentially simplify cache coherence protocols or hierarchical designs if the software layer (the scheduler) is guaranteed to handle resource contention intelligently, accelerating future RISC-V core development.
Technical Deep Dive Available
This public summary covers the essentials. The Full Report contains exclusive architectural diagrams, performance audits, and deep-dive technical analysis reserved for our members.