This paper presents a helper thread prefetching scheme that is designed to work on loosely-coupled processors, such as in a standard chip multiprocessor (CMP) system or an intelligent memory system. Loosely-coupled processors have an advantage in that fine-grain resources, such as processor and L1 cache resources, are not contended by the application and helper threads, hence preserving the speed of the application. However, inter-processor communication is expensive in such a system. We present techniques to alleviate this. Our approach exploits large loop-based code regions and is based on a new synchronization mechanism between the application and helper threads. This mechanism precisely controls how far ahead the execution of the helper thread can be with respect to the application thread. We found that this is important in ensuring prefetching timeliness and avoiding cache pollution. To demonstrate that prefetching in a loosely-coupled system can be done effectively, we evaluate our prefetching in a standard, unmodified CMP system, and in an intelligent memory system where a simple processor in memory executes the helper thread. Evaluating our scheme with nine memory-intensive applications with the memory processor in DRAM achieves an average speedup of 1.25. Moreover, our scheme works well in combination with a conventional processor-side sequential L1 prefetcher, resulting in an average speedup of 1.31. In a standard CMP, the scheme achieves an average speedup of 1.33.
本文提出了一种辅助线程预取方案,该方案旨在用于松耦合处理器,例如在标准的芯片多处理器(CMP)系统或智能存储系统中。松耦合处理器的优势在于,诸如处理器和一级缓存资源等细粒度资源不会被应用程序和辅助线程争用,从而保持了应用程序的速度。然而,在这样的系统中,处理器间的通信成本很高。我们提出了缓解这一问题的技术。我们的方法利用了基于大型循环的代码区域,并基于应用程序和辅助线程之间的一种新的同步机制。这种机制精确地控制辅助线程相对于应用线程可以提前执行多远。我们发现这对于确保预取的及时性和避免缓存污染非常重要。为了证明在松耦合系统中可以有效地进行预取,我们在标准的、未修改的CMP系统以及智能存储系统(其中内存中的一个简单处理器执行辅助线程)中对我们的预取进行了评估。在DRAM中使用内存处理器对九个内存密集型应用程序评估我们的方案,平均加速比达到1.25。此外,我们的方案与传统的处理器端顺序一级预取器结合使用效果良好,平均加速比达到1.31。在标准的CMP中,该方案平均加速比达到1.33。