Loop Transformations Leveraging Hardware Prefetching

Memory-bound applications heavily depend on the bandwidth of the system in order to achieve high performance. Improving temporal and/or spatial locality through loop transformations is a common way of mitigating this depen- dency. However, choosing the right combination of optimizations that should be applied on the target loop nest is not a trivial task, due to the fact that most of these transformations alter the memory access pattern of the application and as a result interfere with the efficiency of the sophisticated hardware prefetching mechanisms present in most modern architectures. We propose an optimization algorithm that analytically classifies an algorithmic description of a loop nest in order to decide whether it should be optimized stressing its temporal or spatial locality. We detect specific patterns in its definition while also taking the hardware prefetching units of modern processors into account. We implement our proposed technique as a tool to be used with the Halide compiler and test it on a variety of benchmarks. We find an average of performance improvement of over 40% compared to previous analytical models targeting the Halide language and compiler.