TY - JOUR
T1 - Efficient execution of recursive programs on commodity vector hardware
AU - Ren, Bin
AU - Jo, Youngjoon
AU - Krishnamoorthy, Sriram
AU - Agrawal, Kunal
AU - Kulkarni, Milind
N1 - Publisher Copyright:
© 2015 ACM.
PY - 2015/6
Y1 - 2015/6
N2 - The pursuit of computational efficiency has led to the proliferation of throughput-oriented hardware, from GPUs to increasingly wide vector units on commodity processors and accelerators. This hardware is designed to efficiently execute data-parallel computations in a vectorized manner. However, many algorithms are more naturally expressed as divide-and-conquer, recursive, task-parallel computations. In the absence of data parallelism, it seems that such algorithms are not well suited to throughput-oriented architectures. This paper presents a set of novel code transformations that expose the data parallelism latent in recursive, task-parallel programs. These transformations facilitate straightforward vectorization of task-parallel programs on commodity hardware. We also present scheduling policies that maintain high utilization of vector resources while limiting space usage. Across several task-parallel benchmarks, we demonstrate both efficient vector resource utilization and substantial speedup on chips using Intel's SSE4.2 vector units, as well as accelerators using Intel's AVX512 units.
AB - The pursuit of computational efficiency has led to the proliferation of throughput-oriented hardware, from GPUs to increasingly wide vector units on commodity processors and accelerators. This hardware is designed to efficiently execute data-parallel computations in a vectorized manner. However, many algorithms are more naturally expressed as divide-and-conquer, recursive, task-parallel computations. In the absence of data parallelism, it seems that such algorithms are not well suited to throughput-oriented architectures. This paper presents a set of novel code transformations that expose the data parallelism latent in recursive, task-parallel programs. These transformations facilitate straightforward vectorization of task-parallel programs on commodity hardware. We also present scheduling policies that maintain high utilization of vector resources while limiting space usage. Across several task-parallel benchmarks, we demonstrate both efficient vector resource utilization and substantial speedup on chips using Intel's SSE4.2 vector units, as well as accelerators using Intel's AVX512 units.
KW - Recursive Programs
KW - Task Parallelism
KW - Vectorization
UR - https://www.scopus.com/pages/publications/84951083894
U2 - 10.1145/2737924.2738004
DO - 10.1145/2737924.2738004
M3 - Article
AN - SCOPUS:84951083894
SN - 1523-2867
VL - 50
SP - 509
EP - 520
JO - ACM SIGPLAN Notices
JF - ACM SIGPLAN Notices
IS - 6
ER -