TY - GEN
T1 - Memory-mapping support for reducer hyperobjects
AU - Lee, I. Ting Angelina
AU - Shafi, Aamir
AU - Leiserson, Charles E.
PY - 2012
Y1 - 2012
N2 - Reducer hyperobjects (reducers) provide a linguistic abstraction for dynamic multithreading that allows different branches of a parallel program to maintain coordinated local views of the same nonlocal variable. In this paper, we investigate how thread-local memory mapping (TLMM) can be used to improve the performance of reducers. Existing concurrency platforms that support reducer hyperobjects, such as Intel Cilk Plus and Cilk++, take a hypermap approach in which a hash table is used to map reducer objects to their local views. The overhead of the hash table is costly - roughly 12× overhead compared to a normal L1-cache memory access on an AMD Opteron 8354. We replaced the Intel Cilk Plus runtime system with our own Cilk-M runtime system which uses TLMM to implement a reducer mechanism that supports a reducer lookup using only two memory accesses and a predictable branch, which is roughly a 3× overhead compared to an ordinary L1-cache memory access. An empirical evaluation shows that the Cilk-M memory-mapping approach is close to 4× faster than the Cilk Plus hypermap approach. Furthermore, the memory-mapping approach admits better locality than the hypermap approach during parallel execution, which allows an application using reducers to scale better.
AB - Reducer hyperobjects (reducers) provide a linguistic abstraction for dynamic multithreading that allows different branches of a parallel program to maintain coordinated local views of the same nonlocal variable. In this paper, we investigate how thread-local memory mapping (TLMM) can be used to improve the performance of reducers. Existing concurrency platforms that support reducer hyperobjects, such as Intel Cilk Plus and Cilk++, take a hypermap approach in which a hash table is used to map reducer objects to their local views. The overhead of the hash table is costly - roughly 12× overhead compared to a normal L1-cache memory access on an AMD Opteron 8354. We replaced the Intel Cilk Plus runtime system with our own Cilk-M runtime system which uses TLMM to implement a reducer mechanism that supports a reducer lookup using only two memory accesses and a predictable branch, which is roughly a 3× overhead compared to an ordinary L1-cache memory access. An empirical evaluation shows that the Cilk-M memory-mapping approach is close to 4× faster than the Cilk Plus hypermap approach. Furthermore, the memory-mapping approach admits better locality than the hypermap approach during parallel execution, which allows an application using reducers to scale better.
KW - Cilk
KW - Dynamic multithreading
KW - Memory mapping
KW - Reducer hyperobjects
KW - Reducers
KW - Task parallelism
KW - Thread-local memory mapping (TLMM)
KW - Work stealing
UR - http://www.scopus.com/inward/record.url?scp=84864142628&partnerID=8YFLogxK
U2 - 10.1145/2312005.2312056
DO - 10.1145/2312005.2312056
M3 - Conference contribution
AN - SCOPUS:84864142628
SN - 9781450312134
T3 - Annual ACM Symposium on Parallelism in Algorithms and Architectures
SP - 287
EP - 297
BT - SPAA'12 - Proceedings of the 24th ACM Symposium on Parallelism in Algorithms and Architectures
T2 - 24th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA'12
Y2 - 25 June 2012 through 27 June 2012
ER -