TY - GEN
T1 - Performance modeling for highly-threaded many-core GPUs
AU - Ma, Lin
AU - Chamberlain, Roger D.
AU - Agrawal, Kunal
PY - 2014
Y1 - 2014
N2 - Highly-threaded many-core GPUs can provide high throughput for a wide range of algorithms and applications. Such machines hide memory latencies via the use of a large number of threads and large memory bandwidth. The achieved performance, therefore, depends on the parallelism exploited by the algorithm, the effectiveness of latency hiding, and the utilization of multiprocessors (occupancy). In this paper, we extend previously proposed analytical models, jointly addressing parallelism, latency-hiding, and occupancy. In particular, the model not only helps to explore and reduce the configuration space for tuning kernel execution on GPUs, but also reflects performance bottlenecks and predicts how the runtime will trend as the problem and other parameters scale. The model is validated with empirical experiments. In addition, the model points to at least one circumstance in which the occupancy decisions automatically made by the scheduler are clearly sub-optimal in terms of runtime.
AB - Highly-threaded many-core GPUs can provide high throughput for a wide range of algorithms and applications. Such machines hide memory latencies via the use of a large number of threads and large memory bandwidth. The achieved performance, therefore, depends on the parallelism exploited by the algorithm, the effectiveness of latency hiding, and the utilization of multiprocessors (occupancy). In this paper, we extend previously proposed analytical models, jointly addressing parallelism, latency-hiding, and occupancy. In particular, the model not only helps to explore and reduce the configuration space for tuning kernel execution on GPUs, but also reflects performance bottlenecks and predicts how the runtime will trend as the problem and other parameters scale. The model is validated with empirical experiments. In addition, the model points to at least one circumstance in which the occupancy decisions automatically made by the scheduler are clearly sub-optimal in terms of runtime.
KW - All-pairs Shortest Paths (APSP)
KW - GPGPU
KW - Performance Model
KW - Threaded Many-core Memory (TMM) Model
UR - https://www.scopus.com/pages/publications/84906336823
U2 - 10.1109/ASAP.2014.6868641
DO - 10.1109/ASAP.2014.6868641
M3 - Conference contribution
AN - SCOPUS:84906336823
SN - 9781479936090
SN - 9781479936090
T3 - Proceedings of the International Conference on Application-Specific Systems, Architectures and Processors
SP - 84
EP - 91
BT - ASAP 2014 - Proceedings of the 2014 IEEE 25th International Conference on Application-Specific Systems, Architectures and Processors
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 25th IEEE International Conference on Application-Specific Systems, Architectures and Processors, ASAP 2014
Y2 - 18 June 2014 through 20 June 2014
ER -