TY - JOUR
T1 - Reducing queuing impact in streaming applications with irregular dataflow
AU - Timcheck, Stephen
AU - Buhler, Jeremy
N1 - Funding Information:
This work was supported by National Science Foundation awards CNS-1763503 and CNS-1500173 .
Publisher Copyright:
© 2021 The Authors
PY - 2022/3
Y1 - 2022/3
N2 - Throughput-oriented streaming applications on massive data sets are a prime candidate for parallelization on wide-SIMD platforms, especially when inputs are independent of one another. Many such applications are represented as a pipeline of compute nodes connected by directed edges. Here, we study applications with irregular dataflow, i.e., those where the number of outputs produced per input to a node is data-dependent and unknown a priori. We consider how to implement such applications on wide-SIMD architectures, such as GPUs, where different nodes of the pipeline execute cooperatively on a single processor. To promote greater SIMD parallelism, irregular application pipelines can utilize queues to gather and compact multiple data items between nodes. However, the decision to introduce a queue between two nodes must trade off benefits to occupancy against costs associated with managing the queue and scheduling the nodes at its endpoints. Moreover, once queues are introduced to an application, their relative sizes impact the frequency with which the application switches between nodes, incurring scheduling and context-switching overhead. This work examines two optimization problems associated with queues. First, given a pipeline with queues between each two nodes and a fixed total budget for queue space, we consider how to choose the relative sizes of inter-node queues to minimize the frequency of switching between nodes. Second, we consider which pairs of successive nodes in a pipeline should have queues between them to maximize overall application throughput. We give an empirically useful approximation to the first problem that allows for an analytical solution and formulate a performance model for the second that directs implementation toward higher-performing strategies. We implemented our analyses and resulting optimizations in applications built using Mercator, a framework we designed to support irregular streaming applications on NVIDIA GPUs. We demonstrate that these optimizations yield meaningful performance improvements for several benchmark Mercator applications.
AB - Throughput-oriented streaming applications on massive data sets are a prime candidate for parallelization on wide-SIMD platforms, especially when inputs are independent of one another. Many such applications are represented as a pipeline of compute nodes connected by directed edges. Here, we study applications with irregular dataflow, i.e., those where the number of outputs produced per input to a node is data-dependent and unknown a priori. We consider how to implement such applications on wide-SIMD architectures, such as GPUs, where different nodes of the pipeline execute cooperatively on a single processor. To promote greater SIMD parallelism, irregular application pipelines can utilize queues to gather and compact multiple data items between nodes. However, the decision to introduce a queue between two nodes must trade off benefits to occupancy against costs associated with managing the queue and scheduling the nodes at its endpoints. Moreover, once queues are introduced to an application, their relative sizes impact the frequency with which the application switches between nodes, incurring scheduling and context-switching overhead. This work examines two optimization problems associated with queues. First, given a pipeline with queues between each two nodes and a fixed total budget for queue space, we consider how to choose the relative sizes of inter-node queues to minimize the frequency of switching between nodes. Second, we consider which pairs of successive nodes in a pipeline should have queues between them to maximize overall application throughput. We give an empirically useful approximation to the first problem that allows for an analytical solution and formulate a performance model for the second that directs implementation toward higher-performing strategies. We implemented our analyses and resulting optimizations in applications built using Mercator, a framework we designed to support irregular streaming applications on NVIDIA GPUs. We demonstrate that these optimizations yield meaningful performance improvements for several benchmark Mercator applications.
KW - Dataflow
KW - Irregular
KW - Queuing
KW - SIMD
KW - Streaming
UR - http://www.scopus.com/inward/record.url?scp=85119100804&partnerID=8YFLogxK
U2 - 10.1016/j.parco.2021.102863
DO - 10.1016/j.parco.2021.102863
M3 - Article
AN - SCOPUS:85119100804
SN - 0167-8191
VL - 109
JO - Parallel Computing
JF - Parallel Computing
M1 - 102863
ER -