TY - GEN
T1 - Design of throughput-optimized arrays from recurrence abstractions
AU - Jacob, Arpith C.
AU - Buhler, Jeremy D.
AU - Chamberlain, Roger D.
PY - 2010
Y1 - 2010
N2 - Many compute-bound applications have seen order-of-magnitude speedups using special-purpose accelerators. FPGAs in particular are good at implementing recurrence equations realized as arrays. Existing high-level synthesis approaches for recurrence equations produce an array that is latency-space optimal. We target applications that operate on a large collection of small inputs, e.g. a database of biological sequences, where overall throughput is the most important measure of performance. In this work, we introduce a new design-space exploration procedure within the polyhedral framework to optimize throughput of a systolic array subject to area and bandwidth constraints of an FPGA device. Our approach is to exploit additional parallelism by pipelining multiple inputs on an array and multiple iteration vectors in a processing element. We prove that the throughput of an array is given by the inverse of the maximum number of iteration vectors executed by any processor in the array, which is determined solely by the array's projection vector. We have applied this observation to discover novel arrays for Nussinov RNA folding. Our throughput-optimized array is 2x faster than the standard latency-space optimal array, yet it uses 15% fewer LUT resources. We achieve a further 2x speedup by processor pipelining, with only a 37% increase in resources. Our tool suggests additional arrays that trade area for throughput and are 4-5x faster than the currently used latency-optimized array. These novel arrays are 70-172x faster than a software baseline.
AB - Many compute-bound applications have seen order-of-magnitude speedups using special-purpose accelerators. FPGAs in particular are good at implementing recurrence equations realized as arrays. Existing high-level synthesis approaches for recurrence equations produce an array that is latency-space optimal. We target applications that operate on a large collection of small inputs, e.g. a database of biological sequences, where overall throughput is the most important measure of performance. In this work, we introduce a new design-space exploration procedure within the polyhedral framework to optimize throughput of a systolic array subject to area and bandwidth constraints of an FPGA device. Our approach is to exploit additional parallelism by pipelining multiple inputs on an array and multiple iteration vectors in a processing element. We prove that the throughput of an array is given by the inverse of the maximum number of iteration vectors executed by any processor in the array, which is determined solely by the array's projection vector. We have applied this observation to discover novel arrays for Nussinov RNA folding. Our throughput-optimized array is 2x faster than the standard latency-space optimal array, yet it uses 15% fewer LUT resources. We achieve a further 2x speedup by processor pipelining, with only a 37% increase in resources. Our tool suggests additional arrays that trade area for throughput and are 4-5x faster than the currently used latency-optimized array. These novel arrays are 70-172x faster than a software baseline.
KW - Dynamic programming
KW - FPGA
KW - Recurrences
KW - Systolic array
KW - Throughput optimization
UR - http://www.scopus.com/inward/record.url?scp=77955909861&partnerID=8YFLogxK
U2 - 10.1109/ASAP.2010.5540753
DO - 10.1109/ASAP.2010.5540753
M3 - Conference contribution
AN - SCOPUS:77955909861
SN - 9781424469673
T3 - Proceedings of the International Conference on Application-Specific Systems, Architectures and Processors
SP - 133
EP - 140
BT - ASAP 10 - 21st IEEE International Conference on Application-Specific Systems, Architectures and Processors, Conference Proceedings
T2 - 21st IEEE International Conference on Application-specific Systems, Architectures and Processors, ASAP 2010
Y2 - 7 July 2010 through 9 July 2010
ER -