TY - GEN
T1 - Addressing queuing bottlenecks at high speeds
AU - Kumar, Sailesh
AU - Turner, Jonathan
AU - Crowley, Patrick
PY - 2005
Y1 - 2005
N2 - Modern routers and switch fabrics can have hundreds of input and output ports running at up to 10 Gb/s; 40 Gb/s systems are starting to appear. At these rates, the performance of the buffering and queuing subsystem becomes a significant bottleneck. In high performance routers with more than a few queues, packet buffering is typically implemented using DRAM for data storage and a combination of off-chip and on-chip SRAM for storing the linked-list nodes and packet length, and the queue headers, respectively. This paper focuses on the performance bottlenecks associated with the use of off-chip SRAM. We show how the combination of implicit buffer pointers and multi-buffer list nodes can dramatically reduce the impact of buffering and queuing subsystem on queuing performance. We also show how combining it with coarse-grained scheduling can improve the performance of fair queuing algorithms, while also reducing the amount of off-chip memory and bandwidth needed. These techniques can reduce the amount of SRAM needed to hold the list nodes by a factor of 10 at the cost of about 10% wastage of the DRAM space, assuming an aggregation degree of 16.
AB - Modern routers and switch fabrics can have hundreds of input and output ports running at up to 10 Gb/s; 40 Gb/s systems are starting to appear. At these rates, the performance of the buffering and queuing subsystem becomes a significant bottleneck. In high performance routers with more than a few queues, packet buffering is typically implemented using DRAM for data storage and a combination of off-chip and on-chip SRAM for storing the linked-list nodes and packet length, and the queue headers, respectively. This paper focuses on the performance bottlenecks associated with the use of off-chip SRAM. We show how the combination of implicit buffer pointers and multi-buffer list nodes can dramatically reduce the impact of buffering and queuing subsystem on queuing performance. We also show how combining it with coarse-grained scheduling can improve the performance of fair queuing algorithms, while also reducing the amount of off-chip memory and bandwidth needed. These techniques can reduce the amount of SRAM needed to hold the list nodes by a factor of 10 at the cost of about 10% wastage of the DRAM space, assuming an aggregation degree of 16.
UR - https://www.scopus.com/pages/publications/33751172719
U2 - 10.1109/CONECT.2005.7
DO - 10.1109/CONECT.2005.7
M3 - Conference contribution
AN - SCOPUS:33751172719
SN - 0769524494
SN - 9780769524498
T3 - Proceedings - Symposium on the High Performance Interconnects, Hot Interconnects
SP - 209
EP - 224
BT - Proceedings - 13th Symposium on High Performance Interconnects, Hot Interconnects 13
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 13th Symposium on High Performance Interconnects, Hot Interconnects 13
Y2 - 17 August 2005 through 19 August 2005
ER -