QFQ+ is a faster variant of QFQ that provides the following two benefits: 1) QFQ+ is faster than QFQ, 2) differently from QFQ, QFQ+ correctly schedules also non-leaves classes in a hierarchical setting. The way how QFQ+ achieves these goals is discussed briefly below. I also report some performance measurements. A detailed description of QFQ+ and of these results can be found in [1].
QFQ+ achieves a higher speed than QFQ by grouping classes into aggregates, and uses the original QFQ scheduling algorithm to schedule aggregates instead of single classes. An aggregate is made of at most M classes, all with the same weight and maximum packet size. M is equal to the minimum between tx_queue_len+1 and 8 (value chosen to get a good trade-off between execution time and service guarantees). QFQ+ associates each aggregate with a budget equal to the maximum packet size for the classes in the aggregate, multiplied by the number of classes of the aggregate. Once selected an aggregate for service, QFQ+ dequeues only the packets of its classes, until the aggregate finishes its budget. Finally, within an aggregate, classes are scheduled with DRR. In my tests, described below, the execution time of QFQ+ with M=8 was from 16% to 31% lower than that of QFQ, and close to that of DRR.
QFQ+ does not use packet lengths for computing aggregate timestamps, but budgets. Hence it does not need to modify any timestamp if the head packet of a class changes. As a consequence, differently from QFQ, which uses head-packet lengths to compute class timestamps, QFQ+ does not need further modifications to correctly schedule also non-leaf classes and classes with non-FIFO qdiscs.
As for service guarantees, thanks to the way how M is computed, the service of QFQ+ is close to the one of QFQ. For example, as proved in [1], under QFQ+ every packet of a given class is guaranteed the same worst-case completion time as under QFQ, plus an additional delay equal to the transmission time, at the rate reserved to the class, of three maximum-size packet. See [1, Section 7.1] for a numerical comparison among the packet delays guaranteed by QFQ+, QFQ and DRR.
I measured the execution time of QFQ+, DRR and QFQ using the testing
environment [2]. In particular, for each scheduler I measured the
average total execution time of a packet enqueue plus a packet
dequeue, without TSO/GSO. What would happen with TSO/GSO is discussed
at the end of this description. For practical reasons, in this testing environment each
enqueue&dequeue is also charged for the cost of generating and
discarding an empty, fixed-size packet (using a free list). The
following table reports the results with an i7-2760QM, against four
different class sets. Time is measured in nanoseconds, while each set
or subset of classes is denoted as
----------------------------------------------- | Set of | QFQ+ (M) | DRR QFQ | | classes | 1 8 | | |-----------------------------------------------| | 1k-w1 | 89 63 | 56 81 | |-----------------------------------------------| | 500-w1, | | | | 250-w2, | 102 71 | 87 103 | | 250-w4 | | | |-----------------------------------------------| | 32k-w1 | 267 225 | 173 257 | |-----------------------------------------------| | 16k-w1, | | | | 8k-w2, | 253 187 | 252 257 | | 8k-w4 | | | -----------------------------------------------
About DRR, it achieves its best performance when all the classes have the same weight. This is fortunate, because in such scenarios it is actually pointless to use a fair-queueing scheduler, as the latter would provide the same quality of service as DRR. In contrast, when classes have differentiated weights and the better service properties of QFQ+ make a difference, QFQ+ has better performance than DRR. It happens mainly because QFQ+ dequeues packets in an order that causes about 8% less cache misses than DRR. As for the number of instructions, QFQ+ executes instead about 7% more instructions than DRR, whereas QFQ executes from 25% to 34% more instructions than DRR.
With TSO/GSO, the number of instructions per packet enqueue/dequeue executed by QFQ and QFQ+ is the same as without TSO/GSO. In contrast, for each dequeue, the number of iterations executed by the main loop of DRR is 64K/quantum times as high as without TSO/GSO.
[1] P. Valente, "Reducing the Execution Time of Fair-Queueing Schedulers"
[2] http://algogroup.unimore.it/people/paolo/agg-sched/test-env.tgz