Intro to QoS // The HQF


The purpose of this post is to provide a basic description and implementation of Quality of Service techniques, specifically in a Cisco deployment, additionally, this will serve as a QoS study guide for the CCIE R&S exam. Please refer to vendor documentation and third party literature for more in-depth concepts.

QoS is typically broken out into different aspects, which makes it a little hard to understand at the beginning, the most basic definition would be, that QoS is a way or technique for treating multiple types of traffic (for any kind of application) in different ways, providing different types of classes or traffic priorities, meaning we will manipulate the traffic in a determined manner depending on the classification or marking that we have previously given to each traffic flow.

Tools that we need to know



Congestion Avoidance

Congestion Management




We then apply these tools in a hierarchical fashion

Hierarchical Queueing Framework or Modular Queueing CLI

Consider this, every bit of information that we attempt to send out an interface, needs to be transformed into electric or optical signals and be put on the interface hardware in a queue called the TxRing or hardware queue, the process of doing this is called serialization, also, this hardware queue is always going to have a FIFO behavior or first in first out.

This is the topology that we will use for the examples throughout this post:

Classification and Marking

Let’s look at the definitions found on End-to-End QoS Network Design book:

Classification tools sort packets into different traffic types, to which different policies then, can be applied. Classifiers inspect one or more fields in a packet to identify the type of traffic that the packet is carrying, after being identified the traffic can then be passed to re-marking, queuing, policing, shaping, etc.

Marking (or re-marking) typically establishes a trust boundary on which scheduling tools later depend. Markers write a field within the packet, frame, cell or label to preserve the classification decision that was reached at the trust boundary, subsequent nodes do not have to perform the same in-depth classification.

What we should take away from here, is that classification allows organizing traffic into traffic classes or categories on the basis of whether the traffic matches specific criteria or not, for instance, we can use an ACL to match traffic originating from a certain subnet and put all this traffic in a class named CLASS_SOURCING_A , such configuration would look like this:

This is how we would normally implement classification, we create a class-map and match on a given criteria, these are all the options that we have available on IOS-XE 15.4 code:

On the other hand, marking allows tuning the attributes for the traffic on the network, which determines how the traffic will be treated, based on how the attributes for the network traffic are set. When marking, we will manipulate the Differentiated Services Code Point (DSCP) field, Class Selector (CS) code point, Class of Service (CoS), IP Precedence (IPP in ToS field) and Traffic Identifier (TID), these are all different terms used to indicate a designated field in a L2 or L3 header.

For instance, we can modify the ToS field in the IP header to give different priorities to certain traffic flows, RFC-791 indicates the different priorities and values on the ToS field first three bits:

Most of the times, we will mark or re-mark with either the IPP or the DSCP codes (using either decimal, binary or Per Hop Behavior (PHB) notation). To make it clear, DSCP is a more granular version of IPP, DSCP is made of 6 bits and IPP 3 bits, IPP is contained in DSCP, and at the same time, DSCP has a PHB nomenclature comprised of BE, AF and EF, being EF the most critical and its decimal value is 46, binary is 101110. The following is an image from End-to-End QoS book, who’s Classification and Marking section is a highly recommended reading in order to understand the DSCP and IPP concepts:

Also, RFC-4594 describes the service classes configured with DiffServ and recommends how they can be used and how to construct them using DSCPs. Cisco QoS marking recommendations follow this RFC with one single exception between CS5 (DSCP decimal 40 or IPP 5) and AF31 (DSCP decimal 26 or IPP 3), the following is an image describing these recommendations, also another reference for DSCP/PHB/IPP conversion lies here.

Look at the difference in the default DSCP values between an ICMP packet and an EIGRP Hello packet:

Intelligently, ICMP by default receives DSCP CS0 or “Best-Effort” behavior, while EIGRP Hellos being a necessary keepalive for EIGRP adjacencies to remain established, receive DSCP CS6.

Let’s generate ICMP traffic from OS1 going to CSR2’s loopback (, the traffic would go from OS1 to CSR1 and then to CSR2 through G21 link, we will apply a policy-map to CSR1’s  Gi3 interface matching a class-default class (this class is always there and is always matching all traffic) and remark the traffic with DSCP 46 (Expedited Forwarding), here’s how:

And if we now do a capture on CSR1’s Gi2 interface, we would be able to see the re-marked ICMP packets with a DSCP value of EF (PHB notation), we can also see that our policy-map has gotten some hits, based on the show policy-map interface Gi3 output:


Congestion Avoidance

There are a few things we can do to avoid having a congested link, as opposed to congestion management techniques, here, we are going to deal with traffic patterns and spikes to avoid collapsing the link. Congestion avoidance is achieved through packet dropping, some of the common ways of dropping packets are Tail Drop, RED and WRED, let’s examine each.

Tail Drop

TCP Synchronization is the behavior of the TCP traffic traversing an interface when such traffic’s TCP Window size starts cutting in half all at the same time and slow starts begin, usually the interface will drop tons of packets per flow, this is the so-called sawtooth behavior of TCP traffic, synchronization incurs poor bandwidth performance for the link.

Tail drop can result in global TCP synchronization, which we need to avoid, tail drop treats all traffic equally and does not differentiate between classes, when an output queue is full and we are doing tail dropping, packets are dropped until the congestion ceases or the queue is no longer full. Tail drop is enabled by default.


Random Early Detection works by letting the end hosts know when they should temporarily slow down on the traffic flow, because most of the traffic across networks is TCP based, RED takes advantage of this by randomly dropping packets from the queue before the buffer is 100% full, in order to avoid the congestion of a link, this results in more even traffic patterns (less sawtooth).

In IOS-XE, we can enable RED by adding random-detect keyword to a policy-map, for instance, let’s enable RED on CSR2’s Gi1 interface by applying a random-detect enabled policy-map called “RED” on the interface, we will also lower the bandwidth of Gi1 to 10 Kbps and then let’s generate a 100 Kbps TCP flow of traffic from OS1 directed to CSR3’s Loopback, on top of that CSR1 will ping CSR3’s Loopack with a 1500 bytes size, here’s how:

So I’ve let the traffic flow for a while and now, if we check the output of show policy-map interface Gi1 on CSR2, we can see that we have randomly dropped some packets, both IPP 0 and IPP 5, and we have also tail dropped some of them:

WRED works very similar, but instead we leverage the DSCP for more granularity and instead of randomly dropping traffic flows, the dropping is going to be based on the priority of each specific traffic flow, more technically, the drop rate is based on the “Mark Probability Denominator”, which increases as queue depth increases, again, if the queue exceeds the maximum, tail drop starts occurring. WRED is configured in the same fashion as RED but we are going to add dscp-based keyword to the random-detect line:

Congestion Management

Congestion Management typically deals with the following scenario: “The link is congested, we have 100% utilization on the output queue, what are we going to do now ?”. This applies outbound only, once the software queue is full, we need to figure out what to do with the traffic, do we re-order the frames ? do we sacrifice one for another (drop) ? This is what congestion management tools deal with, by default all queues have a FIFO behavior, but we can also leverage Weighted Fair Queueing to prioritize some traffic that is delay sensitive, for instance, the VoIP flows.


When using Weighted Fair Queuing, the system will automatically allocate an equal share of bandwidth to each traffic flow,  packets with the same source/destination IP and TCP/UDP ports belong to the same flow. WFQ is simply enabled by typing the fair-queue keyword per class under the policy-map:

We will focus on Class Based Weighted Fair Queuing, because it is the one we would implement using the HQF, CBWFQ simply designates a weighted queue per user-defined class, meaning every time we specify a class under a policy-map, we are enabling WFQ on that class, and the weight for this class will be defined by the bandwidth command.

Let’s look at an example, we will create two class-maps matching HTTP and ICMP respectively, and then a policy-map doing bandwidth reservation and LLQ that we will apply to Gi2 outbound (I have deleted every previous class-map and/or policy-maps):

So in the output above, we have defined two classes matching the desired type of traffic, based on NBAR  (try doing a show ip nbar port-map) and then created a policy-map calling both classes.

For the HTTP traffic, we have given it a bandwidth reservation of 50% (this is based on the software defined bandwidth of the link that we will apply this policy-map to), what this means is that the minimum bandwidth that will be reserved for this type of traffic in times when the link is congested, will be 50%.  We could have also used a fixed value like 50 Mbps, but I rather use percentage. The default bandwidth reservation for the class-default class in this code is 1%, but the system will let you allocate up to 100% of the link for user-defined classes, keep in mind that should the link become congested and 100% of the BW is class-based defined, we could start dropping packets belonging to unclassified traffic.

As far as the ICMP traffic, we have given it a priority of 5%, priority (LLQ) means the maximum percentage guaranteed, so in this case, up 5% of the bandwidth of the link is guaranteed for ICMP traffic, if this type of traffic goes above 5% and the link is congested, a built-in policer could start dropping traffic based on whether the TxRing is full at that moment or not. If on the other hand, the ICMP traffic goes over 5% of the link and the link is not congested, it will start receiving a FIFO treatment.

So if we start generating some HTTP and ICMP traffic from OS1 going to CSR2’s Loopback, we should start seeing some counters clocking when we do a show policy-map interface Gi2:


Policers are typically used to rate-limit the traffic as it enters an interface, they can also be applied outbound but in most cases, we would use them to, for instance, limit the rate at which traffic coming from a customer enters a Provider Edge (PE) router interface. The way policers work is such as Token Bucket, which is a formal definition of a rate of transfer, this is composed of three components, the burst committed (bc), the committed information rate (cir) and the time interval (tc).

The cir will be in most cases the one we will consider when limiting the traffic, the bc specifies in bytes/bits per burst how much traffic can be sent within a given unit of time, and the tc specifies the amount of time between each burst. Traffic flow at a given rate per second is said to be conformed if it falls within the cir rate, otherwise, it is said to be exceeded.

cir = bc / tc

Let’s look at an example, we will police on CSR3’s Gi3 interface to rate-limit traffic coming in matching everything (class-default) to 8 Kbps, which is the minimum allowed in this code that I’m using, by default conformed traffic gets marked to transmit and exceeded traffic gets marked to be dropped, dropping is a marking action, we could also mark it so that the packet receives a low priority after a certain threshold instead.

And if we look at the show policy-map interface command, we see that the bc value has been automatically calculated for us:

This is what’s called a 2 color marker, the two markers are conformed and exceeded. The bc being 1500 bytes means we are allowed to receive 1500 bytes on a per interval basis in order to achieve the target cir of 24 Kbps. So if we transform 1500 bytes to bits, we get 12000 bits, and because tc = bc/cir, we can conclude that tc = 0.5 which is the same as 500ms, meaning the policer will run 2 times per second because 1000 ms / 2 = 500 ms.

So if we wanted to make this policer more strict, we would lower the bc in order to get a lower tc, but in any case, the user-manipulated tc ideally would have to be within 1 and 125ms, otherwise the router will automatically determine an internal tc value that it believes will be more stable with.

In the output above, we can see that 1726 packets have been conformed and transmitted, and 23732 packets have not. We also could have said something like, if the cir is exceeded, we will not drop the packets but instead will remark them to be assigned to a scavenger class or give them a best-effort behavior, like this:


A shaper typically delays excess of traffic by using a buffer or queueing mechanism to hold down the packets whenever the rate of transmission exceeds a certain user-defined threshold. A good shaper in most cases will match the settings of the counter policer, for instance, we could use a shaper outbound to match the cir and bc at which the service provider is policing us on the gateway interface. The shapers work  by sending the bc amount of data every tc interval at the physical port speed (serialization). The default queue type on a shaper is FIFO, but we can enable WFQ inside the shaper.

So the goals of a shaper could be summarized as: smooth out traffic bursts, prepare traffic for ingress policing and delay/queue up exceeding traffic.

Let’s go into CSR2 and create a shaper on Gi3, we will assume we want to match a 5 Mbps policer on our GW to the internet:

Then if we look at the show policy-map interface Gi3 output:

We can see that the bc and be values have been automatically calculated for us, and notice be = bc, they are always going to be the same unless be is manually specified, be is the burst excess, back to our Token Bucket model, be is the amount of bits allowed to be transmitted if the bc bucket did not get emptied completely after a tc interval. Also notice the queue limit is 64 packets but we could have changed this easily by adding the queue-limit keyword to the fair-queue command under the policer.

So now Gig3 will be allowed to send out bursts of 20 Kbits every tc interval, remember that these bursts are always sent at the Access Rate (the actual speed of the interface).


If you have read all the way to this point, HQF is what we have been doing all this time, HQF basically refers to the “new way” or the new syntax when configuring QoS in Cisco devices, which is a nested structure using class-maps to match and classify the traffic, policy-maps to manipulate these classes and then applying these policy-maps using the service-policy command.

This comes handy when for instance, we have logical interfaces like subinterfaces and tunnel interfaces, we can implement a traffic-limiting feature at the parent level and queueing at the lower levels, let’s look at an example.

We will create nested policies, the parent policy will be used to shape the cir on an interface to whatever we want, and then we will allocate different bandwidth reservations to different services, like VoIP or MC PIM traffic:

So we have defined a parent policy-map called SHAPE_100M, which shapes all traffic (because it is matching class-default) to 100 Mbps and from this parent policy we have called the child policy called SERVICES, this policy-map called SERVICES matches on different protocols and allocates a different percentage of bandwidth to each. This is how you would create nested policies using the HQF, the bandwidth allocations for the different services will be based on the parent policy, which is shaping at 100 Mbps: