May
01

The problem of unequal-cost load-balancing

Have you ever wondered why among all IGPs only EIGRP supports unequal-cost load balancing (UCLB)? Is there any special reason why only EIGRP supports this feature? Apparently, there is. Let’s start with the basic idea of equal-cost load-balancing (ECLB). This one is simple: if there are multiple paths to the same destination with equal costs, it is reasonable to use them all and share traffic equally among the paths. Alternate paths are guaranteed to be loop-free, as they are “symmetric” with respect to cost to the primary path. If we there are multiple paths of unequal cost, the same idea could not be applied easily. For example, consider the figure below:

uclb1

Suppose there is a destination behind R2 that R1 routes to. There are two paths to reach R2 from R1: one is directly to R2, and another via R3. The cost of the primary path is 10 and the cost of the secondary path is 120. Intuitively, it would make sense to start sending traffic across both paths, in proportion 12:1 to make the most use of the network. However, if R3 implements the same idea of unequal cost load balancing, we’ve got a problem. The primary path to reach R2 heading from R3 is via R1. Thus, some of the packets that R1 sends to R2 via R3 will be routed back to R1. This is the core problem of UCLB: some secondary paths may result in routing loops, as a node on the path may prefer to route back to the origin.

EIGRP’s solution to the problem

As you remember, EIGRP only uses an alternate path if it satisfies the feasibility condition: AD < FD, where AD is “advertised distance” (the peer’s metric to reach the destination) and FD is feasible distance (local best metric to the destination). The condition ensures that the path chosen by the peer will never lead into a loop, as our primary path is not subset of it (otherwise, AD would be higher or equal than FD). In our case, if we look at R1, FD=10 to reach R2 and R3’s AD=100, thus the alternate path may happen to lead into a loop. It has been proven by Garcia Luna Aceves that the feasibility condition is enough to always select loop-free alternative paths. Interested reader may look at the original paper at DUAL for the proof. In addition, if you are still thinking that EIGRP is history, I recommend you reading RFC5286 to find the same loop-free condition for Basic IP Fast Rerouting procedure (there are alternate approaches to IP FRR though). Since the feasibility condition used by EIGRP allows for selecting only loop-free alternatives it is safe to enable UCLB on EIGRP routers - provided that EIGRP is the only routing protocol - all alternate paths will never result in a routing loop.

Configuring EIGRP for Unequal-Cost Load Balancing

A few words about configuring UCLB in EIGRP. You achieve this by setting the “variance” value to something greater than 1. EIGRP routing process will install all paths with metric < best_metric*variance into the local routing table. Here metric is the full metric of the alternate path and best_metric is the metric of the primary path. By default, the variance value is 1, meaning that only equal-cost paths are used. Let’s configure a quick test-bed scenario We will use EIGRP as the routing protocol running on all routers, and set metric weights so that K1=0; K2=0; K3=1; K4=K5=0; This means that only the link delay is used for EIGRP metric calculations. Such configuration makes EIGRP metric purely additive and easy to work with.

uclb2

The metric to reach R2’s Loopback0 from R1 via the directly connected link is: FD = 256*(10+10)=5120. R3 is a feasible success for the same destination, as AD = 5*256 = 1280 < FD. Thus, R1 may use it for unequal-cost load-balancing. We should set variance to satisfy the requirement (50+10+5)*256 < 256*(10+10)*variance. Here (50+5)*256 is the alternate path’s full metric. From this equation, variance > 65/20=3.25 and thus we need to set the value to at least 4 in order to utilize the alternate path. If we look at R1’s routing table we would see the following:

Rack1R1#show ip route 150.1.2.2
Routing entry for 150.1.2.0/24
Known via "eigrp 100", distance 90, metric 5120, type internal
Redistributing via eigrp 100
Last update from 155.1.13.3 on Serial0/0.13, 00:00:04 ago
Routing Descriptor Blocks:
155.1.13.3, from 155.1.13.3, 00:00:04 ago, via Serial0/0.13
Route metric is 16640, traffic share count is 37
Total delay is 650 microseconds, minimum bandwidth is 128 Kbit
Reliability 255/255, minimum MTU 1500 bytes
Loading 1/255, Hops 2
* 155.1.12.2, from 155.1.12.2, 00:00:04 ago, via Serial0/0.12
Route metric is 5120, traffic share count is 120
Total delay is 200 microseconds, minimum bandwidth is 1544 Kbit
Reliability 255/255, minimum MTU 1500 bytes
Loading 1/255, Hops 1

Notice the traffic share counters – essentially the represent the ratios of traffic shared between the two paths. You may notice that 120:37 is almost the same as 16640:5120 – that is, the amount of traffic send across a particular path is inversely proportional to the path’s metric. This represents the default EIGRP behavior set by the EIGRP process command traffic-share balanced. You may change this behavior using the command traffic-share min across-interfaces, which will instruct EIGRP to use only the minimum cost path (or paths, if any). Other feasible paths will be kept in routing table but not use until the primary path fails. The benefit is slightly faster convergence, as there is no need to insert the alternate path into RIB, as compared to scenarios where UCLB is disabled. This is how the routing table entry looks like when you enable the minimal-metric path routing:

Rack1R1#sh ip route 150.1.2.2
Routing entry for 150.1.2.0/24
Known via "eigrp 100", distance 90, metric 130560, type internal
Redistributing via eigrp 100
Last update from 155.1.13.3 on Serial0/0.13, 00:00:14 ago
Routing Descriptor Blocks:
155.1.13.3, from 155.1.13.3, 00:00:14 ago, via Serial0/0.13
Route metric is 142080, traffic share count is 0
Total delay is 5550 microseconds, minimum bandwidth is 128 Kbit
Reliability 255/255, minimum MTU 1500 bytes
Loading 1/255, Hops 2
* 155.1.12.2, from 155.1.12.2, 00:00:14 ago, via Serial0/0.12
Route metric is 130560, traffic share count is 1
Total delay is 5100 microseconds, minimum bandwidth is 1544 Kbit
Reliability 255/255, minimum MTU 1500 bytes
Loading 1/255, Hops 1

How CEF implements Unequal-Cost Load-Balancing

And now, a few words about data plane implementation of UCLB. It’s not documented on Cisco’s documentation site, and I first read about it the Cisco Press book “Traffic Engineering with MPLS” by Eric Osborne and Ajay Simha. The method does not seem to change since then.

As you know, the most prevalent switching method for Cisco IOS is CEF (we don’t consider distributed platforms). Layer 3 load-balancing is similar to load-balancing used with Ethernet Port-Channels. Router takes the ingress packet, hashes source and destination IP addresses (and maybe L4 ports) and normalizes the result to be, say, in range [1-16] (the result is often called “hash bucket”). The router than uses the hash bucket as the selector for one of the alternate paths. If the hash function distributes IP src/dst combinations evenly among the result space, then all paths will be utilized equally. In order to implement unequal balancing, an additional level of indirection is needed. Suppose you have 3 alternate paths, and you want to balance them in proportion 1:2:3. You need to fill the 16 hash buckets with the path selectors to maintain the same proportion 1:2:3. Solving the simply equation: 1*x+2*x+3*x=16 we find that x=2.6 or 3 rounded. Thus, to maintain the proportions 1:2:3 we need to associate 3 hash buckets with the first path, 6 hash buckets with the second part and the remaining 7 hash buckets with the third path. This not exactly the wanted proportion, but it is close. Here is how the “hash bucket” to “path ID” mapping may look like to implement the above proportion of 1:2:3.

Hash | Path ID
--------------
[01] -> path 1
[02] -> path 2
[03] -> path 3
[04] -> path 1
[05] -> path 2
[06] -> path 3
[07] -> path 1
[08] -> path 2
[09] -> path 3
[10] -> path 2
[11] -> path 3
[12] -> path 2
[13] -> path 3
[14] -> path 2
[15] -> path 3
[16] -> path 3

Once again, provided that the hash function distributes inputs evenly among all buckets, the paths will be used in the desired proportions. As you can see, the way that control plane divides traffic flows among different paths may be severely affected by the data plane implementation.

CEF may load-balance using per-packet or per-flow granularity. In the first case, every next packet of the same flow (src/dst IP and maybe src/dst port) is routed across the different paths. Why this may look like a good idea to better utilize all paths, it usually results in packets arriving out of order. The result is poor application performance, since many L4 protocols are better suited to packets arriving in order. Thus, in real-world, the preferred and the default load-balancing mode is per-flow (often called per-destination in CEF terminology). To change the CEF load-balancing mode on the interface, use the interface level command ip load-sharing per-packet|per-destination. Notice that it only affects the packets ingress on the configured interface.

Let’s look at the CEF data structures that may reveal information on load-balancing implementation. When you issue the following command, you may reveal all alternative adjacencies used to route the packet toward the prefix in question. This is the first part of the output.

Rack1R1#show ip cef 150.1.2.2 internal
150.1.2.0/24, version 77, epoch 0, per-destination sharing
0 packets, 0 bytes
via 155.1.13.3, Serial0/0.13, 0 dependencies
traffic share 37
next hop 155.1.13.3, Serial0/0.13
valid adjacency
via 155.1.12.2, Serial0/0.12, 0 dependencies
traffic share 120
next hop 155.1.12.2, Serial0/0.12
valid adjacency

The next block of information is more interesting. It reveals those 16 hash buckets, loaded with the path indexes (after the “Load distribution” string). Here zero means the first path, and 1 means the second alternate path. As we can see, 4 buckets are loaded with the path “zero”, and 12 buckets are loaded with the path “one”. The resulting share 4:12=1:3 is very close to 37:120, so in this case the data plane implementation did not change the load-balancing logic too much. The rest of the output details every hash bucket index and associated output interface, as well as the number of packets switches using the particular hash bucket. Keep in mind, that only CEF switched packets are reflected in the statistics, so if you use the ping command off the router itself, you will not see any counters incrementing.

[...Output Continues...]
0 packets, 0 bytes switched through the prefix
tmstats: external 0 packets, 0 bytes
internal 0 packets, 0 bytes
Load distribution: 0 1 0 1 0 1 0 1 1 1 1 1 1 1 1 1 (refcount 1)

Hash OK Interface Address Packets
1 Y Serial0/0.13 point2point 0
2 Y Serial0/0.12 point2point 0
3 Y Serial0/0.13 point2point 0
4 Y Serial0/0.12 point2point 0
5 Y Serial0/0.13 point2point 0
6 Y Serial0/0.12 point2point 0
7 Y Serial0/0.13 point2point 0
8 Y Serial0/0.12 point2point 0
9 Y Serial0/0.12 point2point 0
10 Y Serial0/0.12 point2point 0
11 Y Serial0/0.12 point2point 0
12 Y Serial0/0.12 point2point 0
13 Y Serial0/0.12 point2point 0
14 Y Serial0/0.12 point2point 0
15 Y Serial0/0.12 point2point 0
16 Y Serial0/0.12 point2point 0
refcount 6

Any other protocols supporting Unequal Cost Load-Balancing?

As we already know, nor OSPF, ISIS or RIP can support UCLB, as they cannot test that the alternate paths are loop free. In fact, all IGP protocols could be extended to use this feature, as specified in the above-mentioned RFC5286. However, this is not (yet?) implemented by any well-known vendor. Still, one protocol supports UCLB in addition to EIGRP. As you may have guessed this is BGP. What makes BGP so special? Nothing else but the routing-loop detection feature implemented for eBGP session. When a BGP speakers receives a route from the external AS, it looks for its own AS# in the AS_PATH attribute, and discards matching routes. This prevents routing loops on AS-scale. Additionally, this allows BGP to use alternative eBGP paths for unequal-cost load balancing. The proportions for the alternative paths are chosen based on the special BGP extended community attribute called DMZ Link bandwidth. By default, this attribute value is copied from the bandwidth of the interface connecting to the eBGP peer. To configure UCLB with BGP you need the configuration similar to the following:

router bgp 100
bgp maximum-path 3
bgp dmzlink-bw
neighbor 1.1.1.1 remote-as 200
neighbor 1.1.1.1 dmzlink-bw
neighbor 2.2.2.2 remote-as 200
neighbor 2.2.2.2 dmzlink-bw

In order for paths to be eligible for UCLB, they must have the same weight, local-preference, AS_PATH length, Origin and MED. Then, the local speaker may utilize the paths inversely proportional to the value of DMZ Link bandwidth attribute. Keep in mind that BGP multipathing is disabled by default, until you enable it with the command bgp maximum path. IBGP speakers may use DMZ Link bandwidth feature as well, for the paths injected into the local AS via eBGP. In order for this to work, DMZ Link Bandwidth attribute must be propagated across the local AS (send-community extended command) and the exit points for every path must have equal IGP costs in the iBGP speaker’s RIB. The data-plane implementation remains the same as for EIGRP multipathing, as CEF is the same underlying switching method.

So how should I do that in production?

Surprisingly enough, the best way to implement UCLB in real life is by using MPLS Traffic Engineering. This solution allows for high level of administrative control, and is IGP agnostic (well they say you need OSPF or ISIS, but you may use verbatim MPLS TE tunnels even with RIP or EIGRP). It is safe to use unequal-cost load-balancing with MPLS TE tunnels because they connect two nodes using “virtual circuits” and no transit node ever performs routing lookup. Thus, you may create as many tunnels between the source and the destination as you want, and assign the tunnel bandwidth values properly. After this, CEF switching will take care of the rest for you.

Further Reading

Load Balancing with CEF
Unequal Cost Load Sharing
CEF Load Sharing Details

Apr
24

GLBP, an acronym for Gateway Load Balancing Protocol, is a virtual gateway protocol similar to HSRP and VRRP. However, unlike its little brothers, GLBP is capable of using multiple physical gateways at the same time. As we know, a single HSRP or VRRP group represents one virtual gateway, with single virtual IP and MAC addresses. Only one physical gateway in a standby/redundancy group is responsible for packet forwarding, others remain inactive in standby/backup state. If you have R1, R2, R3 sharing the segment 174.X.123.0/24 with the physical IP addresses 174.X.123.1, 174.X.123.2 and 174.X.123.3 you may configure them to represent one single virtual gateway with an IP address 174.X.123.254. The physical gateway priority settings will determine which physical gateway takes the role of the active packet forwarder. The hosts on the segment will set their default gateway to 174.X.123.254, staying isolated of the physical gateway failures.

GLBP brings this idea to new level, by allowing multiple physical gateways to participate in packet forwarding simultaneously. Considering the example above, imagine you need the hosts on the segments to fully utilize all existing physical gateways, yet provide recovery from a gateway failure. For instance, you may want 50% of outgoing packets to be sent up to R1, 30% to R2 and 20% to R3. At the same time, you want to ensure, that hosts using either of the gateways will automatically switch to another if their gateway fails. On top of that, all hosts in the segment should reference to the virtual gateway using the same IP address 174.X.123.254. This is a complicated task, which has being addressed by GLBP protocol design.

To begin with, we should recall that every host on the segment needs to resolve the virtual gateway IP address 174.X.123.254 to the MAC address using ARP protocol. When we use HSRP or VRRP, the ARP response will be the virtual MAC addresses, which is assigned to the active physical gateway. At this point, GLBP responds with different virtual MAC addresses of the physical gateways in the GLBP group. So the key idea with GLBP is accomplishing load-balancing by responding to ARP requests with different virtual MAC addresses.

Here are the implementation details. One of the routers in a GLBP group is elected as an AVG – Active Virtual Gateway. There is only one active AVG in a group, and its task is to respond to ARP requests sent to the virtual gateway IP address (e.g. 174.X.123.254) replying different virtual MAC addresses in response packets. The AVG will also implement load-sharing algorithm, e.g. by sending the replies in proportion to weights configured for physical gateways. Aside from AVG, the other logical component of GLBP implementation is AVF – Active Virtual Forwarder. Any physical gateway in a GLBP group may act as AVF – in fact all physical gateways are usually AVFs. Every AVF has a virtual MAC address assigned by an AVG and a weight value configured by an operator.

Now let’s discuss redundancy – the primary goal of any virtual router protocol. There are two logical entities used to build a GLBP group: AVGs and AVFs, and each of them needs a backup scheme. Since there is just one AVG per a GLBP group, the procedure is pretty simple: each candidate AVG has a priority value assigned; the highest priority router becomes an active AVG, the next by priority becomes a standby AVG. You may configure AVG preemption, so that a newly configured router with highest priority value becomes AVG, preemption the old one.

What about AVF redundancy? First, we need to understand that AVFs are always “active” in the sense that they are always used by a load-balancing algorithm. (However, by setting an AVG weight value below threshold, we may effectively take the AVF out of service. The weight value could be combined with object tracking to bring powerful traffic manipulation options). Next, with respect to redundancy, all AVFs backup each other. For instance, take any AVF: with respect to the other AVFs it is “Active”, and the remaining AVFs are in “Listen” state. If the AVF would fail, other gatewyas will detect the event using Hold timer expiration, and immediately try to take over the failed AVF virtual MAC address. Among the competitors, the AVF with highest weight value would win, and the remaining AVFs will switch back to “Listen” state. At this point, the “winner” will start accepting packets for two virtual MAC addresses: it’s own, and the one it has obtained from the failed AVF. At the same moment, two timers would start: Redirect and Secondary Hold. The Redirect timer determines how long will AVG continue to respond to ARP requests with the virtual MAC of the failed AVF. The Secondary Hold timer sets the amount of time the backup AVF will continue to accept packet for the virtual MAC address taken from the failed AVF.

This is basically how GLBP works. Different load-balancing algorithms are supported – the default one is round robin, with options for weighted load balancing and source-MAC based. The last one will always respond with the same vMAC to the same source MAC address, thereby defining sort of host-gateway “stickiness”. Now for a sample GLBP configuration, for the above mentioned R1, R2 and R3:

!
! We set load-balancing to weighted only on R1
! So if R2 will become the AVG, it will use round-robin
! load-balancing technique
!
R1:
interface FastEthernet0/0
ip address 174.1.123.1 255.255.255.0
glbp 123 ip 174.1.123.254
glbp 123 preempt
glbp 123 weighting 50
glbp 123 load-balancing weighted
!
!
!
R2:
interface FastEthernet0/0
ip address 174.1.123.2 255.255.255.0
glbp 123 ip 174.1.123.254
glbp 123 priority 50
glbp 123 preempt
glbp 123 weighting 30
!
!
!
R3:
interface Ethernet0/0
ip address 174.1.123.3 255.255.255.0
glbp 123 ip 174.1.123.254
glbp 123 priority 25
glbp 123 preempt
glbp 123 weighting 20

Some show output:

Rack1R1#show glbp
FastEthernet0/0 - Group 123
State is Active
2 state changes, last state change 03:12:05
Virtual IP address is 174.1.123.254
Hello time 3 sec, hold time 10 sec
Next hello sent in 0.916 secs
Redirect time 600 sec, forwarder time-out 14400 sec
Preemption enabled, min delay 0 sec
Active is local
Standby is 174.1.123.2, priority 50 (expires in 8.936 sec) <-- Standby AVG
Priority 100 (default)
Weighting 50 (configured 50), thresholds: lower 1, upper 50 <--
<-- Should the weight go below thresh, AVF is taken offline
Load balancing: weighted
Group members:
ca00.0156.0000 (174.1.123.1) local <-- Hardware MACs
ca01.0156.0000 (174.1.123.2)
cc02.0156.0000 (174.1.123.3)
There are 3 forwarders (1 active)
Forwarder 1
State is Listen <-- All other AVFs Listen to us
MAC address is 0007.b400.7b01 (learnt) <-- Virtual MAC
Owner ID is ca01.0156.0000 <-- This is R2
Redirection enabled, 598.928 sec remaining (maximum 600 sec) <--
<-- ARP replies with this vMAC are being sent by AVG
Time to live: 14398.376 sec (maximum 14400 sec)
Preemption enabled, min delay 30 sec
Active is 174.1.123.2 (primary), weighting 30 (expires in 8.368 sec) <--
<-- The AVF reports it’s own IP as active
Arp replies sent: 1
Forwarder 2
State is Active <-- Active mean it’s us
1 state change, last state change 03:12:45
MAC address is 0007.b400.7b02 (default)
Owner ID is ca00.0156.0000 <-- R1 MAC address
Redirection enabled
Preemption enabled, min delay 30 sec
Active is local, weighting 50
Arp replies sent: 1
Forwarder 3
State is Listen <-- All other AVFs Listen to us
MAC address is 0007.b400.7b03 (learnt)
Owner ID is cc02.0156.0000 <-- This is R3
Redirection enabled, 597.916 sec remaining (maximum 600 sec)
Time to live: 14397.916 sec (maximum 14400 sec)
Preemption enabled, min delay 30 sec
Active is 174.1.123.3 (primary), weighting 20 (expires in 7.916 sec)

Rack1R2#show glbp
FastEthernet0/0 - Group 123
State is Standby
4 state changes, last state change 03:16:56
Virtual IP address is 174.1.123.254
Hello time 3 sec, hold time 10 sec
Next hello sent in 0.236 secs
Redirect time 600 sec, forwarder time-out 14400 sec
Preemption enabled, min delay 0 sec
Active is 174.1.123.1, priority 100 (expires in 9.148 sec)
Standby is local <-- We are the standby AVG
Priority 50 (configured)
Weighting 30 (configured 30), thresholds: lower 1, upper 30
Load balancing: round-robin
Group members:
ca00.0156.0000 (174.1.123.1)
ca01.0156.0000 (174.1.123.2) local
cc02.0156.0000 (174.1.123.3)
There are 3 forwarders (1 active)
Forwarder 1
State is Active
1 state change, last state change 03:18:06
MAC address is 0007.b400.7b01 (default)
Owner ID is ca01.0156.0000 <-- This is R2
Preemption enabled, min delay 30 sec
Active is local, weighting 30
Forwarder 2
State is Listen
MAC address is 0007.b400.7b02 (learnt)
Owner ID is ca00.0156.0000
Time to live: 14398.644 sec (maximum 14400 sec)
Preemption enabled, min delay 30 sec
Active is 174.1.123.1 (primary), weighting 50 (expires in 8.636 sec)
Forwarder 3
State is Listen
MAC address is 0007.b400.7b03 (learnt)
Owner ID is cc02.0156.0000
Time to live: 14399.260 sec (maximum 14400 sec)
Preemption enabled, min delay 30 sec
Active is 174.1.123.3 (primary), weighting 20 (expires in 9.260 sec)

Now let’s check how ARP redirection works:

Rack1SW1#ping 174.1.123.254

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 174.1.123.254, timeout is 2 seconds:
..!!!
Success rate is 60 percent (3/5), round-trip min/avg/max = 8/12/16 ms

Rack1SW1#sh ip arp
Protocol Address Age (min) Hardware Addr Type Interface
Internet 174.1.123.254 0 0007.b400.7b01 ARPA Vlan1
Internet 174.1.123.7 - cc06.0156.0000 ARPA Vlan1
Internet 174.1.123.2 0 ca01.0156.0000 ARPA Vlan1
Rack1SW1#clear arp-cache

Rack1SW1#ping 174.1.123.254

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 174.1.123.254, timeout is 2 seconds:
.!!!!
Success rate is 80 percent (4/5), round-trip min/avg/max = 4/13/32 ms

Rack1SW1#sh ip arp
Protocol Address Age (min) Hardware Addr Type Interface
Internet 174.1.123.254 0 0007.b400.7b02 ARPA Vlan1
Internet 174.1.123.7 - cc06.0156.0000 ARPA Vlan1
Internet 174.1.123.2 0 ca01.0156.0000 ARPA Vlan1

Repeat the above actions a few more times

Rack1SW1#sh ip arp
Protocol Address Age (min) Hardware Addr Type Interface
Internet 174.1.123.254 0 0007.b400.7b03 ARPA Vlan1
Internet 174.1.123.7 - cc06.0156.0000 ARPA Vlan1
Internet 174.1.123.2 0 ca01.0156.0000 ARPA Vlan1
Internet 174.1.123.3 0 cc02.0156.0000 ARPA Vlan1

To summarize, GLBP is a virtual gateway protocol, with built-in load-balancing capabilities. Load balancing is based on manipulating ARP responses to the requests sent to the virtual gateway IP address. AVG role is used to load-balance and respond to ARP requests. AVF role manages one or more virtual MACs and is responsible for packet forwarding. AVG redundancy is controlled by GLBP priority and AVF redundancy is implemented using AVF weight value and two additional timers.

Further reading:

GLBP Overview

Subscribe to INE Blog Updates