Oct
28

INE's long awaited CCIE Service Provider Advanced Technologies Class is now available! But first, congratulations to Tedhi Achdiana who just passed the CCIE Service Provider Lab Exam! Here's what Tedhi had to say about his preparation:

Finally i passed my CCIE Service Provider Lab exam in Hongkong on Oct, 17 2011. I used your CCIE Service Provider Printed Materials Bundle. This product makes me deep understand how the Service Provider technology works, so it doesn`t matter when Cisco has changed the SP Blueprint. You just need to practise with IOS XR and finding similiar command in IOS platform.

Thanks to INE and keep good working !

Tedhi Achdiana
CCIE#30949 - Service Provider

The CCIE Service Provider Advanced Technologies Class covers the newest CCIE SP Version 3.0 Blueprint, including the addition of IOS XR hardware. Class topics include Catalyst ME3400 switching, IS-IS, OSPF, BGP, MPLS Layer 3 VPNs (L3VPN), Inter-AS MPLS L3VPNs, IPv6 over MPLS with 6PE and 6VPE, AToM and VPLS based MPLS Layer 2 VPNs (L2VPN), MPLS Traffic Engineering, Service Provider Multicast, and Service Provider QoS. Understanding the topics covered in this class will ensure that students are ready to tackle the next step in their CCIE preparation, applying the technologies themselves with INE's CCIE Service Provider Lab Workbook, and then finally taking and passing the CCIE Service Provider Lab Exam!

Streaming access is available for All Access Pass subscribers for as low as $65/month! Download access can be purchased here for $299. AAP members can additionally upgrade to the download version for $149.

Sample videos from class can be found after the break:

 

The detailed outline of class is as follows:

  • Introduction
  • Catalyst ME3400 Switching
  • Frame Relay / HDLC / PPP & PPPoE
  • IS-IS Overview / Level 1 & Level 2 Routing
  • IS-IS Network Types / Path Selection & Route Leaking
  • IS-IS Route Leaking on IOS XR / IOS XR Routing Policy Language (RPL)
  • IS-IS IPv6 Routing / IS-IS Multi Topology
  • MPLS Overview / LDP Overview
  • Basic MPLS Configuration
  • MPLS Tunnels
  • MPLS Layer 3 VPN (L3VPN) Overview / VRF Overview
  • VPNv4 BGP Overview / Route Distinguishers vs. Route Targets
  • Basic MPLS L3VPN Configuration
  • MPLS L3VPN Verification & Troubleshooting
  • VPNv4 Route Reflectors
  • BGP PE-CE Routing / BGP AS Override
  • RIP PE-CE Routing
  • EIGRP PE-CE Routing
  • OSPF PE-CE Routing / OSPF Domain IDs / Domain Tags & Sham Links
  • OSPF Multi VRF CE Routing
  • MPLS Central Services L3VPNs
  • IPv6 over MPLS with 6PE & 6VPE
  • Inter AS MPLS L3VPN Overview
  • Inter AS MPLS L3VPN Option A - Back-to-Back VRF Exchange Part 1
  • Inter AS MPLS L3VPN Option A - Back-to-Back VRF Exchange Part 2
  • Inter AS MPLS L3VPN Option B - ASBRs Peering VPNv4
  • Inter AS MPLS L3VPN Option C - ASBRs Peering BGP+Label Part 1
  • Inter AS MPLS L3VPN Option C - ASBRs Peering BGP+Label Part 2
  • Carrier Supporting Carrier (CSC) MPLS L3VPN
  • MPLS Layer 2 VPN (L2VPN) Overview
  • Ethernet over MPLS L2VPN AToM
  • PPP & Frame Relay over MPLS L2VPN AToM
  • MPLS L2VPN AToM Interworking
  • Virtual Private LAN Services (VPLS)
  • MPLS Traffic Engineering (TE) Overview
  • MPLS TE Configuration
  • MPLS TE on IOS XR / LDP over MPLS TE Tunnels
  • MPLS TE Fast Reroute (FRR) Link and Node Protection
  • Service Provider Multicast
  • Service Provider QoS

Additionally completely new versions of INE CCIE Service Provider Lab Workbook Volumes I & II are on their way, and should be released before the end of the year. Stay tuned for more information on the workbook and rack rental availability!

Oct
18

One of our most anticipated products of the year - INE's CCIE Service Provider v3.0 Advanced Technologies Class - is now complete!  The videos from class are in the final stages of post production and will be available for streaming and download access later this week.  Download access can be purchased here for $299.  Streaming access is available for All Access Pass subscribers for as low as $65/month!  AAP members can additionally upgrade to the download version for $149.

At roughly 40 hours, the CCIE SPv3 ATC covers the newly released CCIE Service Provider version 3 blueprint, which includes the addition of IOS XR hardware. This class includes both technology lectures and hands on configuration, verification, and troubleshooting on both regular IOS and IOS XR. Class topics include Catalyst ME3400 switching, IS-IS, OSPF, BGP, MPLS Layer 3 VPNs (L3VPN), Inter-AS MPLS L3VPNs, IPv6 over MPLS with 6PE and 6VPE, AToM and VPLS based MPLS Layer 2 VPNs (L2VPN), MPLS Traffic Engineering, Service Provider Multicast, and Service Provider QoS.

Below you can see a sample video from the class, which covers IS-IS Route Leaking, and its implementation on IOS XR with the Routing Policy Language (RPL)

Aug
16

Abstract

In this blog post we are going to review a number of MPLS scaling techniques. Theoretically, the main factors that limit MPLS network growth are:

  1. IGP Scaling. Route Summarization, which is the core procedure for scaling of all commonly used IGPs does not work well with MPLS LSPs. We’ll discuss the reasons for this and see what solutions are available to deploy MPLS in presence of IGP route summarization.
  2. Forwarding State growth. Deploying MPLE TE may be challenging in large network as number of tunnels grow like O(N^2) where N is the number of TE endpoints (typically the number of PE routers). While most of the networks are not even near the breaking point, we are still going to review techniques that allow MPLS-TE to scale to very large networks (10th of thousands routers).
  3. Management Overhead. MPLS requires additional control plane components and therefore is more difficult to manage compared to classic IP networks. This becomes more complicated with the network growth.

The blog post summarizes some recently developed approaches that address the first two of the above mentioned issues. Before we begin, I would like to thank Daniel Ginsburg for introducing me to this topic back in 2007.

IGP Scalability and Route Summarization

IP networks were built around the concept of hierarchical addressing and routing, first introduced by Kleinrock [KLEIN]. Scaling for hierarchically addressed networks is achieved by using topologically-aware addresses that allow for network clustering and routing information hiding by summarizing contiguous address blocks. While other approaches for route scaling has been developed later, the hierarchical approach is the prevalent one in modern networks. Modern IGPs used in SP networks are link-state (ISIS and OSPF) and hence maintain network topology information in addition to the network reachability information. Route summarization creates topological domain boundaries and condenses network reachability information thus having the following important effects on link-state routing protocols:

  1. Topological database size is decreased. This reduces the impact on router memory and CPU as smaller database requires less maintenance efforts and consumes less memory.
  2. Convergence time within every area is improved as a result of faster SPF, smaller flooding scope and decreased FIB size.
  3. Flooding is reduced, since events in one routing domain do not propagate into another as a result of information hiding. This improves routing process stability.

The above positive effects were very important during the earlier days of the Internet, when hardware did not have enough power to easily run link-state routing protocols. Modern advances in router hardware allow for scaling of link state protocols to single-area networks consisting or thousand of routers. Certain optimization procedures mentioned, for instance, in [OSPF-FAST] allow for such large single-area networks to remain stable and converge on sub-second time-scale. Many ISP networks indeed use single-area design for their IGPs, enjoying simplified operations procedures. However, scaling IGPs to tens of thousands nodes will most likely require use of routing areas yet allow for end-to-end MPLS LSPs. Another trend that may call for support of route summarization in MPLS networks is the fact that many enterprises, which typically employ area-based design, are starting their own MPLS deployments.

Before we go into details discussing the effect of route summarization MPLS LSPs, it is worth recalling what negative effects route summarization has in pure IP networks. Firstly, summarization hides network topology and detailed routing information and thus results in suboptimal routing, and even routing loops, in presence of route redistribution between different protocols. This problem is especially visible at larger scale, e.g. for the Internet as a whole. The other serious problem is that summarization hides reachability information. For example, if a single host goes done, the summarized prefix encompassing this hosts’s IP address will remain the same and the failure notification will not leak the event beyond the summarization domain. Being a positive effect of summarization, this behavior has negative impact on inter-area network convergence. For example, this affects BGP next-hop tracking process (see [BGP-NEXTHOP]).

MPLS LSPs and Route Summarization

MPLS LSP is unidirectional tunnel built on switching of locally significant labels. There are multiple ways for constructing MPLS LSPs, but we are concerned with classic LDP signaling at the moment. In order for LDP-signaled LSP to destination (FEC, forwarding equivalency class) X to be contiguous, every hop on path to X should have forwarding state for this FEC. Essentially, forwarding component for MPLS treats X, the FEC, as an endpoint identifier. If two nodes cannot match information about X, they cannot “stitch” the LSP in contiguous manner. In the case of MPLS deployment in IP networks, the FEC is typically the PE’s /32 IP address. IP address has overloaded functions of both location pointer and endpoint identifier. Summarizing the /32 addresses hides the endpoint identity and prohibits LDP nodes to consistently map labels for a given FEC. Look at the diagram below: ABR2 summarizes PE1-PE3 Loopback prefixes, and thus PE1 cannot construct an end-to-end LSP for 10.0.0.1/32, 10.0.0.2/32 or 10.0.0.3/32.

mpls-summarization-1

One obvious solution to this problem would be changing the underlying MPLS transport to IP based tunnels, such as mGRE or L2TPv3. Such solutions are available for deploying with L3 and L2 VPNs (see [L3VPN-MGRE] for example) and perfectly work with route summarization. However, by using convenient routing, these solutions lose the powerful feature of MPLS Traffic Engineering. While some Traffic Engineering solutions are available for IP networks, they are not as flexible and feature-rich as MPLS-TE. Therefore, we are not going to look into IP-based tunneling in this publication.

Route Leaking

Route leaking is the most widely deployed technique to allow for end-to-end LSP construction in presence of route summarization. This technique is based on the fact that typically LSPs need only be constructed for the PE addresses. Provided that a separate subnet is selected for the PE Loopback interfaces it is easy to apply fine-tuned control to the PE prefixes in the network. Referring to the figure below, all transit link prefixes (10.0.0.0/16) could be summarized (and even suppressed) and only the necessary PE prefixes (subnet 20.0.0.0/24 on the diagram below) will be leaked. It is also possible to fine-tune LDP to propagate only the prefixes from selected subnet, reducing LDP signaling overhead.

mpls-summarization-2

Leaking /32’s also allows for perfect next-hop reachability tracking, as a route for every PE router is individually present all routing tables. Of course, such granularity significantly reduces the positive effect of multi-area design and summarization. If the number of PE routers grows to tens of thousands, network stability would be at risk. However, this approach is by far the most commonly used with multi-area designs nowadays.

Inter-Area RSVP-TE LSPs

MPLS transport is often deployed for the purpose of traffic-engineering. Typically, a full-mesh of MPLS TE LSPs is provisioned between the PE routers in the network to accomplish this goal (strategic Traffic Engineering). The MPLS TE LSPs are signaled by means of TE extensions to RSVP protocol (RSVP-TE), which allows for source-routed LSP construction. A typical RSVP-TE LSP is explicitly routed by the source PE, where explicit route is either manually specified or computed dynamically by constrained SPF (cSPF) algorithm. Notice that it is impossible for a router within one area to run cSPF for another area, as the topology information is hidden (there are certain solutions to overcome this limitation e.g. [PCE-ARCH], but they are out of the scope of this publication).

Based on the single-area limitation, it may seem that MPLS-TE is not applicable to multi-area design. However, a number of extensions to RSVP signaling have been proposed to allow for inter-area LSP construction (see [RSVP-TE-INTERAREA]). In short, such extensions allow for explicitly programming loose inter-area hops (ABRs or ASBRs) and let every ABR expand the loose next-hop (ERO, explicit route object, expansion). The headend router and every ABR run the cSPF algorithm for their areas, such that every segment of Inter-AS LSP is locally optimal.

mpls-summarization-3

This approach does not necessarily result in globally optimal path, i.e. the resulting LSP may appear to differ from the shortest IGP path between the two destinations. This is a serious problem, and requires considerable modification to RSVP-TE signaling to be resolved. See the document [PCE-ARCH] for an approach to make MPLS LSPs globally optimal. Next, inter-area TE involves additional management overhead, as the ABR loose next-hops need to be programmed manually, possibly covering backup paths via alternate ABRs. The final issue is less obvious, but may have significant impact in large networks.

Every MPLS-TE constructed LSP is point-to-point (P2P) by its nature. This means that for N endpoints there is a total of N^2 LSPs connecting them, which results in rapidly growing number of forwarding states (e.g. CEF table entries) in the core routers, not to mention the control-plane overhead. This issue becomes serious in very large networks, as it has been thoroughly analyzed in [FARREL-MPLS-TE-SCALE-1 ] and [FARREL-MPLS-TE-SCALE-2]. It is interesting to compare the MPLS-TE scaling properties to those of MPLS LDP. The former constructs P2P LSPs, the latter builds Multipoint-to-Point (MP2P) LSPs that merge toward the tail-end router (see [MINEI-MPLS-SCALING]). This is the direct result of signaling behavior: RSVP is end-to-end and LDP signaling is local between every pair of routers. In result, the number of forwarding states with LDP grows as O(N) compared to O(N^2) with MPLS-TE. Look at the diagram below, that illustrates LSPs constructed using RSVP-TE and LDP.

mpls-summarization-4

One solution to the state growth problem would be shrinking the full-mesh of PE-to-PE LSPs, e.g. pushing the MPLS-TE endpoints from PE deeper in the network core, say up to the aggregation layer and then deploying LDP on the PEs and over the MPLS-TE tunnels. This would reduce the size of the tunnel full-mesh significantly, but prevent the use of peripheral connections for MPLS-TE. In other words, this reduces the effect of using MPLS-TE for network resource optimization.

mpls-summarization-5

Besides losing full MPLS-TE functionality, the use of unmodified LDP means that route summarization could not deployed in such scenario. Based on these limitations, we are not going to discuss this approach for MPLS-TE scaling in this publication. Notice that replacing LDP with MPLS-TE in the non-core sectors of the network does not solve global traffic engineer problem, though allows for construction locally optimal LSPs in every level. However, similar to the "LDP-in-the-edge approach", this solution reduces overall effectiveness of MPLS-TE in the network. There are other ways to improve MPLS-TE scaling properties which we are going to discuss further.

Hierarchical RSVP-TE LSPs

As mentioned previously, hierarchies are the key to IP routing scalability. Based on this, it could be reasonable to assume that hierarchical LSPs could be used to solve MPLS-TE scaling problems. Hierarchical LSPs have been in the MPLS standards almost since the inception and their definition has been finalized in RFC 4206. The idea behind the hierarchical LSPs is the use of layered MPLS-TE tunnels. The first layer of MPLS-TE tunnels is used to establish forwarding adjacencies for the link-state IGP. These forwarding adjacencies are then flooded in link-state advertisements and added into Traffic Engineering Database (TED), though not the LSDB. In fact, the first mesh looks exactly like set of IGP logical connections, but get advertised only into TED. It is important that link-state information is not flooded across this mesh, hence the name “forwarding adjacency”. This first layer of MPLS LSPs typically overlays the network core.

mpls-summarization-6

The second level of MPLS-TE tunnels is constructed with the first-level mesh added to the TED in mind, using cSPF or manual programming. From the standpoint of the second-level mesh, the physical core topology is hidden and replaced with the full-mesh of RSVP-TE signaled “links”. Effectively, the second-level tunnels are nested within the first level mesh and are never visible to the core network. The second-level mesh normally spans edge-to-edge and connects the PEs. At the data plane, hierarchical LSPs are realized by means of label stacking.

mpls-summarization-7

Not every instance of IOS code supports the classic RFC4206 FAs. However, it is possible to add an mGRE tunnel spanning between the P-routers forming the first level mesh and use it to route the PE-PE MPLS TE LSPs (second level). This solution does not require running additional set of IGP adjacencies over the mGRE tunnel and therefore has acceptable scalability properties. The mGRE tunnel could be overlaid over a classic, non-FA RSVP-TE tunnel mesh used for traffic engineering between the P-routers. This solution creates additional overhead for the hierarchical tunnel mesh, but allows for implementing hierarchical LSPs in the IOS code that does not support the RFC 4206 FAs.

At last, it is worth pointing out that Cisco puts different meaning into the term Forwarding Adjacencies, as it is implemented in IOS software. Instead of advertising the MPLS-TE tunnels into TED, Cisco advertises them into LSDB and uses for shortest path construction. These virtual connections are not used for LSA/LSP flooding though, just like the “classic” FAs. Such technique does not allow for hierarchical LSPs, as second-level mesh cannot be signaled over FA defined in this manner, but allows for global visibility of MPLS TE tunnels within a single IGP area, compared to Auto-Route, which only supports local visibility.

Do Hierarchical RSVP-TE LSPs Solve the Scaling Problem?

It may look like that using hierarchical LSPs along with route summarization solves the MPLS scaling problems at once. The network core has to deal with the first-level of MPLS-TE LSPS only, and it is perfectly allows for route summarization, as illustrated on the diagram below:

mpls-summarization-8

However, a more detailed analysis performed in [FARREL-MPLS-TE-SCALE-2] shows that in popular ISP network topologies deploying multi-level LSP hierarchies does not result in significant benefits. Firstly, there are still “congestion points” remaining at the level where 1st and 2nd layer-meshes are joined. This layer has to struggle with fast-growing number of LSP states as it has to support both levels of MPLS TE meshes. Furthermore, the management burden associated with deploying multiple MPLS-TE meshes along with Inter-Area MPLS TE tunnels is significant, and seriously impacts network growth. The same arguments apply to the “hybrid” solution that uses mGRE tunnel in the core, as every P router needs to have a CEF entry for every other P router connected to the mGRE tunnel. Not to mention the underlying P-P RSVP-TE mesh adding even more forwarding states in the P-routers.

Multipoint-to-Point RSVP-TE LSPs

What is the key issue that makes MPLS-TE hard to scale? The root cause is the point-to-point nature of MPLS-TE LSPs. This results in rapid forwarding state proliferation in transit nodes. An alternative solution to using LSP hierarchies is the use if Multipoint-to-Point RSVP-TE LSPs. Similar to LDP, it is possible to merge the LSPs that traverse the same egress link and terminate on the same endpoint. With the existing protocol design, RSVP-TE cannot do that, but certain protocol extensions proposed in [YASUKAWA-MP2P-LSP] allow for automatic RSVP-TE merging. Intuitively it is clear that the number of forwarding states with MP2P LSPs grows like O(N) where N is the number of PE routers, compared to O(N^2) in classic MPLS-TE. Effectively, MP2P RSVP-TE signaled LSPs have the same scaling properties as LDP signaled LSPs, with the added benefits of MPLS-TE functionality and Inter-Area signaling. Based on the scaling properties, th use of MP2P Inter-Area LSPs seems to be a promising direction toward scaling MPLS networks.

BGP-Based Hierarchical LSPs

As we have seen, the default operations mode for MPLS-TE does not offer enough scaling even with LSP hierarchies. It is worth asking, whether it is possible to create hierarchical LSPs using signaling other than RSVP-TE. As you remember, BGP extensions could be used to transport MPLS labels. This gives an idea of creating nested LSPs by overlaying BGP mesh over IGP areas. Here is an illustration of this concept:

mpls-summarization-9

In this sample scenario, there are three IGP areas, and ABRs optimally summarizing their area address ranges, therefore hiding the PE /32 prefixes. Inside every area, LDP or RSVP-TE could be used for constructing intra-area LSPs, for example LSPs from PEs to ABRs. At the same time, all PEs establish BGP sessions with their nearest ABRs and ABRs connect in full iBGP mesh, treating the PEs as route-reflector clients. This allows for the PEs to propagate their Loopback /32 prefixes via BGP. The iBGP peering should be done using another set of Loopback interfaces (call them IGP-routed) that are used to build transport LSPs inside every area..

mpls-summarization-10

The only routers that will see the PE loopback prefixes (BGP routed) are the other PEs and the ABRs. The next step is configuring the ABRs that act as route-reflectors for the PE routers to change the BGP IP next-hop to self and activate MPLS label propagation over all iBGP peering sessions. The net result is overlay label distribution process. Every PE would use two labels in the stack to get to another PE’s Loopback (BGP propagated) by means of BGP next-hop recursive resolution. The topmost label (LDP or RSVP-TE) is used to steer packet to the nearest ABR, using the transport LSP built toward the IGP-routed Loopback interface. The bottom label (BGP propagated) identifies the PE prefix within the context of a given ABR. Every ABR will pop incoming LDP/RSVP-TE label, then swap the PE label to the label understood by another ABR/PE (as signaled via BGP) and then push new LDP label that starts a new LSP to reach the next ABR/PE. Effectively, this implements two-level hierarchical LSP end-to-end between any pair of PEs. This behavior is a result of BGP’s ability to propagate label information and the recursive next-hop resolution process.

How well does this approach scale? Firstly, using BGP for prefix distribution ensures that we may advertise truly large amount of PE prefixes without any serious problems (though DFZ systems operators may disagree). At first sight, routing convergence may seem to be a problem, as loss of any PE router should result in iBGP session timing out based on BGP keepalives. However, if BGP next-hop tracking (see [BGP-NEXTHOP]) is used within every area, then ABRs will be able to detect loss of PE at the pace of IGP converges. Link failures within an area will also be handled by IGP or, possibly, using intra-area protection mechanisms, such as MPLS/IP Fast Re-Route. Now for the drawback of the BGP-signaled hierarchical LSPs:

  1. Configuration and Management overhead. In popular BGP-free core design, P routers (which typically include ABRs) do not run BGP. Adding extra mesh of BGP peering sessions requires configuring all PEs and, specifically, ABRs, which involves non-trivial efforts. This slows initial deployment process and complicates further operations.
  2. ABR failure protection. Hierarchical LSPs constructed using BGP, consist of multiple disconnected LDP/RSVP-TE signaled segments, e.g. PE-ABR1, ABR1-ABR2. Within the current set of RSVP-TE FRR features, it is not possible to protect the LSP endpoint nodes, due to local significance of the exposed label. There is work in progress to implement endpoint node FRR protection, but this feature is not yet available. This might be a problem, as it makes the network core vulnerable to ABR failures.
  3. The amount of forwarding states increases in the PE and (additionally) ABR nodes. However, unlike MPLS TE LSPs this growth is proportional to the number of PE routers, which is approximately the same scaling limitation we would have with MP2P MPLS-TE LSPs. Therefore, growth of forwarding state in the ABRs should not be a big problem, especially since no other nodes than PEs/ABRs are affected.

To summarize, it is possible to deploy hierarchical LSPs and get LDP/single-area TE working with multi-area IGPs without any updates to existing protocols (LDP, BGP, RSVP-TE). The main drawback is excessive management overhead and lack of MPLS FRR protection features for the ABRs. However, this is the only approach that does not require any changes to the running software, as all functionality is implemented using existing protocol features.

LDP Extensions to work with Route Summarization

In order to make native LDP (RFC 3036) work with IGP route summarization it is possible to extend the protocol in some way. Up to date, there are two main approaches: one of them utilizes prefix leaking via LDP and the other implements hierarchical LSPs using LDP.

LDP Prefix Leaking (Interarea LDP)

A very simply extension to LDP allows for performing route summarization. Per the LDP’s RFC, prior to installing a label mapping for prefix X, local router needs to ensure there is an exact match for X in the RIB. RFC 5283 suggest changing this verification procedure to the longest match: if there is a prefix Z in the RIB such that X is a subnet to Z then keep this label mapping. It is important to notice that LFIB is not aggregated in any manner – all label mappings received via LDP are maintained and propagated further. Things are different at the RIB level, however. Prefixes are allowed to be summarized and routing protocol operations are simplified. No end-to-end LSPs are broken, because label mappings for specific prefixes are maintained along the path.

mpls-summarization-11

What are the drawbacks of this approach? Obviously, LFIB size growth (forwarding state) is the first one. It is possible to argue that maintaining LFIB is less burdensome compared to maintaining IGP databases, so this could be acceptable. However, it is well known that IGP convergence is seriously affected by FIB sizes, not just IGP protocol data structures, as updating large FIB takes considerable time. Based on this, the LDP “leaking” approach does not solve all scalability issues. On the other hand, keeping detailed information in LFIB allows for end-to-end connectivity tracking, thanks to LDP ordered label propagation. If on area loses a prefix, LDP will signal the loss of label mapping, even though no specific information will ever leak to IGP. This is the flip-side of having detailed information at the forwarding plane level. The other problem that could be pointed out is LDP signaling overhead. However, since LDP is “on-demand” and “distance-vector” protocol it does not impose as much problems as say link-state IGP flooding.

Aggregated FEC Approach

This approach has been suggested by G. Swallow of Cisco Systems – see [SWALLOW-AGG-FEC]. It requires modifications to LDP and slight changes to the forwarding plane behavior. Here is how it works. When an ABR aggregates prefixes {X1…Xn} into a new summary prefix X it generates a label for X (aggregate FEC) and propagates it to other areas, creating an LSP for the aggregate FEC that terminates at the ABR. The LDP mappings are propagated using “aggregate FEC” type to signal special processing for packets matching this prefix. The LSP constructed for such FEC has PHP (penultimate hop popping) disabled for the reason we’ll explain later. All that other routers see in their RIB/FIB is the summary prefix X and the corresponding LFIB element (aggregate FEC/label). In addition to propagating the route/label for X, the same ABR also applies special hash function to the IP addresses {X1…Xn} (the specific prefixes) and generates local labels based on the result of this function. These new algorithmic labels are stored under the context of the “aggregated” label generated for prefix X. That is, these labels should only be interpreted in the association with the “aggregate” label. The algorithmic labels are further stitched with the labels the ABR learns via LDP from the source area for the prefixes {X1…Xn}.

The last piece of the puzzle is how PE creates the label stack for a specific prefix Xi. When a PE attempts to encapsulate a packet destined to Xi on ip2mpls edge it looks up the LFIB for exact match. If no exact match is found, but there is a matching aggregate FEC X, the PE will use the same hash function that ABR used previously on Xi to create an algorithmic label for Xi. The PE then stacks this “algorithmic” label under the label for the aggregate FEC X and sends the packet with two labels – the topmost for the aggregate X and the bottom for Xi. The packet will arrive at the ABR that originated the summary prefix X, with the topmost label NOT being removed by PHP mechanism (as mentioned previously). This allows the ABR to correctly determine the context for the bottom label. The topmost label is removed, and the next de-aggregation label for Xi is used to lookup the real label (stitched with the de-aggregation label for Xi) to be used for further switching.

mpls-summarization-12

This method is backward compatible with classic LDP implementation and will interoperate with other LDP deployments. Notice that there is no control-plane correlation between ABRs and PE, like there is in the case of BGP-signaled hierarchical LSPs. Instead, synchronization is achieved by using the same globally known hash function that produces de-aggregation labels. This method reduces control plane overhead associated with hierarchical LSP construction, but has one drawback – there is no end-to-end reachability signaling, like there was in RFC 5283 approach. That is, if an area loses prefix for PE, there is no way to signal this via LDP, as only aggregated FEC is being propagated. The presentation [SWALLOW-SCALING-MPLS] suggests a generic solution to this problem, by means of IGP protocol extension. In addition to flooding a summary prefix, the ABR is responsible for flooding a bit-vector that corresponds to every possible /32 under the summary. For example, for a /16 prefix there should be a 2^16 bit vector, where bit flag being equal to one means the corresponding /32 prefix is reachable and zero means it being unreachable. This scheme allows for certain optimizations, such as using Bloom filters (see [BLOOM-FILTER]) for information compression. This approach is known as Summarized Route Detailed Reachability (SRDR). The SRDR approach solves the problem of hiding the reachability information at the cost of modification to IGP signaling. An alternative is using tuned BGP keepalives. This, however, puts high stress on router’s control plane. A better alternative is using data-plane reachability discovery, such as using multi-hop BFD ([BFD-MULTIHOP]). The last two approaches do not require any modifications to IGP protocols and therefore better interoperate with existing networks.

Hierarchical LDP

This approach, an extension to RFC 5283, has been proposed by Kireeti Kompella of Juniper, but never has been officially documented and presented for peer review. There were only a few presentations made at various conferences, such as [KOMPELLA-HLDP]. No IETF draft is available, so we can only guess about the protocol details. In a nutshell, it seems that the idea is running a an overlay mesh of LDP sessions between PEs/ABRs similar to the BGP approach and using stacked FEC advertisements. The topmost FEC in such advertisement corresponds to the summarized prefix advertised by the ABR. This FEC is flooded across all areas and local mappings are used to construct an LSP terminating at the ABR. So far, this looks similar to the aggregated FEC approach. However, instead of using algorithmic label generation, the PE and ABR directly exchange their bindings on the specific prefixes, using new form of FEC announcement – hierarchical FEC stacking. ABR advertises aggregated FEC along with aggregated label and nested specific labels. The PE knows what labels the ABR is expecting for the specific prefixes, and may construct a two-label stack consisting of the “aggregate” label and “specific” label learned via the directed LDP session. The specific prefixes are accepted by the virtue of RFC 5283 extension, which allows detailed FEC information if there is a summary prefix in the RIB matching the specific prefix.

mpls-summarization-13

The hierarchical LDP approach maintains control-plane connection between the PE and the ABRs. Most likely, this means manual configuration of directed LDP sessions, very much similar to the BGP approach. The benefit is the control-plane reachability signaling and better extensibility compared to the Aggregated FEC approach. Another benefit is that BGP mesh is left intact and only LDP configuration has to be modified. However, it seems like that further work on Hierarchical LDP extensions has been abandoned, as there are no publications or discussions on this subject.

Hierarchical LSP Commonalities

So far we reviewed three different approaches to constructing hierarchical LSPs. The first one uses RSVP-TE forwarding adjacencies, the second one uses BGP label propagation and the last two use LDP extensions. All the approaches result in constructing transport LSP segments, terminating at the ABRs. For example, in RSVP-TE approach there are LSPs connecting the ABRs. In BGP approach, there are LSP segments connecting PEs to ABRs. As we mentioned previously, the current set of MPLS FRR features does not protect LSP endpoints. As a direct result, using hierarchical LSPs decreases the effectiveness of MPLS FRR protection. There is a work in progress on extending FRR protection to LSP endpoints, but there are no complete working solutions at the moment.

Summary

We have reviewed various aspects of MPLS technology scaling. The two main ones are scaling IGP protocols by using route summarization/areas and getting MPLS to work with summarization. A number of approaches are available to solve the latter problem, and practically all of them (with except to inter-area MPLS TE) are based on hierarchical LSP construction. Some approaches, such as BGP-signaled hierarchical LSPs are ready to be deployed using existing protocol functionality, at the expense of added management overhead. Others require modifications to control/forwarding plane behavior.

It looked like there was high interest in MPLS scaling problems about 3-4 years ago (2006-2007), but this topic looks to be abandoned nowadays. There is no active work in progress on LDP extensions mentioned above, however Multipoint-to-Point RSVO-TE LSP draft document [YASUKAWA-MP2P-LSP] seems to be making progress through IETF. Based on this, it looks like using Inter-Area RSVP-TE with MP2P extensions is going to be the main solution to scaling the MPLS networks of the future.

Further Reading

RFC 3036: LDP Specification
[FARREL-MPLS-TE-SCALE-1] MPLS-TE Does Not Scale
[RSVP-TE-INTERAREA] Requirements for Inter-Area MPLS TE Traffic Engineering
Inter-Area Traffic Engineering
[PCE-ARCH] A Path-Computation Element Based Architecture
[YASUKAWA-MP2P-LSP] Multipoint-to-Point RSVP-TE Signaled LSPs
[MINEI-MPLS-SCALING] Scaling Considerations in MPLS Networks
[SWALLOW-SCALING-MPLS] George Swallow: Scaling MPLS, MPLS2007 (no public link available)
[KOMPELLSA-HLDP] Techniques for Scaling LDP, Kireeti Kompella MPLS2007 (no public link available)
[SWALLOW-AGG-FEC] Network Scaling with Aggregated IP LSPs
LDP Extensions for Inter-Area LSPs
[KLEIN] Hierarchical Routing for Large Networks
[OSPF-FAST] OSPF Fast Convergence
[L3VPN-MGRE] Layer 3 VPNs over Multipoint GRE
[FARREL-MPLS-TE-SCALE-2] MPLS TE Scaling Analysis
RFC 4206: LSP Hierarchy
[BGP-NEXTHOP] BGP Next-Hop Tracking
[BLOOM-FILTER] Bloom Filter
[MINEI-MPLS-SCALING] Scaling Considerations in MPLS Networks
[L3VPN-MGRE] Layer 3 VPNs over Multipoint GRE
[FARREL-MPLS-TE-SCALE-2] MPLS TE Scaling Analysis
RFC 4206: LSP Hierarchy
[BGP-NEXTHOP] BGP Next-Hop Tracking
[BFD-MULTIHOP] Multihop Extension to BFD

May
01

The problem of unequal-cost load-balancing

Have you ever wondered why among all IGPs only EIGRP supports unequal-cost load balancing (UCLB)? Is there any special reason why only EIGRP supports this feature? Apparently, there is. Let’s start with the basic idea of equal-cost load-balancing (ECLB). This one is simple: if there are multiple paths to the same destination with equal costs, it is reasonable to use them all and share traffic equally among the paths. Alternate paths are guaranteed to be loop-free, as they are “symmetric” with respect to cost to the primary path. If we there are multiple paths of unequal cost, the same idea could not be applied easily. For example, consider the figure below:

uclb1

Suppose there is a destination behind R2 that R1 routes to. There are two paths to reach R2 from R1: one is directly to R2, and another via R3. The cost of the primary path is 10 and the cost of the secondary path is 120. Intuitively, it would make sense to start sending traffic across both paths, in proportion 12:1 to make the most use of the network. However, if R3 implements the same idea of unequal cost load balancing, we’ve got a problem. The primary path to reach R2 heading from R3 is via R1. Thus, some of the packets that R1 sends to R2 via R3 will be routed back to R1. This is the core problem of UCLB: some secondary paths may result in routing loops, as a node on the path may prefer to route back to the origin.

EIGRP’s solution to the problem

As you remember, EIGRP only uses an alternate path if it satisfies the feasibility condition: AD < FD, where AD is “advertised distance” (the peer’s metric to reach the destination) and FD is feasible distance (local best metric to the destination). The condition ensures that the path chosen by the peer will never lead into a loop, as our primary path is not subset of it (otherwise, AD would be higher or equal than FD). In our case, if we look at R1, FD=10 to reach R2 and R3’s AD=100, thus the alternate path may happen to lead into a loop. It has been proven by Garcia Luna Aceves that the feasibility condition is enough to always select loop-free alternative paths. Interested reader may look at the original paper at DUAL for the proof. In addition, if you are still thinking that EIGRP is history, I recommend you reading RFC5286 to find the same loop-free condition for Basic IP Fast Rerouting procedure (there are alternate approaches to IP FRR though). Since the feasibility condition used by EIGRP allows for selecting only loop-free alternatives it is safe to enable UCLB on EIGRP routers - provided that EIGRP is the only routing protocol - all alternate paths will never result in a routing loop.

Configuring EIGRP for Unequal-Cost Load Balancing

A few words about configuring UCLB in EIGRP. You achieve this by setting the “variance” value to something greater than 1. EIGRP routing process will install all paths with metric < best_metric*variance into the local routing table. Here metric is the full metric of the alternate path and best_metric is the metric of the primary path. By default, the variance value is 1, meaning that only equal-cost paths are used. Let’s configure a quick test-bed scenario We will use EIGRP as the routing protocol running on all routers, and set metric weights so that K1=0; K2=0; K3=1; K4=K5=0; This means that only the link delay is used for EIGRP metric calculations. Such configuration makes EIGRP metric purely additive and easy to work with.

uclb2

The metric to reach R2’s Loopback0 from R1 via the directly connected link is: FD = 256*(10+10)=5120. R3 is a feasible success for the same destination, as AD = 5*256 = 1280 < FD. Thus, R1 may use it for unequal-cost load-balancing. We should set variance to satisfy the requirement (50+10+5)*256 < 256*(10+10)*variance. Here (50+5)*256 is the alternate path’s full metric. From this equation, variance > 65/20=3.25 and thus we need to set the value to at least 4 in order to utilize the alternate path. If we look at R1’s routing table we would see the following:

Rack1R1#show ip route 150.1.2.2
Routing entry for 150.1.2.0/24
Known via "eigrp 100", distance 90, metric 5120, type internal
Redistributing via eigrp 100
Last update from 155.1.13.3 on Serial0/0.13, 00:00:04 ago
Routing Descriptor Blocks:
155.1.13.3, from 155.1.13.3, 00:00:04 ago, via Serial0/0.13
Route metric is 16640, traffic share count is 37
Total delay is 650 microseconds, minimum bandwidth is 128 Kbit
Reliability 255/255, minimum MTU 1500 bytes
Loading 1/255, Hops 2
* 155.1.12.2, from 155.1.12.2, 00:00:04 ago, via Serial0/0.12
Route metric is 5120, traffic share count is 120
Total delay is 200 microseconds, minimum bandwidth is 1544 Kbit
Reliability 255/255, minimum MTU 1500 bytes
Loading 1/255, Hops 1

Notice the traffic share counters – essentially the represent the ratios of traffic shared between the two paths. You may notice that 120:37 is almost the same as 16640:5120 – that is, the amount of traffic send across a particular path is inversely proportional to the path’s metric. This represents the default EIGRP behavior set by the EIGRP process command traffic-share balanced. You may change this behavior using the command traffic-share min across-interfaces, which will instruct EIGRP to use only the minimum cost path (or paths, if any). Other feasible paths will be kept in routing table but not use until the primary path fails. The benefit is slightly faster convergence, as there is no need to insert the alternate path into RIB, as compared to scenarios where UCLB is disabled. This is how the routing table entry looks like when you enable the minimal-metric path routing:

Rack1R1#sh ip route 150.1.2.2
Routing entry for 150.1.2.0/24
Known via "eigrp 100", distance 90, metric 130560, type internal
Redistributing via eigrp 100
Last update from 155.1.13.3 on Serial0/0.13, 00:00:14 ago
Routing Descriptor Blocks:
155.1.13.3, from 155.1.13.3, 00:00:14 ago, via Serial0/0.13
Route metric is 142080, traffic share count is 0
Total delay is 5550 microseconds, minimum bandwidth is 128 Kbit
Reliability 255/255, minimum MTU 1500 bytes
Loading 1/255, Hops 2
* 155.1.12.2, from 155.1.12.2, 00:00:14 ago, via Serial0/0.12
Route metric is 130560, traffic share count is 1
Total delay is 5100 microseconds, minimum bandwidth is 1544 Kbit
Reliability 255/255, minimum MTU 1500 bytes
Loading 1/255, Hops 1

How CEF implements Unequal-Cost Load-Balancing

And now, a few words about data plane implementation of UCLB. It’s not documented on Cisco’s documentation site, and I first read about it the Cisco Press book “Traffic Engineering with MPLS” by Eric Osborne and Ajay Simha. The method does not seem to change since then.

As you know, the most prevalent switching method for Cisco IOS is CEF (we don’t consider distributed platforms). Layer 3 load-balancing is similar to load-balancing used with Ethernet Port-Channels. Router takes the ingress packet, hashes source and destination IP addresses (and maybe L4 ports) and normalizes the result to be, say, in range [1-16] (the result is often called “hash bucket”). The router than uses the hash bucket as the selector for one of the alternate paths. If the hash function distributes IP src/dst combinations evenly among the result space, then all paths will be utilized equally. In order to implement unequal balancing, an additional level of indirection is needed. Suppose you have 3 alternate paths, and you want to balance them in proportion 1:2:3. You need to fill the 16 hash buckets with the path selectors to maintain the same proportion 1:2:3. Solving the simply equation: 1*x+2*x+3*x=16 we find that x=2.6 or 3 rounded. Thus, to maintain the proportions 1:2:3 we need to associate 3 hash buckets with the first path, 6 hash buckets with the second part and the remaining 7 hash buckets with the third path. This not exactly the wanted proportion, but it is close. Here is how the “hash bucket” to “path ID” mapping may look like to implement the above proportion of 1:2:3.

Hash | Path ID
--------------
[01] -> path 1
[02] -> path 2
[03] -> path 3
[04] -> path 1
[05] -> path 2
[06] -> path 3
[07] -> path 1
[08] -> path 2
[09] -> path 3
[10] -> path 2
[11] -> path 3
[12] -> path 2
[13] -> path 3
[14] -> path 2
[15] -> path 3
[16] -> path 3

Once again, provided that the hash function distributes inputs evenly among all buckets, the paths will be used in the desired proportions. As you can see, the way that control plane divides traffic flows among different paths may be severely affected by the data plane implementation.

CEF may load-balance using per-packet or per-flow granularity. In the first case, every next packet of the same flow (src/dst IP and maybe src/dst port) is routed across the different paths. Why this may look like a good idea to better utilize all paths, it usually results in packets arriving out of order. The result is poor application performance, since many L4 protocols are better suited to packets arriving in order. Thus, in real-world, the preferred and the default load-balancing mode is per-flow (often called per-destination in CEF terminology). To change the CEF load-balancing mode on the interface, use the interface level command ip load-sharing per-packet|per-destination. Notice that it only affects the packets ingress on the configured interface.

Let’s look at the CEF data structures that may reveal information on load-balancing implementation. When you issue the following command, you may reveal all alternative adjacencies used to route the packet toward the prefix in question. This is the first part of the output.

Rack1R1#show ip cef 150.1.2.2 internal
150.1.2.0/24, version 77, epoch 0, per-destination sharing
0 packets, 0 bytes
via 155.1.13.3, Serial0/0.13, 0 dependencies
traffic share 37
next hop 155.1.13.3, Serial0/0.13
valid adjacency
via 155.1.12.2, Serial0/0.12, 0 dependencies
traffic share 120
next hop 155.1.12.2, Serial0/0.12
valid adjacency

The next block of information is more interesting. It reveals those 16 hash buckets, loaded with the path indexes (after the “Load distribution” string). Here zero means the first path, and 1 means the second alternate path. As we can see, 4 buckets are loaded with the path “zero”, and 12 buckets are loaded with the path “one”. The resulting share 4:12=1:3 is very close to 37:120, so in this case the data plane implementation did not change the load-balancing logic too much. The rest of the output details every hash bucket index and associated output interface, as well as the number of packets switches using the particular hash bucket. Keep in mind, that only CEF switched packets are reflected in the statistics, so if you use the ping command off the router itself, you will not see any counters incrementing.

[...Output Continues...]
0 packets, 0 bytes switched through the prefix
tmstats: external 0 packets, 0 bytes
internal 0 packets, 0 bytes
Load distribution: 0 1 0 1 0 1 0 1 1 1 1 1 1 1 1 1 (refcount 1)

Hash OK Interface Address Packets
1 Y Serial0/0.13 point2point 0
2 Y Serial0/0.12 point2point 0
3 Y Serial0/0.13 point2point 0
4 Y Serial0/0.12 point2point 0
5 Y Serial0/0.13 point2point 0
6 Y Serial0/0.12 point2point 0
7 Y Serial0/0.13 point2point 0
8 Y Serial0/0.12 point2point 0
9 Y Serial0/0.12 point2point 0
10 Y Serial0/0.12 point2point 0
11 Y Serial0/0.12 point2point 0
12 Y Serial0/0.12 point2point 0
13 Y Serial0/0.12 point2point 0
14 Y Serial0/0.12 point2point 0
15 Y Serial0/0.12 point2point 0
16 Y Serial0/0.12 point2point 0
refcount 6

Any other protocols supporting Unequal Cost Load-Balancing?

As we already know, nor OSPF, ISIS or RIP can support UCLB, as they cannot test that the alternate paths are loop free. In fact, all IGP protocols could be extended to use this feature, as specified in the above-mentioned RFC5286. However, this is not (yet?) implemented by any well-known vendor. Still, one protocol supports UCLB in addition to EIGRP. As you may have guessed this is BGP. What makes BGP so special? Nothing else but the routing-loop detection feature implemented for eBGP session. When a BGP speakers receives a route from the external AS, it looks for its own AS# in the AS_PATH attribute, and discards matching routes. This prevents routing loops on AS-scale. Additionally, this allows BGP to use alternative eBGP paths for unequal-cost load balancing. The proportions for the alternative paths are chosen based on the special BGP extended community attribute called DMZ Link bandwidth. By default, this attribute value is copied from the bandwidth of the interface connecting to the eBGP peer. To configure UCLB with BGP you need the configuration similar to the following:

router bgp 100
bgp maximum-path 3
bgp dmzlink-bw
neighbor 1.1.1.1 remote-as 200
neighbor 1.1.1.1 dmzlink-bw
neighbor 2.2.2.2 remote-as 200
neighbor 2.2.2.2 dmzlink-bw

In order for paths to be eligible for UCLB, they must have the same weight, local-preference, AS_PATH length, Origin and MED. Then, the local speaker may utilize the paths inversely proportional to the value of DMZ Link bandwidth attribute. Keep in mind that BGP multipathing is disabled by default, until you enable it with the command bgp maximum path. IBGP speakers may use DMZ Link bandwidth feature as well, for the paths injected into the local AS via eBGP. In order for this to work, DMZ Link Bandwidth attribute must be propagated across the local AS (send-community extended command) and the exit points for every path must have equal IGP costs in the iBGP speaker’s RIB. The data-plane implementation remains the same as for EIGRP multipathing, as CEF is the same underlying switching method.

So how should I do that in production?

Surprisingly enough, the best way to implement UCLB in real life is by using MPLS Traffic Engineering. This solution allows for high level of administrative control, and is IGP agnostic (well they say you need OSPF or ISIS, but you may use verbatim MPLS TE tunnels even with RIP or EIGRP). It is safe to use unequal-cost load-balancing with MPLS TE tunnels because they connect two nodes using “virtual circuits” and no transit node ever performs routing lookup. Thus, you may create as many tunnels between the source and the destination as you want, and assign the tunnel bandwidth values properly. After this, CEF switching will take care of the rest for you.

Further Reading

Load Balancing with CEF
Unequal Cost Load Sharing
CEF Load Sharing Details

Subscribe to INE Blog Updates