Jul
09

Tomorrow (2014-07-09) at 08:00 PDT (15:00 UTC) I will be starting our next major section of the CCIE Routing & Switching Version 5 Advanced Technologies Class – MPLS. This class is free to attend for all at http://live.ine.com – simply sign up for a free INE members account here or sign up for a free trial of our All Access Pass - which includes streaming video access to our entire video library – including all of the new CCIE RSv5 ATC videos up to this point.

For me personally when I was first learning MPLS, the biggest hurdle I found was sorting through all the buzzwords and acronyms. For the life of me no matter how many books I read, I couldn’t figure out why MPLS would even be needed in the first place. Tomorrow’s class will cut to the chase, as essentially MPLS 101 for CCIE Candidates.

Specifically I will be first starting with the main MPLS use case, tunneling BGP over the core. Through live examples on the Cisco IOS CLI I will show why MPLS is the preferred transport method for Service Providers that offer both public and private IPv4 & IPv6 transit services, and then expand into further use cases such as Layer 3 VPN and Layer 2 VPN services, and talk about where MPLS is even applicable in the Enterprise.  As always, questions are welcomed and encouraged during the class - the more you put into class ultimately the more you get out of it.

I hope to see you live during class tomorrow at http://live.ine.com!

Aug
17

Edit: For those of you that want to take a look first-hand at these packets, the Wireshark PCAP files referenced in this post can be found here

One of the hottest topics in networking today is Data Center Virtualized Workload Mobility (VWM). For those of you that have been hiding under a rock for the past few years, workload mobility basically means the ability to dynamically and seamlessly reassign hardware resources to virtualized machines, often between physically disparate locations, while keeping this transparent to the end users. This is often accomplished through VMware vMotion, which allows for live migration of virtual machines between sites, or as similarly implemented in Microsoft’s Hyper-V and Citrix’s Xen hypervisors.

One of the typical requirements of workload mobility is that the hardware resources used must be on the same layer 2 network segment. E.g. the VMware Host machines must be in the same IP subnet and VLAN in order to allow for live migration their VMs. The big design challenge then becomes, how do we allow for live migrations of VMs between Data Centers that are not in the same layer 2 network? One solution to this problem that Cisco has devised is a relatively new technology called Overlay Transport Virtualization (OTV).

As a side result of preparing for INE’s upcoming CCIE Data Center Nexus Bootcamp I’ve had the privilege (or punishment depending on how you look at it ;) ) of delving deep into the OTV implementation on Nexus 7000. My goal was to find out exactly what was going on behind the scenes with OTV. The problem I ran into though was that none of the external Cisco documentation, design guides, white papers, Cisco Live presentations, etc. really contained any of this information. The only thing that is out there on OTV is mainly marketing info, i.e. buzzword bingo, or very basic config snippets on how to implement OTV. In this blog post I’m going to discuss the details of my findings about how OTV actually works, with the most astonishing of these results being that OTV is in fact, a fancy GRE tunnel.

From a high level overview, OTV is basically a layer 2 over layer 3 tunneling protocol. In essence OTV accomplishes the same goal as other L2 tunneling protocols such as L2TPv3, Any Transport over MPLS (AToM), or Virtual Private LAN Services (VPLS). For OTV specifically this goal is to take Ethernet frames from an end station, like a virtual machine, encapsulate them inside IPv4, transport them over the Data Center Interconnect (DCI) network, decapsulate them on the other side, and out pops your original Ethernet frame.

For this specific application OTV has some inherent benefits over other designs such as MPLS L2VPN with AToM or VPLS. The first of which is that OTV is transport agnostic. As long as there is IPv4 connectivity between Data Centers, OTV can be used. For AToM or VPLS, these both require that the transport network be MPLS aware, which can limit your selections of Service Providers for the DCI. For OTV you can technically use it over any regular Internet connectivity.

Another advantage of OTV is that provisioning is simple. AToM and VPLS tunnels are Provider Edge (PE) side protocols, while OTV is a Customer Edge (CE) side protocol. This means for AToM and VPLS the Service Provider has to pre-provision the pseudowires. Even though VPLS supports enhancements like BGP auto-discovery, provisioning of MPLS L2VPN is still requires administrative overhead. OTV is much simpler in this case, because as we’ll see shortly, the configuration is just a few commands that are controlled by the CE router, not the PE router.

The next thing we have to consider with OTV is how exactly this layer 2 tunneling is accomplished. After all we could just configure static GRE tunnels on our DCI edge routers and bridge IP over them, but this is probably not the best design option for either control plane or data plan scalability.

The way that OTV implements the control plane portion of its layer 2 tunnel is what is sometimes described as “MAC in IP Routing”. Specifically OTV uses Intermediate System to Intermediate System (IS-IS) to advertise the VLAN and MAC address information of the end hosts over the Data Center Interconnect. For those of you that are familiar with IS-IS, immediately this should sound suspect. After all, IS-IS isn’t an IP protocol, it’s part of the legacy OSI stack. This means that IS-IS is directly encapsulated over layer 2, unlike OSPF or EIGRP which ride over IP at layer 3. How then can IS-IS be encapsulated over the DCI network that is using IPv4 for transport? The answer? A fancy GRE tunnel.

The next portion that is significant about OTV’s operation is how it actually sends packets in the data plane. Assuming for a moment that the control plane “just works”, and the DCI edge devices learn about all the MAC addresses and VLAN assignments of the end hosts, how do we actually encapsulate layer 2 Ethernet frames inside of IP to send over the DCI? What if there is multicast traffic that is running over the layer 2 network? Also what if there are multiple sites reachable over the DCI? How does it know specifically where to send the traffic? The answer? A fancy GRE tunnel.

Next I want to introduce the specific topology that will be used for us to decode the details of how OTV is working behind the scenes. Within the individual Data Center sites, the layer 2 configuration and physical wiring is not relevant to our discussion of OTV. Assume simply that the end hosts have layer 2 connectivity to the edge routers. Additionally assume that the edge routers have IPv4 connectivity to each other over the DCI network. In this specific case I chose to use RIPv2 for routing over the DCI (yes, you read that correctly), simply so I could filter it from my packet capture output, and easily differentiate between the routing control plane in the DCI transport network vs. the routing control plane that was tunneled inside OTV between the Data Center sites.

What we are mainly concerned with in this topology is as follows:

  • OTV Edge Devices N7K1-3 and N7K2-7
    • These are the devices that actually encapsulate the Ethernet frames from the end hosts into the OTV tunnel. I.e. this is where the OTV config goes.
  • DCI Transport Device N7K2-8
    • This device represents the IPv4 transit cloud between the DC sites. From this device’s perspective it sees only the tunnel encapsulated traffic, and does not know the details about the hosts inside the individual DC sites. Additionally this is where packet capture is occurring so we can view the actual payload of the OTV tunnel traffic.
  • End Hosts R2, R3, Server 1, and Server 3
    • These are the end devices used to generate data plane traffic that ultimately flows over the OTV tunnel.

Now let’s look at the specific configuration on the edge routers that is required to form the OTV tunnel.

N7K1-3:
vlan 172
name OTV_EXTEND_VLAN
!
vlan 999
name OTV_SITE_VLAN
!
spanning-tree vlan 172 priority 4096
!
otv site-vlan 999
otv site-identifier 0x101
!
interface Overlay1
otv join-interface Ethernet1/23
otv control-group 224.100.100.100
otv data-group 232.1.2.0/24
otv extend-vlan 172
no shutdown
!
interface Ethernet1/23
ip address 150.1.38.3/24
ip igmp version 3
ip router rip 1
no shutdown

N7K2-7:
vlan 172
name OTV_EXTEND_VLAN
!
vlan 999
name OTV_SITE_VLAN
!
spanning-tree vlan 172 priority 4096
!
otv site-vlan 999
otv site-identifier 0x102
!
interface Overlay1
otv join-interface port-channel78
otv control-group 224.100.100.100
otv data-group 232.1.2.0/24
otv extend-vlan 172
no shutdown
!
interface port-channel78
ip address 150.1.78.7/24
ip igmp version 3
ip router rip 1

As you can see the configuration for OTV really isn’t that involved. The specific portions of the configuration that are relevant are as follows:

  • Extend VLANs
    • These are the layer 2 segments that will actually get tunneled over OTV. Basically these are the VLANs that you virtual machines reside on that you want to do the VM mobility between. In our case this is VLAN 172, which maps to the IP subnet 172.16.0.0/24.
  • Site VLAN
    • Used to synchronize the Authoritative Edge Device (AED) role within an OTV site. This is for is when you have more than one edge router per site. OTV only allows a specific Extend VLAN to be tunneled by one edge router at a time for the purpose of loop prevention. Essentially this Site VLAN lets the edge routers talk to each other and figure out which one is active/standby on a per-VLAN basis for the OTV tunnel. The Site VLAN should not be included in the extend VLAN list.
  • Site Identifier
    • Should be unique per DC site. If you have more than one edge router per site, they must agree on the Site Identifier, as it’s used in the AED election.
  • Overlay Interface
    • The logical OTV tunnel interface.
  • OTV Join Interface
    • The physical link or port-channel that you use to route upstream towards the DCI.
  • OTV Control Group
    • Multicast address used to discover the remote sites in the control plane.
  • OTV Data Group
    • Used when you’re tunneling multicast traffic over OTV in the data plane.
  • IGMP Version 3
    • Needed to send (S,G) IGMP Report messages towards the DCI network on the Join Interface.

At this point that’s basically all that’s involved in the implementation of OTV. It “just works”, because all the behind the scenes stuff is hidden from us from a configuration point of view. A quick test of this from the end hosts shows us that:

R2#ping 255.255.255.255
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 255.255.255.255, timeout is 2 seconds:

Reply to request 0 from 172.16.0.3, 4 ms
Reply to request 1 from 172.16.0.3, 1 ms
Reply to request 2 from 172.16.0.3, 1 ms
Reply to request 3 from 172.16.0.3, 1 ms
Reply to request 4 from 172.16.0.3, 1 ms

R2#traceroute 172.16.0.3
Type escape sequence to abort.
Tracing the route to 172.16.0.3
VRF info: (vrf in name/id, vrf out name/id)
1 172.16.0.3 0 msec * 0 msec

The fact that R3 responds to R2’s packets going to the all hosts broadcast address (255.255.255.255) implies that they are in the same broadcast domain. How specifically is it working though? That’s what took a lot further investigation.

To simplify the packet level verification a little further, I changed the MAC address of the four end devices that are used to generate the actual data plane traffic. The Device, IP address, and MAC address assignments are as follows:

The first thing I wanted to verify in detail was what the data plane looked like, and specifically what type of tunnel encapsulation was used. With a little searching I found that OTV is currently on the IETF standards track in draft format. As of writing, the newest draft is draft-hasmit-otv-03. Section 3.1 Encapsulation states:

3.  Data Plane

3.1. Encapsulation

The overlay encapsulation format is a Layer-2 ethernet frame
encapsulated in UDP inside of IPv4 or IPv6.

The format of OTV UDP IPv4 encapsulation is as follows:

1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|Version| IHL |Type of Service| Total Length |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Identification |Flags| Fragment Offset |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Time to Live | Protocol = 17 | Header Checksum |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Source-site OTV Edge Device IP Address |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Destination-site OTV Edge Device (or multicast) Address |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Source Port = xxxx | Dest Port = 8472 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| UDP length | UDP Checksum = 0 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|R|R|R|R|I|R|R|R| Overlay ID |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Instance ID | Reserved |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
| Frame in Ethernet or 802.1Q Format |
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

A quick PING sweep of packet lengths with the Don’t Fragment bit set allowed me to find the encapsulation overhead, which turns out to be 42 bytes, as seen below:

R3#ping 172.16.0.2 size 1459 df-bit 

Type escape sequence to abort.
Sending 5, 1459-byte ICMP Echos to 172.16.0.2, timeout is 2 seconds:
Packet sent with the DF bit set
.....
Success rate is 0 percent (0/5)

R3#ping 172.16.0.2 size 1458 df-bit

Type escape sequence to abort.
Sending 5, 1458-byte ICMP Echos to 172.16.0.2, timeout is 2 seconds:
Packet sent with the DF bit set
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/1/4 ms

None of my testing however could verify what the encapsulation header was though. The draft says that the transport is supposed to be UDP port 8472, but none of my logging produced results showing that any UDP traffic was even in the transit network (save for my RIPv2 routing ;) ). After much frustration, I finally broke out the sniffer and took some packet samples. The first capture below shows a normal ICMP ping between R2 and R3.

MPLS? GRE? Where did those come from? That’s right, OTV is in fact a fancy GRE tunnel. More specifically it is an Ethernet over MPLS over GRE tunnel. My poor little PINGs between R2 and R3 are in fact encapsulated as ICMP over IP over Ethernet over MPLS over GRE over IP over Ethernet (IoIoEoMPLSoGREoIP for short). Let’s take a closer look at the encapsulation headers now:

In the detailed header output we see our transport Ethernet header, which in a real deployment can be anything depending on what the transport of your DCI is (Ethernet, POS, ATM, Avian Carrier, etc.) Next we have the IP OTV tunnel header, which surprised me in a few aspects. First, all documentation I read said that without the use of an OTV Adjacency Server, unicast can’t be used for transport. This is true... up to a point. Multicast it turns out is only used to establish the control plane, and to tunnel multicast over multicast in the data plane. Regular unicast traffic over OTV will be encapsulated as unicast, as seen in this capture.

The next header after IP is GRE. In other words, OTV is basically the same as configuring a static GRE tunnel between the edge routers and then bridging over them, along with some enhancements (hence fancy GRE). The OTV enhancements (which we’ll talk about shortly) are the reason why you wouldn’t just configure GRE statically. Nevertheless this surprised me because even in hindsight the only mention of OTV using GRE I found was here. What’s really strange about this is that Cisco’s OTV implementation doesn’t follow what the standards track draft says, which is UDP, even though the authors of the OTV draft are Cisco engineers. Go figure.

The next header, MPLS, makes sense since the prior encapsulation is already GRE. Ethernet over MPLS over GRE is already well defined and used in deployment, so there’s no real reason to reinvent the wheel here. I haven’t verified this in detail yet but I’m assuming that the MPLS Label value would be used in cases where the edge router has multiple overlay interfaces, in which case the label in the data plane would quickly tell it which overlay interface the incoming packet is destined for. This logic is similar to MPLS L3VPN where the bottom of the stack VPN label tells a PE router which CE facing link the packet is ultimately destined for. I’m going to do some more testing later with a larger more complex topology to actually verify this fact though, as all data plane traffic over this tunnel is always sharing the same MPLS label value.

Next we see the original Ethernet header, which is sourced from R2’s MAC address 0000.0000.0002 and going to R3’s MAC address 0000.0000.0003. Finally we have the original IP header and the final ICMP payload. The key with OTV is that this inner Ethernet header and its payload remain untouched, so it looks like from the end host perspective that all the devices are just on the same LAN.

Now that it was apparent that OTV was just a fancy GRE tunnel, the IS-IS piece fell into place. Since IS-IS runs directly over layer 2 (e.g. Ethernet), and OTV is an Ethernet over MPLS over GRE tunnel, then IS-IS can encapsulate as IS-IS over Ethernet over MPLS over GRE (phew!). To test this, I changed the MAC address of one of the end hosts, and looked at the IS-IS LSP generation of the edge devices. After all the goal of the OTV control plane is to use IS-IS to advertise the MAC addresses of end hosts in that particular site, as well as the particular VLAN that they reside in. The configuration steps and packet capture result of this are as follows:

R3#conf t
Enter configuration commands, one per line. End with CNTL/Z.
R3(config)#int gig0/0
R3(config-if)#mac-address 1234.5678.9abc
R3(config-if)#
*Aug 17 22:17:10.883: %LINK-5-CHANGED: Interface GigabitEthernet0/0, changed state to reset
*Aug 17 22:17:11.883: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet0/0, changed state to down
*Aug 17 22:17:16.247: %LINK-3-UPDOWN: Interface GigabitEthernet0/0, changed state to up
*Aug 17 22:17:17.247: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet0/0, changed state to up

The first thing I noticed about the IS-IS encoding over OTV is that it uses IPv4 Multicast. This makes sense, because if you have 3 or more OTV sites you don’t want to have to send your IS-IS LSPs as replicated Unicast. As long as all of the AEDs on all sites have joined the control group (224.100.100.100 in this case), the LSP replication should be fine. This multicast forwarding can also be verified in the DCI transport network core in this case as follows:

N7K2-8#show ip mroute
IP Multicast Routing Table for VRF "default"

(*, 224.100.100.100/32), uptime: 20:59:33, ip pim igmp
Incoming interface: Null, RPF nbr: 0.0.0.0
Outgoing interface list: (count: 2)
port-channel78, uptime: 20:58:46, igmp
Ethernet1/29, uptime: 20:58:53, igmp

(150.1.38.3/32, 224.100.100.100/32), uptime: 21:00:05, ip pim mrib
Incoming interface: Ethernet1/29, RPF nbr: 150.1.38.3
Outgoing interface list: (count: 2)
port-channel78, uptime: 20:58:46, mrib
Ethernet1/29, uptime: 20:58:53, mrib, (RPF)

(150.1.78.7/32, 224.100.100.100/32), uptime: 21:00:05, ip pim mrib
Incoming interface: port-channel78, RPF nbr: 150.1.78.7
Outgoing interface list: (count: 2)
port-channel78, uptime: 20:58:46, mrib, (RPF)
Ethernet1/29, uptime: 20:58:53, mrib

(*, 232.0.0.0/8), uptime: 21:00:05, pim ip
Incoming interface: Null, RPF nbr: 0.0.0.0
Outgoing interface list: (count: 0)

Note that N7K1-3 (150.1.38.3) and N7K2-7 (150.1.78.7) have both joined the (*, 224.100.100.100). A very important point about this is that the control group for OTV is an Any Source Multicast (ASM) group, not a Source Specific Multicast (SSM) group. This implies that your DCI transit network must run PIM Sparse Mode and have a Rendezvous Point (RP) configured in order to build the shared tree (RPT) for the OTV control group used by the AEDs. You technically could use Bidir but you really wouldn't want to for this particular application. This kind of surprised me how they chose to implement it, because there are already more efficient ways of doing source discovery for SSM, for example how Multicast MPLS L3VPN uses the BGP AFI/SAFI Multicast MDT to advertise the (S,G) pairs of the PE routers. I suppose the advantage of doing OTV this way though is that it makes the OTV config very straightforward from an implementation point of view on the AEDs, and you don’t need an extra control plane protocol like BGP to exchange the (S,G) pairs before you actually join the tree. The alternative to this of course is to use the Adjacency Server and just skip using multicast all together. This however will result in unicast replication in the core, which can be bad, mkay?

Also for added fun in the IS-IS control plane the actual MAC address routing table can be verified as follows:

N7K2-7# show otv route

OTV Unicast MAC Routing Table For Overlay1

VLAN MAC-Address Metric Uptime Owner Next-hop(s)
---- -------------- ------ -------- --------- -----------
172 0000.0000.0002 1 01:22:06 site port-channel27
172 0000.0000.0003 42 01:20:51 overlay N7K1-3
172 0000.0000.000a 42 01:18:11 overlay N7K1-3
172 0000.0000.001e 1 01:20:36 site port-channel27
172 1234.5678.9abc 42 00:19:09 overlay N7K1-3

N7K2-7# show otv isis database detail | no-more
OTV-IS-IS Process: default LSP database VPN: Overlay1

OTV-IS-IS Level-1 Link State Database
LSPID Seq Number Checksum Lifetime A/P/O/T
N7K2-7.00-00 * 0x000000A3 0xA36A 893 0/0/0/1
Instance : 0x000000A3
Area Address : 00
NLPID : 0xCC 0x8E
Hostname : N7K2-7 Length : 6
Extended IS : N7K1-3.01 Metric : 40
Vlan : 172 : Metric : 1
MAC Address : 0000.0000.001e
Vlan : 172 : Metric : 1
MAC Address : 0000.0000.0002
Digest Offset : 0
N7K1-3.00-00 0x00000099 0xBAA4 1198 0/0/0/1
Instance : 0x00000094
Area Address : 00
NLPID : 0xCC 0x8E
Hostname : N7K1-3 Length : 6
Extended IS : N7K1-3.01 Metric : 40
Vlan : 172 : Metric : 1
MAC Address : 1234.5678.9abc
Vlan : 172 : Metric : 1
MAC Address : 0000.0000.000a
Vlan : 172 : Metric : 1
MAC Address : 0000.0000.0003
Digest Offset : 0
N7K1-3.01-00 0x00000090 0xCBAB 718 0/0/0/1
Instance : 0x0000008E
Extended IS : N7K2-7.00 Metric : 0
Extended IS : N7K1-3.00 Metric : 0
Digest Offset : 0

So at this point we see what our ICMP PING was actually ICMP over IP over Ethernet over MPLS over GRE over IP over Ethernet, and our routing protocol was IS-IS over Ethernet over MPLS over GRE over IP over Ethernet :/ What about multicast in the data plane though? At this point verification of multicast over the DCI core is pretty straightforward, since we can just enable a multicast routing protocol like EIGRP and look at the result. This can be seen below:

R2#config t
Enter configuration commands, one per line. End with CNTL/Z.
R2(config)#router eigrp 1
R2(config-router)#no auto-summary
R2(config-router)#network 0.0.0.0
R2(config-router)#end
R2#

R3#config t
Enter configuration commands, one per line. End with CNTL/Z.
R3(config)#router eigrp 1
R3(config-router)#no auto-summary
R3(config-router)#network 0.0.0.0
R3(config-router)#end
R3#
*Aug 17 22:39:43.419: %SYS-5-CONFIG_I: Configured from console by console
*Aug 17 22:39:43.423: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 1: Neighbor 172.16.0.2 (GigabitEthernet0/0) is up: new adjacency

R3#show ip eigrp neighbors
IP-EIGRP neighbors for process 1
H Address Interface Hold Uptime SRTT RTO Q Seq
(sec) (ms) Cnt Num
0 172.16.0.2 Gi0/0 11 00:00:53 1 200 0 1

Our EIGRP adjacency came up, so multicast obviously is being tunneled over OTV. Let’s see the packet capture result:

We can see EIGRP being tunneled inside the OTV payload, but what’s with the outer header? Why is EIGRP using the ASM 224.100.100.100 group instead of the SSM 232.1.2.0/24 data group? My first guess was that link local multicast (i.e. 224.0.0.0/24) would get encapsulated as control plane instead of as data plane. This would make sense because control plane protocols like OSPF, EIGRP, PIM, etc. you would want those tunneling to all OTV sites, not just the ones that joined the SSM feeds. To test if this was the case, the only change I needed to make was to have one router join a non-link-local multicast group, and have the other router send ICMP pings. Since they’re effectively in the same LAN segment, no PIM routing is needed in the DC sites, just basic IGMP Snooping, which is enabled in NX-OS by default. The config on the IOS routers is as follows:

R2#config t
Enter configuration commands, one per line. End with CNTL/Z.
R2(config)#ip multicast-routing
R2(config)#int gig0/0
R2(config-if)#ip igmp join-group 224.10.20.30
R2(config-if)#end
R2#

R3#ping 224.10.20.30 repeat 1000 size 1458 df-bit

Type escape sequence to abort.
Sending 1000, 1458-byte ICMP Echos to 224.10.20.30, timeout is 2 seconds:
Packet sent with the DF bit set

Reply to request 0 from 172.16.0.2, 1 ms
Reply to request 1 from 172.16.0.2, 1 ms
Reply to request 2 from 172.16.0.2, 1 ms

The packet capture result was as follows:

This was more as expected. Now the multicast data plane packet was getting encapsulated in the ICMP over IP over Ethernet over MPLS over GRE over IP *Multicast* over Ethernet OTV group. The payload wasn’t decoded, as I think even Wireshark was dumbfounded by this string of encapsulations.

In summary we can make the following observations about OTV:

  • OTV encapsulation has 42 bytes of overhead that consists of:
    • New Outer Ethernet II Header - 14 Bytes
    • New Outer IP Header - 20 Bytes
    • GRE Header - 4 Bytes
    • MPLS Header - 4 Bytes
  • OTV uses both Unicast and Multicast transport
    • ASM Multicast is used to build the control plane for OTV IS-IS, ARP, IGMP, EIGRP, etc.
    • Unicast is used for normal unicast data plane transmission between sites
    • SSM Multicast is used for normal multicast data plane transmission between sites
    • Optionally ASM & SSM can be replaced with the Adjacency Server
  • GRE is the ultimate band-aid of networking

Now the next time someone is throwing around fancy buzzwords about OTV, DCI, VWM, etc. you can say “oh, you mean that fancy GRE tunnel”? ;)

I’ll be continuing this series in the coming days and weeks on other Data Center and specifically CCIE Data Center related technologies. If you have a request for a specific topic or protocol that you’d like to see the behind the scene’s details of, drop me a line at bmcgahan@ine.com.

Happy Labbing!

May
10

Update: Congrats to Mark, our winner of 100 rack rental tokens for the first correct answer, that XR2 is missing a BGP router-id.  In regular IOS, a router-id is chosen based on the highest Loopback interface.  If there is no Loopback interface the highest IP address of all up/up interfaces is chosen.  In the case of IOS XR however, the router-id will not be chosen from a physical link.  It will only be chosen from the highest Loopback interface, or from the manual router-id command.  Per the Cisco documentation:

BGP Router Identifier

For BGP sessions between neighbors to be established, BGP must be assigned a router ID. The router ID is sent to BGP peers in the OPEN message when a BGP session is established.

BGP attempts to obtain a router ID in the following ways (in order of preference):

  • By means of the address configured using the bgp router-id command in router configuration mode.
  • By using the highest IPv4 address on a loopback interface in the system if the router is booted with saved loopback address configuration.
  • By using the primary IPv4 address of the first loopback address that gets configured if there are not any in the saved configuration.

If none of these methods for obtaining a router ID succeeds, BGP does not have a router ID and cannot establish any peering sessions with BGP neighbors. In such an instance, an error message is entered in the system log, and the show bgp summary command displays a router ID of 0.0.0.0.

After BGP has obtained a router ID, it continues to use it even if a better router ID becomes available. This usage avoids unnecessary flapping for all BGP sessions. However, if the router ID currently in use becomes invalid (because the interface goes down or its configuration is changed), BGP selects a new router ID (using the rules described) and all established peering sessions are reset.

Since XR2 in this case does not have a Loopback configured, the BGP process cannot initialize.  The kicker with this problem is that the documentation states that when this problem occurs you should see that "an error message is entered in the system log", however in this case a Syslog was not generated about the error.  At least this is the last time this problem will bite me ;)

 


Today while working on additional content for our CCIE Service Provider Version 3.0 Lab Workbook I had one of those epic brain fart moments.  What started off as work on (what I thought was) a fairly simply design ended up as a 2 hour troubleshooting rabbit hole of rolling back config snippets one by one, debugging, and basically overall misery that can be perfectly summed up by this GIF of a guy smashing his head against his keyboard. :)

The scenario in question was a BGP peering between two IOS XR routers.  One was the PE of an MPLS L3VPN network and one was the CE.  As I've done this config literally hundreds of times in the past I could not for the life of me figure out why the BGP peering would not establish.  The relevant snippet of the topology diagram is as follows:

Since this scenario caused me so much pleasure I am offering 100 tokens good for CCIE Service Provider Version 3.0 Rack Rentals - or any of our other Routing & Switching rack rentals & mock labs, Security rack rentals, or Voice rack rentals - to whoever the first person is that can tell me why did these neighbors not establish a BGP peering.  The relevant outputs needed to troubleshoot the problem can be found below.  I still haven't decided whether I'm going to leave this problem in the workbook or not since it's such a mean one :)  Good luck!

 

 

<strong>RP/0/0/CPU0:XR1#show run</strong>
Fri May 11 00:34:38.563 UTC
Building configuration...
!! IOS XR Configuration 3.9.1
!! Last configuration change at Fri May 11 00:32:50 2012 by xr1
!
hostname XR1
username xr1
group root-lr
password 7 13061E010803
!
vrf ABC
address-family ipv4 unicast
import route-target
26:65001
!
export route-target
26:65001
!
!
!
line console
exec-timeout 0 0
!
ipv4 access-list PE_ROUTERS
10 permit ipv4 host 1.1.1.1 any
20 permit ipv4 host 2.2.2.2 any
30 permit ipv4 host 5.5.5.5 any
40 permit ipv4 host 19.19.19.19 any
!
interface Loopback0
ipv4 address 19.19.19.19 255.255.255.255
!
interface GigabitEthernet0/1/0/0
ipv4 address 172.19.10.19 255.255.255.0
!
interface GigabitEthernet0/1/0/1
ipv4 address 26.3.19.19 255.255.255.0
!
interface POS0/6/0/0
vrf ABC
ipv4 address 10.19.20.19 255.255.255.0
!
route-policy PASS
pass
end-policy
!
router isis 1
is-type level-2-only
net 49.0001.0000.0000.0019.00
address-family ipv4 unicast
mpls ldp auto-config
!
interface Loopback0
passive
address-family ipv4 unicast
!
!
interface GigabitEthernet0/1/0/1
point-to-point
hello-password hmac-md5 encrypted 022527722E
address-family ipv4 unicast
!
!
!
router bgp 26
address-family ipv4 unicast
!
! address-family ipv4 unicast
address-family vpnv4 unicast
!
neighbor-group PE_ROUTERS
remote-as 26
update-source Loopback0
address-family vpnv4 unicast
!
!
neighbor 1.1.1.1
use neighbor-group PE_ROUTERS
!
neighbor 2.2.2.2
use neighbor-group PE_ROUTERS
!
neighbor 5.5.5.5
use neighbor-group PE_ROUTERS
!
vrf ABC
rd 26:65001
address-family ipv4 unicast
!
neighbor 10.19.20.20
remote-as 65001
address-family ipv4 unicast
route-policy PASS in
route-policy PASS out
as-override
!
!
!
!
mpls ldp
label
allocate for PE_ROUTERS
!
!
end

RP/0/0/CPU0:XR1#

<strong>RP/0/3/CPU0:XR2#show run </strong>
Fri May 11 00:35:04.932 UTC
Building configuration...
!! IOS XR Configuration 3.9.1
!! Last configuration change at Fri May 11 00:30:30 2012 by xr2
!
hostname XR2
logging console debugging
username xr2
group root-lr
password 7 00071A150754
!
cdp
line console
exec-timeout 0 0
!
interface GigabitEthernet0/4/0/0
ipv4 address 10.20.20.20 255.255.255.0
ipv6 address 2001:10:20:20::20/64
!
interface POS0/7/0/0
ipv4 address 10.19.20.20 255.255.255.0
ipv6 address 2001:10:19:20::20/64
!
route-policy PASS
pass
end-policy
!
router bgp 65001
address-family ipv4 unicast
!
neighbor 10.19.20.19
remote-as 26
address-family ipv4 unicast
route-policy PASS in
route-policy PASS out
!
!
!
end

RP/0/3/CPU0:XR2#

RP/0/0/CPU0:XR1#show bgp vrf ABC ipv4 unicast summary 
Fri May 11 00:34:29.712 UTC
BGP VRF ABC, state: Active
BGP Route Distinguisher: 26:65001
VRF ID: 0x60000002
BGP router identifier 19.19.19.19, local AS number 26
BGP table state: Active
Table ID: 0xe0000002
BGP main routing table version 1

BGP is operating in STANDALONE mode.

Process RcvTblVer bRIB/RIB LabelVer ImportVer SendTblVer StandbyVer
Speaker 1 1 1 1 1 1

Neighbor Spk AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down St/PfxRcd
10.19.20.20 0 65001 2 7 0 0 0 00:03:59 Idle

 
<strong>RP/0/3/CPU0:XR2#show bgp ipv4 unicast summary</strong>
Fri May 11 00:35:02.278 UTC
BGP router identifier 0.0.0.0, local AS number 65001
BGP generic scan interval 60 secs
BGP table state: Active
Table ID: 0xe0000000
BGP main routing table version 1
BGP scan interval 60 secs

BGP is operating in STANDALONE mode.

Process RcvTblVer bRIB/RIB LabelVer ImportVer SendTblVer StandbyVer
Speaker 1 1 1 1 1 1

Neighbor Spk AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down St/PfxRcd
10.19.20.19 0 26 2 2 0 0 0 00:04:31 Active

 
RP/0/0/CPU0:XR1#show bgp vrf ABC ipv4 unicast neighbors 
Fri May 11 00:34:18.708 UTC

BGP neighbor is 10.19.20.20, vrf ABC
Remote AS 65001, local AS 26, external link
Remote router ID 0.0.0.0
BGP state = Idle
Last read 00:00:00, Last read before reset 00:04:10
Hold time is 180, keepalive interval is 60 seconds
Configured hold time: 180, keepalive: 60, min acceptable hold time: 3
Last write 00:00:15, attempted 53, written 53
Second last write 00:01:01, attempted 53, written 53
Last write before reset 00:04:10, attempted 72, written 72
Second last write before reset 00:04:15, attempted 53, written 53
Last write pulse rcvd May 11 00:34:02.927 last full not set pulse count 9
Last write pulse rcvd before reset 00:04:10
Socket not armed for io, not armed for read, not armed for write
Last write thread event before reset 00:04:10, second last 00:04:10
Last KA expiry before reset 00:00:00, second last 00:00:00
Last KA error before reset 00:00:00, KA not sent 00:00:00
Last KA start before reset 00:00:00, second last 00:00:00
Precedence: internet
Enforcing first AS is enabled
Received 2 messages, 0 notifications, 0 in queue
Sent 7 messages, 0 notifications, 0 in queue
Minimum time between advertisement runs is 0 secs

For Address Family: IPv4 Unicast
BGP neighbor version 0
Update group: 0.2
Route refresh request: received 0, sent 0
Policy for incoming advertisements is PASS
Policy for outgoing advertisements is PASS
0 accepted prefixes, 0 are bestpaths
Cumulative no. of prefixes denied: 0.
Prefix advertised 0, suppressed 0, withdrawn 0
Maximum prefixes allowed 524288
Threshold for warning message 75%, restart interval 0 min
AS override is set
An EoR was not received during read-only mode
Last ack version 0, Last synced ack version 0
Outstanding version objects: current 0, max 0

Connections established 1; dropped 1
Local host: 10.19.20.19, Local port: 19432
Foreign host: 10.19.20.20, Foreign port: 179
Last reset 00:00:15, due to Peer closing down the session
Peer reset reason: Remote closed the session (Connection timed out)
Time since last notification sent to neighbor: 00:02:11
Error Code: administrative shutdown
Notification data sent:
None

<strong>RP/0/3/CPU0:XR2#show bgp ipv4 unicast neighbors </strong>
Fri May 11 00:34:58.427 UTC

BGP neighbor is 10.19.20.19
Remote AS 26, local AS 65001, external link
Remote router ID 0.0.0.0
BGP state = Active
Last read 00:00:00, Last read before reset 00:04:50
Hold time is 180, keepalive interval is 60 seconds
Configured hold time: 180, keepalive: 60, min acceptable hold time: 3
Last write 00:04:50, attempted 19, written 19
Second last write 00:04:50, attempted 53, written 53
Last write before reset 00:04:50, attempted 19, written 19
Second last write before reset 00:04:50, attempted 53, written 53
Last write pulse rcvd May 11 00:30:08.305 last full not set pulse count 4
Last write pulse rcvd before reset 00:04:50
Socket not armed for io, not armed for read, not armed for write
Last write thread event before reset 00:04:50, second last 00:04:50
Last KA expiry before reset 00:00:00, second last 00:00:00
Last KA error before reset 00:00:00, KA not sent 00:00:00
Last KA start before reset 00:04:50, second last 00:00:00
Precedence: internet
Enforcing first AS is enabled
Received 2 messages, 0 notifications, 0 in queue
Sent 2 messages, 0 notifications, 0 in queue
Minimum time between advertisement runs is 30 secs

For Address Family: IPv4 Unicast
BGP neighbor version 0
Update group: 0.2
Route refresh request: received 0, sent 0
Policy for incoming advertisements is PASS
Policy for outgoing advertisements is PASS
0 accepted prefixes, 0 are bestpaths
Cumulative no. of prefixes denied: 0.
Prefix advertised 0, suppressed 0, withdrawn 0
Maximum prefixes allowed 524288
Threshold for warning message 75%, restart interval 0 min
An EoR was not received during read-only mode
Last ack version 0, Last synced ack version 0
Outstanding version objects: current 0, max 0

Connections established 1; dropped 1
Local host: 10.19.20.20, Local port: 60056
Foreign host: 10.19.20.19, Foreign port: 179
Last reset 00:02:27, due to Interface flap
Time since last notification sent to neighbor: 00:05:07
Error Code: administrative reset
Notification data sent:
None


                        
Oct
28

INE's long awaited CCIE Service Provider Advanced Technologies Class is now available! But first, congratulations to Tedhi Achdiana who just passed the CCIE Service Provider Lab Exam! Here's what Tedhi had to say about his preparation:

Finally i passed my CCIE Service Provider Lab exam in Hongkong on Oct, 17 2011. I used your CCIE Service Provider Printed Materials Bundle. This product makes me deep understand how the Service Provider technology works, so it doesn`t matter when Cisco has changed the SP Blueprint. You just need to practise with IOS XR and finding similiar command in IOS platform.

Thanks to INE and keep good working !

Tedhi Achdiana
CCIE#30949 - Service Provider

The CCIE Service Provider Advanced Technologies Class covers the newest CCIE SP Version 3.0 Blueprint, including the addition of IOS XR hardware. Class topics include Catalyst ME3400 switching, IS-IS, OSPF, BGP, MPLS Layer 3 VPNs (L3VPN), Inter-AS MPLS L3VPNs, IPv6 over MPLS with 6PE and 6VPE, AToM and VPLS based MPLS Layer 2 VPNs (L2VPN), MPLS Traffic Engineering, Service Provider Multicast, and Service Provider QoS. Understanding the topics covered in this class will ensure that students are ready to tackle the next step in their CCIE preparation, applying the technologies themselves with INE's CCIE Service Provider Lab Workbook, and then finally taking and passing the CCIE Service Provider Lab Exam!

Streaming access is available for All Access Pass subscribers for as low as $65/month! Download access can be purchased here for $299. AAP members can additionally upgrade to the download version for $149.

Sample videos from class can be found after the break:

 

The detailed outline of class is as follows:

  • Introduction
  • Catalyst ME3400 Switching
  • Frame Relay / HDLC / PPP & PPPoE
  • IS-IS Overview / Level 1 & Level 2 Routing
  • IS-IS Network Types / Path Selection & Route Leaking
  • IS-IS Route Leaking on IOS XR / IOS XR Routing Policy Language (RPL)
  • IS-IS IPv6 Routing / IS-IS Multi Topology
  • MPLS Overview / LDP Overview
  • Basic MPLS Configuration
  • MPLS Tunnels
  • MPLS Layer 3 VPN (L3VPN) Overview / VRF Overview
  • VPNv4 BGP Overview / Route Distinguishers vs. Route Targets
  • Basic MPLS L3VPN Configuration
  • MPLS L3VPN Verification & Troubleshooting
  • VPNv4 Route Reflectors
  • BGP PE-CE Routing / BGP AS Override
  • RIP PE-CE Routing
  • EIGRP PE-CE Routing
  • OSPF PE-CE Routing / OSPF Domain IDs / Domain Tags & Sham Links
  • OSPF Multi VRF CE Routing
  • MPLS Central Services L3VPNs
  • IPv6 over MPLS with 6PE & 6VPE
  • Inter AS MPLS L3VPN Overview
  • Inter AS MPLS L3VPN Option A - Back-to-Back VRF Exchange Part 1
  • Inter AS MPLS L3VPN Option A - Back-to-Back VRF Exchange Part 2
  • Inter AS MPLS L3VPN Option B - ASBRs Peering VPNv4
  • Inter AS MPLS L3VPN Option C - ASBRs Peering BGP+Label Part 1
  • Inter AS MPLS L3VPN Option C - ASBRs Peering BGP+Label Part 2
  • Carrier Supporting Carrier (CSC) MPLS L3VPN
  • MPLS Layer 2 VPN (L2VPN) Overview
  • Ethernet over MPLS L2VPN AToM
  • PPP & Frame Relay over MPLS L2VPN AToM
  • MPLS L2VPN AToM Interworking
  • Virtual Private LAN Services (VPLS)
  • MPLS Traffic Engineering (TE) Overview
  • MPLS TE Configuration
  • MPLS TE on IOS XR / LDP over MPLS TE Tunnels
  • MPLS TE Fast Reroute (FRR) Link and Node Protection
  • Service Provider Multicast
  • Service Provider QoS

Additionally completely new versions of INE CCIE Service Provider Lab Workbook Volumes I & II are on their way, and should be released before the end of the year. Stay tuned for more information on the workbook and rack rental availability!

Oct
18

One of our most anticipated products of the year - INE's CCIE Service Provider v3.0 Advanced Technologies Class - is now complete!  The videos from class are in the final stages of post production and will be available for streaming and download access later this week.  Download access can be purchased here for $299.  Streaming access is available for All Access Pass subscribers for as low as $65/month!  AAP members can additionally upgrade to the download version for $149.

At roughly 40 hours, the CCIE SPv3 ATC covers the newly released CCIE Service Provider version 3 blueprint, which includes the addition of IOS XR hardware. This class includes both technology lectures and hands on configuration, verification, and troubleshooting on both regular IOS and IOS XR. Class topics include Catalyst ME3400 switching, IS-IS, OSPF, BGP, MPLS Layer 3 VPNs (L3VPN), Inter-AS MPLS L3VPNs, IPv6 over MPLS with 6PE and 6VPE, AToM and VPLS based MPLS Layer 2 VPNs (L2VPN), MPLS Traffic Engineering, Service Provider Multicast, and Service Provider QoS.

Below you can see a sample video from the class, which covers IS-IS Route Leaking, and its implementation on IOS XR with the Routing Policy Language (RPL)

Aug
30

One of the frequent questions I hear regarding L3VPNs, is regarding the bottom VPN label.  In this article, we will focus on the control plane that provides both the VPN and transit labels, and then look at the data plane that results because of those labels.

In the topology, there are 2 customer sites (bottom right, and bottom left).  The BGP, VRFs, Redistribution, etc are all configured to allow us to focus on the control and data plane.   Lets begin by verifying that R1 is sourcing the network, 1.1.1.1/32.

MPLS-class blog3 simple larger canvas

A debug verifies that R1 is sending the updates for 1.1.1.1 to R2.

R1 sources net 1.1.1.1

R2 has learned the route from R1, has assigned a VPN label for it, and has exported it from the VRF into BGP.  This lucky route was assigned the local label of 16 by R2.

R2 has route in bgp and has label for it

We can also look at the MPLS forwarding table on R2 to see the same tag information.

verfy mpls forwarding table on R2

This prefix, as a VPNV4 route, is sent as an update to the iBGP peer R4.   We can force an update with refresh.

r2 clear ip bgp

The update can be seen on the wire between R2 and R3, (on its way to R4) using a protocol analyzer.  You may also notice that R2 uses outgoing label 19 for forwarding this update to 4.4.4.4   The label can be seen in the MPLS forwarding output above.

wireshard update from r2 to R4

The VPN label being advertised in the update is Label 16, which is R2's local label for the 1.1.1.1 network.

On R4, which will be the ingress PE for transit traffic destined for 1.1.1.1, we can see that the VPN label of 16 is associated with destination network of 1.1.1.1  The next hop of 2.2.2.2 to reach the 1.1.1.1 network, is due to R2 assigning next-hop-self for updates it sends to R4.

R4 has vpn label now learned from R2

We can also see the outgoing MPLS label that R4 will use to reach the next hop of 2.2.2.2.  The label of 18 below, was advertised by R3, as the label to use to reach 2.2.2.2

R4 next hop for 2.2.2.2

We can also verify that the route (1.1.1.1) has been imported by R4 into the customer vrf.

r4 vrf has 1.1.1.1

So when a transit packet is sent from R5 to 1.1.1.1, R4 should impose 2 labels.   The bottom label will be 16 (the VPN label) for the 1.1.1.1 network (R2 told us about that via the iBGP update), and the top label should be 18 (advertised via LDP from R3), to reach the next hop of 2.2.2.2

On R4 a quick check of the CEF table for the vrf can verify both labels.

cef table on R4

A simple trace from R5, to the destination network of 1.1.1.1 should prove all the labels in the path.

Trace from R5 to 1.1.1.1

The top label of 18 is used to reach the next hop of 2.2.2.2, and the bottom label of 16, which is meaningful to R2, because he sourced it, will be used by R2 in forwarding the transit traffic destined to 1.1.1.1 to the next hop, which is R1.

R3 will pop the transit label off, due to R2 advertising implicit-null for the network 2.2.2.2 (itself).

For more information and step by step training on MPLS, take a look at our newest MPLS self paced course!

If you like, an 8 minute video, that reviews the same steps, may be viewed here.

Thanks for reading!

 

Aug
26

basic.mpls.example

In this blog post we’re going to discuss the fundamental logic of how MPLS tunnels allow applications such as L2VPN & L3VPN to work, and how MPLS tunnels enable Service Providers to run what is known as the “BGP Free Core”. In a nutshell, MPLS tunnels allow traffic to transit over devices that have no knowledge of the traffic’s final destination, similar to how GRE tunnels and site-to-site IPsec VPN tunnels work. To accomplish this, MPLS tunnels use a combination of IGP learned information, BGP learned information, and MPLS labels.

In normal IP transit networks, each device in the topology makes a routing decision on a hop-by-hop basis by comparing the destination IP address of the packet to the routing or forwarding table. In MPLS networks, devices make their decision based on the MPLS label contained in the packet that is received. In this manner, MPLS enabled Label Switch Routers (LSRs for short) do not necessarily need IP routing information about all destinations, as long as they know how to forward traffic based on an MPLS label. To demonstrate how this process works, we’ll examine the above topology in two sample cases, first with normal IP packet forwarding, and secondly with IP packet forwarding plus MPLS.

In this topology R4, R5, and R6 represent the Service Provider network, while SW1 and SW2 represent customers that are using the Service Provider for transit. In each example case, the goal of our scenario will be to provide IP packet transport between the 10.1.7.0/24 network attached to SW1, and the 10.1.8.0/24 network attached to SW2.

Case 1: Normal IP Forwarding

In case 1, the devices in the topology are configured as follows. AS 100, which consists of R4, R5, and R6, runs OSPF as an IGP on all internal interfaces, along with a full mesh of iBGP. AS 7, which consists of SW1, and AS 8, which consists of SW2, peer EBGP with AS 100 via R4 and R6 respectively, and advertise the prefixes 10.1.7.0/24 & 10.1.8.0/24 respectively into BGP. The relevant device configurations are as follows. Note that these configs assume that layer 2 connectivity has already been established between the devices.

R4#
interface Loopback0
ip address 10.1.4.4 255.255.255.255
ip ospf 1 area 0
!
interface FastEthernet0/0
ip address 10.1.47.4 255.255.255.0
!
interface FastEthernet0/1
ip address 10.1.45.4 255.255.255.0
ip ospf 1 area 0
!
router bgp 100
neighbor 10.1.6.6 remote-as 100
neighbor 10.1.6.6 update-source Loopback0
neighbor 10.1.6.6 next-hop-self
neighbor 10.1.45.5 remote-as 100
neighbor 10.1.45.5 next-hop-self
neighbor 10.1.47.7 remote-as 7

R5#
interface FastEthernet0/0
ip address 10.1.45.5 255.255.255.0
ip ospf 1 area 0
!
interface FastEthernet0/1
ip address 10.1.56.5 255.255.255.0
ip ospf 1 area 0
!
router bgp 100
neighbor 10.1.45.4 remote-as 100
neighbor 10.1.56.6 remote-as 100

R6#
interface Loopback0
ip address 10.1.6.6 255.255.255.255
ip ospf 1 area 0
!
interface FastEthernet0/0
ip address 10.1.56.6 255.255.255.0
ip ospf 1 area 0
!
interface FastEthernet0/1
ip address 10.1.68.6 255.255.255.0
!
router bgp 100
neighbor 10.1.4.4 remote-as 100
neighbor 10.1.4.4 update-source Loopback0
neighbor 10.1.4.4 next-hop-self
neighbor 10.1.56.5 remote-as 100
neighbor 10.1.56.5 next-hop-self
neighbor 10.1.68.8 remote-as 8

SW1#
interface Loopback0
ip address 10.1.7.7 255.255.255.0
!
interface Vlan47
ip address 10.1.47.7 255.255.255.0
!
router bgp 7
network 10.1.7.0 mask 255.255.255.0
neighbor 10.1.47.4 remote-as 100

SW2#
interface Loopback0
ip address 10.1.8.8 255.255.255.0
!
interface Vlan68
ip address 10.1.68.8 255.255.255.0
!
router bgp 8
network 10.1.8.0 mask 255.255.255.0
neighbor 10.1.68.6 remote-as 100

Next, let’s examine the hop-by-hop packet flow as traffic moves between the 10.1.7.0/24 and 10.1.8.0/24 networks, starting at SW1 towards SW2, and then back in the reverse direction. Note that verification should be done in both directions, as packet flow from the source to destination is independent of packet flow from the destination back to the source.

SW1#show ip route 10.1.8.0
Routing entry for 10.1.8.0/24
Known via "bgp 7", distance 20, metric 0
Tag 100, type external
Last update from 10.1.47.4 00:21:13 ago
Routing Descriptor Blocks:
* 10.1.47.4, from 10.1.47.4, 00:21:13 ago
Route metric is 0, traffic share count is 1
AS Hops 2
Route tag 100

On SW1, the prefix 10.1.8.0 is learned via BGP from R4, with a next-hop of 10.1.47.4. Next, SW1 performs a second recursive lookup on the next-hop to see which interface must be used for packet forwarding.

SW1#show ip route 10.1.47.4
Routing entry for 10.1.47.0/24
Known via "connected", distance 0, metric 0 (connected, via interface)
Routing Descriptor Blocks:
* directly connected, via Vlan47
Route metric is 0, traffic share count is 1

The result of this lookup is that SW1 should use interface Vlan47, which connects towards R4. Assuming that underlying IP address to MAC address resolution is successful, packets going to 10.1.8.0 should be properly routed towards R4. Next, the lookup process continues on R4.

R4#show ip route 10.1.8.0
Routing entry for 10.1.8.0/24
Known via "bgp 100", distance 200, metric 0
Tag 8, type internal
Last update from 10.1.6.6 00:25:19 ago
Routing Descriptor Blocks:
* 10.1.6.6, from 10.1.6.6, 00:25:19 ago
Route metric is 0, traffic share count is 1
AS Hops 1
Route tag 8

R4 is learning the prefix 10.1.8.0 via iBGP from R6, with a next-hop value of 10.1.6.6, R6’s Loopback0 interface. R4 must now perform an additional recursive lookup to figure out what interface 10.1.6.6 is reachable out.

R4#show ip route 10.1.6.6
Routing entry for 10.1.6.6/32
Known via "ospf 1", distance 110, metric 3, type intra area
Last update from 10.1.45.5 on FastEthernet0/1, 00:25:26 ago
Routing Descriptor Blocks:
* 10.1.45.5, from 10.1.6.6, 00:25:26 ago, via FastEthernet0/1
Route metric is 3, traffic share count is 1

R4 knows 10.1.6.6 via OSPF from R5, which uses interface FastEthernet0/1. Assuming layer 2 connectivity is working properly, packets towards 10.1.8.0 are now routed to R5, and the lookup process continues.

R5#show ip route 10.1.8.0
Routing entry for 10.1.8.0/24
Known via "bgp 100", distance 200, metric 0
Tag 8, type internal
Last update from 10.1.56.6 00:24:36 ago
Routing Descriptor Blocks:
* 10.1.56.6, from 10.1.56.6, 00:24:36 ago
Route metric is 0, traffic share count is 1
AS Hops 1
Route tag 8

R5 is learning the prefix 10.1.8.0 via iBGP from R6, with a next-hop of 10.1.56.6. A recursive lookup on the next-hop, as seen below, indicates that R5 should use interface FastEthernet0/1 to forward packets towards 10.1.8.0.

R5#show ip route 10.1.56.6
Routing entry for 10.1.56.0/24
Known via "connected", distance 0, metric 0 (connected, via interface)
Routing Descriptor Blocks:
* directly connected, via FastEthernet0/1
Route metric is 0, traffic share count is 1

The lookup process now continues on R6, as seen below.

R6#show ip route 10.1.8.0
Routing entry for 10.1.8.0/24
Known via "bgp 100", distance 20, metric 0
Tag 8, type external
Last update from 10.1.68.8 00:28:58 ago
Routing Descriptor Blocks:
* 10.1.68.8, from 10.1.68.8, 00:28:58 ago
Route metric is 0, traffic share count is 1
AS Hops 1
Route tag 8

R6 is learning the prefix 10.1.8.0 via EBGP from SW2, with a next-hop of 10.1.68.8.

R6#show ip route 10.1.68.8
Routing entry for 10.1.68.0/24
Known via "connected", distance 0, metric 0 (connected, via interface)
Routing Descriptor Blocks:
* directly connected, via FastEthernet0/1
Route metric is 0, traffic share count is 1

A recursive lookup on 10.1.68.8 from R6 dictates that interface FastEthernet0/1 should be used to forward traffic on to SW2.

SW2#show ip route 10.1.8.0
Routing entry for 10.1.8.0/24
Known via "connected", distance 0, metric 0 (connected, via interface)
Advertised by bgp 8
Routing Descriptor Blocks:
* directly connected, via Loopback0
Route metric is 0, traffic share count is 1

SW2’s lookup for 10.1.8.0 indicates that the destination is directly connected, and packets are routed to the final destination. For return traffic back to 10.1.7.0, a lookup occurs in the reverse direction similar to what we saw above, starting as SW2, and moving to R6, R5, R4, and then finally SW1.

This example shows how the hop-by-hop routing paradigm works in IPv4 networks. While this type of design works, one of the limitations of IPv4 forwarding is that all devices in the transit path must have routing information for all destinations they are forwarding packets towards. If AS 100 was used for Internet transit in this example, each router in the transit path would need 300,000+ routes in their routing tables in order to provide transit to all Internet destinations. This is just one of the many applications for which MPLS can be helpful. By introducing MPLS into this design, the need for large routing tables can be avoided in the core of the Service Provider network.

Case 2: MPLS Forwarding

By enabling MPLS in the Service Provider network of AS 100, BGP can be disabled in the core, lightening the load on devices that are possibly already taxed for resources. The configuration for MPLS in this scenario is very simple, but the understanding of what happens behind the scenes can be intimidating when learning the technology for the first time. To help with this learning curve, we’ll look at the step by step process that occurs when an MPLS tunnel is functional in AS 100.

The configuration changes necessary to implement MPLS in AS 100 are as follows:

R4#
mpls label protocol ldp
!
interface FastEthernet0/1
mpls ip
!
router bgp 100
no neighbor 10.1.45.5 remote-as 100

R5#
mpls label protocol ldp
!
interface FastEthernet0/0
mpls ip
!
interface FastEthernet0/1
mpls ip
!
no router bgp 100

R6#
mpls label protocol ldp
!
interface FastEthernet0/0
mpls ip
!
router bgp 100
no neighbor 10.1.56.5 remote-as 100

Once MPLS is enabled inside AS 100, BGP can be disabled on R5, and the additional BGP peering statements removed on R4 and R6. The end result of this change is surprising for some, as seen below.

R5#show ip route 10.1.7.0
% Subnet not in table
R5#show ip route 10.1.8.0
% Subnet not in table

SW1#ping 10.1.8.8 source 10.1.7.7

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 10.1.8.8, timeout is 2 seconds:
Packet sent with a source address of 10.1.7.7
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/4/9 ms

Although R5 no longer has a route to the prefixes 10.1.7.0/24 or 10.1.8.0/24, it can still provide transit for traffic between them. How is this possible you may ask? The key is that an MPLS tunnel has now been formed between the ingress and egress routers of the Service Provider network, which are R4 and R6 in this case.  To see the operation of the MPLS tunnel, we'll follow the same lookup process as before, but now R4, R5, and R6 will add an additional MPLS label lookup.

Below SW1 looks up the route for 10.1.8.0/24, and finds that it recurses to R4's next-hop value reachable via the Vlan47 interface.

SW1#show ip route 10.1.8.0
Routing entry for 10.1.8.0/24
Known via "bgp 7", distance 20, metric 0
Tag 100, type external
Last update from 10.1.47.4 01:02:56 ago
Routing Descriptor Blocks:
* 10.1.47.4, from 10.1.47.4, 01:02:56 ago
Route metric is 0, traffic share count is 1
AS Hops 2
Route tag 100

SW1#show ip route 10.1.47.4
Routing entry for 10.1.47.0/24
Known via "connected", distance 0, metric 0 (connected, via interface)
Routing Descriptor Blocks:
* directly connected, via Vlan47
Route metric is 0, traffic share count is 1

Next, R4 receives packets for this destination and performs its own lookup.

R4#show ip route 10.1.8.0
Routing entry for 10.1.8.0/24
Known via "bgp 100", distance 200, metric 0
Tag 8, type internal
Last update from 10.1.6.6 01:05:15 ago
Routing Descriptor Blocks:
* 10.1.6.6, from 10.1.6.6, 01:05:15 ago
Route metric is 0, traffic share count is 1
AS Hops 1
Route tag 8

Like before, R4 finds the route via BGP from R6, with a next-hop of 10.1.6.6.  R4 must now perform a recursive lookup on 10.1.6.6 to find the outgoing interface to reach R6.

R4#show ip route 10.1.6.6
Routing entry for 10.1.6.6/32
Known via "ospf 1", distance 110, metric 3, type intra area
Last update from 10.1.45.5 on FastEthernet0/1, 01:06:22 ago
Routing Descriptor Blocks:
* 10.1.45.5, from 10.1.6.6, 01:06:22 ago, via FastEthernet0/1
Route metric is 3, traffic share count is 1

R4's recursive lookup finds the outgoing interface FastEthernet0/1 with a next-hop of 10.1.45.5.  In normal IP forwarding, the packet would now be sent to the interface driver for layer 2 encapsulation.  However in this case, R4 first checks to see if the interface FastEthernet0/1 is MPLS enabled, as seen below.

R4#show mpls interfaces
Interface              IP            Tunnel   BGP Static Operational
FastEthernet0/1        Yes (ldp)     No       No  No     Yes

Since interface FastEthernet0/1 is running MPLS via Label Distribution Protocol (LDP), R4 now consults the MPLS Label Forwarding Information Base (LFIB) to see if there is an MPLS label assigned to the next-hop we're trying to reach, 10.1.6.6.

R4#show mpls forwarding-table
Local  Outgoing      Prefix            Bytes Label   Outgoing   Next Hop
Label  Label or VC   or Tunnel Id      Switched      interface
16     Pop Label     10.1.56.0/24      0             Fa0/1      10.1.45.5
17     17            10.1.6.6/32       0             Fa0/1      10.1.45.5
18     18            10.1.68.0/24      0             Fa0/1      10.1.45.5

R4 finds an entry for 10.1.6.6/32 in the LFIB, and uses the outgoing label value of 17.  This means that for traffic going to 10.1.8.0/24, the label 17 will be added to the packet header.  In reality this lookup process occurs as one step, which is the lookup in the CEF table.  The below output of the CEF table for the final destination on R4 shows that label 17 will be used, because it is inherited from the next-hop of 10.1.6.6.

R4#show ip cef 10.1.8.0 detail
10.1.8.0/24, epoch 0
recursive via 10.1.6.6
nexthop 10.1.45.5 FastEthernet0/1 label 17

Now that the MPLS label lookup is successful, the packet is label switched to R5, which leads us to the key step in this example.  When R5 receives the packet, it sees that it has an MPLS label in the header.  This means that R5 performs a lookup in the MPLS LFIB first, and not in the regular IP routing table.  Specifically R5 sees the label number 17 coming in, which has a match in the LFIB as seen below.

R5#show mpls forwarding-table
Local  Outgoing      Prefix            Bytes Label   Outgoing   Next Hop
Label  Label or VC   or Tunnel Id      Switched      interface
16     Pop Label     10.1.4.4/32       15447         Fa0/0      10.1.45.4
17     Pop Label     10.1.6.6/32       15393         Fa0/1      10.1.56.6
18     Pop Label     10.1.68.0/24      0             Fa0/1      10.1.56.6

The local label 17 is associated with the destination 10.1.6.6/32.  Although our packets are going to the final destination 10.1.8.0/24, knowing how to get towards the next-hop 10.1.6.6/32 is sufficient for R5, because we know that R6 actually does have the route for the final destination.  Specifically R5's operation at this point is to remove the label 17 from the packet, and continue to forward the packet towards R6 without an additional label.  This operation is known as the "pop" operation, or label disposition.  This occurs because R5 sees the outgoing label as "no label", which causes it to remove any MPLS labels from the packet, and continue forwarding it.

On the return trip for packets from 10.1.8.0/24 back to 10.1.7.0/24, R6 adds the label 16 and forwards the packet to R5, then R5 removes the label 16 and forwards the packet to R4.  This can be inferred from the LFIB and CEF table verifications below.

R6#show mpls forwarding-table
Local Outgoing Prefix Bytes Label Outgoing Next Hop
Label Label or VC or Tunnel Id Switched interface
16 16 10.1.4.4/32 0 Fa0/0 10.1.56.5
17 Pop Label 10.1.45.0/24 0 Fa0/0 10.1.56.5

R6#show ip cef 10.1.7.0 detail
10.1.7.0/24, epoch 0
recursive via 10.1.4.4
nexthop 10.1.56.5 FastEthernet0/0 label 16

R5#show mpls forwarding-table
Local Outgoing Prefix Bytes Label Outgoing Next Hop
Label Label or VC or Tunnel Id Switched interface
16 No Label 10.1.4.4/32 17606 Fa0/0 10.1.45.4
17 No Label 10.1.6.6/32 17552 Fa0/1 10.1.56.6
18 Pop Label 10.1.68.0/24 0 Fa0/1 10.1.56.6

To see this operation in action, we can send traffic from 10.1.7.0/24 to 10.1.8.0/24, and look at the debug mpls packet output on R5.

R5#debug mpls packet
Packet debugging is on

SW1#ping 10.1.8.8 source 10.1.7.7

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 10.1.8.8, timeout is 2 seconds:
Packet sent with a source address of 10.1.7.7
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/4/9 ms

R5#
MPLS les: Fa0/0: rx: Len 118 Stack {17 0 254} - ipv4 data
MPLS les: Fa0/1: rx: Len 118 Stack {16 0 254} - ipv4 data
MPLS les: Fa0/0: rx: Len 118 Stack {17 0 254} - ipv4 data
MPLS les: Fa0/1: rx: Len 118 Stack {16 0 254} - ipv4 data
MPLS les: Fa0/0: rx: Len 118 Stack {17 0 254} - ipv4 data
MPLS les: Fa0/1: rx: Len 118 Stack {16 0 254} - ipv4 data
MPLS les: Fa0/0: rx: Len 118 Stack {17 0 254} - ipv4 data
MPLS les: Fa0/1: rx: Len 118 Stack {16 0 254} - ipv4 data
MPLS les: Fa0/0: rx: Len 118 Stack {17 0 254} - ipv4 data
MPLS les: Fa0/1: rx: Len 118 Stack {16 0 254} - ipv4 data

The beauty of this MPLS design is that for any new routes AS 7 or AS 8 advertise into the network, AS 100 does not need to allocate new MPLS labels in the core.  As long as MPLS transport is established between the BGP peering address of the Provider Edge routers (R4 and R6 in this case), traffic for any destinations can transit over the Service Provider's network without the core needing any further forwarding information.

Route tag 8
R4#show ip route 10.1.6.6
Routing entry for 10.1.6.6/32
Known via "ospf 1", distance 110, metric 3, type intra area
Last update from 10.1.45.5 on FastEthernet0/1, 01:06:22 ago
Routing Descriptor Blocks:
* 10.1.45.5, from 10.1.6.6, 01:06:22 ago, via FastEthernet0/1
Route metric is 3, traffic share count is 1

R4#show mpls interfaces
Interface              IP            Tunnel   BGP Static Operational
FastEthernet0/1        Yes (ldp)     No       No  No     Yes
R4#show mpls forw
R4#show mpls forwarding-table
Local  Outgoing      Prefix            Bytes Label   Outgoing   Next Hop
Label  Label or VC   or Tunnel Id      Switched      interface
16     Pop Label     10.1.56.0/24      0             Fa0/1      10.1.45.5
17     17            10.1.6.6/32       0             Fa0/1      10.1.45.5
18     18            10.1.68.0/24      0             Fa0/1      10.1.45.5
R4#show ip cef 10.1.8.0 detail
10.1.8.0/24, epoch 0
recursive via 10.1.6.6
nexthop 10.1.45.5 FastEthernet0/1 label 17

We recently created a new self-paced MPLS course, which walks the learner step by step from concept to implementation for MPLS and L3 VPNs.  Click here for more information.

Aug
16

Abstract

In this blog post we are going to review a number of MPLS scaling techniques. Theoretically, the main factors that limit MPLS network growth are:

  1. IGP Scaling. Route Summarization, which is the core procedure for scaling of all commonly used IGPs does not work well with MPLS LSPs. We’ll discuss the reasons for this and see what solutions are available to deploy MPLS in presence of IGP route summarization.
  2. Forwarding State growth. Deploying MPLE TE may be challenging in large network as number of tunnels grow like O(N^2) where N is the number of TE endpoints (typically the number of PE routers). While most of the networks are not even near the breaking point, we are still going to review techniques that allow MPLS-TE to scale to very large networks (10th of thousands routers).
  3. Management Overhead. MPLS requires additional control plane components and therefore is more difficult to manage compared to classic IP networks. This becomes more complicated with the network growth.

The blog post summarizes some recently developed approaches that address the first two of the above mentioned issues. Before we begin, I would like to thank Daniel Ginsburg for introducing me to this topic back in 2007.

IGP Scalability and Route Summarization

IP networks were built around the concept of hierarchical addressing and routing, first introduced by Kleinrock [KLEIN]. Scaling for hierarchically addressed networks is achieved by using topologically-aware addresses that allow for network clustering and routing information hiding by summarizing contiguous address blocks. While other approaches for route scaling has been developed later, the hierarchical approach is the prevalent one in modern networks. Modern IGPs used in SP networks are link-state (ISIS and OSPF) and hence maintain network topology information in addition to the network reachability information. Route summarization creates topological domain boundaries and condenses network reachability information thus having the following important effects on link-state routing protocols:

  1. Topological database size is decreased. This reduces the impact on router memory and CPU as smaller database requires less maintenance efforts and consumes less memory.
  2. Convergence time within every area is improved as a result of faster SPF, smaller flooding scope and decreased FIB size.
  3. Flooding is reduced, since events in one routing domain do not propagate into another as a result of information hiding. This improves routing process stability.

The above positive effects were very important during the earlier days of the Internet, when hardware did not have enough power to easily run link-state routing protocols. Modern advances in router hardware allow for scaling of link state protocols to single-area networks consisting or thousand of routers. Certain optimization procedures mentioned, for instance, in [OSPF-FAST] allow for such large single-area networks to remain stable and converge on sub-second time-scale. Many ISP networks indeed use single-area design for their IGPs, enjoying simplified operations procedures. However, scaling IGPs to tens of thousands nodes will most likely require use of routing areas yet allow for end-to-end MPLS LSPs. Another trend that may call for support of route summarization in MPLS networks is the fact that many enterprises, which typically employ area-based design, are starting their own MPLS deployments.

Before we go into details discussing the effect of route summarization MPLS LSPs, it is worth recalling what negative effects route summarization has in pure IP networks. Firstly, summarization hides network topology and detailed routing information and thus results in suboptimal routing, and even routing loops, in presence of route redistribution between different protocols. This problem is especially visible at larger scale, e.g. for the Internet as a whole. The other serious problem is that summarization hides reachability information. For example, if a single host goes done, the summarized prefix encompassing this hosts’s IP address will remain the same and the failure notification will not leak the event beyond the summarization domain. Being a positive effect of summarization, this behavior has negative impact on inter-area network convergence. For example, this affects BGP next-hop tracking process (see [BGP-NEXTHOP]).

MPLS LSPs and Route Summarization

MPLS LSP is unidirectional tunnel built on switching of locally significant labels. There are multiple ways for constructing MPLS LSPs, but we are concerned with classic LDP signaling at the moment. In order for LDP-signaled LSP to destination (FEC, forwarding equivalency class) X to be contiguous, every hop on path to X should have forwarding state for this FEC. Essentially, forwarding component for MPLS treats X, the FEC, as an endpoint identifier. If two nodes cannot match information about X, they cannot “stitch” the LSP in contiguous manner. In the case of MPLS deployment in IP networks, the FEC is typically the PE’s /32 IP address. IP address has overloaded functions of both location pointer and endpoint identifier. Summarizing the /32 addresses hides the endpoint identity and prohibits LDP nodes to consistently map labels for a given FEC. Look at the diagram below: ABR2 summarizes PE1-PE3 Loopback prefixes, and thus PE1 cannot construct an end-to-end LSP for 10.0.0.1/32, 10.0.0.2/32 or 10.0.0.3/32.

mpls-summarization-1

One obvious solution to this problem would be changing the underlying MPLS transport to IP based tunnels, such as mGRE or L2TPv3. Such solutions are available for deploying with L3 and L2 VPNs (see [L3VPN-MGRE] for example) and perfectly work with route summarization. However, by using convenient routing, these solutions lose the powerful feature of MPLS Traffic Engineering. While some Traffic Engineering solutions are available for IP networks, they are not as flexible and feature-rich as MPLS-TE. Therefore, we are not going to look into IP-based tunneling in this publication.

Route Leaking

Route leaking is the most widely deployed technique to allow for end-to-end LSP construction in presence of route summarization. This technique is based on the fact that typically LSPs need only be constructed for the PE addresses. Provided that a separate subnet is selected for the PE Loopback interfaces it is easy to apply fine-tuned control to the PE prefixes in the network. Referring to the figure below, all transit link prefixes (10.0.0.0/16) could be summarized (and even suppressed) and only the necessary PE prefixes (subnet 20.0.0.0/24 on the diagram below) will be leaked. It is also possible to fine-tune LDP to propagate only the prefixes from selected subnet, reducing LDP signaling overhead.

mpls-summarization-2

Leaking /32’s also allows for perfect next-hop reachability tracking, as a route for every PE router is individually present all routing tables. Of course, such granularity significantly reduces the positive effect of multi-area design and summarization. If the number of PE routers grows to tens of thousands, network stability would be at risk. However, this approach is by far the most commonly used with multi-area designs nowadays.

Inter-Area RSVP-TE LSPs

MPLS transport is often deployed for the purpose of traffic-engineering. Typically, a full-mesh of MPLS TE LSPs is provisioned between the PE routers in the network to accomplish this goal (strategic Traffic Engineering). The MPLS TE LSPs are signaled by means of TE extensions to RSVP protocol (RSVP-TE), which allows for source-routed LSP construction. A typical RSVP-TE LSP is explicitly routed by the source PE, where explicit route is either manually specified or computed dynamically by constrained SPF (cSPF) algorithm. Notice that it is impossible for a router within one area to run cSPF for another area, as the topology information is hidden (there are certain solutions to overcome this limitation e.g. [PCE-ARCH], but they are out of the scope of this publication).

Based on the single-area limitation, it may seem that MPLS-TE is not applicable to multi-area design. However, a number of extensions to RSVP signaling have been proposed to allow for inter-area LSP construction (see [RSVP-TE-INTERAREA]). In short, such extensions allow for explicitly programming loose inter-area hops (ABRs or ASBRs) and let every ABR expand the loose next-hop (ERO, explicit route object, expansion). The headend router and every ABR run the cSPF algorithm for their areas, such that every segment of Inter-AS LSP is locally optimal.

mpls-summarization-3

This approach does not necessarily result in globally optimal path, i.e. the resulting LSP may appear to differ from the shortest IGP path between the two destinations. This is a serious problem, and requires considerable modification to RSVP-TE signaling to be resolved. See the document [PCE-ARCH] for an approach to make MPLS LSPs globally optimal. Next, inter-area TE involves additional management overhead, as the ABR loose next-hops need to be programmed manually, possibly covering backup paths via alternate ABRs. The final issue is less obvious, but may have significant impact in large networks.

Every MPLS-TE constructed LSP is point-to-point (P2P) by its nature. This means that for N endpoints there is a total of N^2 LSPs connecting them, which results in rapidly growing number of forwarding states (e.g. CEF table entries) in the core routers, not to mention the control-plane overhead. This issue becomes serious in very large networks, as it has been thoroughly analyzed in [FARREL-MPLS-TE-SCALE-1 ] and [FARREL-MPLS-TE-SCALE-2]. It is interesting to compare the MPLS-TE scaling properties to those of MPLS LDP. The former constructs P2P LSPs, the latter builds Multipoint-to-Point (MP2P) LSPs that merge toward the tail-end router (see [MINEI-MPLS-SCALING]). This is the direct result of signaling behavior: RSVP is end-to-end and LDP signaling is local between every pair of routers. In result, the number of forwarding states with LDP grows as O(N) compared to O(N^2) with MPLS-TE. Look at the diagram below, that illustrates LSPs constructed using RSVP-TE and LDP.

mpls-summarization-4

One solution to the state growth problem would be shrinking the full-mesh of PE-to-PE LSPs, e.g. pushing the MPLS-TE endpoints from PE deeper in the network core, say up to the aggregation layer and then deploying LDP on the PEs and over the MPLS-TE tunnels. This would reduce the size of the tunnel full-mesh significantly, but prevent the use of peripheral connections for MPLS-TE. In other words, this reduces the effect of using MPLS-TE for network resource optimization.

mpls-summarization-5

Besides losing full MPLS-TE functionality, the use of unmodified LDP means that route summarization could not deployed in such scenario. Based on these limitations, we are not going to discuss this approach for MPLS-TE scaling in this publication. Notice that replacing LDP with MPLS-TE in the non-core sectors of the network does not solve global traffic engineer problem, though allows for construction locally optimal LSPs in every level. However, similar to the "LDP-in-the-edge approach", this solution reduces overall effectiveness of MPLS-TE in the network. There are other ways to improve MPLS-TE scaling properties which we are going to discuss further.

Hierarchical RSVP-TE LSPs

As mentioned previously, hierarchies are the key to IP routing scalability. Based on this, it could be reasonable to assume that hierarchical LSPs could be used to solve MPLS-TE scaling problems. Hierarchical LSPs have been in the MPLS standards almost since the inception and their definition has been finalized in RFC 4206. The idea behind the hierarchical LSPs is the use of layered MPLS-TE tunnels. The first layer of MPLS-TE tunnels is used to establish forwarding adjacencies for the link-state IGP. These forwarding adjacencies are then flooded in link-state advertisements and added into Traffic Engineering Database (TED), though not the LSDB. In fact, the first mesh looks exactly like set of IGP logical connections, but get advertised only into TED. It is important that link-state information is not flooded across this mesh, hence the name “forwarding adjacency”. This first layer of MPLS LSPs typically overlays the network core.

mpls-summarization-6

The second level of MPLS-TE tunnels is constructed with the first-level mesh added to the TED in mind, using cSPF or manual programming. From the standpoint of the second-level mesh, the physical core topology is hidden and replaced with the full-mesh of RSVP-TE signaled “links”. Effectively, the second-level tunnels are nested within the first level mesh and are never visible to the core network. The second-level mesh normally spans edge-to-edge and connects the PEs. At the data plane, hierarchical LSPs are realized by means of label stacking.

mpls-summarization-7

Not every instance of IOS code supports the classic RFC4206 FAs. However, it is possible to add an mGRE tunnel spanning between the P-routers forming the first level mesh and use it to route the PE-PE MPLS TE LSPs (second level). This solution does not require running additional set of IGP adjacencies over the mGRE tunnel and therefore has acceptable scalability properties. The mGRE tunnel could be overlaid over a classic, non-FA RSVP-TE tunnel mesh used for traffic engineering between the P-routers. This solution creates additional overhead for the hierarchical tunnel mesh, but allows for implementing hierarchical LSPs in the IOS code that does not support the RFC 4206 FAs.

At last, it is worth pointing out that Cisco puts different meaning into the term Forwarding Adjacencies, as it is implemented in IOS software. Instead of advertising the MPLS-TE tunnels into TED, Cisco advertises them into LSDB and uses for shortest path construction. These virtual connections are not used for LSA/LSP flooding though, just like the “classic” FAs. Such technique does not allow for hierarchical LSPs, as second-level mesh cannot be signaled over FA defined in this manner, but allows for global visibility of MPLS TE tunnels within a single IGP area, compared to Auto-Route, which only supports local visibility.

Do Hierarchical RSVP-TE LSPs Solve the Scaling Problem?

It may look like that using hierarchical LSPs along with route summarization solves the MPLS scaling problems at once. The network core has to deal with the first-level of MPLS-TE LSPS only, and it is perfectly allows for route summarization, as illustrated on the diagram below:

mpls-summarization-8

However, a more detailed analysis performed in [FARREL-MPLS-TE-SCALE-2] shows that in popular ISP network topologies deploying multi-level LSP hierarchies does not result in significant benefits. Firstly, there are still “congestion points” remaining at the level where 1st and 2nd layer-meshes are joined. This layer has to struggle with fast-growing number of LSP states as it has to support both levels of MPLS TE meshes. Furthermore, the management burden associated with deploying multiple MPLS-TE meshes along with Inter-Area MPLS TE tunnels is significant, and seriously impacts network growth. The same arguments apply to the “hybrid” solution that uses mGRE tunnel in the core, as every P router needs to have a CEF entry for every other P router connected to the mGRE tunnel. Not to mention the underlying P-P RSVP-TE mesh adding even more forwarding states in the P-routers.

Multipoint-to-Point RSVP-TE LSPs

What is the key issue that makes MPLS-TE hard to scale? The root cause is the point-to-point nature of MPLS-TE LSPs. This results in rapid forwarding state proliferation in transit nodes. An alternative solution to using LSP hierarchies is the use if Multipoint-to-Point RSVP-TE LSPs. Similar to LDP, it is possible to merge the LSPs that traverse the same egress link and terminate on the same endpoint. With the existing protocol design, RSVP-TE cannot do that, but certain protocol extensions proposed in [YASUKAWA-MP2P-LSP] allow for automatic RSVP-TE merging. Intuitively it is clear that the number of forwarding states with MP2P LSPs grows like O(N) where N is the number of PE routers, compared to O(N^2) in classic MPLS-TE. Effectively, MP2P RSVP-TE signaled LSPs have the same scaling properties as LDP signaled LSPs, with the added benefits of MPLS-TE functionality and Inter-Area signaling. Based on the scaling properties, th use of MP2P Inter-Area LSPs seems to be a promising direction toward scaling MPLS networks.

BGP-Based Hierarchical LSPs

As we have seen, the default operations mode for MPLS-TE does not offer enough scaling even with LSP hierarchies. It is worth asking, whether it is possible to create hierarchical LSPs using signaling other than RSVP-TE. As you remember, BGP extensions could be used to transport MPLS labels. This gives an idea of creating nested LSPs by overlaying BGP mesh over IGP areas. Here is an illustration of this concept:

mpls-summarization-9

In this sample scenario, there are three IGP areas, and ABRs optimally summarizing their area address ranges, therefore hiding the PE /32 prefixes. Inside every area, LDP or RSVP-TE could be used for constructing intra-area LSPs, for example LSPs from PEs to ABRs. At the same time, all PEs establish BGP sessions with their nearest ABRs and ABRs connect in full iBGP mesh, treating the PEs as route-reflector clients. This allows for the PEs to propagate their Loopback /32 prefixes via BGP. The iBGP peering should be done using another set of Loopback interfaces (call them IGP-routed) that are used to build transport LSPs inside every area..

mpls-summarization-10

The only routers that will see the PE loopback prefixes (BGP routed) are the other PEs and the ABRs. The next step is configuring the ABRs that act as route-reflectors for the PE routers to change the BGP IP next-hop to self and activate MPLS label propagation over all iBGP peering sessions. The net result is overlay label distribution process. Every PE would use two labels in the stack to get to another PE’s Loopback (BGP propagated) by means of BGP next-hop recursive resolution. The topmost label (LDP or RSVP-TE) is used to steer packet to the nearest ABR, using the transport LSP built toward the IGP-routed Loopback interface. The bottom label (BGP propagated) identifies the PE prefix within the context of a given ABR. Every ABR will pop incoming LDP/RSVP-TE label, then swap the PE label to the label understood by another ABR/PE (as signaled via BGP) and then push new LDP label that starts a new LSP to reach the next ABR/PE. Effectively, this implements two-level hierarchical LSP end-to-end between any pair of PEs. This behavior is a result of BGP’s ability to propagate label information and the recursive next-hop resolution process.

How well does this approach scale? Firstly, using BGP for prefix distribution ensures that we may advertise truly large amount of PE prefixes without any serious problems (though DFZ systems operators may disagree). At first sight, routing convergence may seem to be a problem, as loss of any PE router should result in iBGP session timing out based on BGP keepalives. However, if BGP next-hop tracking (see [BGP-NEXTHOP]) is used within every area, then ABRs will be able to detect loss of PE at the pace of IGP converges. Link failures within an area will also be handled by IGP or, possibly, using intra-area protection mechanisms, such as MPLS/IP Fast Re-Route. Now for the drawback of the BGP-signaled hierarchical LSPs:

  1. Configuration and Management overhead. In popular BGP-free core design, P routers (which typically include ABRs) do not run BGP. Adding extra mesh of BGP peering sessions requires configuring all PEs and, specifically, ABRs, which involves non-trivial efforts. This slows initial deployment process and complicates further operations.
  2. ABR failure protection. Hierarchical LSPs constructed using BGP, consist of multiple disconnected LDP/RSVP-TE signaled segments, e.g. PE-ABR1, ABR1-ABR2. Within the current set of RSVP-TE FRR features, it is not possible to protect the LSP endpoint nodes, due to local significance of the exposed label. There is work in progress to implement endpoint node FRR protection, but this feature is not yet available. This might be a problem, as it makes the network core vulnerable to ABR failures.
  3. The amount of forwarding states increases in the PE and (additionally) ABR nodes. However, unlike MPLS TE LSPs this growth is proportional to the number of PE routers, which is approximately the same scaling limitation we would have with MP2P MPLS-TE LSPs. Therefore, growth of forwarding state in the ABRs should not be a big problem, especially since no other nodes than PEs/ABRs are affected.

To summarize, it is possible to deploy hierarchical LSPs and get LDP/single-area TE working with multi-area IGPs without any updates to existing protocols (LDP, BGP, RSVP-TE). The main drawback is excessive management overhead and lack of MPLS FRR protection features for the ABRs. However, this is the only approach that does not require any changes to the running software, as all functionality is implemented using existing protocol features.

LDP Extensions to work with Route Summarization

In order to make native LDP (RFC 3036) work with IGP route summarization it is possible to extend the protocol in some way. Up to date, there are two main approaches: one of them utilizes prefix leaking via LDP and the other implements hierarchical LSPs using LDP.

LDP Prefix Leaking (Interarea LDP)

A very simply extension to LDP allows for performing route summarization. Per the LDP’s RFC, prior to installing a label mapping for prefix X, local router needs to ensure there is an exact match for X in the RIB. RFC 5283 suggest changing this verification procedure to the longest match: if there is a prefix Z in the RIB such that X is a subnet to Z then keep this label mapping. It is important to notice that LFIB is not aggregated in any manner – all label mappings received via LDP are maintained and propagated further. Things are different at the RIB level, however. Prefixes are allowed to be summarized and routing protocol operations are simplified. No end-to-end LSPs are broken, because label mappings for specific prefixes are maintained along the path.

mpls-summarization-11

What are the drawbacks of this approach? Obviously, LFIB size growth (forwarding state) is the first one. It is possible to argue that maintaining LFIB is less burdensome compared to maintaining IGP databases, so this could be acceptable. However, it is well known that IGP convergence is seriously affected by FIB sizes, not just IGP protocol data structures, as updating large FIB takes considerable time. Based on this, the LDP “leaking” approach does not solve all scalability issues. On the other hand, keeping detailed information in LFIB allows for end-to-end connectivity tracking, thanks to LDP ordered label propagation. If on area loses a prefix, LDP will signal the loss of label mapping, even though no specific information will ever leak to IGP. This is the flip-side of having detailed information at the forwarding plane level. The other problem that could be pointed out is LDP signaling overhead. However, since LDP is “on-demand” and “distance-vector” protocol it does not impose as much problems as say link-state IGP flooding.

Aggregated FEC Approach

This approach has been suggested by G. Swallow of Cisco Systems – see [SWALLOW-AGG-FEC]. It requires modifications to LDP and slight changes to the forwarding plane behavior. Here is how it works. When an ABR aggregates prefixes {X1…Xn} into a new summary prefix X it generates a label for X (aggregate FEC) and propagates it to other areas, creating an LSP for the aggregate FEC that terminates at the ABR. The LDP mappings are propagated using “aggregate FEC” type to signal special processing for packets matching this prefix. The LSP constructed for such FEC has PHP (penultimate hop popping) disabled for the reason we’ll explain later. All that other routers see in their RIB/FIB is the summary prefix X and the corresponding LFIB element (aggregate FEC/label). In addition to propagating the route/label for X, the same ABR also applies special hash function to the IP addresses {X1…Xn} (the specific prefixes) and generates local labels based on the result of this function. These new algorithmic labels are stored under the context of the “aggregated” label generated for prefix X. That is, these labels should only be interpreted in the association with the “aggregate” label. The algorithmic labels are further stitched with the labels the ABR learns via LDP from the source area for the prefixes {X1…Xn}.

The last piece of the puzzle is how PE creates the label stack for a specific prefix Xi. When a PE attempts to encapsulate a packet destined to Xi on ip2mpls edge it looks up the LFIB for exact match. If no exact match is found, but there is a matching aggregate FEC X, the PE will use the same hash function that ABR used previously on Xi to create an algorithmic label for Xi. The PE then stacks this “algorithmic” label under the label for the aggregate FEC X and sends the packet with two labels – the topmost for the aggregate X and the bottom for Xi. The packet will arrive at the ABR that originated the summary prefix X, with the topmost label NOT being removed by PHP mechanism (as mentioned previously). This allows the ABR to correctly determine the context for the bottom label. The topmost label is removed, and the next de-aggregation label for Xi is used to lookup the real label (stitched with the de-aggregation label for Xi) to be used for further switching.

mpls-summarization-12

This method is backward compatible with classic LDP implementation and will interoperate with other LDP deployments. Notice that there is no control-plane correlation between ABRs and PE, like there is in the case of BGP-signaled hierarchical LSPs. Instead, synchronization is achieved by using the same globally known hash function that produces de-aggregation labels. This method reduces control plane overhead associated with hierarchical LSP construction, but has one drawback – there is no end-to-end reachability signaling, like there was in RFC 5283 approach. That is, if an area loses prefix for PE, there is no way to signal this via LDP, as only aggregated FEC is being propagated. The presentation [SWALLOW-SCALING-MPLS] suggests a generic solution to this problem, by means of IGP protocol extension. In addition to flooding a summary prefix, the ABR is responsible for flooding a bit-vector that corresponds to every possible /32 under the summary. For example, for a /16 prefix there should be a 2^16 bit vector, where bit flag being equal to one means the corresponding /32 prefix is reachable and zero means it being unreachable. This scheme allows for certain optimizations, such as using Bloom filters (see [BLOOM-FILTER]) for information compression. This approach is known as Summarized Route Detailed Reachability (SRDR). The SRDR approach solves the problem of hiding the reachability information at the cost of modification to IGP signaling. An alternative is using tuned BGP keepalives. This, however, puts high stress on router’s control plane. A better alternative is using data-plane reachability discovery, such as using multi-hop BFD ([BFD-MULTIHOP]). The last two approaches do not require any modifications to IGP protocols and therefore better interoperate with existing networks.

Hierarchical LDP

This approach, an extension to RFC 5283, has been proposed by Kireeti Kompella of Juniper, but never has been officially documented and presented for peer review. There were only a few presentations made at various conferences, such as [KOMPELLA-HLDP]. No IETF draft is available, so we can only guess about the protocol details. In a nutshell, it seems that the idea is running a an overlay mesh of LDP sessions between PEs/ABRs similar to the BGP approach and using stacked FEC advertisements. The topmost FEC in such advertisement corresponds to the summarized prefix advertised by the ABR. This FEC is flooded across all areas and local mappings are used to construct an LSP terminating at the ABR. So far, this looks similar to the aggregated FEC approach. However, instead of using algorithmic label generation, the PE and ABR directly exchange their bindings on the specific prefixes, using new form of FEC announcement – hierarchical FEC stacking. ABR advertises aggregated FEC along with aggregated label and nested specific labels. The PE knows what labels the ABR is expecting for the specific prefixes, and may construct a two-label stack consisting of the “aggregate” label and “specific” label learned via the directed LDP session. The specific prefixes are accepted by the virtue of RFC 5283 extension, which allows detailed FEC information if there is a summary prefix in the RIB matching the specific prefix.

mpls-summarization-13

The hierarchical LDP approach maintains control-plane connection between the PE and the ABRs. Most likely, this means manual configuration of directed LDP sessions, very much similar to the BGP approach. The benefit is the control-plane reachability signaling and better extensibility compared to the Aggregated FEC approach. Another benefit is that BGP mesh is left intact and only LDP configuration has to be modified. However, it seems like that further work on Hierarchical LDP extensions has been abandoned, as there are no publications or discussions on this subject.

Hierarchical LSP Commonalities

So far we reviewed three different approaches to constructing hierarchical LSPs. The first one uses RSVP-TE forwarding adjacencies, the second one uses BGP label propagation and the last two use LDP extensions. All the approaches result in constructing transport LSP segments, terminating at the ABRs. For example, in RSVP-TE approach there are LSPs connecting the ABRs. In BGP approach, there are LSP segments connecting PEs to ABRs. As we mentioned previously, the current set of MPLS FRR features does not protect LSP endpoints. As a direct result, using hierarchical LSPs decreases the effectiveness of MPLS FRR protection. There is a work in progress on extending FRR protection to LSP endpoints, but there are no complete working solutions at the moment.

Summary

We have reviewed various aspects of MPLS technology scaling. The two main ones are scaling IGP protocols by using route summarization/areas and getting MPLS to work with summarization. A number of approaches are available to solve the latter problem, and practically all of them (with except to inter-area MPLS TE) are based on hierarchical LSP construction. Some approaches, such as BGP-signaled hierarchical LSPs are ready to be deployed using existing protocol functionality, at the expense of added management overhead. Others require modifications to control/forwarding plane behavior.

It looked like there was high interest in MPLS scaling problems about 3-4 years ago (2006-2007), but this topic looks to be abandoned nowadays. There is no active work in progress on LDP extensions mentioned above, however Multipoint-to-Point RSVO-TE LSP draft document [YASUKAWA-MP2P-LSP] seems to be making progress through IETF. Based on this, it looks like using Inter-Area RSVP-TE with MP2P extensions is going to be the main solution to scaling the MPLS networks of the future.

Further Reading

RFC 3036: LDP Specification
[FARREL-MPLS-TE-SCALE-1] MPLS-TE Does Not Scale
[RSVP-TE-INTERAREA] Requirements for Inter-Area MPLS TE Traffic Engineering
Inter-Area Traffic Engineering
[PCE-ARCH] A Path-Computation Element Based Architecture
[YASUKAWA-MP2P-LSP] Multipoint-to-Point RSVP-TE Signaled LSPs
[MINEI-MPLS-SCALING] Scaling Considerations in MPLS Networks
[SWALLOW-SCALING-MPLS] George Swallow: Scaling MPLS, MPLS2007 (no public link available)
[KOMPELLSA-HLDP] Techniques for Scaling LDP, Kireeti Kompella MPLS2007 (no public link available)
[SWALLOW-AGG-FEC] Network Scaling with Aggregated IP LSPs
LDP Extensions for Inter-Area LSPs
[KLEIN] Hierarchical Routing for Large Networks
[OSPF-FAST] OSPF Fast Convergence
[L3VPN-MGRE] Layer 3 VPNs over Multipoint GRE
[FARREL-MPLS-TE-SCALE-2] MPLS TE Scaling Analysis
RFC 4206: LSP Hierarchy
[BGP-NEXTHOP] BGP Next-Hop Tracking
[BLOOM-FILTER] Bloom Filter
[MINEI-MPLS-SCALING] Scaling Considerations in MPLS Networks
[L3VPN-MGRE] Layer 3 VPNs over Multipoint GRE
[FARREL-MPLS-TE-SCALE-2] MPLS TE Scaling Analysis
RFC 4206: LSP Hierarchy
[BGP-NEXTHOP] BGP Next-Hop Tracking
[BFD-MULTIHOP] Multihop Extension to BFD

Aug
16

Last week we wrapped up the MPLS bootcamp, and it was a blast!   A big shout out to all the students who attended,  as well as to many of the INE staff who stopped by (you know who you are :)).    Thank you all.

Here is the topology we used for the class, as we built the network, step by step.

MPLS-class blog

The class was organized and delivered in 30 specific lessons. Here is the "overview" slide from class:

MPLS Journey Statement

One of the important items we discussed was troubleshooting.   When we understand all the components of Layer3 VPNs, the troubleshooting is easy.   Here are the steps:

  • Can PE see CE’s routes?
  • Are VPN routes going into MP-BGP?  (The Export)
  • Are remote PEs seeing the VPN routes?
  • Are remote PEs inserting the VPN routes into the correct local VRF? (The Import)
  • Are remote PEs, advertising these to remote CEs?
  • Are the remote CEs seeing the routes?

We had lots of fun, and included wireshark protocol analysis, so we could see and verify what we were learning.   Here is one example, of a BGP updated from a downstream iBGP neighbor which includes the VPN label:

VPN Label

If you missed the class, but still want to benefit from it, we have recorded all 30 sessions, and it is available as an on-demand version of the class.

Next week, the BGP bootcamp is running, so if you need to  brush up on BGP, we will be covering the following topics, also  in 30, easy to digest lessons:

  • Monitoring and Troubleshooting BGP
  • Multi-Homed BGP Networks
  • AS-Path Filters
  • Prefix-List Filters
  • Outbound Route Filtering
  • Route-Maps as BGP Filters
  • BGP Path Attributes
  • BGP Local Preference
  • BGP Multi-Exit-Discriminator (MED)
  • BGP Communities
  • BGP Customer Multi-Homed to a Single Service Provider
  • BGP Customer Multi-Homed to Multiple Service Providers
  • Transit Autonomous System Functions
  • Packet Forwarding in Transit Autonomous Systems
  • Monitoring and Troubleshooting IBGP in Transit AS
  • Network Design with Route Reflectors
  • Limiting the Number of Prefixes Received from a BGP Neighbor
  • AS-Path Prepending
  • BGP Peer Group
  • BGP Route Flap Dampening
  • Troubleshooting Routing Issues
  • Scaling BGP

I look forward to seeing you in class!

Best wishes in all of your learning.

Jul
19

Can you solve this puzzle?

R2, R3 and R4 create the service provider network, with MPLS on all three routers, and iBGP at the PE routers.  R1 and R5 are the CE routers.

R2, prefers the BGP next hop of 4.4.4.4 for network 5.5.5.5 (R5 loopback). R4, at 4.4.4.4 is an iBGP neighbor.

R2#show ip route vrf v | inc 5.5.5.0
B 5.5.5.0 [200/409600] via 4.4.4.4, 00:06:47

Is R2 preferring an iBGP learned route, which has an AD of 200, over a EIGRP route, which would have an AD of 90?

Can you identify why the routing for 5.5.5.0 on the VRF of R2 is using BGP instead of EIGRP?

EIGRP PATH with MPLS

Below are the relevant portions of the configuration, which also can serve as a great review of how to configure MPLS VPNs.
R1, CE router:

R1#show run
interface Loopback0
ip address 1.1.1.1 255.255.255.0
!
interface FastEthernet0/0
ip address 10.1.12.1 255.255.255.0
duplex auto
speed auto
!
interface Serial0/0
ip address 10.1.215.1 255.255.255.0
!

router eigrp 1
network 0.0.0.0
no auto-summary

R2, PE Router:

R2#show run
!
ip vrf v
rd 1:1
route-target export 1:1
route-target import 1:1
!
!
interface Loopback0
ip address 2.2.2.2 255.255.255.255
ip ospf 1 area 0
!
interface FastEthernet0/0
ip vrf forwarding v
ip address 10.1.12.2 255.255.255.0
!
interface FastEthernet0/1
ip address 10.1.23.2 255.255.255.0
ip ospf 1 area 0
mpls ip
!
router eigrp 1
no auto-summary
!
address-family ipv4 vrf v
redistribute bgp 234 metric 1 10000 1 1 1
network 10.1.12.2 0.0.0.0
auto-summary
autonomous-system 1
exit-address-family
!
router ospf 1
log-adjacency-changes
!
router bgp 234
no bgp default ipv4-unicast
bgp log-neighbor-changes
neighbor 4.4.4.4 remote-as 234
neighbor 4.4.4.4 update-source Loopback0
!
address-family vpnv4
neighbor 4.4.4.4 activate
neighbor 4.4.4.4 send-community extended
exit-address-family
!
address-family ipv4 vrf v
redistribute eigrp 1
no synchronization
exit-address-family
!
ip forward-protocol nd
!

R3, P router:

R3#show run

interface Loopback0
ip address 3.3.3.3 255.255.255.255
!
interface FastEthernet0/0
ip address 10.1.34.3 255.255.255.0
mpls ip
!
interface FastEthernet0/1
ip address 10.1.23.3 255.255.255.0
mpls ip
!
router ospf 1
log-adjacency-changes
network 0.0.0.0 255.255.255.255 area 0
!

R4: PE Router

R4#show run
!
ip vrf v
rd 1:1
route-target export 1:1
route-target import 1:1
!
!
interface Loopback0
ip address 4.4.4.4 255.255.255.255
ip ospf 1 area 0
!
interface FastEthernet0/0
ip address 10.1.34.4 255.255.255.0
ip ospf 1 area 0
mpls ip
!
interface FastEthernet0/1
ip vrf forwarding v
ip address 10.1.45.4 255.255.255.0
!
router eigrp 1
no auto-summary
!
address-family ipv4 vrf v
redistribute bgp 234 metric 1 1 1 1 1
network 10.1.45.4 0.0.0.0
auto-summary
autonomous-system 1
exit-address-family
!
router ospf 1
log-adjacency-changes
!
router bgp 234
no bgp default ipv4-unicast
bgp log-neighbor-changes
neighbor 2.2.2.2 remote-as 234
neighbor 2.2.2.2 update-source Loopback0
!
address-family vpnv4
neighbor 2.2.2.2 activate
neighbor 2.2.2.2 send-community extended
exit-address-family
!
address-family ipv4 vrf v
redistribute eigrp 1
no synchronization
exit-address-family

R5: CE Router

R5#show run
!
interface Loopback0
ip address 5.5.5.5 255.255.255.0
!
interface Serial0/0
ip address 10.1.215.5 255.255.255.0
clock rate 64000
!
interface FastEthernet0/1
ip address 10.1.45.5 255.255.255.0
!
router eigrp 1
network 0.0.0.0
no auto-summary
!

Now for a couple show commands on R1:

R1#show ip route eigrp
5.0.0.0/24 is subnetted, 1 subnets
D 5.5.5.0 [90/435200] via 10.1.12.2, 00:19:08, FastEthernet0/0
10.0.0.0/24 is subnetted, 3 subnets
D 10.1.45.0 [90/307200] via 10.1.12.2, 00:19:08, FastEthernet0/0
R1#

R1#show ip eigrp topology
IP-EIGRP Topology Table for AS(1)/ID(10.1.215.1)

Codes: P - Passive, A - Active, U - Update, Q - Query, R - Reply,
r - reply Status, s - sia Status

P 1.1.1.0/24, 1 successors, FD is 128256
via Connected, Loopback0
P 5.5.5.0/24, 1 successors, FD is 435200
via 10.1.12.2 (435200/409600), FastEthernet0/0
via 10.1.215.5 (2297856/128256), Serial0/0
P 10.1.12.0/24, 1 successors, FD is 281600
via Connected, FastEthernet0/0
P 10.1.45.0/24, 1 successors, FD is 307200
via 10.1.12.2 (307200/281600), FastEthernet0/0
via 10.1.215.5 (2195456/281600), Serial0/0
P 10.1.215.0/24, 1 successors, FD is 2169856
via Connected, Serial0/0
R1#

And some on R2, the PE router:

R2#show ip route vrf v

Routing Table: v
Codes: C - connected, S - static, R - RIP, M - mobile, B - BGP
D - EIGRP, EX - EIGRP external, O - OSPF, IA - OSPF inter area
N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2
E1 - OSPF external type 1, E2 - OSPF external type 2
i - IS-IS, su - IS-IS summary, L1 - IS-IS level-1, L2 - IS-IS level-2
ia - IS-IS inter area, * - candidate default, U - per-user static route
o - ODR, P - periodic downloaded static route

Gateway of last resort is not set

1.0.0.0/24 is subnetted, 1 subnets
D 1.1.1.0 [90/409600] via 10.1.12.1, 00:31:48, FastEthernet0/0
5.0.0.0/24 is subnetted, 1 subnets
B 5.5.5.0 [200/409600] via 4.4.4.4, 00:02:34
10.0.0.0/24 is subnetted, 3 subnets
C 10.1.12.0 is directly connected, FastEthernet0/0
B 10.1.45.0 [200/0] via 4.4.4.4, 00:31:48
D 10.1.215.0 [90/2195456] via 10.1.12.1, 00:31:21, FastEthernet0/0

R2#show ip eigrp vrf v topology
IP-EIGRP Topology Table for AS(1)/ID(10.1.12.2) Routing Table: v

Codes: P - Passive, A - Active, U - Update, Q - Query, R - Reply,
r - reply Status, s - sia Status

P 1.1.1.0/24, 1 successors, FD is 409600
via 10.1.12.1 (409600/128256), FastEthernet0/0
P 5.5.5.0/24, 1 successors, FD is 409600
via VPNv4 Sourced (409600/0)
P 10.1.12.0/24, 1 successors, FD is 281600
via Connected, FastEthernet0/0
P 10.1.45.0/24, 1 successors, FD is 281600
via VPNv4 Sourced (281600/0)
P 10.1.215.0/24, 1 successors, FD is 2195456
via 10.1.12.1 (2195456/2169856), FastEthernet0/0
R2#

Take a minute to post your thoughts, and as always, happy studies.

....

 

It has been a few days, and we have received lots of great ideas.   Thank you.

When R4 receives the routes in VRF v, the EIGRP metrics are copied into extended BGP attributes, and include the information for metric, AS, route-type and more.  The iBGP updates from R4 to R2 contain all those attributes.   When R2 receives the updates, if the route type is internal (from EIGRP attributes) and the source EIGRP AS matches the local EIGRP AS we are importing to, it will then be up to the  metric to determine the best path.

If we decreased the bandwidth statement on R4 Fa0/1, or used an offset list (2,000,000 more should do the trick) on R5 out Fa0/1 (towards R4), the increase in metric would cause R2 to prefer the path through R1 for 5.5.5.0/24 instead of using the MPLS backbone.

BGP updates that contain the cost community attribute will use the EIGRP AD instead of the iBGP AD of 200 to compare routes on metric alone. In that light, another option, would be to tell R2 to ignore cost-community, with the BGP router command:

bgp bestpath cost-community ignore

Let's take a look at the results.

Here is the baseline for before any changes:

R2#show ip route vrf v | inc 5.5.5
B 5.5.5.0 [200/409600] via 4.4.4.4, 00:02:29
R2#show ip bgp vpnv4 all 5.5.5.0
BGP routing table entry for 1:1:5.5.5.0/24, version 8
Paths: (1 available, best #1, table v)
Flag: 0x820
Not advertised to any peer
Local
4.4.4.4 (metric 21) from 4.4.4.4 (4.4.4.4)
Origin incomplete, metric 409600, localpref 100, valid, internal, best
Extended Community: RT:1:1 Cost:pre-bestpath:128:409600 0x8800:32768:0
0x8801:1:153600 0x8802:65281:256000 0x8803:65281:1500
mpls labels in/out nolabel/19
R2#

Now we will remove the default behavior

R2(config)#router bgp 234
R2(config-router)#bgp bestpath cost-community ignore

Cleared BGP sessions and routing tables, and waited a minute before the following show commands:

R2#show ip route vrf v | inc 5.5.5
D 5.5.5.0 [90/2323456] via 10.1.12.1, 00:00:08, FastEthernet0/0
R2#show ip bgp vpnv4 all 5.5.5.0
BGP routing table entry for 1:1:5.5.5.0/24, version 8
Paths: (2 available, best #2, table v)
Flag: 0x820
Advertised to update-groups:
1
Local
4.4.4.4 (metric 21) from 4.4.4.4 (4.4.4.4)
Origin incomplete, metric 409600, localpref 100, valid, internal
Extended Community: RT:1:1 Cost:pre-bestpath:128:409600 0x8800:32768:0
0x8801:1:153600 0x8802:65281:256000 0x8803:65281:1500
mpls labels in/out 20/19
Local
10.1.12.1 from 0.0.0.0 (2.2.2.2)
Origin incomplete, metric 2323456, localpref 100, weight 32768, valid, sourced, best
Extended Community: RT:1:1
Cost:pre-bestpath:128:2323456 (default-2145160191) 0x8800:32768:0
0x8801:1:665600 0x8802:65282:1657856 0x8803:65281:1500
mpls labels in/out 20/nolabel
R2#

After setting it back to defaults, we could then try an offset list on R5 advertising to R4:

R5(config)#router eigrp 1
R5(config-router)#offset-list 0 out 2000000 fastEthernet 0/1

Cleared BGP sessions and routing tables, and waited a minute before the following show commands:

R2#show ip route vrf v | inc 5.5.5
D 5.5.5.0 [90/2323456] via 10.1.12.1, 00:06:28, FastEthernet0/0
R2#show ip bgp vpnv4 all 5.5.5.0
BGP routing table entry for 1:1:5.5.5.0/24, version 12
Paths: (1 available, best #1, table v)
Flag: 0x820
Advertised to update-groups:
1
Local
10.1.12.1 from 0.0.0.0 (2.2.2.2)
Origin incomplete, metric 2323456, localpref 100, weight 32768, valid, sourced, best
Extended Community: RT:1:1
Cost:pre-bestpath:128:2323456 (default-2145160191) 0x8800:32768:0
0x8801:1:665600 0x8802:65282:1657856 0x8803:65281:1500
mpls labels in/out 31/nolabel
R2#

After resetting all that, implementing the following on R4, and then clearing BGP and routing, we issue the show commands again.

R4(config)#int fa 0/1
R4(config-if)#bandwidth 100

R2#show ip route vrf v | inc 5.5.5
D 5.5.5.0 [90/2323456] via 10.1.12.1, 00:00:05, FastEthernet0/0
R2#show ip bgp vpnv4 all 5.5.5.0
BGP routing table entry for 1:1:5.5.5.0/24, version 20
Paths: (1 available, best #1, table v)
Flag: 0x820
Advertised to update-groups:
1
Local
10.1.12.1 from 0.0.0.0 (2.2.2.2)
Origin incomplete, metric 2323456, localpref 100, weight 32768, valid, sourced, best
Extended Community: RT:1:1
Cost:pre-bestpath:128:2323456 (default-2145160191) 0x8800:32768:0
0x8801:1:665600 0x8802:65282:1657856 0x8803:65281:1500
mpls labels in/out 23/nolabel
R2#

Thanks again to all who contributed. I encourage all RS candidates to lab this up, as well as practice MPLS with OSPF at the CEs.

Marcel posted a comment, reminding us of an excellent document written by Petr, on this topic and more. The original post from Petr which includes the link to free .PDF for this document may be found by clicking here. Thanks Marcel!

Subscribe to INE Blog Updates

New Blog Posts!