Mar
01

One of the most anticipated videos series in INE history is now available in our streaming library - Cisco’s Application Centric Infrastructure (ACI) Part 1 - Network Centric Mode!

This course is part of our new CCIE Data Center v2 Advanced Technologies Series, which also currently includes the following new courses:

Access to these courses and more is now available through INE’s All Access Pass subscription.

The DCv2 Advanced Technologies Series also has additional upcoming courses scheduled that include:

  • Application Centric Infrastructure (ACI) Part 2 - Application Centric Mode
  • Nexus Overlay Transport Virtualization (OTV)
  • Locator/ID Separation Protocol (LISP) on Nexus NX-OS
  • Storage Area Network (SAN) Switching on Nexus NX-OS
  • Cisco Unified Computing System (UCS)
  • Nexus NX-OS Security
  • Nexus NX-OS Security
  • Quality of Service (QoS) on Nexus NX-OS
  • Network Services & Management on Nexus NX-OS
  • Automation and Orchestration with Nexus NX-OS

In addition to the video courses, INE’s CCIE DCv2 Lab Workbook is currently available in beta testing, along with our CCIE DCv2 Rack Rentals. The public rack rental scheduler will be posted shortly, and a separate announcement will be posted about its availability and how to use it.

Happy Studying!

Aug
17

Edit: For those of you that want to take a look first-hand at these packets, the Wireshark PCAP files referenced in this post can be found here

One of the hottest topics in networking today is Data Center Virtualized Workload Mobility (VWM). For those of you that have been hiding under a rock for the past few years, workload mobility basically means the ability to dynamically and seamlessly reassign hardware resources to virtualized machines, often between physically disparate locations, while keeping this transparent to the end users. This is often accomplished through VMware vMotion, which allows for live migration of virtual machines between sites, or as similarly implemented in Microsoft’s Hyper-V and Citrix’s Xen hypervisors.

One of the typical requirements of workload mobility is that the hardware resources used must be on the same layer 2 network segment. E.g. the VMware Host machines must be in the same IP subnet and VLAN in order to allow for live migration their VMs. The big design challenge then becomes, how do we allow for live migrations of VMs between Data Centers that are not in the same layer 2 network? One solution to this problem that Cisco has devised is a relatively new technology called Overlay Transport Virtualization (OTV).

As a side result of preparing for INE’s upcoming CCIE Data Center Nexus Bootcamp I’ve had the privilege (or punishment depending on how you look at it ;) ) of delving deep into the OTV implementation on Nexus 7000. My goal was to find out exactly what was going on behind the scenes with OTV. The problem I ran into though was that none of the external Cisco documentation, design guides, white papers, Cisco Live presentations, etc. really contained any of this information. The only thing that is out there on OTV is mainly marketing info, i.e. buzzword bingo, or very basic config snippets on how to implement OTV. In this blog post I’m going to discuss the details of my findings about how OTV actually works, with the most astonishing of these results being that OTV is in fact, a fancy GRE tunnel.

From a high level overview, OTV is basically a layer 2 over layer 3 tunneling protocol. In essence OTV accomplishes the same goal as other L2 tunneling protocols such as L2TPv3, Any Transport over MPLS (AToM), or Virtual Private LAN Services (VPLS). For OTV specifically this goal is to take Ethernet frames from an end station, like a virtual machine, encapsulate them inside IPv4, transport them over the Data Center Interconnect (DCI) network, decapsulate them on the other side, and out pops your original Ethernet frame.

For this specific application OTV has some inherent benefits over other designs such as MPLS L2VPN with AToM or VPLS. The first of which is that OTV is transport agnostic. As long as there is IPv4 connectivity between Data Centers, OTV can be used. For AToM or VPLS, these both require that the transport network be MPLS aware, which can limit your selections of Service Providers for the DCI. For OTV you can technically use it over any regular Internet connectivity.

Another advantage of OTV is that provisioning is simple. AToM and VPLS tunnels are Provider Edge (PE) side protocols, while OTV is a Customer Edge (CE) side protocol. This means for AToM and VPLS the Service Provider has to pre-provision the pseudowires. Even though VPLS supports enhancements like BGP auto-discovery, provisioning of MPLS L2VPN is still requires administrative overhead. OTV is much simpler in this case, because as we’ll see shortly, the configuration is just a few commands that are controlled by the CE router, not the PE router.

The next thing we have to consider with OTV is how exactly this layer 2 tunneling is accomplished. After all we could just configure static GRE tunnels on our DCI edge routers and bridge IP over them, but this is probably not the best design option for either control plane or data plan scalability.

The way that OTV implements the control plane portion of its layer 2 tunnel is what is sometimes described as “MAC in IP Routing”. Specifically OTV uses Intermediate System to Intermediate System (IS-IS) to advertise the VLAN and MAC address information of the end hosts over the Data Center Interconnect. For those of you that are familiar with IS-IS, immediately this should sound suspect. After all, IS-IS isn’t an IP protocol, it’s part of the legacy OSI stack. This means that IS-IS is directly encapsulated over layer 2, unlike OSPF or EIGRP which ride over IP at layer 3. How then can IS-IS be encapsulated over the DCI network that is using IPv4 for transport? The answer? A fancy GRE tunnel.

The next portion that is significant about OTV’s operation is how it actually sends packets in the data plane. Assuming for a moment that the control plane “just works”, and the DCI edge devices learn about all the MAC addresses and VLAN assignments of the end hosts, how do we actually encapsulate layer 2 Ethernet frames inside of IP to send over the DCI? What if there is multicast traffic that is running over the layer 2 network? Also what if there are multiple sites reachable over the DCI? How does it know specifically where to send the traffic? The answer? A fancy GRE tunnel.

Next I want to introduce the specific topology that will be used for us to decode the details of how OTV is working behind the scenes. Within the individual Data Center sites, the layer 2 configuration and physical wiring is not relevant to our discussion of OTV. Assume simply that the end hosts have layer 2 connectivity to the edge routers. Additionally assume that the edge routers have IPv4 connectivity to each other over the DCI network. In this specific case I chose to use RIPv2 for routing over the DCI (yes, you read that correctly), simply so I could filter it from my packet capture output, and easily differentiate between the routing control plane in the DCI transport network vs. the routing control plane that was tunneled inside OTV between the Data Center sites.

What we are mainly concerned with in this topology is as follows:

  • OTV Edge Devices N7K1-3 and N7K2-7
    • These are the devices that actually encapsulate the Ethernet frames from the end hosts into the OTV tunnel. I.e. this is where the OTV config goes.
  • DCI Transport Device N7K2-8
    • This device represents the IPv4 transit cloud between the DC sites. From this device’s perspective it sees only the tunnel encapsulated traffic, and does not know the details about the hosts inside the individual DC sites. Additionally this is where packet capture is occurring so we can view the actual payload of the OTV tunnel traffic.
  • End Hosts R2, R3, Server 1, and Server 3
    • These are the end devices used to generate data plane traffic that ultimately flows over the OTV tunnel.

Now let’s look at the specific configuration on the edge routers that is required to form the OTV tunnel.

N7K1-3:
vlan 172
name OTV_EXTEND_VLAN
!
vlan 999
name OTV_SITE_VLAN
!
spanning-tree vlan 172 priority 4096
!
otv site-vlan 999
otv site-identifier 0x101
!
interface Overlay1
otv join-interface Ethernet1/23
otv control-group 224.100.100.100
otv data-group 232.1.2.0/24
otv extend-vlan 172
no shutdown
!
interface Ethernet1/23
ip address 150.1.38.3/24
ip igmp version 3
ip router rip 1
no shutdown

N7K2-7:
vlan 172
name OTV_EXTEND_VLAN
!
vlan 999
name OTV_SITE_VLAN
!
spanning-tree vlan 172 priority 4096
!
otv site-vlan 999
otv site-identifier 0x102
!
interface Overlay1
otv join-interface port-channel78
otv control-group 224.100.100.100
otv data-group 232.1.2.0/24
otv extend-vlan 172
no shutdown
!
interface port-channel78
ip address 150.1.78.7/24
ip igmp version 3
ip router rip 1

As you can see the configuration for OTV really isn’t that involved. The specific portions of the configuration that are relevant are as follows:

  • Extend VLANs
    • These are the layer 2 segments that will actually get tunneled over OTV. Basically these are the VLANs that you virtual machines reside on that you want to do the VM mobility between. In our case this is VLAN 172, which maps to the IP subnet 172.16.0.0/24.
  • Site VLAN
    • Used to synchronize the Authoritative Edge Device (AED) role within an OTV site. This is for is when you have more than one edge router per site. OTV only allows a specific Extend VLAN to be tunneled by one edge router at a time for the purpose of loop prevention. Essentially this Site VLAN lets the edge routers talk to each other and figure out which one is active/standby on a per-VLAN basis for the OTV tunnel. The Site VLAN should not be included in the extend VLAN list.
  • Site Identifier
    • Should be unique per DC site. If you have more than one edge router per site, they must agree on the Site Identifier, as it’s used in the AED election.
  • Overlay Interface
    • The logical OTV tunnel interface.
  • OTV Join Interface
    • The physical link or port-channel that you use to route upstream towards the DCI.
  • OTV Control Group
    • Multicast address used to discover the remote sites in the control plane.
  • OTV Data Group
    • Used when you’re tunneling multicast traffic over OTV in the data plane.
  • IGMP Version 3
    • Needed to send (S,G) IGMP Report messages towards the DCI network on the Join Interface.

At this point that’s basically all that’s involved in the implementation of OTV. It “just works”, because all the behind the scenes stuff is hidden from us from a configuration point of view. A quick test of this from the end hosts shows us that:

R2#ping 255.255.255.255
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 255.255.255.255, timeout is 2 seconds:

Reply to request 0 from 172.16.0.3, 4 ms
Reply to request 1 from 172.16.0.3, 1 ms
Reply to request 2 from 172.16.0.3, 1 ms
Reply to request 3 from 172.16.0.3, 1 ms
Reply to request 4 from 172.16.0.3, 1 ms

R2#traceroute 172.16.0.3
Type escape sequence to abort.
Tracing the route to 172.16.0.3
VRF info: (vrf in name/id, vrf out name/id)
1 172.16.0.3 0 msec * 0 msec

The fact that R3 responds to R2’s packets going to the all hosts broadcast address (255.255.255.255) implies that they are in the same broadcast domain. How specifically is it working though? That’s what took a lot further investigation.

To simplify the packet level verification a little further, I changed the MAC address of the four end devices that are used to generate the actual data plane traffic. The Device, IP address, and MAC address assignments are as follows:

The first thing I wanted to verify in detail was what the data plane looked like, and specifically what type of tunnel encapsulation was used. With a little searching I found that OTV is currently on the IETF standards track in draft format. As of writing, the newest draft is draft-hasmit-otv-03. Section 3.1 Encapsulation states:

3.  Data Plane

3.1. Encapsulation

The overlay encapsulation format is a Layer-2 ethernet frame
encapsulated in UDP inside of IPv4 or IPv6.

The format of OTV UDP IPv4 encapsulation is as follows:

1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|Version| IHL |Type of Service| Total Length |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Identification |Flags| Fragment Offset |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Time to Live | Protocol = 17 | Header Checksum |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Source-site OTV Edge Device IP Address |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Destination-site OTV Edge Device (or multicast) Address |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Source Port = xxxx | Dest Port = 8472 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| UDP length | UDP Checksum = 0 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|R|R|R|R|I|R|R|R| Overlay ID |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Instance ID | Reserved |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
| Frame in Ethernet or 802.1Q Format |
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

A quick PING sweep of packet lengths with the Don’t Fragment bit set allowed me to find the encapsulation overhead, which turns out to be 42 bytes, as seen below:

R3#ping 172.16.0.2 size 1459 df-bit 

Type escape sequence to abort.
Sending 5, 1459-byte ICMP Echos to 172.16.0.2, timeout is 2 seconds:
Packet sent with the DF bit set
.....
Success rate is 0 percent (0/5)

R3#ping 172.16.0.2 size 1458 df-bit

Type escape sequence to abort.
Sending 5, 1458-byte ICMP Echos to 172.16.0.2, timeout is 2 seconds:
Packet sent with the DF bit set
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/1/4 ms

None of my testing however could verify what the encapsulation header was though. The draft says that the transport is supposed to be UDP port 8472, but none of my logging produced results showing that any UDP traffic was even in the transit network (save for my RIPv2 routing ;) ). After much frustration, I finally broke out the sniffer and took some packet samples. The first capture below shows a normal ICMP ping between R2 and R3.

MPLS? GRE? Where did those come from? That’s right, OTV is in fact a fancy GRE tunnel. More specifically it is an Ethernet over MPLS over GRE tunnel. My poor little PINGs between R2 and R3 are in fact encapsulated as ICMP over IP over Ethernet over MPLS over GRE over IP over Ethernet (IoIoEoMPLSoGREoIP for short). Let’s take a closer look at the encapsulation headers now:

In the detailed header output we see our transport Ethernet header, which in a real deployment can be anything depending on what the transport of your DCI is (Ethernet, POS, ATM, Avian Carrier, etc.) Next we have the IP OTV tunnel header, which surprised me in a few aspects. First, all documentation I read said that without the use of an OTV Adjacency Server, unicast can’t be used for transport. This is true... up to a point. Multicast it turns out is only used to establish the control plane, and to tunnel multicast over multicast in the data plane. Regular unicast traffic over OTV will be encapsulated as unicast, as seen in this capture.

The next header after IP is GRE. In other words, OTV is basically the same as configuring a static GRE tunnel between the edge routers and then bridging over them, along with some enhancements (hence fancy GRE). The OTV enhancements (which we’ll talk about shortly) are the reason why you wouldn’t just configure GRE statically. Nevertheless this surprised me because even in hindsight the only mention of OTV using GRE I found was here. What’s really strange about this is that Cisco’s OTV implementation doesn’t follow what the standards track draft says, which is UDP, even though the authors of the OTV draft are Cisco engineers. Go figure.

The next header, MPLS, makes sense since the prior encapsulation is already GRE. Ethernet over MPLS over GRE is already well defined and used in deployment, so there’s no real reason to reinvent the wheel here. I haven’t verified this in detail yet but I’m assuming that the MPLS Label value would be used in cases where the edge router has multiple overlay interfaces, in which case the label in the data plane would quickly tell it which overlay interface the incoming packet is destined for. This logic is similar to MPLS L3VPN where the bottom of the stack VPN label tells a PE router which CE facing link the packet is ultimately destined for. I’m going to do some more testing later with a larger more complex topology to actually verify this fact though, as all data plane traffic over this tunnel is always sharing the same MPLS label value.

Next we see the original Ethernet header, which is sourced from R2’s MAC address 0000.0000.0002 and going to R3’s MAC address 0000.0000.0003. Finally we have the original IP header and the final ICMP payload. The key with OTV is that this inner Ethernet header and its payload remain untouched, so it looks like from the end host perspective that all the devices are just on the same LAN.

Now that it was apparent that OTV was just a fancy GRE tunnel, the IS-IS piece fell into place. Since IS-IS runs directly over layer 2 (e.g. Ethernet), and OTV is an Ethernet over MPLS over GRE tunnel, then IS-IS can encapsulate as IS-IS over Ethernet over MPLS over GRE (phew!). To test this, I changed the MAC address of one of the end hosts, and looked at the IS-IS LSP generation of the edge devices. After all the goal of the OTV control plane is to use IS-IS to advertise the MAC addresses of end hosts in that particular site, as well as the particular VLAN that they reside in. The configuration steps and packet capture result of this are as follows:

R3#conf t
Enter configuration commands, one per line. End with CNTL/Z.
R3(config)#int gig0/0
R3(config-if)#mac-address 1234.5678.9abc
R3(config-if)#
*Aug 17 22:17:10.883: %LINK-5-CHANGED: Interface GigabitEthernet0/0, changed state to reset
*Aug 17 22:17:11.883: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet0/0, changed state to down
*Aug 17 22:17:16.247: %LINK-3-UPDOWN: Interface GigabitEthernet0/0, changed state to up
*Aug 17 22:17:17.247: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet0/0, changed state to up

The first thing I noticed about the IS-IS encoding over OTV is that it uses IPv4 Multicast. This makes sense, because if you have 3 or more OTV sites you don’t want to have to send your IS-IS LSPs as replicated Unicast. As long as all of the AEDs on all sites have joined the control group (224.100.100.100 in this case), the LSP replication should be fine. This multicast forwarding can also be verified in the DCI transport network core in this case as follows:

N7K2-8#show ip mroute
IP Multicast Routing Table for VRF "default"

(*, 224.100.100.100/32), uptime: 20:59:33, ip pim igmp
Incoming interface: Null, RPF nbr: 0.0.0.0
Outgoing interface list: (count: 2)
port-channel78, uptime: 20:58:46, igmp
Ethernet1/29, uptime: 20:58:53, igmp

(150.1.38.3/32, 224.100.100.100/32), uptime: 21:00:05, ip pim mrib
Incoming interface: Ethernet1/29, RPF nbr: 150.1.38.3
Outgoing interface list: (count: 2)
port-channel78, uptime: 20:58:46, mrib
Ethernet1/29, uptime: 20:58:53, mrib, (RPF)

(150.1.78.7/32, 224.100.100.100/32), uptime: 21:00:05, ip pim mrib
Incoming interface: port-channel78, RPF nbr: 150.1.78.7
Outgoing interface list: (count: 2)
port-channel78, uptime: 20:58:46, mrib, (RPF)
Ethernet1/29, uptime: 20:58:53, mrib

(*, 232.0.0.0/8), uptime: 21:00:05, pim ip
Incoming interface: Null, RPF nbr: 0.0.0.0
Outgoing interface list: (count: 0)

Note that N7K1-3 (150.1.38.3) and N7K2-7 (150.1.78.7) have both joined the (*, 224.100.100.100). A very important point about this is that the control group for OTV is an Any Source Multicast (ASM) group, not a Source Specific Multicast (SSM) group. This implies that your DCI transit network must run PIM Sparse Mode and have a Rendezvous Point (RP) configured in order to build the shared tree (RPT) for the OTV control group used by the AEDs. You technically could use Bidir but you really wouldn't want to for this particular application. This kind of surprised me how they chose to implement it, because there are already more efficient ways of doing source discovery for SSM, for example how Multicast MPLS L3VPN uses the BGP AFI/SAFI Multicast MDT to advertise the (S,G) pairs of the PE routers. I suppose the advantage of doing OTV this way though is that it makes the OTV config very straightforward from an implementation point of view on the AEDs, and you don’t need an extra control plane protocol like BGP to exchange the (S,G) pairs before you actually join the tree. The alternative to this of course is to use the Adjacency Server and just skip using multicast all together. This however will result in unicast replication in the core, which can be bad, mkay?

Also for added fun in the IS-IS control plane the actual MAC address routing table can be verified as follows:

N7K2-7# show otv route

OTV Unicast MAC Routing Table For Overlay1

VLAN MAC-Address Metric Uptime Owner Next-hop(s)
---- -------------- ------ -------- --------- -----------
172 0000.0000.0002 1 01:22:06 site port-channel27
172 0000.0000.0003 42 01:20:51 overlay N7K1-3
172 0000.0000.000a 42 01:18:11 overlay N7K1-3
172 0000.0000.001e 1 01:20:36 site port-channel27
172 1234.5678.9abc 42 00:19:09 overlay N7K1-3

N7K2-7# show otv isis database detail | no-more
OTV-IS-IS Process: default LSP database VPN: Overlay1

OTV-IS-IS Level-1 Link State Database
LSPID Seq Number Checksum Lifetime A/P/O/T
N7K2-7.00-00 * 0x000000A3 0xA36A 893 0/0/0/1
Instance : 0x000000A3
Area Address : 00
NLPID : 0xCC 0x8E
Hostname : N7K2-7 Length : 6
Extended IS : N7K1-3.01 Metric : 40
Vlan : 172 : Metric : 1
MAC Address : 0000.0000.001e
Vlan : 172 : Metric : 1
MAC Address : 0000.0000.0002
Digest Offset : 0
N7K1-3.00-00 0x00000099 0xBAA4 1198 0/0/0/1
Instance : 0x00000094
Area Address : 00
NLPID : 0xCC 0x8E
Hostname : N7K1-3 Length : 6
Extended IS : N7K1-3.01 Metric : 40
Vlan : 172 : Metric : 1
MAC Address : 1234.5678.9abc
Vlan : 172 : Metric : 1
MAC Address : 0000.0000.000a
Vlan : 172 : Metric : 1
MAC Address : 0000.0000.0003
Digest Offset : 0
N7K1-3.01-00 0x00000090 0xCBAB 718 0/0/0/1
Instance : 0x0000008E
Extended IS : N7K2-7.00 Metric : 0
Extended IS : N7K1-3.00 Metric : 0
Digest Offset : 0

So at this point we see what our ICMP PING was actually ICMP over IP over Ethernet over MPLS over GRE over IP over Ethernet, and our routing protocol was IS-IS over Ethernet over MPLS over GRE over IP over Ethernet :/ What about multicast in the data plane though? At this point verification of multicast over the DCI core is pretty straightforward, since we can just enable a multicast routing protocol like EIGRP and look at the result. This can be seen below:

R2#config t
Enter configuration commands, one per line. End with CNTL/Z.
R2(config)#router eigrp 1
R2(config-router)#no auto-summary
R2(config-router)#network 0.0.0.0
R2(config-router)#end
R2#

R3#config t
Enter configuration commands, one per line. End with CNTL/Z.
R3(config)#router eigrp 1
R3(config-router)#no auto-summary
R3(config-router)#network 0.0.0.0
R3(config-router)#end
R3#
*Aug 17 22:39:43.419: %SYS-5-CONFIG_I: Configured from console by console
*Aug 17 22:39:43.423: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 1: Neighbor 172.16.0.2 (GigabitEthernet0/0) is up: new adjacency

R3#show ip eigrp neighbors
IP-EIGRP neighbors for process 1
H Address Interface Hold Uptime SRTT RTO Q Seq
(sec) (ms) Cnt Num
0 172.16.0.2 Gi0/0 11 00:00:53 1 200 0 1

Our EIGRP adjacency came up, so multicast obviously is being tunneled over OTV. Let’s see the packet capture result:

We can see EIGRP being tunneled inside the OTV payload, but what’s with the outer header? Why is EIGRP using the ASM 224.100.100.100 group instead of the SSM 232.1.2.0/24 data group? My first guess was that link local multicast (i.e. 224.0.0.0/24) would get encapsulated as control plane instead of as data plane. This would make sense because control plane protocols like OSPF, EIGRP, PIM, etc. you would want those tunneling to all OTV sites, not just the ones that joined the SSM feeds. To test if this was the case, the only change I needed to make was to have one router join a non-link-local multicast group, and have the other router send ICMP pings. Since they’re effectively in the same LAN segment, no PIM routing is needed in the DC sites, just basic IGMP Snooping, which is enabled in NX-OS by default. The config on the IOS routers is as follows:

R2#config t
Enter configuration commands, one per line. End with CNTL/Z.
R2(config)#ip multicast-routing
R2(config)#int gig0/0
R2(config-if)#ip igmp join-group 224.10.20.30
R2(config-if)#end
R2#

R3#ping 224.10.20.30 repeat 1000 size 1458 df-bit

Type escape sequence to abort.
Sending 1000, 1458-byte ICMP Echos to 224.10.20.30, timeout is 2 seconds:
Packet sent with the DF bit set

Reply to request 0 from 172.16.0.2, 1 ms
Reply to request 1 from 172.16.0.2, 1 ms
Reply to request 2 from 172.16.0.2, 1 ms

The packet capture result was as follows:

This was more as expected. Now the multicast data plane packet was getting encapsulated in the ICMP over IP over Ethernet over MPLS over GRE over IP *Multicast* over Ethernet OTV group. The payload wasn’t decoded, as I think even Wireshark was dumbfounded by this string of encapsulations.

In summary we can make the following observations about OTV:

  • OTV encapsulation has 42 bytes of overhead that consists of:
    • New Outer Ethernet II Header - 14 Bytes
    • New Outer IP Header - 20 Bytes
    • GRE Header - 4 Bytes
    • MPLS Header - 4 Bytes
  • OTV uses both Unicast and Multicast transport
    • ASM Multicast is used to build the control plane for OTV IS-IS, ARP, IGMP, EIGRP, etc.
    • Unicast is used for normal unicast data plane transmission between sites
    • SSM Multicast is used for normal multicast data plane transmission between sites
    • Optionally ASM & SSM can be replaced with the Adjacency Server
  • GRE is the ultimate band-aid of networking

Now the next time someone is throwing around fancy buzzwords about OTV, DCI, VWM, etc. you can say “oh, you mean that fancy GRE tunnel”? ;)

I’ll be continuing this series in the coming days and weeks on other Data Center and specifically CCIE Data Center related technologies. If you have a request for a specific topic or protocol that you’d like to see the behind the scene’s details of, drop me a line at bmcgahan@ine.com.

Happy Labbing!

Feb
15

Introduction

Recently, there were discussions going around about Cisco’s new datacenter technology – Overlay Transport Virtualization (OTV), implemented in Nexus 7k data-center switches (limited demo deployments only). The purpose of this technology is connecting separated data-center islands over a convenient packet switched network. It is said that OTV is a better solution compared to well-known VPLS, or any other Layer 2 VPN technology. In this post we are going to give a brief comparison of two technologies and see what benefits OTV may actually bring to data-centers.

VPLS Overview

We are going to give a rather condensed overview of VPLS functionality here, just to have a baseline to compare OTV with. A reader is assumed to have solid understanding or MPLS and Layer 2 VPNs, as technology fundamentals are not described here.

otv-blog-post-vpls

  • VPLS provides multipoint LAN services by extending a LAN cloud over a packet switched network. MPLS is used as a primary transport for tunneling Ethernet frames, however it could be replaced with any suitable tunneling solution, such as GRE or L2TPv3 that runs over a convenient packet switched network. That is, to say at least, VPLS could be transport agnostic, if required. The main benefit of label-switched paths is the ability to leverage MPLS-TE, which is very important to service-providers.
  • The core of VPLS functionality is the process of establishing a full-mesh of pseudowires between all participating PE's. If a VPLS deployment has N PE's, every PE needs to allocate and advertise N-1 different labels to N-1 remote PEs that they should use when encapsulating L2 packets sent to the advertising PE. This is required in order for the advertising PE to be able to distinguish frames coming from different remote PEs and properly perform MAC-address learning.
  • There are two main different methods for MPLS tunnel signaling. Those are full mesh of LDP sessions and full-mesh of BGP peerings. Notice that LDP-based standard does not specify any auto-discovery technique and those could be selected at vendor’s discretion (e.g. RADIUS). BGP-based signaling technique allows for auto-discovery and automatic label allocation using BGP multiprotocol extensions.
  • VPLS utilizes the same data-plane based learning that "classic" Ethernet uses. When a frame is received over a pseudowire, its source MAC address is associated with the “virtual port” and corresponding MPLS label. When a multicast frame or a frame with unknown unicast destination address is received it is flooded out of all ports including the pseudowires (virtual ports).
  • The full mesh of pseudowires allows for simplified forwarding in VPLS core network. Instead of running STP to block redundant connection, split horizon data-plane forwarding rule is used. A frame received on a pseudowire is forwarded only out of physical ports, and not out of any other pseudowires.
  • VPLS does not facilitate control-plane adderess learning, but may facilitate some special signaling to explicitly withdraw MAC addresses from remote PEs when a topology change is detected on the local site. This is especially important in multi-homed scenarios, when MAC address mobility could be a reality.

VPLS Limitations

The following is the list of VPLS limitations, which are inherently rooted in the Ethernet technology:

  • Data-plane learning and flooding are still in use. This fundamental property makes Ethernet plug-and-play technology, but results in poor scalability. Essentially, a single LAN segment is one fault domain, as any change in the topology requires re-flooding of frames and generates sudden bursts of traffic. A topology change in single site may propagate to other sites across the VPLS core (by use of signaling extensions) and cause MAC address re-learning and flooding.
  • Ethernet addressing is inherently flat and non-hierarchical. MAC addresses serve the purpose of identifying endpoints, not pointing their locations. This makes troubleshooting large Ethernet topologies very challenging. In addition to this, MAC addresses are non-summarizable which, coupled with learning & flooding behavior, results in uncontrolled address-table growth for all devices in Ethernet domain.
  • Spanning Tree is still used in CE domains, which results in slower convergence and suboptimal bandwidth usage. Since STP blocks all redundant links the links' bandwidth is effectively wasted. Plus, the links close the root bridge are more congested than the the STP leaves. The use of MSTP or PVST could alleviate this problem, but traffic engineering becomes complicated due to multiple logical topologies.
  • If a customer site is multihomed in VPLS core network, STP needs to be running across the core to ensure blocking of redundant paths, as VPLS does not detect this. This further worsens convergence times and affects stability. There is an alternative, not yet standartized approach for selecting a designated forwarder for multi-homed sites, which does not rely on STP.
  • VPLS does not have native multicast replication support. Multicast frames are flooded out of all pseudowires, even though IGMP snooping may reduce the amount of flooding on local physical ports. Ineffective PE-based multicast replication limits the use of heavyweight multicast applications in VPLS.

Of course, the community came with some solutions to the above problems. Firstly, the problem of address-table growth could be alleviated using MAC-in-MAC (IEEE 802.1ah) encapsulation for customer frames. In this solution, PE devices only have to learn other PE’s MAC addresses, or the MAC-in-MAC stacking devices could be pushed down to the customer network. Even simpler, CE devices could be replaced with routers, thus exposing only a single CE MAC address to the VPLS cloud. Next, there is a lot of work being done for multicast optimization in VPLS. The root cause lies in adapting the main underlying transports – MPLS label switching – to effectively handling multicast traffic. The solution uses point-to-multipoint LSPs and either M-LDP or RSVP-TE extensions to signal those. It is not yet completely standardized and widely adopted, but the work is definitely in progress. However, the main scaling factor of Ethernet – topology agnostic control plane with uncontrolled data-plane learning is not addresses in VPLS extensions so far.

As an alternative to using sophisticated M-LSPs, the multicast problem could be “resolved” by co-locating additional Layer 3 VPN topology and running Layer 3 mVPNs. A “proxy” PIM router deployed at every site will process the IGMP messages and signal multicast trees in the core network. While this solution is not as elegant as using P2MP LSPs and introduces additional operational expenses, but it nonetheless offers a working alternative.

Overlay Transport Virtualization

Regardless the loud name, from technical standpoint OTV looks like nothing more than VPLS stripped off MPLS transport and having optimized multicast handling similar to one used in Draft Rosen's Layer 3 mVPNs. The are some Ethernet flooding behavior improvenments, but those are questionable. Let’s see how OTV works. Notice that this time we use the notion of CE devices, not PE, as OTV is the technology to be deployed at customer’s edge.

otv-blog-post-otv

  • CE devices connect to customer Ethernet clouds and face a provider packet-switching network (or enterprise network core) using a Layer 3 interface. Every CE device creates an overlay tunnel interface that is very much similar to MDT Tunnel interface specified in Draft Rosen. Only IP transport is required in the core and thus the technology does not depend on MPLS transport.
  • Just like with Rosen’s mVPNs all CE’s need to join a P (provider) multicast group to discover other CE's over the overlay tunnel. The CE's then establish a full mesh of IS-IS adjacencies with other CE's on the overlay tunnel. This could be thought as an equivalent for full-mesh of control-plane link in VPLS. The provider multicast group is normally an ASM or Bidir group.
  • IS-IS nodes (CEs) flood LSP information including the MAC addresses known via attached physical Ethernet ports to all other CE’s. This is possible due to flexible TLV structure found in ISIS LSPs. The same LSP flooding could be used to remove or unlearn a MAC address if the local CE finds it unavailable.
  • Ethernet forwarding follows the normal rules, but GRE encapsulation (or any other IP tunneling) is used when sending Ethernet frames over the provided IP cloud to a remote CE. Notice that GRE packets received on overlay tunnel interfaces do NOT result in MAC-address learning for encapsulated frames. Furthermore, unknown unicast frames are NOT flooded out of overlay tunnel interface – it is assumed that all remote MAC addresses are signaled via control plane and should be known.
  • Multicast Ethernet frames are encapsulated using multicast IP packets and forwarded using provider multicast services. Furthermore, it is possible to specify a set of “selective” or “data” multicast groups that are used for flooding specific multicast flows, just like in mVPNs. Upon a reception of an IGMP join, a CE will snoop on it (similar to the classic IGMP snooping) and translate into core multicast group PIM join. All CE's receiving IGMP reports for the same group will join the same core multicast tree and form an optimal multicast distribution structure in the core. The actual multicast flow frames will them get flooded down the selective multicast tree to the participating nodes only. Notice one important difference from mVPNs – the CE devices are not multicast routers, they are effectively virtual switches performing IGMP snooping and extended signaling in provider core.
  • OTV handles multi-homed scenarios properly, without running STP on top the overlay tunnel. If two CE’s share the same logical backdoor link (i.e. they hear each other ISIS hello packets over the link) one of the devices is elected as appointed (aka authoritative) forwarder for the given link (e.g VLAN). Only this device actually floods and forwards the frames on the given segment, thus eliminating Layer 2 forwarding loop. This concept is very similar o electing a PIM Assert winner on a shared link. Notice that this approach is simialar to VPLS draft proposal for multihoming, but uses IGP signaling instead of BGP.
  • OTV supports ARP optimization in order to reduce the amount of broadcast traffic flooded across the overlay tunnel. Every CE may snoop on local ARP replies and use the ISIS extensions to signal IP to MAC bindings to remote nodes. Every CE will then attempt to respond to an ARP request using its local cache, instead of forwarding the ARP packet over the core network. This does not eliminate ARP, just reduces the amount of broadcast flooding

Now for some conclusions. OTV claims to be better than VPLS, but this could be argued. To begin with, VPLS is positioned as provider edge technology and OTV is customer-edge technology. Next, the following list captures similarities and differences between the two technologies:

  • The same logical full-mesh of signaling is used in the core. IS-IS it outlined in the patent document, but any other protocol could be obviously used here, e.g. LDP or BGP. Even the patent document mentions that. What was the reason to re-inventing the wheel? The answer could be “TRILL” as we see in the following section. But so far switching to new signaling makes little sense in terms of benefits.
  • OTV runs over native IP, and does not require underlying MPLS. Like we said before, it was possible to simple change VPLS transport to any IP tunneling technique instead of coming with a new technology. By missing MPLS, OTV loses the important ability to signal optimal path selection in provider networks at the PE edge.
  • Control Plane MAC-address learning is said to reduce broadcast in the core network. This is indeed accomplished but at a significant price. Here is the problem: If a topology change in one site is to be propagated to other sites, control plane must signal the removal of locally learned MAC addresses to the remote sites. Effectively, this will translate data-plane “black-holing” until the MAC addresses are not re-learned and signaled again, as OTV does not flood over the IP core.The things are even worse in control plane. A topology change will flush all MAC addresses known to a CE an result in LSP flooding to all ajdacent nodes. The amount of LSP replication could be optimized using IS-IS mesh-groups, but at least N copies of LSP should be sent, where N is the number of adjacencies. As soon as new MACs are learned, additional LSP will be flooded out to all neighbors! Properly controlling LSP generation, i.e. delaying LSP sending may help reduce flooding but again will result in convergence issues in data plane.

    To summarize, the price paid for flooding reduction is slower convergence in presence of topology changes and control-plane scalability challenges. The main problem – topology unawareness that leads to the need of re-learning MAC address is not addressed in OVT (yet). However, if you think that data-plane flooding in data-centers could be very intensive, the amount of control plane flooding introduced could become acceptable.

  • Optimized multicast support seems to be the only big benefit of OTV that does not result in significant trade-offs. Itroducing native multicast it’s probably due to the fact that VPLS multicasting is still not standardized, while datacenters need it now. The multicast solution is a copy of mVPNs model and not something new and exciting, like M-LSPs are. Like we said before, the same idea could be deployed in VPLS scenarios by means of co-located mVPN. Also, when deployed over SP networks, this feature requires SP multicast support for auto-discovery and optimized forwarding. This is not a major problem though, and OTV has support for unicast static discovery.

To summarize, it looks like OTV is an attempt to fast-track a slightly better “customer” VPLS in datacenters, while IETF folks struggle for actual VPLS standardization. The technology is “CE-centric”, in essence that it does not require any provider intervention with exception to providing multicast and L3 services. It is most likely that OTV and VPLS projects are being carried by different teams that are being time-pressed and thus don’t have resources to coordinate their efforts and come with a unified solution. There are no huge improvements so far in terms of Ethernet optimization, with except to reduced flooding in network core, traded for control-plane complexity. At its current form, OTV might look a bit disappointing, unless is a first step to implementing TRILL (Transport Interconnection for Lots of Links) – new IETF standard for routing bridges.

Meet TRILL

Before we being, it is worth noting that IETF TRILL project somewhat parallels IEEE 802.1aq standard development. Both standards propose replacement of STP with link-state routing protocol. We’ll be discussing mainly RBridges in this post, due to the fact that more open information is available on TRILL. Plus, IETF papers are much easier to read compared to IEEE documents!

Like mentioned above, RBridges is short name for Routing Bridges, the project pioneered by Radia Perlman of Sun Microsystems. If you remember, she is the person who invented the original (DEC) STP protocol. In the new proposal, all bridges become aware of each other and the whole topology by using IS-IS routing protocol. Every bridge has e “nickname” - a 16-bit identified, which addresses the router in the global topology (similar to OSPF router-id). Once the topology is built, the switches operate as following:

otv-blog-post-trill

  • RBridges dynamically learn MAC addresses and associate them with respective VLANs on the customer-facing ports. When a frame needs to be switched out, the RBridges looks up the destination MAC address and finds the remote bridge nickname associated with this MAC. The frame is then encapsulated using three headers. The intermediate RBridges then use the TRILL header for actual hop-by-hop routing to the egress RBridges. Notice that the outer header contains the MAC addresses of two directly connected RBridges, just like it would be in case of routing. The TRILL header has a hop-count field that is similar to IP TTL. Every bridge in the path decrements this field and drops the frame if time to live falls to zero, just like convenient routers do.
  • If the destination address is not known to the ingress RBridge, the frame is flooded in controlled manner, using special destination TRILL address. The shortest-path trees constructed using the link state information are used for flooding and RPF check is performed on every RBridges to eliminate transient routing loops. It is important to point out that every transit RBridges on the path of the flooded frame does not learn the inner MAC address unless it has the inner VLAN locally configured. Thus, transit RBridges limit their address table sizes only to the addresses of other RBridges. Every egress RBridges receiving the TRILL-encapsulated frame will learn the inner source MAC address and associate it with the sending RBridge nickname. This will allow the egress RBridges to properly route the response frames.
  • Any link failure will result in SPF topology re-computations but will not cause MAC-address table flushes unless the particular part of the network is completely isolated by the failure. This is due to the fact that the destination MAC addresses are not associated with a local port but with the remote RBridges that has the destination MAC connected. The RBridges will simply recalculate the routes to the new destination. This behavior is a direct result of topology-aware learning process that we’ll describe later. The net result is that topology changes do not have the devastating effect they have on STP-based networks.
  • On a segment that has multiple RBridges connected, one is elected as “appointed forwarder” for every VLAN. Only the appointed forwarder is allowed to send/receive frames from a shared link, all others remain standby. This is required to ensure loop-free forwarding and avoid excessive flooding. This is very similar to the OTV behavior that we described above.
  • In addition to data-plane MAC address learning, RBridges may explicitly propagate MAC addresses found on a locally connected VLAN to all other RBridges. This is an optional feature, but it allows for faster convergence and less flooding in the topology. The protocol is known as ESADI - End Station Address Distribution Information and looks very similar to control-plane learning found in OTV.
  • The use of shortest-path routing allows for Equal-Cost Multipath Balancing and traffic engineer features found in routed IP network, thus resolving the problem of STP bottlenecks. Additionally, the use of appointed forwarders for every VLAN ensures better load balancing for multi-homed segments - again, a feature copied by OTV.

To summarize, TRILL keeps intact Ethernet’s dynamic data-plane learning feature that made the technology so “plug-and-play”. However, the amount of flooding is now controlled by use of distribution trees and hop counting. The net effect of flooding is significantly reduced due to the fact that topology changes do not flush the MAC address tables. Load balancing is much more effective and deterministic in TRILL networks. TRILL networks are easier to troubleshoot, as every RBridges associates the “flat” MAC address with the “location” in the network defined via the remote bridge nickname. Lastly, the problem of address table growth is somewhat resolved, due to the fact that MAC addresses need not to be known on every switch in the domain, but only on the switches that actually have connection to the end equipment.

Even though TRILL offers some benefits, the MAC address learning and frame flooding remains there. Furthermore, the problem of address space growth is not fully resolved, as with TRILL it results in “core-edge” address table asymmetry. If you are looking for a complete solution for all Ethernet issues, it is recommended to read the paper “Floodless in SEATTLE” (see Further Reading below), which offers significantly re-engineered Ethernet technology utilizing DHT (Distributed Hash Tables) found in peer-to-peer networks and link-state routing. SEATTLE offers truly floodless Ethernet and resolves the address space growth problem by making the global MAC address table work like a distributed database. Thanks to Daniel Ginsburg aka dbg for referring me to this wonderful reading!

Conclusions

Right now, OTV looks like an attempt to rapidly deploy VPLS functionality without relying on MPLS transport. This is probably driven by the growth need for deploying large data-centers and interconnecting them across convenient packet-switched networks at customer edge. OTV reduces flooding in network cores but makes convergence process slower in the presence of topology changes. Multicast traffic is forwarded in optimal manner using core network’s multicast services. If OTV is a first step toward TRILL, then it looks like a very promising technology. Otherwise it is just a VPLS replica with some optimizations. Still, I hope that one day OTV and VPLS branches will be merged and TRILL would become implemented in one common VPLS framework!

Further Reading

OTV Patent Paper
RBridges Draft Document
SEATTLE Technology
VPLS using LDP Signaling
VPLS using BGP Signaling
Multicasting in VPLS
Multihoming in VPLS
Multicast VPNs using Draft Rosen
Introduction to M-LSPs and Practical Examples

Subscribe to INE Blog Updates

New Blog Posts!