Aug
17

Edit: For those of you that want to take a look first-hand at these packets, the Wireshark PCAP files referenced in this post can be found here

One of the hottest topics in networking today is Data Center Virtualized Workload Mobility (VWM). For those of you that have been hiding under a rock for the past few years, workload mobility basically means the ability to dynamically and seamlessly reassign hardware resources to virtualized machines, often between physically disparate locations, while keeping this transparent to the end users. This is often accomplished through VMware vMotion, which allows for live migration of virtual machines between sites, or as similarly implemented in Microsoft’s Hyper-V and Citrix’s Xen hypervisors.

One of the typical requirements of workload mobility is that the hardware resources used must be on the same layer 2 network segment. E.g. the VMware Host machines must be in the same IP subnet and VLAN in order to allow for live migration their VMs. The big design challenge then becomes, how do we allow for live migrations of VMs between Data Centers that are not in the same layer 2 network? One solution to this problem that Cisco has devised is a relatively new technology called Overlay Transport Virtualization (OTV).

As a side result of preparing for INE’s upcoming CCIE Data Center Nexus Bootcamp I’ve had the privilege (or punishment depending on how you look at it ;) ) of delving deep into the OTV implementation on Nexus 7000. My goal was to find out exactly what was going on behind the scenes with OTV. The problem I ran into though was that none of the external Cisco documentation, design guides, white papers, Cisco Live presentations, etc. really contained any of this information. The only thing that is out there on OTV is mainly marketing info, i.e. buzzword bingo, or very basic config snippets on how to implement OTV. In this blog post I’m going to discuss the details of my findings about how OTV actually works, with the most astonishing of these results being that OTV is in fact, a fancy GRE tunnel.

From a high level overview, OTV is basically a layer 2 over layer 3 tunneling protocol. In essence OTV accomplishes the same goal as other L2 tunneling protocols such as L2TPv3, Any Transport over MPLS (AToM), or Virtual Private LAN Services (VPLS). For OTV specifically this goal is to take Ethernet frames from an end station, like a virtual machine, encapsulate them inside IPv4, transport them over the Data Center Interconnect (DCI) network, decapsulate them on the other side, and out pops your original Ethernet frame.

For this specific application OTV has some inherent benefits over other designs such as MPLS L2VPN with AToM or VPLS. The first of which is that OTV is transport agnostic. As long as there is IPv4 connectivity between Data Centers, OTV can be used. For AToM or VPLS, these both require that the transport network be MPLS aware, which can limit your selections of Service Providers for the DCI. For OTV you can technically use it over any regular Internet connectivity.

Another advantage of OTV is that provisioning is simple. AToM and VPLS tunnels are Provider Edge (PE) side protocols, while OTV is a Customer Edge (CE) side protocol. This means for AToM and VPLS the Service Provider has to pre-provision the pseudowires. Even though VPLS supports enhancements like BGP auto-discovery, provisioning of MPLS L2VPN is still requires administrative overhead. OTV is much simpler in this case, because as we’ll see shortly, the configuration is just a few commands that are controlled by the CE router, not the PE router.

The next thing we have to consider with OTV is how exactly this layer 2 tunneling is accomplished. After all we could just configure static GRE tunnels on our DCI edge routers and bridge IP over them, but this is probably not the best design option for either control plane or data plan scalability.

The way that OTV implements the control plane portion of its layer 2 tunnel is what is sometimes described as “MAC in IP Routing”. Specifically OTV uses Intermediate System to Intermediate System (IS-IS) to advertise the VLAN and MAC address information of the end hosts over the Data Center Interconnect. For those of you that are familiar with IS-IS, immediately this should sound suspect. After all, IS-IS isn’t an IP protocol, it’s part of the legacy OSI stack. This means that IS-IS is directly encapsulated over layer 2, unlike OSPF or EIGRP which ride over IP at layer 3. How then can IS-IS be encapsulated over the DCI network that is using IPv4 for transport? The answer? A fancy GRE tunnel.

The next portion that is significant about OTV’s operation is how it actually sends packets in the data plane. Assuming for a moment that the control plane “just works”, and the DCI edge devices learn about all the MAC addresses and VLAN assignments of the end hosts, how do we actually encapsulate layer 2 Ethernet frames inside of IP to send over the DCI? What if there is multicast traffic that is running over the layer 2 network? Also what if there are multiple sites reachable over the DCI? How does it know specifically where to send the traffic? The answer? A fancy GRE tunnel.

Next I want to introduce the specific topology that will be used for us to decode the details of how OTV is working behind the scenes. Within the individual Data Center sites, the layer 2 configuration and physical wiring is not relevant to our discussion of OTV. Assume simply that the end hosts have layer 2 connectivity to the edge routers. Additionally assume that the edge routers have IPv4 connectivity to each other over the DCI network. In this specific case I chose to use RIPv2 for routing over the DCI (yes, you read that correctly), simply so I could filter it from my packet capture output, and easily differentiate between the routing control plane in the DCI transport network vs. the routing control plane that was tunneled inside OTV between the Data Center sites.

What we are mainly concerned with in this topology is as follows:

  • OTV Edge Devices N7K1-3 and N7K2-7
    • These are the devices that actually encapsulate the Ethernet frames from the end hosts into the OTV tunnel. I.e. this is where the OTV config goes.
  • DCI Transport Device N7K2-8
    • This device represents the IPv4 transit cloud between the DC sites. From this device’s perspective it sees only the tunnel encapsulated traffic, and does not know the details about the hosts inside the individual DC sites. Additionally this is where packet capture is occurring so we can view the actual payload of the OTV tunnel traffic.
  • End Hosts R2, R3, Server 1, and Server 3
    • These are the end devices used to generate data plane traffic that ultimately flows over the OTV tunnel.

Now let’s look at the specific configuration on the edge routers that is required to form the OTV tunnel.

N7K1-3:
vlan 172
name OTV_EXTEND_VLAN
!
vlan 999
name OTV_SITE_VLAN
!
spanning-tree vlan 172 priority 4096
!
otv site-vlan 999
otv site-identifier 0x101
!
interface Overlay1
otv join-interface Ethernet1/23
otv control-group 224.100.100.100
otv data-group 232.1.2.0/24
otv extend-vlan 172
no shutdown
!
interface Ethernet1/23
ip address 150.1.38.3/24
ip igmp version 3
ip router rip 1
no shutdown

N7K2-7:
vlan 172
name OTV_EXTEND_VLAN
!
vlan 999
name OTV_SITE_VLAN
!
spanning-tree vlan 172 priority 4096
!
otv site-vlan 999
otv site-identifier 0x102
!
interface Overlay1
otv join-interface port-channel78
otv control-group 224.100.100.100
otv data-group 232.1.2.0/24
otv extend-vlan 172
no shutdown
!
interface port-channel78
ip address 150.1.78.7/24
ip igmp version 3
ip router rip 1

As you can see the configuration for OTV really isn’t that involved. The specific portions of the configuration that are relevant are as follows:

  • Extend VLANs
    • These are the layer 2 segments that will actually get tunneled over OTV. Basically these are the VLANs that you virtual machines reside on that you want to do the VM mobility between. In our case this is VLAN 172, which maps to the IP subnet 172.16.0.0/24.
  • Site VLAN
    • Used to synchronize the Authoritative Edge Device (AED) role within an OTV site. This is for is when you have more than one edge router per site. OTV only allows a specific Extend VLAN to be tunneled by one edge router at a time for the purpose of loop prevention. Essentially this Site VLAN lets the edge routers talk to each other and figure out which one is active/standby on a per-VLAN basis for the OTV tunnel. The Site VLAN should not be included in the extend VLAN list.
  • Site Identifier
    • Should be unique per DC site. If you have more than one edge router per site, they must agree on the Site Identifier, as it’s used in the AED election.
  • Overlay Interface
    • The logical OTV tunnel interface.
  • OTV Join Interface
    • The physical link or port-channel that you use to route upstream towards the DCI.
  • OTV Control Group
    • Multicast address used to discover the remote sites in the control plane.
  • OTV Data Group
    • Used when you’re tunneling multicast traffic over OTV in the data plane.
  • IGMP Version 3
    • Needed to send (S,G) IGMP Report messages towards the DCI network on the Join Interface.

At this point that’s basically all that’s involved in the implementation of OTV. It “just works”, because all the behind the scenes stuff is hidden from us from a configuration point of view. A quick test of this from the end hosts shows us that:

R2#ping 255.255.255.255
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 255.255.255.255, timeout is 2 seconds:

Reply to request 0 from 172.16.0.3, 4 ms
Reply to request 1 from 172.16.0.3, 1 ms
Reply to request 2 from 172.16.0.3, 1 ms
Reply to request 3 from 172.16.0.3, 1 ms
Reply to request 4 from 172.16.0.3, 1 ms

R2#traceroute 172.16.0.3
Type escape sequence to abort.
Tracing the route to 172.16.0.3
VRF info: (vrf in name/id, vrf out name/id)
1 172.16.0.3 0 msec * 0 msec

The fact that R3 responds to R2’s packets going to the all hosts broadcast address (255.255.255.255) implies that they are in the same broadcast domain. How specifically is it working though? That’s what took a lot further investigation.

To simplify the packet level verification a little further, I changed the MAC address of the four end devices that are used to generate the actual data plane traffic. The Device, IP address, and MAC address assignments are as follows:

The first thing I wanted to verify in detail was what the data plane looked like, and specifically what type of tunnel encapsulation was used. With a little searching I found that OTV is currently on the IETF standards track in draft format. As of writing, the newest draft is draft-hasmit-otv-03. Section 3.1 Encapsulation states:

3.  Data Plane

3.1. Encapsulation

The overlay encapsulation format is a Layer-2 ethernet frame
encapsulated in UDP inside of IPv4 or IPv6.

The format of OTV UDP IPv4 encapsulation is as follows:

1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|Version| IHL |Type of Service| Total Length |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Identification |Flags| Fragment Offset |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Time to Live | Protocol = 17 | Header Checksum |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Source-site OTV Edge Device IP Address |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Destination-site OTV Edge Device (or multicast) Address |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Source Port = xxxx | Dest Port = 8472 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| UDP length | UDP Checksum = 0 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|R|R|R|R|I|R|R|R| Overlay ID |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Instance ID | Reserved |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
| Frame in Ethernet or 802.1Q Format |
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

A quick PING sweep of packet lengths with the Don’t Fragment bit set allowed me to find the encapsulation overhead, which turns out to be 42 bytes, as seen below:

R3#ping 172.16.0.2 size 1459 df-bit 

Type escape sequence to abort.
Sending 5, 1459-byte ICMP Echos to 172.16.0.2, timeout is 2 seconds:
Packet sent with the DF bit set
.....
Success rate is 0 percent (0/5)

R3#ping 172.16.0.2 size 1458 df-bit

Type escape sequence to abort.
Sending 5, 1458-byte ICMP Echos to 172.16.0.2, timeout is 2 seconds:
Packet sent with the DF bit set
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/1/4 ms

None of my testing however could verify what the encapsulation header was though. The draft says that the transport is supposed to be UDP port 8472, but none of my logging produced results showing that any UDP traffic was even in the transit network (save for my RIPv2 routing ;) ). After much frustration, I finally broke out the sniffer and took some packet samples. The first capture below shows a normal ICMP ping between R2 and R3.

MPLS? GRE? Where did those come from? That’s right, OTV is in fact a fancy GRE tunnel. More specifically it is an Ethernet over MPLS over GRE tunnel. My poor little PINGs between R2 and R3 are in fact encapsulated as ICMP over IP over Ethernet over MPLS over GRE over IP over Ethernet (IoIoEoMPLSoGREoIP for short). Let’s take a closer look at the encapsulation headers now:

In the detailed header output we see our transport Ethernet header, which in a real deployment can be anything depending on what the transport of your DCI is (Ethernet, POS, ATM, Avian Carrier, etc.) Next we have the IP OTV tunnel header, which surprised me in a few aspects. First, all documentation I read said that without the use of an OTV Adjacency Server, unicast can’t be used for transport. This is true... up to a point. Multicast it turns out is only used to establish the control plane, and to tunnel multicast over multicast in the data plane. Regular unicast traffic over OTV will be encapsulated as unicast, as seen in this capture.

The next header after IP is GRE. In other words, OTV is basically the same as configuring a static GRE tunnel between the edge routers and then bridging over them, along with some enhancements (hence fancy GRE). The OTV enhancements (which we’ll talk about shortly) are the reason why you wouldn’t just configure GRE statically. Nevertheless this surprised me because even in hindsight the only mention of OTV using GRE I found was here. What’s really strange about this is that Cisco’s OTV implementation doesn’t follow what the standards track draft says, which is UDP, even though the authors of the OTV draft are Cisco engineers. Go figure.

The next header, MPLS, makes sense since the prior encapsulation is already GRE. Ethernet over MPLS over GRE is already well defined and used in deployment, so there’s no real reason to reinvent the wheel here. I haven’t verified this in detail yet but I’m assuming that the MPLS Label value would be used in cases where the edge router has multiple overlay interfaces, in which case the label in the data plane would quickly tell it which overlay interface the incoming packet is destined for. This logic is similar to MPLS L3VPN where the bottom of the stack VPN label tells a PE router which CE facing link the packet is ultimately destined for. I’m going to do some more testing later with a larger more complex topology to actually verify this fact though, as all data plane traffic over this tunnel is always sharing the same MPLS label value.

Next we see the original Ethernet header, which is sourced from R2’s MAC address 0000.0000.0002 and going to R3’s MAC address 0000.0000.0003. Finally we have the original IP header and the final ICMP payload. The key with OTV is that this inner Ethernet header and its payload remain untouched, so it looks like from the end host perspective that all the devices are just on the same LAN.

Now that it was apparent that OTV was just a fancy GRE tunnel, the IS-IS piece fell into place. Since IS-IS runs directly over layer 2 (e.g. Ethernet), and OTV is an Ethernet over MPLS over GRE tunnel, then IS-IS can encapsulate as IS-IS over Ethernet over MPLS over GRE (phew!). To test this, I changed the MAC address of one of the end hosts, and looked at the IS-IS LSP generation of the edge devices. After all the goal of the OTV control plane is to use IS-IS to advertise the MAC addresses of end hosts in that particular site, as well as the particular VLAN that they reside in. The configuration steps and packet capture result of this are as follows:

R3#conf t
Enter configuration commands, one per line. End with CNTL/Z.
R3(config)#int gig0/0
R3(config-if)#mac-address 1234.5678.9abc
R3(config-if)#
*Aug 17 22:17:10.883: %LINK-5-CHANGED: Interface GigabitEthernet0/0, changed state to reset
*Aug 17 22:17:11.883: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet0/0, changed state to down
*Aug 17 22:17:16.247: %LINK-3-UPDOWN: Interface GigabitEthernet0/0, changed state to up
*Aug 17 22:17:17.247: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet0/0, changed state to up

The first thing I noticed about the IS-IS encoding over OTV is that it uses IPv4 Multicast. This makes sense, because if you have 3 or more OTV sites you don’t want to have to send your IS-IS LSPs as replicated Unicast. As long as all of the AEDs on all sites have joined the control group (224.100.100.100 in this case), the LSP replication should be fine. This multicast forwarding can also be verified in the DCI transport network core in this case as follows:

N7K2-8#show ip mroute
IP Multicast Routing Table for VRF "default"

(*, 224.100.100.100/32), uptime: 20:59:33, ip pim igmp
Incoming interface: Null, RPF nbr: 0.0.0.0
Outgoing interface list: (count: 2)
port-channel78, uptime: 20:58:46, igmp
Ethernet1/29, uptime: 20:58:53, igmp

(150.1.38.3/32, 224.100.100.100/32), uptime: 21:00:05, ip pim mrib
Incoming interface: Ethernet1/29, RPF nbr: 150.1.38.3
Outgoing interface list: (count: 2)
port-channel78, uptime: 20:58:46, mrib
Ethernet1/29, uptime: 20:58:53, mrib, (RPF)

(150.1.78.7/32, 224.100.100.100/32), uptime: 21:00:05, ip pim mrib
Incoming interface: port-channel78, RPF nbr: 150.1.78.7
Outgoing interface list: (count: 2)
port-channel78, uptime: 20:58:46, mrib, (RPF)
Ethernet1/29, uptime: 20:58:53, mrib

(*, 232.0.0.0/8), uptime: 21:00:05, pim ip
Incoming interface: Null, RPF nbr: 0.0.0.0
Outgoing interface list: (count: 0)

Note that N7K1-3 (150.1.38.3) and N7K2-7 (150.1.78.7) have both joined the (*, 224.100.100.100). A very important point about this is that the control group for OTV is an Any Source Multicast (ASM) group, not a Source Specific Multicast (SSM) group. This implies that your DCI transit network must run PIM Sparse Mode and have a Rendezvous Point (RP) configured in order to build the shared tree (RPT) for the OTV control group used by the AEDs. You technically could use Bidir but you really wouldn't want to for this particular application. This kind of surprised me how they chose to implement it, because there are already more efficient ways of doing source discovery for SSM, for example how Multicast MPLS L3VPN uses the BGP AFI/SAFI Multicast MDT to advertise the (S,G) pairs of the PE routers. I suppose the advantage of doing OTV this way though is that it makes the OTV config very straightforward from an implementation point of view on the AEDs, and you don’t need an extra control plane protocol like BGP to exchange the (S,G) pairs before you actually join the tree. The alternative to this of course is to use the Adjacency Server and just skip using multicast all together. This however will result in unicast replication in the core, which can be bad, mkay?

Also for added fun in the IS-IS control plane the actual MAC address routing table can be verified as follows:

N7K2-7# show otv route

OTV Unicast MAC Routing Table For Overlay1

VLAN MAC-Address Metric Uptime Owner Next-hop(s)
---- -------------- ------ -------- --------- -----------
172 0000.0000.0002 1 01:22:06 site port-channel27
172 0000.0000.0003 42 01:20:51 overlay N7K1-3
172 0000.0000.000a 42 01:18:11 overlay N7K1-3
172 0000.0000.001e 1 01:20:36 site port-channel27
172 1234.5678.9abc 42 00:19:09 overlay N7K1-3

N7K2-7# show otv isis database detail | no-more
OTV-IS-IS Process: default LSP database VPN: Overlay1

OTV-IS-IS Level-1 Link State Database
LSPID Seq Number Checksum Lifetime A/P/O/T
N7K2-7.00-00 * 0x000000A3 0xA36A 893 0/0/0/1
Instance : 0x000000A3
Area Address : 00
NLPID : 0xCC 0x8E
Hostname : N7K2-7 Length : 6
Extended IS : N7K1-3.01 Metric : 40
Vlan : 172 : Metric : 1
MAC Address : 0000.0000.001e
Vlan : 172 : Metric : 1
MAC Address : 0000.0000.0002
Digest Offset : 0
N7K1-3.00-00 0x00000099 0xBAA4 1198 0/0/0/1
Instance : 0x00000094
Area Address : 00
NLPID : 0xCC 0x8E
Hostname : N7K1-3 Length : 6
Extended IS : N7K1-3.01 Metric : 40
Vlan : 172 : Metric : 1
MAC Address : 1234.5678.9abc
Vlan : 172 : Metric : 1
MAC Address : 0000.0000.000a
Vlan : 172 : Metric : 1
MAC Address : 0000.0000.0003
Digest Offset : 0
N7K1-3.01-00 0x00000090 0xCBAB 718 0/0/0/1
Instance : 0x0000008E
Extended IS : N7K2-7.00 Metric : 0
Extended IS : N7K1-3.00 Metric : 0
Digest Offset : 0

So at this point we see what our ICMP PING was actually ICMP over IP over Ethernet over MPLS over GRE over IP over Ethernet, and our routing protocol was IS-IS over Ethernet over MPLS over GRE over IP over Ethernet :/ What about multicast in the data plane though? At this point verification of multicast over the DCI core is pretty straightforward, since we can just enable a multicast routing protocol like EIGRP and look at the result. This can be seen below:

R2#config t
Enter configuration commands, one per line. End with CNTL/Z.
R2(config)#router eigrp 1
R2(config-router)#no auto-summary
R2(config-router)#network 0.0.0.0
R2(config-router)#end
R2#

R3#config t
Enter configuration commands, one per line. End with CNTL/Z.
R3(config)#router eigrp 1
R3(config-router)#no auto-summary
R3(config-router)#network 0.0.0.0
R3(config-router)#end
R3#
*Aug 17 22:39:43.419: %SYS-5-CONFIG_I: Configured from console by console
*Aug 17 22:39:43.423: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 1: Neighbor 172.16.0.2 (GigabitEthernet0/0) is up: new adjacency

R3#show ip eigrp neighbors
IP-EIGRP neighbors for process 1
H Address Interface Hold Uptime SRTT RTO Q Seq
(sec) (ms) Cnt Num
0 172.16.0.2 Gi0/0 11 00:00:53 1 200 0 1

Our EIGRP adjacency came up, so multicast obviously is being tunneled over OTV. Let’s see the packet capture result:

We can see EIGRP being tunneled inside the OTV payload, but what’s with the outer header? Why is EIGRP using the ASM 224.100.100.100 group instead of the SSM 232.1.2.0/24 data group? My first guess was that link local multicast (i.e. 224.0.0.0/24) would get encapsulated as control plane instead of as data plane. This would make sense because control plane protocols like OSPF, EIGRP, PIM, etc. you would want those tunneling to all OTV sites, not just the ones that joined the SSM feeds. To test if this was the case, the only change I needed to make was to have one router join a non-link-local multicast group, and have the other router send ICMP pings. Since they’re effectively in the same LAN segment, no PIM routing is needed in the DC sites, just basic IGMP Snooping, which is enabled in NX-OS by default. The config on the IOS routers is as follows:

R2#config t
Enter configuration commands, one per line. End with CNTL/Z.
R2(config)#ip multicast-routing
R2(config)#int gig0/0
R2(config-if)#ip igmp join-group 224.10.20.30
R2(config-if)#end
R2#

R3#ping 224.10.20.30 repeat 1000 size 1458 df-bit

Type escape sequence to abort.
Sending 1000, 1458-byte ICMP Echos to 224.10.20.30, timeout is 2 seconds:
Packet sent with the DF bit set

Reply to request 0 from 172.16.0.2, 1 ms
Reply to request 1 from 172.16.0.2, 1 ms
Reply to request 2 from 172.16.0.2, 1 ms

The packet capture result was as follows:

This was more as expected. Now the multicast data plane packet was getting encapsulated in the ICMP over IP over Ethernet over MPLS over GRE over IP *Multicast* over Ethernet OTV group. The payload wasn’t decoded, as I think even Wireshark was dumbfounded by this string of encapsulations.

In summary we can make the following observations about OTV:

  • OTV encapsulation has 42 bytes of overhead that consists of:
    • New Outer Ethernet II Header - 14 Bytes
    • New Outer IP Header - 20 Bytes
    • GRE Header - 4 Bytes
    • MPLS Header - 4 Bytes
  • OTV uses both Unicast and Multicast transport
    • ASM Multicast is used to build the control plane for OTV IS-IS, ARP, IGMP, EIGRP, etc.
    • Unicast is used for normal unicast data plane transmission between sites
    • SSM Multicast is used for normal multicast data plane transmission between sites
    • Optionally ASM & SSM can be replaced with the Adjacency Server
  • GRE is the ultimate band-aid of networking

Now the next time someone is throwing around fancy buzzwords about OTV, DCI, VWM, etc. you can say “oh, you mean that fancy GRE tunnel”? ;)

I’ll be continuing this series in the coming days and weeks on other Data Center and specifically CCIE Data Center related technologies. If you have a request for a specific topic or protocol that you’d like to see the behind the scene’s details of, drop me a line at bmcgahan@ine.com.

Happy Labbing!

Oct
18

One of our most anticipated products of the year - INE's CCIE Service Provider v3.0 Advanced Technologies Class - is now complete!  The videos from class are in the final stages of post production and will be available for streaming and download access later this week.  Download access can be purchased here for $299.  Streaming access is available for All Access Pass subscribers for as low as $65/month!  AAP members can additionally upgrade to the download version for $149.

At roughly 40 hours, the CCIE SPv3 ATC covers the newly released CCIE Service Provider version 3 blueprint, which includes the addition of IOS XR hardware. This class includes both technology lectures and hands on configuration, verification, and troubleshooting on both regular IOS and IOS XR. Class topics include Catalyst ME3400 switching, IS-IS, OSPF, BGP, MPLS Layer 3 VPNs (L3VPN), Inter-AS MPLS L3VPNs, IPv6 over MPLS with 6PE and 6VPE, AToM and VPLS based MPLS Layer 2 VPNs (L2VPN), MPLS Traffic Engineering, Service Provider Multicast, and Service Provider QoS.

Below you can see a sample video from the class, which covers IS-IS Route Leaking, and its implementation on IOS XR with the Routing Policy Language (RPL)

Jul
31

In this short article we'll take a look at Cisco IOS static multicast routes (mroutes) and the way they are used for RPF information selection. Multicast routing using PIM is not based on propagation of any type of multicast routes - the process that was used say, in DVMRP. Instead, router performs RPF checks based on the contents of unicast routing table, populated by regular routing protocols. The RPF checks could be classified as either data-plane or control-plane. Data-plane RPF check applies when router receives a multicast packet, to validate if the interface and upstream neighbor sending the packet match RPF information. For data-plane multicast, the packet must be received from an active PIM neighbor on the interface that is on the shortest path to the packet source IP address, or RPF check would fail. Control-plane RPF check is performed when originating/receiving control-plane messages, such as sending PIM Join or receiving MSDP SA message. For example, PIM needs to know where to send the Join message for a particular (S,G) or (*,G) tree, and this is done based on RPF lookup for the source IP or RP address. Effectively for PIM, RPF check influences the actual multicast path selection in the "reversed way": it carves the route that PIM Join message would take and thus affects the tree construction. In both control and data-plane RPF check cases, the process is similar, and based on looking through all available RPF sources.

The following is the list of possible RPF sources:

  1. Unicast routes, static/dynamic (e.g. via OSPF). This the normal source of RPF information, and the only one you need in properly configured multicast network, where a single routing protocol is used and multicast is enabled on all links.
  2. Static mroutes, which are "hints" for RPF check. Those could be used in situations where you need to engineer multicast traffic flow over the links that don't run IGP, such as tunnels, or fix RPF failure in situations where multicast routing is not enabled on all links or you have route redistribution configured.
  3. Multicast extension routes, such as those learned via M-BGP. While those belong mainly to the SP domain, M-BGP could be used within the scope of CCIE RS exam to creatively influence path selection and perform RPF fixups without resorting to static m-routes.

You may find out which source is used for your particular address by using the command show ip rpf [Address]. The process of finding the RPF information is different from simple unicast routing table lookup. It is not based solely on the longest-match rule across all RPF sources, but rather the best match is selected within every group and then the winner is elected based on administrative distance. The router selects best matching prefix from both the unicast table (based on longest match) and static multicast routing table and compares their AD's, to select the best one. For the mroute table, the order you create static mroutes with is important - the first matching route is selected, not the longest-matching one.

By default, when you configure a static mroute, its admin distance is zero. For example, if you have a static default mroute ip mroute 0.0.0.0 0.0.0.0 it will always be used over any longer-matching unicast prefix, since it matches everything and has the AD of zero. As another example, assume that you want prefix 192.168.1.0/24 to be RPF checked again unicast table while the rest against addresses matched against the default mroute. You may configure something like this:

ip mroute 192.168.1.0 255.255.255.0 null 0 255
ip mroute 0.0.0.0 0.0.0.0 Tunnel 0

Like we mentioned before, the order of mroute statements is important here, and for sources in the range 192.168.1.0/24 the first matching static mroute has the AD of 255 and thus would be always less preferred as compared to unicast table routes (but not ignored or black-holed!). However, for all other sources, the default mroute will be selected over any unicast information. Notice that if you put the static default mroute ahead of the specific mroute the trick will not work - the default mroute will always match everything and prevent further search through mroute table. What if mroute and unicast route both have the same admin distance? In this case, the static mroute wins, unless it is compared against directly attached route or default route. In the latter case, unicast direct or unicast default route would ace the mroute for RPF check.

NOTE:
It seems that in all recent IOS version the linearly ordered match has been replaced with longest-match lookup across the mroute table. CCO documentation and examples still state that ordered match is in use, but actual testing shows it is, in fact, longest match. Thanks to David Serra for pointing this out.

Finally, what about M-BGP, which is another common source for RPF information? M-BGP routes are treated the same way as static mroutes, but having distance of BGP process - 200 or 20 for iBGP and eBGP respectively. They don't show up in the unicast routing table, but they are used as RPF information source. However, when looking up for the best matching M-BGP prefix, a longest match is performed and selected for comparison, unlike linear ordering used for mroutes. Think of the following scenario: your router receives a unicast default route via OSPF and prefix 192.168.1.0/24 via M-iBGP session. A packet with the source address 192.168.1.100 arrives - what would be used for RPF check? Somewhat counter-intuitively, it would be the OSPF default route, because of OSPF's admin distance 110 and BGP's distance 200 for iBGP. You can solve this problem by lowering BGP's distance or increasing OSPF's distance or resorting to use a static mroute for the source prefix. Keep in mind, though, that in case of equal AD - e.g. when the same prefix is received via unicast and multicast BGP address families - the multicast would take precedence, per the general comparison rule.

In the end, let's briefly talk about what happens if router has multiple equal-cost paths to the source. Firstly, only those routes that point to the active PIM neighbors would be used. Secondly, the router will use the entry with the highest PIM neighbor IP address. This will effectively eliminate uncertainty in RPF decision. It is possible to use equal-cost multicast splitting, but this is a separate IOS feature:

Load Splitting IP Multicast traffic over ECMP

This feature allows splitting (not load-balancing) multicast trees among different RPF paths and accepting packets from multiple RPF sources. However, for the classic multicast, there is only one RPF interface.

Dec
02

After working with the December 2010 London Bootcamp on Multicast for the better part of Day 4 in our 12-day bootcamp, I returned to the hotel to find the following post on my Facebook page - "Multicast is EVIL!"

Why do so many students feel this way about this particular technology? I think one of the biggest challenges is that troubleshooting Multicast definitely reminds us of just what an "art" solving network issues can become. And speaking of troubleshooting, in the Version 4 Routing and Switching exam, we may have to contend with fixing problems beyond the scope of our own "self-induced" variety. This is, of course, thanks to the initial 2 hour Troubleshooting section which may indeed include Multicast-related Trouble Tickets.

Your very best defense against any issues in the lab exam regarding this technology - the new 3-Day Multicast technology bootcamp. Also, be sure to enjoy the latest free vSeminar from Brian McGahan - Troubleshooting IP Multicast Routing.

Oct
12

INE is proud to announce the release of our Multicast Class-on-Demand! Taught by myself, this 15-hour Class-on-Demand series covers IPv4 and IPv6 Multicast Routing on Cisco IOS, including both technology lectures and hands-on CLI examples. More information on the Multicast Class-on-Demand can be found here.

To celebrate the release of this new Class-on-Demand, I will be running a free vSeminar on Troubleshooting IP Multicast Routing this Thursday, October 14th, at 10:00 am PDT (17:00 GMT). This free seminar will run about an hour long, and will cover CLI examples of how to troubleshoot common IP Multicast problems, including RPF failure and the use of static multicast routes & mtrace. For those unable to attend, this vSeminar will be available in recorded Class-on-Demand format at a later date. The url to attend is http://ieclass.internetworkexpert.com/tshootipmulticast/

Click here to register for notifications about new upcoming vSeminars.

Hope to see you there!

Sep
18

When we ask students “what are your weakest areas” or “what are your biggest areas of concern” for the CCIE Lab Exam, we typically always here non-core topics like Multicast, Security, QoS, BGP, etc. As such, INE has responded with a series of bootcamps focused on these disciplines.

The IPv4/IPv6 Multicast 3-Day live, online bootcamp, and the associated Class On-Demand version seeks to address the often confounding subject of Multicast. Detailed coverage of Multicast topics for the following certifications is provided:

Cisco Certified Network Professional (CCNP)

Cisco Certified Design Associate (CCDA)

Cisco Certified Design Professional (CCDP)

Cisco Certified Design Expert (CCDE)

Cisco Certified Internetwork Expert Routing & Switching (CCIE R&S)

Cisco Certified Internetwork Expert Service Provider (CCIE Service Provider)

Cisco Certified Internetwork Expert Security (CCIE Security)

To purchase the live and on-demand versions of the course for just an amazing $295 - just click here. The live course runs 11 AM to 6 PM EST US on September 29,30, and October 1.

The preliminary course outline is as follows:

  • Module 1 Introduction to Multicast

Lesson 1 The Need for Multicast

Lesson 2 Multicast Traffic Characteristics and Behavior

Lesson 3 Multicast Addressing

Lesson 4 IGMP

Lesson 5 Protocol Independent Multicast

  • Module 2 IGMP

Lesson 1 IGMP Version 1

Lesson 2 IGMP Version 2

Lesson 3 IGMP Version 3

Lesson 4 CGMP

Lesson 5 IGMP Snooping

Lesson 6 IGMP Optimization

Lesson 7 IGMP Security

Lesson 8 Advanced IGMP Mechanisms


  • Module 3 Protocol Independent Multicast Forms

Lesson 1 Dense Mode

Lesson 2 Sparse Mode

Lesson 3 Sparse-Dense Mode

Lesson 4 Bidirectional PIM

Lesson 5 PIM on NBMA Networks

  • Module 4 Rendezvous Points

Lesson 1 Static Configuration

Lesson 2 AUTO-RP

Lesson 3 BSR

Lesson 4 Hybrid RP Assignment Approaches

  • Module 5 Connecting PIM Domains

Lesson 1 MSDP

Lesson 2 MSDP Configurations

Lesson 3 MSDP to Anycast

  • Module 6 Multicast Tools

Lesson 1 Rate Limiting

Lesson 2 Multicasting with Tunnels

Lesson 3 Multicast Helper

Lesson 4 Miscellaneous Other Tools and Features

  • Module 7 IPv6 Multicast

Lesson 1 PIM

Lesson 2 RP Assignments

Lesson 3 MLD

Mar
24

IPv6 multicast routing is a fun topic, and is often either loved or avoided :). Here is a jump-start for all my CCIE candidate friends.

Readers digest version: “Auto-RP is out, Dense-mode is out, IGMP is replaced with Multicast Listener Discovery (MLD). MLDv2 supports SSM. RPs, Bi-directional PIM, SSM, ASM and BSRs are still alive and well, and we can now avoid static RPs and BSR if we choose to use embedded RP within the multicast packets themselves. (Crazy and amazing stuff).

Want a little more? Then read on. In this multi-part blog, we will discuss static RP, BSR, and Embedded RP. This first blog will discuss static RP, with some examples that will assist you in getting started.  For those of you who subscribe the open lecture series, I will be including all three RP options in a discussion there as well.

Here is the topology we will use:

IPV6 Multicasting

Here is some additional info on the topology. There is a loopback 0 interface on each router using 2002:yyyy::y/64, where y = the router number. We also hard coded the MAC addresses to 00yy.yyyy.yyyy so that they would be easy to spot. OSPFv3 is running on each interface shown in the diagram including the loopbacks. Let’s verify some basic addressing and connectivity before we add IPv6 multicast routing to the mix.

R3#show ipv6 int brief
FastEthernet0/0 [up/up]
FE80::233:33FF:FE33:3333
2002:34::3
Serial0/1 [up/up]
Serial0/1.23 [up/up]
FE80::C003:8FF:FECC:0
2002:23::3
Loopback0 [up/up]
FE80::C003:8FF:FECC:0
2002:3333::3

R3#show ipv6 route ospf
IPv6 Routing Table - 17 entries
Codes: C - Connected, L - Local, S - Static, R - RIP, B - BGP
U - Per-user Static route, M - MIPv6
I1 - ISIS L1, I2 - ISIS L2, IA - ISIS interarea, IS - ISIS summary
O - OSPF intra, OI - OSPF inter, OE1 - OSPF ext 1, OE2 - OSPF ext 2
ON1 - OSPF NSSA ext 1, ON2 - OSPF NSSA ext 2
D - EIGRP, EX - EIGRP external
O 2002:12::/64 [110/5]
via FE80::244:44FF:FE44:4444, FastEthernet0/0
O 2002:14::/64 [110/65]
via FE80::244:44FF:FE44:4444, FastEthernet0/0
O 2002:16::/64 [110/4]
via FE80::244:44FF:FE44:4444, FastEthernet0/0
O 2002:45::/64 [110/2]
via FE80::244:44FF:FE44:4444, FastEthernet0/0
O 2002:56::/64 [110/3]
via FE80::244:44FF:FE44:4444, FastEthernet0/0
O 2002:1111::1/128 [110/4]
via FE80::244:44FF:FE44:4444, FastEthernet0/0
O 2002:2222::2/128 [110/5]
via FE80::244:44FF:FE44:4444, FastEthernet0/0
O 2002:4444::4/128 [110/1]
via FE80::244:44FF:FE44:4444, FastEthernet0/0
O 2002:5555::5/128 [110/2]
via FE80::244:44FF:FE44:4444, FastEthernet0/0
O 2002:6666::6/128 [110/3]
via FE80::244:44FF:FE44:4444, FastEthernet0/0
R3#

Looks like everything is there. Lets do an IPv6 traceroute to verify. Note the path that begins from R3, then through R4 -> R5 -> R6 -> R1 -> R2. That is the path the OSPFv3 control plane sorted out. It will be important later, when we see what the RPF path is from R3 to R2.

R3#traceroute ipv6 2002:2222::2

Type escape sequence to abort.
Tracing the route to 2002:2222::2

1 2002:34::4 80 msec 28 msec 4 msec
2 2002:45::5 44 msec 40 msec 24 msec
3 2002:56::6 4 msec 32 msec 32 msec
4 2002:16::1 40 msec 28 msec 24 msec
5 2002:2222::2 84 msec 8 msec 20 msec
R3#

OK, I am sold that we have connectivity. Lets verify that PIM, MLD and Multicast-Routing is not currently enabled yet.

R3#show ipv6 pim interface
No interfaces found.

R3#show ipv6 mld interface

FastEthernet0/0 is up, line protocol is up
Internet address is FE80::233:33FF:FE33:3333/10
MLD is disabled on interface

Loopback0 is up, line protocol is up
Internet address is FE80::C003:8FF:FECC:0/10
MLD is disabled on interface
Serial0/1.23 is up, line protocol is up
Internet address is FE80::C003:8FF:FECC:0/10
MLD is disabled on interface
R3#

Lets enable ipv6 multicast routing, and see the difference on R3.

R3(config)#ipv6 multicast-routing

R3#show ipv6 pim interface
Interface PIM Nbr Hello DR
Count Intvl Prior

Tunnel0 off 0 30 1
Address: FE80::C003:8FF:FECC:0
DR : not elected

FastEthernet0/0 on 0 30 1
Address: FE80::233:33FF:FE33:3333
DR : this system

Loopback0 on 0 30 1
Address: FE80::C003:8FF:FECC:0
DR : this system
Serial0/1.23 on 0 30 1
Address: FE80::C003:8FF:FECC:0
DR : this system

R3#show ipv6 mld interface
Tunnel0 is up, line protocol is up
Internet address is FE80::C003:8FF:FECC:0/10
MLD is disabled on interface

FastEthernet0/0 is up, line protocol is up
Internet address is FE80::233:33FF:FE33:3333/10
MLD is enabled on interface
Current MLD version is 2
MLD query interval is 125 seconds
MLD querier timeout is 255 seconds
MLD max query response time is 10 seconds
Last member query response interval is 1 seconds
MLD activity: 9 joins, 0 leaves
MLD querying router is FE80::233:33FF:FE33:3333 (this system)

Loopback0 is up, line protocol is up
Internet address is FE80::C003:8FF:FECC:0/10
MLD is enabled on interface
Current MLD version is 2
MLD query interval is 125 seconds
MLD querier timeout is 255 seconds
MLD max query response time is 10 seconds
Last member query response interval is 1 seconds
MLD activity: 6 joins, 0 leaves
MLD querying router is FE80::C003:8FF:FECC:0 (this system)
Serial0/1.23 is up, line protocol is up
Internet address is FE80::C003:8FF:FECC:0/10
MLD is enabled on interface
Current MLD version is 2
MLD query interval is 125 seconds
MLD querier timeout is 255 seconds
MLD max query response time is 10 seconds
Last member query response interval is 1 seconds
MLD activity: 7 joins, 0 leaves
MLD querying router is FE80::C003:8FF:FECC:0 (this system)

R3#

Just adding the global command of “ipv6 multicast-routing” enabled MLD on the interfaces, and enabled PIM as well. The tunnel created is used to send register messages to RPs when multicast content is seen. The command was so easy, I added the command to the other 5 routers as well. (Not shown here, but if you want to see it, just revisit the previous command in the previous example 5 more times.  :)

Now we can take a look at some additional information such as the tunnel interface that was created and neighborships from a PIM perspective.

R3#show ipv6 int brief
FastEthernet0/0 [up/up]
FE80::233:33FF:FE33:3333
2002:34::3
Serial0/1 [up/up]
Serial0/1.23 [up/up]
FE80::C003:8FF:FECC:0
2002:23::3
Loopback0 [up/up]
FE80::C003:8FF:FECC:0
2002:3333::3
Tunnel0 [up/up] FE80::C003:8FF:FECC:0 unnumbered (FastEthernet0/0)

R3#show ipv6 pim neighbor
Neighbor Address Interface Uptime Expires DR pri Bidir

FE80::244:44FF:FE44:4444 FastEthernet0/0 00:00:55 00:01:21 1 (DR) B
FE80::C002:8FF:FECC:0 Serial0/1.23 00:00:57 00:01:18 1 B

R3#

Looks like R4 won the DR election on the FA0/0 segment. All is fair in love and multicast.   (Earlier, before R4 was enabled for multicast routing, R3 was the DR, and R4 is now the DR for the segment due to a higher IP address on R4).
Next we will set up hard code an RP for the entire domain to use. R2’s loopback0 will do fine. We will use the following command on R2, as well as all the other 5 routers.

R2(config)#ipv6 pim rp-address 2002:2222::2
%SYS-5-CONFIG_I: Configured from console by console
%LINEPROTO-5-UPDOWN: Line protocol on Interface Tunnel1, changed state to up
%LINEPROTO-5-UPDOWN: Line protocol on Interface Tunnel2, changed state to up
R2#

On R2, PIM tunnels are automagically created. These are used for initial register messages. In sparse mode, the RP will build a shortest path tree (by default), in the direction of the multicast source.

R2#show ipv6 pim tunnel
Tunnel0*
Type : PIM Encap
RP : Embedded RP Tunnel
Source: 2002:12::2
Tunnel1*
Type : PIM Encap
RP : 2002:2222::2*
Source: 2002:12::2
Tunnel2*
Type : PIM Decap
RP : 2002:2222::2*
Source: -

R2#

Lets see if everyone agrees that R2 should be the RP. They should, since we hard coded it on all 6 routers. We will do a quick check on R3. All the others should be similar.

R3#show ipv6 pim group-map ff00::/8
IP PIM Group Mapping Table
(* indicates group mappings being used)

FF00::/8*
SM, RP: 2002:2222::2
RPF: Fa0/0,FE80::244:44FF:FE44:4444
Info source: Static
Uptime: 00:04:44, Groups: 0
FF00::/8
SM
Info source: Default
Uptime: 00:17:22, Groups: 0

R3#

Sweet! Since we are on R3, let’s have R3’s loopback 0 interface join a multicast group. IGMP has been replaced in IPv6 multicast with Multicast Listener Discovery (MLD), which uses IPv6 ICMP for communications. Lets use the group of FF08:AAAA::1.

R3(config)#int lo 0
R3(config-if)#ipv6 mld join-group FF08:AAAA::1

Lets look at the mroute table. It appears that R3 is ready willing and able to get that stream of traffic. We should follow the path upstream, and see if the join went all the way to R2, (the RP).

R3#show ipv6 mroute
Multicast Routing Table
Flags: D - Dense, S - Sparse, B - Bidir Group, s - SSM Group,
C - Connected, L - Local, I - Received Source Specific Host Report,
P - Pruned, R - RP-bit set, F - Register flag, T - SPT-bit set,
J - Join SPT
Timers: Uptime/Expires
Interface state: Interface, State

(*, FF08:AAAA::1), 00:00:32/never, RP 2002:2222::2, flags: SCLJ Incoming interface: FastEthernet0/0 RPF nbr: FE80::244:44FF:FE44:4444 Immediate Outgoing interface list: Loopback0, Forward, 00:00:32/never
R3#

R4#show ipv6 mroute
Multicast Routing Table
Flags: D - Dense, S - Sparse, B - Bidir Group, s - SSM Group,
C - Connected, L - Local, I - Received Source Specific Host Report,
P - Pruned, R - RP-bit set, F - Register flag, T - SPT-bit set,
J - Join SPT
Timers: Uptime/Expires
Interface state: Interface, State

(*, FF08:AAAA::1), 00:02:40/00:02:47, RP 2002:2222::2, flags: S Incoming interface: FastEthernet0/1 RPF nbr: FE80::255:55FF:FE55:5555 Immediate Outgoing interface list: FastEthernet0/0, Forward, 00:02:40/00:02:47 R4#

R5#show ipv6 mroute
Multicast Routing Table
Flags: D - Dense, S - Sparse, B - Bidir Group, s - SSM Group,
C - Connected, L - Local, I - Received Source Specific Host Report,
P - Pruned, R - RP-bit set, F - Register flag, T - SPT-bit set,
J - Join SPT
Timers: Uptime/Expires
Interface state: Interface, State

(*, FF08:AAAA::1), 00:02:57/00:03:27, RP 2002:2222::2, flags: S Incoming interface: FastEthernet0/0 RPF nbr: FE80::266:66FF:FE66:6666 Immediate Outgoing interface list: FastEthernet0/1, Forward, 00:02:57/00:03:27
R5#

R6#show ipv6 mroute
Multicast Routing Table
Flags: D - Dense, S - Sparse, B - Bidir Group, s - SSM Group,
C - Connected, L - Local, I - Received Source Specific Host Report,
P - Pruned, R - RP-bit set, F - Register flag, T - SPT-bit set,
J - Join SPT
Timers: Uptime/Expires
Interface state: Interface, State

(*, FF08:AAAA::1), 00:03:17/00:03:08, RP 2002:2222::2, flags: S Incoming interface: FastEthernet0/1 RPF nbr: FE80::211:11FF:FE11:1111 Immediate Outgoing interface list: FastEthernet0/0, Forward, 00:03:17/00:03:08
R6#

R1#show ipv6 mroute
Multicast Routing Table
Flags: D - Dense, S - Sparse, B - Bidir Group, s - SSM Group,
C - Connected, L - Local, I - Received Source Specific Host Report,
P - Pruned, R - RP-bit set, F - Register flag, T - SPT-bit set,
J - Join SPT
Timers: Uptime/Expires
Interface state: Interface, State

(*, FF08:AAAA::1), 00:03:39/00:02:47, RP 2002:2222::2, flags: S Incoming interface: FastEthernet0/0 RPF nbr: FE80::222:22FF:FE22:2222 Immediate Outgoing interface list: FastEthernet0/1, Forward, 00:03:39/00:02:47
R1#

R2#show ipv6 mroute
Multicast Routing Table
Flags: D - Dense, S - Sparse, B - Bidir Group, s - SSM Group,
C - Connected, L - Local, I - Received Source Specific Host Report,
P - Pruned, R - RP-bit set, F - Register flag, T - SPT-bit set,
J - Join SPT
Timers: Uptime/Expires
Interface state: Interface, State

(*, FF08:AAAA::1), 00:04:05/00:03:25, RP 2002:2222::2, flags: S Incoming interface: Tunnel2 RPF nbr: 2002:2222::2 Immediate Outgoing interface list: FastEthernet0/0, Forward, 00:04:05/00:03:25 R2#

Looks like the join for that group went all the way back to the RP. (One of the reasons I did the traceroute earlier, was to show the unicast routing path for IPv6). Notice that the RPF path back to the RP of 2002:2222::2, follows the same path the traceroute did. Convenient, isn’t it.

Ok, so now we just need some content. We can use R6 to emulate a mcast server for that group, by doing a ping. First lets do a debug of ipv6 pim on the RP, R2.

R2#debug ipv6 pim
IPv6 PIM debugging is on

Now a ping from R6.

R6#ping ff08:aaaa::1
Output Interface: Fastethernet0/0
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to FF08:AAAA::1, timeout is 2 seconds:
Packet sent with a source address of 2002:56::6

Reply to request 0 received from 2002:3333::3, 248 ms
Reply to request 1 received from 2002:3333::3, 64 ms
Reply to request 2 received from 2002:3333::3, 80 ms
Reply to request 3 received from 2002:3333::3, 72 ms
Reply to request 4 received from 2002:3333::3, 84 ms
Success rate is 100 percent (5/5), round-trip min/avg/max = 64/109/248 ms
5 multicast replies and 0 errors.

Cool. Lets look at the debug output on the RP.

IPv6 PIM: Received J/P on FastEthernet0/0 from FE80::211:11FF:FE11:1111 target: FE80::222:22FF:FE22:2222 (to us)
IPv6 PIM: J/P entry: Join root: 2002:2222::2 group: FF08:AAAA::1 flags: RPT WC S
IPv6 PIM: (*,FF08:AAAA::1) FastEthernet0/0 Raise J/P expiration timer to 210 seconds

Lets also investigate R5, in the path between R6 (the mcast server) and the R3 (who joined the group)

R5#show ipv6 mroute
Multicast Routing Table
Flags: D - Dense, S - Sparse, B - Bidir Group, s - SSM Group,
C - Connected, L - Local, I - Received Source Specific Host Report,
P - Pruned, R - RP-bit set, F - Register flag, T - SPT-bit set,
J - Join SPT
Timers: Uptime/Expires
Interface state: Interface, State

(*, FF08:AAAA::1), 00:11:18/00:03:06, RP 2002:2222::2, flags: S
Incoming interface: FastEthernet0/0
RPF nbr: FE80::266:66FF:FE66:6666
Immediate Outgoing interface list:
FastEthernet0/1, Forward, 00:11:18/00:03:06

(2002:56::6, FF08:AAAA::1), 00:03:09/00:00:20, flags: ST Incoming interface: FastEthernet0/0 RPF nbr: FE80::266:66FF:FE66:6666 Immediate Outgoing interface list: FastEthernet0/1, Forward, 00:03:08/00:03:16
R5#

Another command that provides additional information is the show ipv6 pim topology command.

R5#show ipv6 pim topology
IP PIM Multicast Topology Table
Entry state: (*/S,G)[RPT/SPT] Protocol Uptime Info
Entry flags: KAT - Keep Alive Timer, AA - Assume Alive, PA - Probe Alive,
RA - Really Alive, LH - Last Hop, DSS - Don't Signal Sources,
RR - Register Received, SR - Sending Registers, E - MSDP External,
DCC - Don't Check Connected
Interface state: Name, Uptime, Fwd, Info
Interface flags: LI - Local Interest, LD - Local Disinterest,
II - Internal Interest, ID - Internal Disinterest,
LH - Last Hop, AS - Assert, AB - Admin Boundary

(*,FF08:AAAA::1)
SM UP: 00:11:53 JP: Join(now) Flags:
RP: 2002:2222::2
RPF: FastEthernet0/0,FE80::266:66FF:FE66:6666
FastEthernet0/1 00:11:53 fwd Join(00:02:31)

(2002:56::6,FF08:AAAA::1) SM SPT UP: 00:03:44 JP: Join(never) Flags: KAT(00:00:50) AA PA RA RPF: FastEthernet0/0,FE80::266:66FF:FE66:6666* FastEthernet0/1 00:03:44 fwd Join(00:02:41)
R5#

That’s a great little jumpstart into IPv6 multicast.

One item of note, is that hard coding the RP is not very scalable. The other options include BSR and Embedded RP (where the actual multicast traffic has the RP embedded within the packet itself). I will include those examples in another blog post. Or if you can’t wait, jump right into our RS workbooks for a wealth of information, insight and practice labs to improve your Tier1, Tier2 and Tier3 skills.

Any skills you develop in IPv4 multicasting will greatly assist you with the IPv6 multicasting, and they are both on the blueprint.

Best wishes.

Mar
20

A pretty important topic that is very easy to overlook when studying multicast is the PIM Assert Mechanism.  After working with the TechEdit Team in the IEOC it is obvious that more than just a handful of students are confused about what this mechanism does and how it works. In this blog post (the first of many dedicated to multicasting), we will examine the PIM Assert mechanism and put this topic behind us in our preparation in mastering multicast.

In Figure 1, R1 and R4 have a route to the source 150.1.5.5 (the multicast source), and share a multi-access connection to R6. R6’s FastEthernet0/0 interface has joined the multicast group 239.6.6.6.

Figure 1

Both R1 and R4 are receiving copies of the same multicast packets from the source (illustrated by the yellow arrows), but it’s not very efficient for both routers to forward the packets onto the same network segment.  This would result in duplicate traffic and a waste of bandwidth and processing power.

To stop this duplication of shared traffic, PIM routers connected to a shared segment will elect a single forwarder for that particular segment.  Since PIM does not have its own routing protocol that could be used to determine the best path to send data across, it relies on a special process called the PIM Assert Mechanism to make this determination.

This mechanism tells a router that when it receives a multicast packet from a particular source on an interface that is already listed in its own Outbound Interface List (OIL) for the same (S,G) pair, that it needs to send an Assert Message.  Assert Messages contain the metric of the unicast route to the source in question, the Administrative Distance of the protocol that discovered the route, the multicast source and the group itself, and are used to elect what is called the PIM Forwarder.

In the scenario in figure 1, both R1 and R4 will send the same multicast stream to R6.  This means they will put their VLAN 146 interfaces into the OIL for the (S,G) pair (150.1.5.5, 239.6.6.6) and because this is a LAN segment  each device will see each the others stream.  This condition, each router producing duplicate packets on the segment, will trigger the Assert Mechanism.

These Assert Messages are used to elect the PIM forwarder using the following three rules:

  1. The router generating an Assert with the lowest Administrative distance is elected the forwarder.  The AD would only differ if the routes to R5 where from different routing protocols.  If the Administrative Distances are the same then we move to step 2.
  2. The best unicast routing metric is used to break a tie if the Administrative Distances are the same.  The combination of AD and the unicast routing metric is referred to as a “tuple”. If metrics are the same them we move on to step 3.
  3. The device with the highest IP Address will be elected as the PIM Forwarder.

When a device is elected to be the PIM Forwarder it will continue to send the multicast stream while the other device stops forwarding that group's traffic.  Furthermore, the “Assert Loser” will prune its physical interface connected to the shared media.

Using the following show commands we can see the outcome of the election.  R4 shows that it won the election by displaying the “A” associated with the interface that is forwarding multicasts, and R1 prunes its interface exactly as we discussed.

R4#show ip mroute 239.6.6.6
<output omitted for clarity>
(150.1.5.5, 239.6.6.6), 00:01:04/00:01:12, flags: T
Incoming interface: Serial0/1/0, RPF nbr 155.1.45.5
Outgoing interface list:
FastEthernet0/0, Forward/Sparse-Dense, 00:00:39/00:00:00, A
R1#show ip mroute 239.6.6.6
<output omitted for clarity>
(150.1.5.5, 239.6.6.6), 00:01:15/00:01:24, flags: PT
Incoming interface: Serial0/0.1, RPF nbr 155.1.0.5
Outgoing interface list:
FastEthernet0/0, Prune/Sparse-Dense, 00:01:27/00:01:32

Do not confuse the PIM forwarder with the Designated Router, the PIM forwarder’s job is simply to forward multicast traffic onto a shared segment. We will cover the Designated Router in another blog post.  For a more detailed explanation of this process and its application please check out the newly revised Multicast section of our Volume 1 Workbook.

Dec
26

IPv6 multicast renames IGMP to the Multicast Listener Discovery Protocol (MLP). Version 1 of MLD is similar to IGMP Version 2, while Version 2 of MLD is similar to Version 3 IGMP. As such, MLD Version 2 supports Source Specific Multicast (SSM) for IPv6 environments.

Using MLD, hosts can indicate they want to receive multicast transmissions for select groups. Routers (queriers) can control the flow of multicast in the network through the use of MLD.

MLD uses the Internet Control Message Protocol (ICMP) to carry its messages. All such messages are link-local in scope, and they all have the router alert option set.

MLD uses three types of messages - Query, Report, and Done. The Done message is like the Leave message in IGMP version 2. It indicates a host no longer wants to receive the multicast transmission. This triggers a Query to check for any more receivers on the segment.

Configuration options for MLD will be very similar to configuration tasks we needed to master for IGMP. You can limit the number of receivers with the ipv6 mld limit command. If you want the interface to "permanently" subscribe, you can use the ipv6 mld join-group command. Also, like in IGMP, there are several timers you may manipulate for the protocol's mechanics.

Configuring IPv6 multicast-routing with the global configuration command ipv6 multicast-routing, automatically configures Protocol Independent Multicast (PIM) an all active interfaces. This also includes the automatic configuration of MLD. Here are verifications:

R0#show ipv6 pim interface
Interface          PIM  Nbr   Hello  DR
Count Intvl  Prior

Tunnel0            off  0     30     1     
Address: FE80::C000:2FF:FE97:0
DR     : not elected
VoIP-Null0         off  0     30     1     
Address: ::
DR     : not elected
FastEthernet0/0    on   0     30     1     
Address: FE80::C000:2FF:FE97:0
DR     : this system

FastEthernet0/1    off  0     30     1     
Address: ::
DR     : not elected

Notice the PIM is indeed enabled on the Fa0/0 we have configured in this scenario. Now for the verification of MLD:

R0#show ipv6 mld interface
Tunnel0 is up, line protocol is up
Internet address is FE80::C000:2FF:FE97:0/10
MLD is disabled on interface
VoIP-Null0 is up, line protocol is up
Internet address is ::/0
MLD is disabled on interface
FastEthernet0/0 is up, line protocol is up
Internet address is FE80::C000:2FF:FE97:0/10
MLD is enabled on interface
Current MLD version is 2

MLD query interval is 125 seconds
MLD querier timeout is 255 seconds
MLD max query response time is 10 seconds
Last member query response interval is 1 seconds
MLD activity: 5 joins, 0 leaves
MLD querying router is FE80::C000:2FF:FE97:0 (this system)
FastEthernet0/1 is administratively down, line protocol is down
Internet address is ::/0
MLD is disabled on interface

Notice the similarities to IGMP are obviously striking.

Thanks for reading, and I hope to "see you" again soon here at the INE blog.

Dec
16

IPv6 multicast is an important new blueprint topic for the Version 4.X CCIE R&S Lab Exam as well as the Written Qualification Exam. In this post, we will start at the most logical starting point for this topic - the IPv6 multicast addressing in use.

Like in IP version 4, multicast refers to addressing nodes so that a copy of data will be sent to all nodes that possess the address. Multicast allows for the elimination of broadcasts in IPv6. Broadcasts in IP version 4 were problematic, since the copy of data is delivered to all nodes in the network, whether the node cares to receive the information or not.

Multicast addresses are quickly detected by the initial bit settings. A multicast address begins with the first 8 bits set to 1 (11111111). The corresponding IPv6 prefix notation is FF00::/8.

Following the initial 8 bits, there are 4 bits (labeled 0RPT) which are flag fields. The high-order flag is reserved, and must be initialized to 0. If the R bit is set to 1, then the P and T bits must also be set to 1. This indicates there is an embedded Rendezvous Point (RP) address in the multicast address.

The next four bits are scope. The possible scope values are:

0  reserved
1  Interface-Local scope
2  Link-Local scope
3  reserved
4  Admin-Local scope
5  Site-Local scope
6  (unassigned)
7  (unassigned)
8  Organization-Local scope
9  (unassigned)
A  (unassigned)
B  (unassigned)
C  (unassigned)
D  (unassigned)
E  Global scope
F  reserved

The remaining 112 bits of the address make up the multicast Group ID. An example of an IPv6 multicast address would be all of the NTP servers on the Internet - FF0E:0:0:0:0:0:0:101.

Notice, like in IPv4 multicast, there are many reserved addresses of link-local scope. Here are some examples:

FF02:0:0:0:0:0:0:1 - all nodes
FF02:0:0:0:0:0:0:2 - all routers
FF02:0:0:0:0:0:0:9 - all RIP

A special, reserved IPv6 multicast address that you should be aware of is the Solicited-Node multicast address:

FF02:0:0:0:0:1:FFXX:XXXX

A Solicited-Node multicast address is created automatically for you by the router. It takes the low-order 24 bits of the IPv6 address (unicast or anycast) and appends those bits to the prefix FF02:0:0:0:0:1:FF00::/104. This results in a multicast address within the range FF02:0:0:0:0:1:FF00:0000 to FF02:0:0:0:0:1:FFFF:FFFF. These addresses are used by the IPv6 Neighbor Discovery (ND) protocol in order to provide a much more efficient address resolution protocol than Address Resolution Protocol (ARP) of IPv4.

Now that we understand the addressing, let us see it in action on a Cisco router.

R1#configure terminal
Enter configuration commands, one per line.  End with CNTL/Z.
R1(config)#ipv6 unicast-routing
R1(config)#interface fa0/0
R1(config-if)#ipv6 address 2001:1::/64 eui-64
R1(config-if)#no shutdown
R1(config-if)#
*Mar  1 00:03:32.627: %LINK-3-UPDOWN: Interface FastEthernet0/0, changed state to up
*Mar  1 00:03:33.627: %LINEPROTO-5-UPDOWN: Line protocol on Interface FastEthernet0/0, changed state to up
R1(config-if)#do show ipv6 interface fa0/0
FastEthernet0/0 is up, line protocol is up
IPv6 is enabled, link-local address is FE80::C001:1FF:FE47:0
No Virtual link-local address(es):
Global unicast address(es):
2001:1::C001:1FF:FE47:0, subnet is 2001:1::/64 [EUI]
Joined group address(es):
FF02::1
FF02::2
FF02::1:FF47:0

MTU is 1500 bytes
ICMP error messages limited to one every 100 milliseconds
ICMP redirects are enabled
ICMP unreachables are sent
ND DAD is enabled, number of DAD attempts: 1
ND reachable time is 30000 milliseconds
ND advertised reachable time is 0 milliseconds
ND advertised retransmit interval is 0 milliseconds
ND router advertisements are sent every 200 seconds
ND router advertisements live for 1800 seconds
ND advertised default router preference is Medium
Hosts use stateless autoconfig for addresses.

Notice that because I enabled the IPv6 routing capabilities for this device, one of the multicast groups joined is ALL ROUTERS for the local-link (FF02::2). Also note the Solicited-Node multicast address of FF02::1:FF47:0.

I hope you have enjoyed this presentation on IPv6 multicast and will be joining us for more. If you want practice right away with these topics, check out any of our CCIE R&S products.

Subscribe to INE Blog Updates