May
05

This weekend while working on content updates for CCIE R&S Version 5, I ran into an interesting problem.  In order to test some nuances of routing protocol updates and packet fragmentation, I was trying to generate BGP UPDATE messages that would exceed the transit MTU.  To do this I manually created a bunch of Loopback interfaces and did a redistribute connected into BGP.  When I looked at the packet capture details, I started to realize how many routes I'd actually need in order to fill up the packet sizes.  After wasting about 30 minutes copying and pasting new Loopbacks over and over, I decided to come up with a better automated solution instead.  I thought, “why not just have the router generate its own random Loopback addresses and then advertise them into BGP?” Well surprisingly I actually got it to work, despite my amateur at best coding skills.

The following TCL script is used to generate a given number of Loopback interfaces with random IPv4 and IPv6 addresses.  To use it simply start the tclsh from the IOS CLI, paste the procedure in, then invoke it with generate_loopbacks X, where “X” is the number of routes you want to generate.  Note that I didn’t add any error checking for overlapping addresses or invalid address and mask combinations.  If someone wants to update the script to account for this, please feel free to do so and I’ll throw 100 rack rental tokens your way for the trouble. Edit: Special thanks to Jason Cook for adding the error checking for me.

A quick demo of the script in action can be found after the jump.

The script:

proc generate_loopbacks {x} {

# random number generator
proc rand_range { min max } { return [expr int(rand() * ($max - $min)) + $min] }

# define subnet mask lengths
set len(1) 128.0.0.0
set len(2) 192.0.0.0
set len(3) 224.0.0.0
set len(4) 240.0.0.0
set len(5) 248.0.0.0
set len(6) 252.0.0.0
set len(7) 254.0.0.0
set len(8) 255.0.0.0
set len(9) 255.128.0.0
set len(10) 255.192.0.0
set len(11) 255.224.0.0
set len(12) 255.240.0.0
set len(13) 255.248.0.0
set len(14) 255.252.0.0
set len(15) 255.254.0.0
set len(16) 255.255.0.0
set len(17) 255.255.128.0
set len(18) 255.255.192.0
set len(19) 255.255.224.0
set len(20) 255.255.240.0
set len(21) 255.255.248.0
set len(22) 255.255.252.0
set len(23) 255.255.254.0
set len(24) 255.255.255.0
set len(25) 255.255.255.128
set len(26) 255.255.255.192
set len(27) 255.255.255.224
set len(28) 255.255.255.240
set len(29) 255.255.255.248
set len(30) 255.255.255.252
set len(31) 255.255.255.254
set len(32) 255.255.255.255

# Iterate the loop $x times

for {set n 1} {$n<=$x} {incr n 1} {

# generate random IPv4 address
set a [rand_range 1 223]
set b [rand_range 1 255]
set c [rand_range 1 255]
set d [rand_range 1 255]

# generate random IPv4 mask
set y [rand_range 1 32]

# generate random IPv6 address
set e [format %x [rand_range 1 65534]]
set f [format %x [rand_range 1 65534]]
set g [format %x [rand_range 1 65534]]
set h [format %x [rand_range 1 65534]]
set i [format %x [rand_range 1 65534]]
set j [format %x [rand_range 1 65534]]
set k [format %x [rand_range 1 65534]]

# generate random IPv6 mask
set z [rand_range 16 64]

# set error check variable
set m 0

# set $LOOBACK_NUMBER
set LOOPBACK_NUMBER [expr 10000 + $n]

# send IOS exec commands
set OUTPUT [ ios_config "interface Loopback$LOOPBACK_NUMBER" "ip address $a.$b.$c.$d $len($y)" "ipv6 address 2001:$e:$f:$g:$h:$i:$j:$k/$z" ]

# Split the OUTPUT variable into individual lines, and for each line place it into the variable LINE
foreach LINE [split $OUTPUT "\n"] {

# check if the LINE variable contains an indication that there is a problem with a random address
# and if so, set a variable m to a specific value
if { [regexp "is overlapping with" $LINE] } {
set m 1 } elseif { [regexp "overlaps with" $LINE] } {
set m 1 } elseif { [regexp "Bad mask" $LINE] } {
set m 1 }

# if the variable m is 1 decrement the variable n used to control the for loop by 1
# forcing the most recent loopback to be re-iterated by the above script
if { [expr $m==1] } {
incr n -1 }
}

}

}

Below is a basic demo of the script in action. R1 and R2 are directly connected on the IPv4 network 10.0.0.0/24 and the IPv6 network 2001::/64. They are peering EBGP, and R1 is doing redistribute connected into BGP in both IPv4 Unicast and IPv6 Unicast address families.

R1#show ip int brief Interface              IP-Address      OK? Method Status                Protocol
GigabitEthernet1 unassigned YES NVRAM up up
GigabitEthernet1.10 10.0.0.1 YES manual up up
GigabitEthernet2 unassigned YES NVRAM administratively down down
GigabitEthernet3 unassigned YES NVRAM administratively down down

R1#show ipv6 int brief GigabitEthernet1 [up/up]
unassigned
GigabitEthernet1.10 [up/up]
FE80::250:56FF:FE8D:4B00
2001::1
GigabitEthernet2 [administratively down/down]
unassigned
GigabitEthernet3 [administratively down/down]
unassigned

R1#sh run | s bgp router bgp 1
bgp log-neighbor-changes
neighbor 10.0.0.2 remote-as 2
neighbor 2001::2 remote-as 2
!
address-family ipv4
redistribute connected
neighbor 10.0.0.2 activate
no neighbor 2001::2 activate
exit-address-family
!
address-family ipv6
redistribute connected
neighbor 2001::2 activate
exit-address-family

R2#sh bgp ipv4 unicast summary BGP router identifier 10.0.0.2, local AS number 2
BGP table version is 144, main routing table version 144
1 network entries using 248 bytes of memory
1 path entries using 120 bytes of memory
1/1 BGP path/bestpath attribute entries using 240 bytes of memory
1 BGP AS-PATH entries using 24 bytes of memory
0 BGP route-map cache entries using 0 bytes of memory
0 BGP filter-list cache entries using 0 bytes of memory
BGP using 632 total bytes of memory
BGP activity 74/72 prefixes, 74/72 paths, scan interval 60 secs

Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd
10.0.0.1 4 1 22 19 144 0 0 00:13:52 1

R2#sh bgp ipv6 unicast summary BGP router identifier 10.0.0.2, local AS number 2
BGP table version is 4, main routing table version 4
1 network entries using 272 bytes of memory
1 path entries using 144 bytes of memory
1/1 BGP path/bestpath attribute entries using 240 bytes of memory
1 BGP AS-PATH entries using 24 bytes of memory
0 BGP route-map cache entries using 0 bytes of memory
0 BGP filter-list cache entries using 0 bytes of memory
BGP using 680 total bytes of memory
BGP activity 74/72 prefixes, 74/72 paths, scan interval 60 secs

Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd
2001::1 4 1 10 8 4 0 0 00:04:14 1

R2#show bgp ipv4 unicast BGP table version is 144, local router ID is 10.0.0.2
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
r RIB-failure, S Stale, m multipath, b backup-path, f RT-Filter,
x best-external, a additional-path, c RIB-compressed,
Origin codes: i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found

Network Next Hop Metric LocPrf Weight Path
r> 10.0.0.0/24 10.0.0.1 0 0 1 ?

R2#show bgp ipv6 unicast BGP table version is 4, local router ID is 10.0.0.2
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
r RIB-failure, S Stale, m multipath, b backup-path, f RT-Filter,
x best-external, a additional-path, c RIB-compressed,
Origin codes: i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found

Network Next Hop Metric LocPrf Weight Path
r> 2001::/64 2001::1 0 0 1 ?

From the above output we can see that R2's only BGP routes right now are the directly connected links that R1's redistributing. Now R1 invokes the TCL script:

R1#tclsh
R1(tcl)#proc generate_loopbacks {x} {
+>(tcl)#
+>(tcl)# # random number generator
+>(tcl)# proc rand_range { min max } { return [expr int(rand() * ($max - $min)) + $min] }
+>(tcl)#
+>(tcl)# # define subnet mask lengths
+>(tcl)# set len(1) 128.0.0.0
+>(tcl)# set len(2) 192.0.0.0
+>(tcl)# set len(3) 224.0.0.0
+>(tcl)# set len(4) 240.0.0.0
+>(tcl)# set len(5) 248.0.0.0
+>(tcl)# set len(6) 252.0.0.0
+>(tcl)# set len(7) 254.0.0.0
+>(tcl)# set len(8) 255.0.0.0
+>(tcl)# set len(9) 255.128.0.0
+>(tcl)# set len(10) 255.192.0.0
+>(tcl)# set len(11) 255.224.0.0
+>(tcl)# set len(12) 255.240.0.0
+>(tcl)# set len(13) 255.248.0.0
+>(tcl)# set len(14) 255.252.0.0
+>(tcl)# set len(15) 255.254.0.0
+>(tcl)# set len(16) 255.255.0.0
+>(tcl)# set len(17) 255.255.128.0
+>(tcl)# set len(18) 255.255.192.0
+>(tcl)# set len(19) 255.255.224.0
+>(tcl)# set len(20) 255.255.240.0
+>(tcl)# set len(21) 255.255.248.0
+>(tcl)# set len(22) 255.255.252.0
+>(tcl)# set len(23) 255.255.254.0
+>(tcl)# set len(24) 255.255.255.0
+>(tcl)# set len(25) 255.255.255.128
+>(tcl)# set len(26) 255.255.255.192
+>(tcl)# set len(27) 255.255.255.224
+>(tcl)# set len(28) 255.255.255.240
+>(tcl)# set len(29) 255.255.255.248
+>(tcl)# set len(30) 255.255.255.252
+>(tcl)# set len(31) 255.255.255.254
+>(tcl)# set len(32) 255.255.255.255
+>(tcl)#
+>(tcl)## Iterate the loop $x times
+>(tcl)#
+>(tcl)# for {set n 1} {$n<=$x} {incr n 1} {
+>(tcl)#
+>(tcl)# # generate random IPv4 address
+>(tcl)# set a [rand_range 1 223]
+>(tcl)# set b [rand_range 1 255]
+>(tcl)# set c [rand_range 1 255]
+>(tcl)# set d [rand_range 1 255]
+>(tcl)#
+>(tcl)# # generate random IPv4 mask
+>(tcl)# set y [rand_range 1 32]
+>(tcl)#
+>(tcl)# # generate random IPv6 address
+>(tcl)# set e [format %x [rand_range 1 65534]]
+>(tcl)# set f [format %x [rand_range 1 65534]]
+>(tcl)# set g [format %x [rand_range 1 65534]]
+>(tcl)# set h [format %x [rand_range 1 65534]]
+>(tcl)# set i [format %x [rand_range 1 65534]]
+>(tcl)# set j [format %x [rand_range 1 65534]]
+>(tcl)# set k [format %x [rand_range 1 65534]]
+>(tcl)#
+>(tcl)# # generate random IPv6 mask
+>(tcl)# set z [rand_range 16 64]
+>(tcl)#
+>(tcl)# # set error check variable
+>(tcl)# set m 0
+>(tcl)#
+>(tcl)# # set $LOOBACK_NUMBER
+>(tcl)# set LOOPBACK_NUMBER [expr 10000 + $n]
+>(tcl)#
+>(tcl)# # send IOS exec commands
+>(tcl)# set OUTPUT [ ios_config "interface Loopback$LOOPBACK_NUMBER" "ip address $a.$b.$c.$d $len($y)" "ipv6 address 2001:$e:$f:$g:$h:$i:$j:$k/$z" ]
+>(tcl)#
+>(tcl)# # Split the OUTPUT variable into individual lines, and for each line place it into the variable LINE
+>(tcl)# foreach LINE [split $OUTPUT "\n"] {
+>(tcl)#
+>(tcl)# # check if the LINE variable contains an indication that there is a problem with a random address
+>(tcl)# # and if so, set a variable m to a specific value
+>(tcl)# if { [regexp "is overlapping with" $LINE] } {
+>(tcl)# set m 1 } elseif { [regexp "overlaps with" $LINE] } {
+>(tcl)# set m 1 } elseif { [regexp "Bad mask" $LINE] } {
+>(tcl)# set m 1 }
+>(tcl)#
+>(tcl)# # if the variable m is 1 decrement the variable n used to control the for loop by 1
+>(tcl)# # forcing the most recent loopback to be re-iterated by the above script
+>(tcl)# if { [expr $m==1] } {
+>(tcl)# incr n -1 }
+>(tcl)# }
+>(tcl)#
+>(tcl)# }
+>(tcl)#
+>(tcl)#}
R1(tcl)#
R1(tcl)#generate_loopbacks 100

After a few minutes the script should be done and R1 should be advertising the new routes into BGP:

R2#show bgp ipv4 unicast
BGP table version is 210, local router ID is 10.0.0.2
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
r RIB-failure, S Stale, m multipath, b backup-path, f RT-Filter,
x best-external, a additional-path, c RIB-compressed,
Origin codes: i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found

Network Next Hop Metric LocPrf Weight Path
*> 6.95.74.8/31 10.0.0.1 0 0 1 ?
r> 10.0.0.0/24 10.0.0.1 0 0 1 ?
*> 17.48.0.0/12 10.0.0.1 0 0 1 ?
*> 18.96.0.0/12 10.0.0.1 0 0 1 ?
*> 20.149.128.0/19 10.0.0.1 0 0 1 ?
*> 24.128.0.0/12 10.0.0.1 0 0 1 ?
*> 38.223.64.0/23 10.0.0.1 0 0 1 ?
*> 45.250.118.192/26
10.0.0.1 0 0 1 ?
*> 46.43.110.166/31 10.0.0.1 0 0 1 ?
*> 53.29.128.0/17 10.0.0.1 0 0 1 ?
*> 54.192.224.0/20 10.0.0.1 0 0 1 ?
*> 54.198.33.64/26 10.0.0.1 0 0 1 ?
*> 59.137.166.128/27
10.0.0.1 0 0 1 ?
*> 63.218.166.192/26
10.0.0.1 0 0 1 ?
*> 64.0.0.0/4 10.0.0.1 0 0 1 ?
*> 82.10.112.0/20 10.0.0.1 0 0 1 ?
*> 84.223.226.96/27 10.0.0.1 0 0 1 ?
*> 86.129.194.64/27 10.0.0.1 0 0 1 ?
*> 90.91.0.0/18 10.0.0.1 0 0 1 ?
*> 90.176.106.0/23 10.0.0.1 0 0 1 ?
[snip]

R2#show bgp ipv6 unicast
BGP table version is 88, local router ID is 10.0.0.2
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
r RIB-failure, S Stale, m multipath, b backup-path, f RT-Filter,
x best-external, a additional-path, c RIB-compressed,
Origin codes: i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found

Network Next Hop Metric LocPrf Weight Path
r> 2001::/64 2001::1 0 0 1 ?
*> 2001:434::/30 2001::1 0 0 1 ?
*> 2001:613:93FF:8B40::/60
2001::1 0 0 1 ?
*> 2001:7E2:8398::/46
2001::1 0 0 1 ?
*> 2001:86D:FA5C:4000::/51
2001::1 0 0 1 ?
*> 2001:8AD:9170::/44
2001::1 0 0 1 ?
*> 2001:136F:E000::/37
2001::1 0 0 1 ?
*> 2001:156A:E238::/49
2001::1 0 0 1 ?
*> 2001:169C::/30 2001::1 0 0 1 ?
*> 2001:192D:B2EC:2548::/61
2001::1 0 0 1 ?
*> 2001:1A00::/23 2001::1 0 0 1 ?
*> 2001:1C00::/23 2001::1 0 0 1 ?
[snip]

Feel free to use and modify the script any way you like!

Feb
09

Five hours of new videos have been added to INE's CCIE Service Provider Version 3.0 Advanced Technologies Class which cover Multicast VPN and QoS support for Layer 3 MPLS VPNs.  Specifically these videos cover Multicast VPN & QoS theory, configuration, verification, and troubleshooting on both regular IOS and IOS XR, including PIM Sparse Mode, PIM Source Specific Multicast, VRF Aware PIM, Multicast Distribution Trees, BGP MDT exchange, and SP QoS Classification, Marking, Scheduling, and Admission Control.  For the remainder of the month of February these Multicast videos are free for all users to view.  The links can be found below, or from your INE members site download or streaming links for the class.

Enjoy!

Oct
18

One of our most anticipated products of the year - INE's CCIE Service Provider v3.0 Advanced Technologies Class - is now complete!  The videos from class are in the final stages of post production and will be available for streaming and download access later this week.  Download access can be purchased here for $299.  Streaming access is available for All Access Pass subscribers for as low as $65/month!  AAP members can additionally upgrade to the download version for $149.

At roughly 40 hours, the CCIE SPv3 ATC covers the newly released CCIE Service Provider version 3 blueprint, which includes the addition of IOS XR hardware. This class includes both technology lectures and hands on configuration, verification, and troubleshooting on both regular IOS and IOS XR. Class topics include Catalyst ME3400 switching, IS-IS, OSPF, BGP, MPLS Layer 3 VPNs (L3VPN), Inter-AS MPLS L3VPNs, IPv6 over MPLS with 6PE and 6VPE, AToM and VPLS based MPLS Layer 2 VPNs (L2VPN), MPLS Traffic Engineering, Service Provider Multicast, and Service Provider QoS.

Below you can see a sample video from the class, which covers IS-IS Route Leaking, and its implementation on IOS XR with the Routing Policy Language (RPL)

Oct
12

The BGP MED attribute, commonly referred to as the BGP metric, provides a means to convey to a neighboring Autonomous System (AS) a preferred entry point into the local AS.  BGP MED is a non-transitive optional attribute and thus the receiving AS cannot propagate it across its AS borders.  However, the receiving AS may reset the metric value upon receipt, if it so desires.

Previous versions of BGP (v2 and v3) defined this attribute as the inter-AS metric (INTER_AS_METRIC) but in BGPv4 it is defined as the multi-exit discriminator (MULTI_EXIT_DISC). The MED is an unsigned 32bit integer.  The MED value can be any from 0 to 4,294,967,295 (2^32-1) with a lower value being preferred.  Certain implementations of BGP will treat a path with a MED value of 4,294,967,295 as infinite and hence the path would be deemed unusable so the MED value will be reset to 4,294,967,294.  This rewriting of the MED value could lead to inconsistencies, unintended path selections or even churn. I’ll do a follow up article on how BGP MED can possibly cause an endless convergence loop in certain topologies.

Cisco’s BGP implementation automatically assign the value of the MED attribute based on the IGP metric value for any locally originate prefixes. The reasoning behind this is when there are multiple peering points with a neighboring AS the neighboring AS can use this metric to determine the best entry point into the local AS. This is the case when the originating AS’s network uses as single IGP. When multiple IGPs are used (i.e. OSPF and IS-IS) the metric value automatically copied into BGP will not be comparable. In this situation the metric values should be manually set before sending to the neighboring AS.

The MED value by default will only be used in Cisco’s BGP Best Path selection algorithm when comparing paths from the same AS.  If comparison is desired between different ASes the bgp always-compare-med router configuration command can be used. Use this command with caution as different ASes can have different policies regarding the setting of the MED value or in the case of the MED automatically being set they could be using different IGPs. Additionally by default MED is not compared between sub-autonomous systems in a BGP confederation. To enabled comparison between different sub-ASes within a confederation use the bgp bestpath med confed router configuration command.

As mentioned by default the MED values are compared for paths from the same AS but this presents a problem in the way BGP path comparison is done in the IOS.   Lets first examine how the path comparison is done to get a better understanding of the BGP Deterministic MED command and why Cisco recommends it to be enabled.

Here is the topology that we will use for this scenario:

BGP Deterministic MED

We will primarily look at the effects of BGP MED on the BGP best path decision process from R1’s perspective.   In this network AS 400 is advertising the 24.1.1.0/24 network.  R2, R3 and R4 are in AS 200 with R5 being in AS 300.  R2 is setting the MED for this network to 200, R3 to 300, R4 to 400 and R5 to 500 when the 24.1.1.0/24 network is advertised to R1.  R1’s BGP configuration is below:

Rack1R1# show run | sec router bgp 100
router bgp 100
no synchronization
bgp router-id 1.1.1.1
neighbor 54.1.12.2 remote-as 200
neighbor 54.1.13.3 remote-as 200
neighbor 54.1.14.4 remote-as 200
neighbor 54.1.15.5 remote-as 300
no auto-summary
Rack1R1#

The output of the show ip bgp on R1:

Rack1R1#show ip bgp
BGP table version is 2, local router ID is 1.1.1.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

Network Next Hop Metric LocPrf Weight Path
* 24.1.1.0/24 54.1.12.2 200 0 200 400 ?
* 54.1.14.4 400 0 200 400 ?
* 54.1.15.5 500 0 300 400 ?
*> 54.1.13.3 300 0 200 400 ?
Rack1R1#

Now lets look at the 24.1.1.0/24 network in a little more detail:

Rack1R1#show ip bgp 24.1.1.0/24
BGP routing table entry for 24.1.1.0/24, version 2
Paths: (4 available, best #4, table Default-IP-Routing-Table)
Advertised to update-groups:
2
200 400
54.1.12.2 from 54.1.12.2 (2.2.2.2)
Origin incomplete, metric 200, localpref 100, valid, external
200 400
54.1.14.4 from 54.1.14.4 (4.4.4.4)
Origin incomplete, metric 400, localpref 100, valid, external
300 400
54.1.15.5 from 54.1.15.5 (5.5.5.5)
Origin incomplete, metric 500, localpref 100, valid, external
200 400
54.1.13.3 from 54.1.13.3 (3.3.3.3)
Origin incomplete, metric 300, localpref 100, valid, external, best
Rack1R1#

As we can see R1 has selected R3’s (3.3.3.3) advertisement of the 24.1.1.0/24 as the best path.  The MED is 300 for this advertisement which isn’t the lowest of all advertisements from AS 200.  The advertisement from R2 is actually lower as it has a MED value of 200.  Remember that the lower MED value is preferred since this value is normally copied from the IGP metric and with IGPs the lower metric value is preferred.  Since the MED attribute is optional it may not be present in all paths. By default, BGP process will assume the MED value of zero for such paths, which will make them more preferred during the selection based on metric. If you want to change this behavior, use the bgp bestpath med missing-as-worst router configuration command.

Lets look at how R1 ended up selecting R3 as the best path.  First off the router will order the paths from the newest to the oldest.  By default all factors in the BGP best path decision process being the same, the oldest path will be selected as best.  BGP does to this reduce the amount of churn in the routing table.  To change this behavior and not use the oldest path as the best, the BGP router-ID can be used to determine the best path.  To enable this use the bgp bestpath compare-routerid router configuration command.

Below the bgp bestpath compare-routerid command is enabled on R1.  Now R1 has selected R2’s path as the best since it has the lowest BGP router ID.

Rack1R1#show run | sec router bgp
router bgp 100
no synchronization
bgp router-id 1.1.1.1
bgp bestpath compare-routerid
neighbor 54.1.12.2 remote-as 200
neighbor 54.1.13.3 remote-as 200
neighbor 54.1.14.4 remote-as 200
neighbor 54.1.15.5 remote-as 300
no auto-summary
Rack1R1#show ip bgp 24.1.1.0/24
BGP routing table entry for 24.1.1.0/24, version 3
Paths: (4 available, best #1, table Default-IP-Routing-Table)
Flag: 0x10840
Advertised to update-groups:
2
200 400
54.1.12.2 from 54.1.12.2 (2.2.2.2)
Origin incomplete, metric 200, localpref 100, valid, external, best
200 400
54.1.14.4 from 54.1.14.4 (4.4.4.4)
Origin incomplete, metric 400, localpref 100, valid, external
300 400
54.1.15.5 from 54.1.15.5 (5.5.5.5)
Origin incomplete, metric 500, localpref 100, valid, external
200 400
54.1.13.3 from 54.1.13.3 (3.3.3.3)
Origin incomplete, metric 300, localpref 100, valid, external
Rack1R1#

The bgp bestpath compare-routerid command is removed for the remainder of this scenario.  When the command is removed R3 is once again selected as best.

Rack1R1#show ip bgp 24.1.1.0/24
BGP routing table entry for 24.1.1.0/24, version 4
Paths: (4 available, best #4, table Default-IP-Routing-Table)
Flag: 0x10840
Advertised to update-groups:
2
200 400
54.1.12.2 from 54.1.12.2 (2.2.2.2)
Origin incomplete, metric 200, localpref 100, valid, external
200 400
54.1.14.4 from 54.1.14.4 (4.4.4.4)
Origin incomplete, metric 400, localpref 100, valid, external
300 400
54.1.15.5 from 54.1.15.5 (5.5.5.5)
Origin incomplete, metric 500, localpref 100, valid, external
200 400
54.1.13.3 from 54.1.13.3 (3.3.3.3)
Origin incomplete, metric 300, localpref 100, valid, external, best
Rack1R1#

Additional RFC 4277 (Experience with the BGP-4 Protocol) mentions the following in regards to selecting a path based upon oldest path.

7.1.4.  MEDs and Temporal Route Selection

Some implementations have hooks to apply temporal behavior in MED-based best path selection. That is, all things being equal up to MED consideration, preference would be applied to the "oldest" path, without preference for the lower MED value. The reasoning for this is that "older" paths are presumably more stable, and thus preferable. However, temporal behavior in route selection results in non-deterministic behavior, and as such, may often be undesirable.

 

Rack1R1#show ip bgp 24.1.1.0/24
BGP routing table entry for 24.1.1.0/24, version 4
Paths: (4 available, best #4, table Default-IP-Routing-Table)
Flag: 0x820
Advertised to update-groups:
2
200 400
54.1.12.2 from 54.1.12.2 (2.2.2.2)
Origin incomplete, metric 200, localpref 100, valid, external
200 400
54.1.13.3 from 54.1.13.3 (3.3.3.3)
Origin incomplete, metric 300, localpref 100, valid, external
200 400
54.1.14.4 from 54.1.14.4 (4.4.4.4)
Origin incomplete, metric 400, localpref 100, valid, external
300 400
54.1.15.5 from 54.1.15.5 (5.5.5.5)
Origin incomplete, metric 500, localpref 100, valid, external, best
Rack1R1#

First off it’s important to understand that the paths are compared in pairs starting with the newest path and comparing it with the second newest.  The winning path between the first and second is then compared to the third and in our cause the winner of that comparison is finally compared with the fourth and final path.   On R1 for the 24.1.1.0/24 network, R2’s and R3’s paths are compared first.  Everything in the BGP best path decision algorithm is the same down to MED (weight, local preference, AS path, etc).  Since the advertisements by R2 and R3 are in the same AS the MED is compared and R2 is wins since it has a MED of 200 as opposed to R3’s MED of 300.  Next R2 is then compared to the third oldest entry which is R4’s.  R2 and R4 are in the same AS so R2 wins based upon the lower MED value. Finally R2 is compared with R5. Everything is equal but the MED, router ID and age of the advertisement.  Since R2 and R5 are in different ASes and the bgp always-compare-med isn’t enable, MED isn’t compared.  Additionally we do not have bgp bestpath compare-routerid enabled which leads the R1 to select the oldest advertisement.  Since R5 is listed below R2 we know that it is older and in turn wins out due to being the older advertisement and is installed as the best path to reach the 24.1.1.0/24 network.

As we can see the MED comparison between the paths advertised by AS 200 did not happen as intended by AS 200.   AS 200 was setting the MED so that AS 100 will use R2 as the ingress point into AS 200. This is only because R5’s advertisement was second to the oldest that in turn broke the MED comparison between the AS 200 routers (R2, R3 and R4).

Ideally we want the MED compared between advertisements from the same AS irrespective of their age.  This is where the bgp deterministic-med router configuration command is useful.  When this command is enabled the router will group all paths from the same AS and compare them together before comparing them to paths from different ASes.  Lets enable the command on R1.  We should see that R2 is selected as the preferred path between R2, R3 and R4 but this will mean that once R2 is compared to R5, R5 will be installed since it is an older advertisement.

Rack1R1#show run | sec router bgp
router bgp 100
no synchronization
bgp router-id 1.1.1.1
bgp log-neighbor-changes
bgp deterministic-med
neighbor 54.1.12.2 remote-as 200
neighbor 54.1.13.3 remote-as 200
neighbor 54.1.14.4 remote-as 200
neighbor 54.1.15.5 remote-as 300
no auto-summary
Rack1R1#show ip bgp 24.1.1.0/24
BGP routing table entry for 24.1.1.0/24, version 5
Paths: (4 available, best #4, table Default-IP-Routing-Table)
Flag: 0x820
Advertised to update-groups:
2
200 400
54.1.12.2 from 54.1.12.2 (2.2.2.2)
Origin incomplete, metric 200, localpref 100, valid, external
200 400
54.1.13.3 from 54.1.13.3 (3.3.3.3)
Origin incomplete, metric 300, localpref 100, valid, external
200 400
54.1.14.4 from 54.1.14.4 (4.4.4.4)
Origin incomplete, metric 400, localpref 100, valid, external
300 400
54.1.15.5 from 54.1.15.5 (5.5.5.5)
Origin incomplete, metric 500, localpref 100, valid, external, best
Rack1R1#

It we want to have R2 selected as best we can clear the BGP neighbor relationship with R5 which will in turn cause R5’s paths to be cleared out.  Once the neighbor relationship with R5 comes back up and R5 advertised the 24.1.1.0/24 path, it will be the newest advertisement and in turn be listed at the top.

Rack1R1#clear ip bgp 54.1.15.5
Rack1R1#
%BGP-5-ADJCHANGE: neighbor 54.1.15.5 Down User reset
Rack1R1#
%BGP-5-ADJCHANGE: neighbor 54.1.15.5 Up
Rack1R1#

Now as expected R2 was finally selected as the best path.

Rack1R1#show ip bgp 24.1.1.0
BGP routing table entry for 24.1.1.0/24, version 6
Paths: (4 available, best #2, table Default-IP-Routing-Table)
Flag: 0x820
Advertised to update-groups:
2
300 400
54.1.15.5 from 54.1.15.5 (5.5.5.5)
Origin incomplete, metric 500, localpref 100, valid, external
200 400
54.1.12.2 from 54.1.12.2 (2.2.2.2)
Origin incomplete, metric 200, localpref 100, valid, external, best
200 400
54.1.13.3 from 54.1.13.3 (3.3.3.3)
Origin incomplete, metric 300, localpref 100, valid, external
200 400
54.1.14.4 from 54.1.14.4 (4.4.4.4)
Origin incomplete, metric 400, localpref 100, valid, external
Rack1R1#

Of course to always ensure R2 is selected in our network as the best path we could also use the bgp always-compare-med command to compare MED between different ASes but this command is normally not used in the real world unless MED policies are standardized between neighboring ASes.

Rack1R1#show run | sec router bgp
router bgp 100
no synchronization
bgp router-id 1.1.1.1
bgp always-compare-med
bgp deterministic-med
neighbor 54.1.12.2 remote-as 200
neighbor 54.1.13.3 remote-as 200
neighbor 54.1.14.4 remote-as 200
neighbor 54.1.15.5 remote-as 300
no auto-summary
Rack1R1#
Rack1R1#clear ip bgp *
%BGP-5-ADJCHANGE: neighbor 54.1.12.2 Down User reset
%BGP-5-ADJCHANGE: neighbor 54.1.13.3 Down User reset
%BGP-5-ADJCHANGE: neighbor 54.1.14.4 Down User reset
%BGP-5-ADJCHANGE: neighbor 54.1.15.5 Down User reset
Rack1R1#
%BGP-5-ADJCHANGE: neighbor 54.1.12.2 Up
%BGP-5-ADJCHANGE: neighbor 54.1.13.3 Up
%BGP-5-ADJCHANGE: neighbor 54.1.14.4 Up
%BGP-5-ADJCHANGE: neighbor 54.1.15.5 Up
Rack1R1#show ip bgp 24.1.1.0
BGP routing table entry for 24.1.1.0/24, version 4
Paths: (4 available, best #2, table Default-IP-Routing-Table)
Flag: 0x10860
Advertised to update-groups:
2
300 400
54.1.15.5 from 54.1.15.5 (5.5.5.5)
Origin incomplete, metric 500, localpref 100, valid, external
200 400
54.1.12.2 from 54.1.12.2 (2.2.2.2)
Origin incomplete, metric 200, localpref 100, valid, external, best
200 400
54.1.13.3 from 54.1.13.3 (3.3.3.3)
Origin incomplete, metric 300, localpref 100, valid, external
200 400
54.1.14.4 from 54.1.14.4 (4.4.4.4)
Origin incomplete, metric 400, localpref 100, valid, external
Rack1R1#

If BGP Deterministic MED is used, it should be enabled on all BGP speaking devices within an AS to ensure a consistent policy regarding the use of MEDs.

We should now have a better understanding of how MED is used in the BGP route selection process and the BGP route selection process is general.

My next post will be in regards the Two Rate Three Color Marker (trTCM) as defined in RFC 2698 and implemented in the Cisco IOS. Also I hope to see many of you in my new RS Bootcamps.

Sep
02

Abstract

This publication briefly covers the use of 3rd party next-hops in OSPF, RIP, EIGRP and BGP routing protocols. Common concepts are introduced and protocol-specific implementations are discussed. Basic understanding of the routing protocol function is required before reading this blog post.

Overview

Third-party next-hop concept appears only to distance vector protocol, or in the parts of the link-state protocols that exhibit distance-vector behavior. The idea is that a distance-vector update carries explicit next-hop value, which is used by receiving side, as opposed to the "implicit" next-hop calculated as the sending router's address - the source address in the IP header carrying the routing update. Such "explicit" next-hop is called "third-party" next-hop IP address, allowing for pointing to a different next-hop, other than advertising router. Intitively, this is only possible if the advertising and receiving router are on a shared segment, but the "shared segment" concept could be generalized and abstracted. Every popular distance-vector protocols support third party next-hop - RIPv2, EIGRP, OSPF and BGP all carry explicit next-hop value. Look at the figure below - it illustrates the situation where two different distance-vector protocols are running on the shared segment, but none of them runs on all routers attached to the segment. The protocols "overlap" at a "pivotal" router and redistribution is used to provide inter-protocol route exchange.

third-party-nh-generic

Per the default distance-vector protocol behavior, traffic from one routing domain going into another has cross the "pivotal" router, the router where the two domains overlap (R3 in our case) - as opposed to going directly to the closes next-hop on the shared segment. The reason for this is that there is no direct "native" update exchange between the hops running different routing protocols. In situations like this, it is beneficial to rewrite the next-hop IP address to point toward the "optimum" exit point, using the "pivotal" router's knowledge of both routing protocols.

OSPF is somewhat special with respect to the 3rd party next-hop implementation. It supports third-party next-hop in Type-5/7 LSAs (External Routing Information LSA and NSSA External LSA). These LSAs are processed in "distance-vector manner" by every receiving router. By default, the LSA is assumed to advertise the external prefix "connected" to the advertising router. However, if the FA is non-zero, the address in this field is used to calculate the forwarding information, as opposed to default forwarding toward the advertising router. Forwarding Address is always present in Type-7 LSAs, for the reason illustrated on the figure below:

third-party-nh-ospf-nssa-fa

Since there could be multiple ABRs in NSSA area, only one is elected to perform 7-to-5 LSA translation - otherwise the routing information will loop back in the area, unless manual filtering implemented in the ABRs (which is prone to errors). Translating ABR is elected based on the highest Router-ID, and may not be on the optimum path toward the advertising ASBR. Therefore, the forwarding address should prompt the more optimum path, based on the inter-area routing information.

EIGRP

We start with the scenario where we redistribute RIP into EIGRP.

third-party-nh-rip2eigrp

Notice that EIGRP will not insert the third-party next-hop until you apply the command no ip next-hop-self eigrp on R3's connection to the shared segment. Look at the routing table output prior to applying the no ip next-hop-self eigrp command.

R1#show  ip route eigrp 
140.1.0.0/16 is variably subnetted, 2 subnets, 2 masks
D EX 140.1.2.2/32
[170/2560002816] via 140.1.123.3, 00:00:27, FastEthernet0/0

After the command has been applied to R3’s interface:

R1#show  ip route eigrp
140.1.0.0/16 is variably subnetted, 2 subnets, 2 masks
D EX 140.1.2.2/32
[170/2560002816] via 140.1.123.2, 00:00:04, FastEthernet0/0

The same behavior is observed when redistributing OSPF into EIGRP, but not when redistributing BGP. For some reason, BGP's next-hop is not copied into EIGRP, e.g. in the example below, EIGRP will NOT insert the BGP's next-hop into updates. Notice that you may enable or disable the third-party next-hop behavior in EIGRP using the interface-level command ip next-hop-self eigrp.

RIP

RIP passes the third-party next-hop from OSPF, BGP or EIGRP. For instance, assume EIGRP redistribution into RIP. You have to turn on the no ip split-horizon on R3's Ethernet connection to get this to work:

third-party-nh-eigrp2rip

R2#show ip route rip 
140.1.0.0/16 is variably subnetted, 3 subnets, 2 masks
R 140.1.1.1/32 [120/1] via 140.1.123.1, 00:00:17, FastEthernet0/0

Notice the following RIP debugging output, which lists the third-party next-hop:

RIP: received v2 update from 140.1.123.3 on FastEthernet0/0
140.1.1.1/32 via 140.1.123.1 in 1 hops
140.1.123.0/24 via 0.0.0.0 in 1 hops

Surprisingly, there is NO need to enable the command no ip split-horizon on the interface when redistributing BGP or OSPF routes into RIP. Seem like only EIGRP to RIP redistribution requires that. Keep in mind, however, that split-horizon is OFF by default on physical frame-relay interfaces. Here is a sample output of redistributing BGP into RIP using the third-party next-hop:

R3#show ip route bgp 
140.1.0.0/16 is variably subnetted, 3 subnets, 2 masks
B 140.1.2.2/32 [20/0] via 140.1.123.2, 00:22:13
R3#

R1#show ip route rip
140.1.0.0/16 is variably subnetted, 3 subnets, 2 masks
R 140.1.2.2/32 [120/1] via 140.1.123.2, 00:00:09, FastEthernet0/0

RIP’s third-party next-hop behavior is fully automatic. You cannot disable or enable it, like you do in EIGRP.

OSPF

Similarly to RIP, OSPF has no problems picking up the third-party next-hop from BGP, EIGRP or RIP. Here is how it would look like (guess which protocol is redistributed into OSPF, based solely on the commands output):

R1#sh ip route ospf 
140.1.0.0/16 is variably subnetted, 3 subnets, 2 masks
O E2 140.1.2.2/32 [110/1] via 140.1.123.2, 00:34:59, FastEthernet0/0

R1#show ip ospf database external

OSPF Router with ID (140.1.1.1) (Process ID 1)

Type-5 AS External Link States

Routing Bit Set on this LSA
LS age: 131
Options: (No TOS-capability, DC)
LS Type: AS External Link
Link State ID: 140.1.2.2 (External Network Number )
Advertising Router: 140.1.123.3
LS Seq Number: 80000002
Checksum: 0xF749
Length: 36
Network Mask: /32
Metric Type: 2 (Larger than any link state path)
TOS: 0
Metric: 1
Forward Address: 140.1.123.2
External Route Tag: 200

If you’re still guessing, the external protocol is BGP, as could have been seen observing the automatic External Route Tag – OSPF set’s it to the last AS# found in the AS_PATH.

third-party-nh-bgp2ospf

There are special conditions to be met by OSPF for the FA address to be used. First, the interface where the third party next-hop resides should be advertised into OSPF using the network command. Secondly, this interface should not be passive in OSPF and should not have network type point-to-point or point-to-multipoint. Violating any of these conditions will stop OSPF from using the FA in type-5 LSA created for external routes. Violating any of these conditions prevents third-party next-hop installation in the external LSAs.

OSPF is special in one other respect. Distance vector-protocols such as RIP or EIGRP modify the next-hop as soon as they pass the routing information to other devices. That is, the third party next-hop is not maintained through the RIP or EIGRP domain. Contrary to these, OSPF LSAs are flooded within their scope with the FA unmodified. This creates interesting problem: if the FA address is not reachable in the receiving router’s routing table, the external information found in type 7/5 LSA is not used. This situation is discussed in the blog post “OSPF Filtering using FA Address”.

BGP

When you redistribute any protocol into BGP, the system correctly sets the third-party next-hop in the local BGP table. Look at the diagram below, where EIGRP prefixes are being redistributed into BGP AS 300:

third-party-nh-eigrp2bgp

R3’s BGP process installs R1 Loopback0 prefix into the BGP table with the next-hop value of R1’s address, not “0.0.0.0” like it would be for locally advertised routes. You will observe the same behavior if you inject EIGRP prefixes into BGP using the network command.

R3#sh ip bgp
BGP table version is 9, local router ID is 140.1.123.3
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

Network Next Hop Metric LocPrf Weight Path
*> 140.1.1.1/32 140.1.123.1 156160 32768 ?

Furthermore, BGP is supposed to change the next-hop to self when advertising prefixes over eBGP peering sessions. However, when all peers share the same segment, the prefixes re-advertised over the shared segment do not have their next-hop changed. See the diagram below:

third-pary-nh-bgp2bgp

Here R1 advertises prefix 140.1.1.1/24 to R3 and R3 re-advertises it back to R2 over the same segment. Unless non-physical interfaces are used to form the BGP sessions (e.g. Loopbacks), the next-hop received from R1 is not changed when passing it down to R2. This implements the default third-party next-hop preservation over eBGP sessions. Look at the sample output for the configuration illustrated above: R1 receives R2’s prefix with unmodified next-hop.

R1#show ip bgp 
BGP table version is 3, local router ID is 140.1.1.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

Network Next Hop Metric LocPrf Weight Path
*> 140.1.1.1/32 0.0.0.0 0 32768 i
*> 140.1.2.2/32 140.1.123.2 0 300 200 i

There is a way to disable this default behavior in BGP. A logical assumption would be that using the command neighbor X.X.X.X next-hop-self would work, and it does indeed, in the recent IOS versions. The older IOS, such as 12.2T did not have this command working for eBGP sessions, and your option would have been using a route-map with set ip next-hop command. The route-map method may still be handy, if you want insert totally “bogus” IP next-hop from the shared segment – receiving BGP speaker will accept any IP address that is on the same segment. That is not something you would do in the production environment too often, but definitely an interesting idea for lab practicing. One good use in production is changing the BGP next-hop to an HSRP virtual IP address, to provide physical BGP speaker redundancy. Here is a sample code for setting an explicit next-hop in BGP update:

router bgp 300
neighbor 140.1.123.1 remote-as 100
neighbor 140.1.123.1 route-map BGP_NEXT_HOP out
!
route-map BGP_NEXT_HOP permit 10
set ip next-hop 140.1.123.100

Summary

All popular distance-vector protocols support third-party next-hop insertion. This mechanism is useful on multi-access segments, in situations when you want pass optimum path information between routers belonging to different routing protocols. We illustrated that RIP implements this function automatically, and does not allow any tuning. On the other hand, EIGRP supports third-party next-hop passing from any protocol, other than BGP, and you may turn this function on/off on per-interface basis. Furthermore, OSPF’s special feature is propagation of the third-party next-hop within an area/autonomous system, unlike the distance-vector protocols that reset the next-hop at every hop (considering AS a being a “single-hop” for BGP). Thanks to that feature, OSPF offers interesting possibility to filter external routing information by blocking FA prefix from the routing tables. Finally, BGP gives most flexibility when it comes to the IP next-hop manipulation, allowing for changing it to any value.

Further Reading

Common Routing Problem with OSPF Forwarding Address
OSPF Prefix Filtering Using Forwarding Address
BGP Redundancy using HSRP

Aug
21

Our BGP class is coming up!  This class is for learners who are pursuing the CCIP track, or simply want to really master BGP.  I have been working through the slides, examples  and demos that we'll use in class, and it is going to be excellent.  :) If you can't make the live event, we are recording it, so it will be available as a class on demand, after the live event.    More information, can be found by clicking here.

One of the common questions that comes up is "Why does the router choose THAT route?

We all know, (or at least after reading the list below, we will know), that BGP uses the following order, to determine the "best" path.

bgp bestpath

So now for the question.   Take a look at the partial output of the show command below:

bgp bestpath

Regarding the 2.2.2.0/24 network, why did this router select the 192.168.68.8 next hop route, over the one just below it?

Post your ideas, and we will have a drawing next week, before the BGP class begins.   We'll give 1 lucky winner some rack tokens for our preferred rack vendor, Graded Labs.   Everyone who comments, will be entered into the drawing.  I will update the post with the lucky winner.

Thanks for your ideas, and happy learning.

Thank you to all who responded.  eBGP is preferred over iBGP, and that is what it came down to.

The winner of the graded labs tokens is Jon!  Congratulations.

Aug
16

Last week we wrapped up the MPLS bootcamp, and it was a blast!   A big shout out to all the students who attended,  as well as to many of the INE staff who stopped by (you know who you are :)).    Thank you all.

Here is the topology we used for the class, as we built the network, step by step.

MPLS-class blog

The class was organized and delivered in 30 specific lessons. Here is the "overview" slide from class:

MPLS Journey Statement

One of the important items we discussed was troubleshooting.   When we understand all the components of Layer3 VPNs, the troubleshooting is easy.   Here are the steps:

  • Can PE see CE’s routes?
  • Are VPN routes going into MP-BGP?  (The Export)
  • Are remote PEs seeing the VPN routes?
  • Are remote PEs inserting the VPN routes into the correct local VRF? (The Import)
  • Are remote PEs, advertising these to remote CEs?
  • Are the remote CEs seeing the routes?

We had lots of fun, and included wireshark protocol analysis, so we could see and verify what we were learning.   Here is one example, of a BGP updated from a downstream iBGP neighbor which includes the VPN label:

VPN Label

If you missed the class, but still want to benefit from it, we have recorded all 30 sessions, and it is available as an on-demand version of the class.

Next week, the BGP bootcamp is running, so if you need to  brush up on BGP, we will be covering the following topics, also  in 30, easy to digest lessons:

  • Monitoring and Troubleshooting BGP
  • Multi-Homed BGP Networks
  • AS-Path Filters
  • Prefix-List Filters
  • Outbound Route Filtering
  • Route-Maps as BGP Filters
  • BGP Path Attributes
  • BGP Local Preference
  • BGP Multi-Exit-Discriminator (MED)
  • BGP Communities
  • BGP Customer Multi-Homed to a Single Service Provider
  • BGP Customer Multi-Homed to Multiple Service Providers
  • Transit Autonomous System Functions
  • Packet Forwarding in Transit Autonomous Systems
  • Monitoring and Troubleshooting IBGP in Transit AS
  • Network Design with Route Reflectors
  • Limiting the Number of Prefixes Received from a BGP Neighbor
  • AS-Path Prepending
  • BGP Peer Group
  • BGP Route Flap Dampening
  • Troubleshooting Routing Issues
  • Scaling BGP

I look forward to seeing you in class!

Best wishes in all of your learning.

May
03

The purpose of event dampening is reducing the effect of oscillations on routing systems. In general, periodic process that affect the routing system as a whole should have the period no shorter than the system convergence time (relaxation time). Otherwise, the system will never stabilize and will be constantly updating its state. In reality, complex system have multiple periodic processes running at the same time, which results is in harmonic process interference and complex process spectrum. Considering such behavior is outside the scope of this paper. What we want to do, is finding optimal settings to filter high-frequency events from the routing system. In our particular case, events are interface flaps, occurring periodically. We want to make sure that oscillations with period T or less are not reported to the routing system. Here T is found empirically, based on observed/estimated convergence time as suggested above.

Event dampening uses exponential back-off algorithm to suppress event reporting to the upper level protocols. Effectively, every time an interface flaps (goes down, to be accurate) a penalty value of P is added to the interface penalty counter. If at some point the accumulated penalty exceeds the "suppress" value of S, the interface is placed in the suppress state and further link events are not reported to the upper protocol modules. At all time, the interface penalty counter follows exponential decay process based on the formula P(t)=P(0)*2^(-t/H) where H is half-life time setting for the process. As soon as accumulated penalty reaches the lower boundary of R - the reuse value, interface is unsuppressed, and further changes are again reported to the upper level protocols.

optimized-dampening

What we want to find out, is the lowest value of H that suppresses all harmonic oscillation processes with the period of T or lower, but does not suppress longer-period processes. E.g. we want to block all oscillations happening every 5 seconds or more often, but report interface flaps happening every 6 seconds or less often. Look at the figure above and consider that there have been two flap events, separated by time period T. At the moment of the second flap, the suppress condition is: P*2^(-T/H)+P >= S. Here the left part is the penalty accumulated at the moment of the second flap, assuming the initial penalty at the moment of the first flap was zero. From this inequality, we quickly find out that H >= T/log2(P/(S-P)). if we could make P/(S-P)=2, then the formula would be greatly simplified. Per Cisco's implementation, P (penalty) is fixed to 1000, and by setting S=1500 we get 1000/(1500-1000)=2. Therefore, if we select S=1500, P=1000 then our condition becomes H >= T. Since we are looking for the minimal value of H we can set H=T. Seeded with this values, event dampening filter will reject all oscillating porcesses with the period shorter than T. However, there is one more parameter we are left to find is R - the reuse time.

We may apply the following logic here. Observing no further events since the last flap for the duration of 2xT, we may assume that the periodic process has stopped. Therefore, we may unblock the interface after 2xT seconds. The reuse value could be found by taking the penalty accumulated after the second flap, and further decaying it for 2xT more seconds: (P*2^(-T/H)+P)*2^(-2T/H) <= R. Since we set H=T we quickly find out that R >= 3/8*P = 375. At this point we have all parameters we need to know in order to apply optimal event dampening settings based on the cut-off period for oscillating processes. Here is a sample configuration, for T=10 seconds. Notice the last parameter, know as the maximum suppress time - the maximum time that the interface could be kept in suppress state. Since our goal is to hold the interface suppressed for at least 2xT seconds, the maximum suppress time is twice the half-life value.

R3:
interface FastEthernet 0/0
dampening 10 375 1500 20

Lastly, a few words on figuring out the convergence time for your network. To being with, we only consider IGP protocols in this discussion. Dampening in BGP is more complicated, due to the scale of the routing system involved. The general consensus nowadays is that using dampening in BGP may result in more harm than good, due to cascading withdrawn messages. Next, for the IGPs, you are generally considered with a single fault domain, which in properly designed network is bounded to one IGP area (or EIGRP query scope zone). Convergence time for a single area depends on the following factors:

  • Area size - impacts routing database sizes, affects LSA/Query propagation time and SPF runtime.
  • Weakest (in terms of CPU/Memory) router in the area - this is the router to complete SPF computations the last.
  • RIB/FIB sizes: a significant amount of time is wasted on updating RIB/FIB tables after IGP re-convergence. Again, depends on the area size

To summarize, the main factor is the area size and the number of links in the area (which normally follows the power law based on the number of nodes). However, knowing this fact does not give us a formula for the convergence time. In most cases, you should rely on empirical evidence to obtain this. Starting with one-two seconds could be reasonable, but you should scale this value by the factor of two or three to account for multiple oscillations that may run in the network concurrently. Still, one again, there is no magical formula for this - this is what network engineers and designers are for!

Jan
30

Introduction

In this series of posts, we are going to review some interesting topics illustrating unexpected behavior of the BGP routing protocol. It may seem that BGP is a robust and stable protocol, however the way it was designed inherently presents some anomalies in optimal route selection. The main reason for this is the fact that BGP is a path-vector protocol, much like a distance-vector protocol with optimal route selection based on policies, rather than simple additive metrics.

The fact that BGP is mainly used for Inter-AS routing results in different routing policies used inside every AS. When those different policies come to interact, the resulting behavior might not be the same as expected by individual policy developers. For example, prepending the AS_PATH attribute may not result in proper global path manipulation if an upstream AS performs additional prepending.

In addition to that, BGP was designed for inter-AS loop detection based on the AS_PATH attribute and therefore cannot detect intra-AS routing loops. Optimally, intra-AS routing loops could be prevented by ensuring a full mesh of BGP peering between all routers in the AS. However, implementing full-mesh is not possible for a large number of BGP routers. Known solutions to this problem - Route Reflectors and BGP Confederations - prevent all BGP speakers from having full information on all potential AS exit points due to the best-path selection process. This unavoidable loss of additional information may result in suboptimal routing or routing loops, as illustrated below.

BGP RRs and Intra-AS Routing Loops

As mentioned above, a full mesh of BGP peering sessions eliminates intra-AS routing loops. However, using Route Reflectors (RRs) - a common solution to the full-mesh problem, will not result in the same behavior, as RRs only propagate best-paths to the clients, thus hiding the complete routing information from edge routers. This may result in inconsistent best-path selection by clients and end up in routing loops. A known design rule used to avoid this is to place Route Reflectors along the packet forwarding paths between the RR clients in different clusters. This also translates in the design principle where iBGP peering sessions closely follow the physical (geographical) topology.

Here is an example of what could happen in the situation where this rule is not observed. Look at the topology below, where R5 peers with the RR that is not the one closest to it in terms of IGP metrics. At the same time, R1 and R2 peer with another RR, and R5 is on the forwarding path between R1, R2 and R4. The problem here is that R5 receives external BGP prefixes from a different RR than R1 and R2 use. Thus, the exit point that R1 and R2 consider optimal may not be optimal for R5. Here is what happens:

bgp-anomalies-part1-1

BB3 advertises AS54 prefixes to R4 and BB1 advertises the same set of prefixes to R6. R4 and R6 exchange this information and every route-reflector prefers the directly connected exit point and advertises best path to its route-reflector clients. R4 sends the best paths to R1 and R2 and those clients install best-paths with the next hop of R4:

Rack1R2#show ip bgp  
BGP table version is 22, local router ID is 150.1.2.2
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

Network Next Hop Metric LocPrf Weight Path
*>i28.119.16.0/24 150.1.4.4 0 100 0 54 i
*>i28.119.17.0/24 150.1.4.4 0 100 0 54 i
*>i112.0.0.0 150.1.4.4 0 100 0 54 50 60 i
*>i113.0.0.0 150.1.4.4 0 100 0 54 50 60 i
*>i114.0.0.0 150.1.4.4 0 100 0 54 i
*>i115.0.0.0 150.1.4.4 0 100 0 54 i
*>i116.0.0.0 150.1.4.4 0 100 0 54 i
*>i117.0.0.0 150.1.4.4 0 100 0 54 i
*>i118.0.0.0 150.1.4.4 0 100 0 54 i
*>i119.0.0.0 150.1.4.4 0 100 0 54 i

And R5 receives the best paths from R6, which prefers the exit point via BB1. Thus, the best-paths in R5 would point toward R6:

Rack1R5#show ip bgp 
BGP table version is 22, local router ID is 150.1.5.5
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

Network Next Hop Metric LocPrf Weight Path
*>i28.119.16.0/24 150.1.6.6 0 100 0 54 i
*>i28.119.17.0/24 150.1.6.6 0 100 0 54 i
*>i112.0.0.0 150.1.6.6 0 100 0 54 50 60 i
*>i113.0.0.0 150.1.6.6 0 100 0 54 50 60 i
*>i114.0.0.0 150.1.6.6 0 100 0 54 i
*>i115.0.0.0 150.1.6.6 0 100 0 54 i
*>i116.0.0.0 150.1.6.6 0 100 0 54 i
*>i117.0.0.0 150.1.6.6 0 100 0 54 i
*>i118.0.0.0 150.1.6.6 0 100 0 54 i
*>i119.0.0.0 150.1.6.6 0 100 0 54 i
*>i139.1.0.0 150.1.6.6 0 100 0 i

And since R5 has to traverse R1 or R2 to reach R6 and R1 and R2 have to traverse R5 to get to R4, we have a routing loop:

Rack1SW3#traceroute 28.119.16.1

Type escape sequence to abort.
Tracing the route to 28.119.16.1

1 139.1.11.1 1004 msec 0 msec 4 msec
2 150.1.2.2 36 msec 32 msec 36 msec
3 139.1.25.5 64 msec 60 msec 64 msec
4 139.1.25.2 56 msec 56 msec 52 msec
5 139.1.25.5 84 msec 80 msec 84 msec
6 139.1.25.2 76 msec 76 msec 72 msec
7 139.1.25.5 104 msec 104 msec 100 msec
8 139.1.25.2 96 msec 96 msec 96 msec
9 139.1.25.5 136 msec 120 msec 124 msec
10 139.1.25.2 116 msec

The best way to avoid these routing loops is to make iBGP sessions closely follow the physical topology, illustrated on the diagram below:

bgp-anomalies-part1-2

Another solution would be to adjust the topology to follow the iBGP peering sessions. For example, we could configure a GRE tunnel between R5 and R6 and exchange BGP routes over it. This will result in suboptimal routing but will prevent routing loops. Of course, this is not the recommended solution. However, the use of tunneling to resolve this issue prompts another idea: using MPLS forwarding and a BGP free core.

We are not going to illustrate this well-known concept here, but simply point to the fact that PE routers label-encapsulate IP packets routed towards BGP prefixes using MPLS labels for BGP next-hops. The actual packet forwarding is based on shortest IGP paths (or MPLS TE paths) and there are no intermediate routers that may steer packets according to BGP routing tables. Effectively, you may place a route reflector anywhere in the topology and peer your PE routers however you prefer – the optimum routing inside the AS is not based on BGP anymore. However, just from the logical perspective, it still makes sense to group RR clusters based on geographical proximity.

To be continued

In the next blog post from this series we will review situations when BGP gets stuck with permanently oscillating routes, resulting in continuous prefix advertisements and withdraws. We will see how dangerous the BGP MED attribute can be and explain the rationale behind the Cisco IOS command bgp always-compare-med and bgp deterministic-med

Jan
17

Abstract:

Inter-AS Multicast VPN solution introduces some challenges in cases where peering systems implement BGP-free core. This post illustrates a known solution to this problem, implemented in Cisco IOS software. The solution involves the use of special MP-BGP and PIM extensions. The reader is assumed to have understanding of basic Cisco's mVPN implementation, PIM-protocol and Multi-Protocol BGP extensions.

Abbreviations used

mVPN – Multicast VPN
MSDP – Multicast Source Discovery Protocol
PE – Provider Edge
CE – Customer Edge
RPF – Reverse Path Forwarding
MP-BGP – Multi-Protocol BGP
PIM – Protocol Independent Multicast
PIM SM – PIM Sparse Mode
PIM SSM – PIM Source Specific Multicast
LDP – Label Distribution Protocol
MDT – Multicast Distribution Tree
P-PIM – Provider Facing PIM Instance
C-PIM – Customer Facing PIM Instance
NLRI – Network Layer Rechability Information

Inter-AS mVPN Overview

A typical "classic" Inter-AS mVPN solution leverages the following key components:

  • PIM-SM maintaining separate RP for every AS
  • MSDP used to exchange information on active multicast sources
  • (Optionally) MP-BGP multicast extension to propagate information about multicast prefixes for RPF validations

With this solution, different PEs participating in the same MDT discover each other by joining the shared tree towards the local RP and listening to the multicast packets sent by other PEs. Those PEs belong to the same or different Autonomous Systems. In the latter case, the sources are discovered by the virtue of MSDP peering. This scenario assumes that every router in the local AS has complete routing information about multicast sources (PE’s loopback addresses) residing in other system. Such information is necessary for the purpose of RPF check. In turn, this leads to the requirement of running BGP on all P routers OR redistributing MP-BGP (BGP) multicast prefixes information into IGP. The redistribution approach clearly has limited scalability, while the other method requires enabling BGP on the P routers, which nullifies with the idea of BGP-free core.

A solution alternative to running PIM in Sparse Mode would be using PIM SSM, which relies on out-of-band information for multicast sources discovery. For such case, Cisco released a draft proposal listing new MP-BGP MDT SAFI that is used to propagate MDT information and associated PE addresses. Let’s make a short detour to MP-BGP to get a better understanding of SAFIs.

MP-BGP Overview

Recall the classic BGP UPDATE message format. It consists of the following sections: [Withdrawn prefixes (Optional)] + [Path Attributes] + [NLRIs]. The Withdrawn Prefixes and NLRIs are IPv4 prefixes, and their structure does not support any other network protocols. The Path Attributes (e.g. AS_PATH, ORIGIN, LOCAL_PREF, NEXT_HOP) are associated with all NLRIs; prefixes sharing different set of path attributes should be carried in a separate UPDATE message. Also, notice that NEXT_HOP is an IPv4 address as well.

In order to introduce support for non-IPv4 network protocols into BGP, two new optional transitive Path Attributes have been added to BGP. The first attribute is known as MP_REACH_NLRI, has the following structure: [AFI/SAFI] + [NEXT_HOP] + [NLRI]. Both NEXT_HOP and NLRI are formatted according to the protocol encoded via AFI/SAFI that stands for Address Family Identifier and Subsequent Address Family Identifier respectively. For example, this could be an IPv6 or CLNS prefix. Thus, all information about non-IPv4 prefixes is encoded in a new BGP Path Attribute. A typical BGP Update message that contains MP_REACH_NLRI attributes would have no “classic” NEXT_HOP attribute and no “Withdrawn Prefixes” or “NLRIs” found in normal UPDATE messages. For the next-hop calculations, a receiving BGP speaker should use the information found in MP_REACH_NLRI attribute. However, the multi-protocol UPDATE message may contain other BGP path attributes such as AS_PATH, ORIGIN, MED, LOCAL_PREF and so on. However, this time those attributes are associated with the non-IPv4 prefixes found in all attached MP_REACH_NLRI attributes.

The second attribute, MP_UNREACH_NLRI has format similar to MP_REACH_NLRI but lists the “multi-protocol” addresses to be removed. No other path attributes need to be associated with this attribute, an UPDATE message may simply contain the list of MP_UNREACH_NLRIs.

The list of supported AFIs may be found in RFC1700 (though it’s obsolete now, it is still very informative) – for example, AFI 1 stands for IPv4, AFI 2 stands for IPv6 etc. The subsequent AFI is needed to clarify the purpose of the information found in MP_REACH_NLRI. For example, SAFI value of 1 means the prefixes should be used for unicast forwarding, SAFI 2 means the prefixes are to be used for multicast RPF checks and SAFI 3 means the prefixes could be used for both purposes. Last, but not least – SAFI of 128 means MPLS labeled VPN address.

Just as a reminder, BGP process would perform separate best-path election process for “classic” IPv4 prefixes and every AFI/SAFI pair prefixes separately, based on the path attributes. This allows for independent route propagation for the addresses found in different address families. Since a given BGP speaker may not support particular network protocols, the list of supported AFI/SAFI pairs is advertised using BGP capabilities feature (another BGP extension), and the particular network protocol information is only propagated if both speakers support it.

MDT SAFI Overview

Cisco drafted a new SAFI to be used with regular AFIs such as IPv4 or IPv6. This SAFI is needed to propagate MDT group address and the associated PE’s Loopback address. The format for AFI 1 (IPv4) is as follows:

MP_NLRI = [RD:PE’s IPv4 Address]:[MDT Group Address],
MP_NEXT_HOP = [BGP Peer IPv4 Address].

Here RD is the RD corresponding to the VRF that has the MDT configured and “IPv4 address” is the respective PE router’s Loopback address. Normally, per Cisco rules this is the same Loopback interface used for VPNv4 peering, but this could be changed using the VRF-level command bgp next-hop.

If all PEs in the local AS exchange this information and pass it to PIM SSM, the P-PIM (Provider PIM, facing the SP core) process will be able to build an (S,G) trees for the MDT group address towards the other PE’s IPv4 addresses. This is by the virtue of the fact that the PE’s IPv4 address is known via IGP, as all PEs are in the same AS. There are no problems using the BGP-free core for intra-AS mVPN with PIM-SSM. Also, it’s worth mentioning that a precursor to MDT was as special extended community used along with VPNv4 address family. MP-BGP would use RD value of 2 (not applicable to any unicast VRF) to transport the associated PE’s IPv4 address along with an extended community that contains the MDT group address. This solution allowed for the “bootstrap” information propagation inside a single AS, since the extended-community was non-transitive. Using extended communities allowed for PIM SSM discovery information distribution inside a single AS. This temporary solution was replaced by the MDT SAFI draft.

Next, consider the case of Inter-AS VPN where at least one AS uses BGP-free core. When two peering Autonomous Systems activate the IPv4 MDT SAFI, the ASBRs will advertise all information learned from PE’s to each other. The information will further propagate down to each AS’s PEs. Next, the P-PIM processes will attempt to build (S, G) trees towards the PE IP addresses in neighboring systems. Even though the PEs may know the other PE's addresses (e.g. if Inter-AS VPN Option C is being used), the P-routers don’t have this information. If Inter-AS VPN Option B is in use, even the PE routers will have no proper information to build the (S, G) trees.

RPF Proxy Vector

The solution to this problem uses a modification to PIM protocol and RPF check functionality. Known as RPF Proxy Vector, it defines a new PIM TLV that contains the IPv4 address of the “proxy” router used for RPF checks and as an intermediate destination for PIM Joins. Let’s see how it works in a particular scenario.

On the diagram below you can see AS 100 and AS 200 using Inter-AS VPN Option B to exchange VPNv4 routes. PEs and ASBRs peer via BGP and exchange VPNv4 and IPv4 MDT SAFI prefixes. For every external prefix relayed to its own PEs, the ASBRs would change the next-hop found MP_REACH_NLRI to its local Loopback address. For VPNv4 prefixes, this achieves the goal of terminating the LSP on the ASBR. For MDT SAFI prefixes, this procedure sets the IPv4 address to be used as “proxy” in PIM Joins.

mvpn-inter-as-basic

Let’s say that MDT group used by R1 and R5 is 232.1.1.1. When R1 receives the MDT SAFI update with the MDT value of 200:1:232.1.1.1, PE IPv4 address 20.0.5.5 and the next-hop value of 10.0.3.3 (R3’ Loopback0 interface) it will pass this information down to PIM process. The PIM process will construct a PIM Join for group 232.1.1.1 towards the IP address 20.0.5.5 (not known in AS 100) and insert a proxy vector value of 10.0.3.3. The PIM process will then use the route to 10.0.3.3 to find the next upstream PIM peer to send the Join message to. Every P-router will process the PIM Join message with the proxy vector, and use the proxy IPv4 address to relay the message upstream. As soon as the message reaches the proxy router (in our case it’s R3), the proxy vector is being removed and the PIM Joins propagate further using the regular procedure, as the domain behind the proxy is supposed to have visibility of the actual Join target.

In addition to use the proxy vector to relay the PIM Join upwards, every routers creates a special mroute state for the (S,G) pair where S is the PE IPv4 address and G is the MDT group. This mroute state will have the proxy IPv4 address associated with it. When a matching multicast packet going from the external PE towards the MDT address hits the router, the RPF check will be performed based on the upstream interface associated with the proxy IPv4 address, not the actual source IPv4 address found in the packet. For example, in our scenario, R2 would have an mroute state for (20.0.5.5, 232.1.1.1) with the proxy IPv4 address of 10.0.3.3. All packets coming from R5 to 232.1.1.1 will be RPF checked based on the upstream interface towards 10.0.3.3.

Using the above-described “proxy-based” procedure, the P routers may successfully perform RPF checks for packets with the source IPv4 addresses not found in the local RIB. The tradeoff is the amount of multicast state information that has to be stored in the P-routers memory – it’s going to be proportional to the number of PEs multiplied by number of mVPN in the worst-case scenario where every PE’s participate in every mVPN. There could be more multicast route states in situations where Data MDT are being used in addition to the Default MDT.

BGP Connector Attribute

Another additional piece of information is needed for “intra-VPN” operations: joining a PIM tree towards a particular IP address inside a VPN and performing an RPF check inside the VPN. Consider the use of Inter-AS VPN Option B, where VPNv4 prefixes have their MP_REACH_NLRI next-hop changed to the local ASBR’s IPv4 address. When a local PE receives a multicast packet on the MDT tunnel interface it decapsulates it and performs a source IPv4 address lookup inside the VRF’s table. Based on MP-BGP learned routes, the next-hop would point towards the ASBR (Option B), while the packets might be coming across a different inter-AS link running multicast MP-BGP peering. Thus, relying solely on the unicast next-hop may not be sufficient for Inter-AS RPF checks.

For example, look at the figure below, where R3 and R4 run MP-BGP for VPN4 while R4 and R6 run multicast MP-BGP extension. R1 peers with both ASBRs and learns VPNv4 prefixes from R3 while it learns as MDT SAFI information and PE’s IPv4 addresses from R6. PIM is enabled only on the link connecting R6 and R4.

mvpn-inter-as-diverse-paths

In this situation, the RPF lookup would fail, as MDT SAFI information is exchanged across the link running M-BGP, while VPNv4 prefixes next-hop point to R3. Thus, a method is required to preserve the information for the RPF lookup.

Cisco suggested the use of a new, optional transitive attribute named BGP Connector to be exported along with VPNv4 prefixes out of PE hosting an mVPN. This attribute contains the following two components: [AFI/SAFI] + [Connector Information] and in general defines information needed by network protocol identified by AFI/SAFI pair to connect to the routing information found in MP_REACH_NLRI. If AFI=IPv4 and SAFI=MDT, the connector attribute contains the IPv4 address of the router originating the prefixes associated with the VRF that has an MDT configured.

The customer-facing PIM process (C-PIM) in the PE routers will use information found in the BGP Connector attribute to perform the intra-VPN RPF check as well as to find the next-hop to send PIM Joins. Notice the C-PIM Joins do not need to have the RPF proxy vector piggy-backed in the PIM messages, as those are transported inside MDT Tunnel towards the remote PEs.

You may notice that the use of BGP Connector attribute eliminates the need for special “VPNv4 multicast” address family that could be used to transport RPF check information for VPNv4 prefixes. The VPNv4 multicast address family is not really needed as the multicast packets are tunneled through SP cores using MDT Tunnel and using the BGP connector is sufficient for RPF-checks at the provider edge. However, the use of mBGP is still needed in situations where diverse unicast and multicast transport paths are used between the Autonomous Systems.

Case Study

Let’s put all concepts to test in a sample scenario. Here, two autonomous systems AS #100 and #200 peer using Inter-AS VPN Option B to exchange VPNv4 prefixes across the link between R3-R4. At the same time, multicast prefixes are exchanged across the peering link R6-R4 along with MDT SAFI attributes. AS 100 implements BGP-free core, so R2 does no peer via BGP with any other routers and only uses OSPF for IGP prefixes exchange. R1 peers with both ASBRs: R3 and R6 via MP-BGP for the purpose of VPNv4 and multicast prefixes and MDT SAFI exchange.

mvpn-case-study

Here on the diagram the links highlighted with orange color are enabled for PIM SM and may carry the multicast traffic. Notice that MPLS traffic and multicast traffic take different paths between the systems. The following are the main R1’s configuration highlights:

  • VRF RED configured with MDT of 232.1.1.1. PIM SSM configured for P-PIM instance using the default group range of 232/8.
  • The command ip multicast vrf RED rpf proxy rd vector
    ensures the use of RPF proxy vector for the MDT built for VRF RED. Thus the MDT tree built using P-PIM would make use of RPF proxy vector. Notice that the command ip multicast rpf proxy vector applies to the Joins received in the global routing table and is typically seen in the P-routers.
  • R1 exchanges VPNv4 prefixes with R3 and multicast prefixes with R6 via BGP. At the same time R1 exchanges IPv4 MDT SAFI with R6 to learn the MDT information from AS 200.
  • Connected routes from VRF RED are redistributed into MP-BGP.
R1:
hostname R1
!
interface Serial 2/0
encapsulation frame-relay
no shutdown
!
ip multicast-routing
ip pim ssm default
!
interface Serial 2/0.12 point-to-point
ip address 10.0.12.1 255.255.255.0
frame-relay interface-dlci 102
mpls ip
ip pim sparse-mode
!
interface Loopback0
ip pim sparse-mode
ip address 10.0.1.1 255.255.255.255
!
router ospf 1
network 10.0.12.1 0.0.0.0 area 0
network 10.0.1.1 0.0.0.0 area 0
!
router bgp 100
neighbor 10.0.3.3 remote-as 100
neighbor 10.0.3.3 update-source Loopback 0
neighbor 10.0.6.6 remote-as 100
neighbor 10.0.6.6 update-source Loopback 0
address-family ipv4 unicast
no neighbor 10.0.3.3 activate
no neighbor 10.0.6.6 activate
address-family vpnv4 unicast
neighbor 10.0.3.3 activate
neighbor 10.0.3.3 send-community both
address-family ipv4 mdt
neighbor 10.0.6.6 activate
address-family ipv4 multicast
neighbor 10.0.6.6 activate
network 10.0.1.1 mask 255.255.255.255
address-family ipv4 vrf RED
redistribute connected
!
no ip domain-lookup
!
ip multicast vrf RED rpf proxy rd vector
!
ip vrf RED
rd 100:1
route-target both 200:1
route-target both 100:1
mdt default 232.1.1.1
!
ip multicast-routing vrf RED
!
interface FastEthernet 0/0
ip vrf forwarding RED
ip address 192.168.1.1 255.255.255.0
ip pim dense-mode
no shutdown

Notice that in the configuration above, Loopback0 interface is used as the source for the MDT tunnel, and therefore has to have PIM (multicast routing) enabled on it. Next in turn, R2’s configuration is straightforward – OSPF used for IGP, adjacencies with R1, R3 and R3 and label exchange via LDP. Notice that R2 does NOT run PIM on the uplink to R3, and does NOT run LDP with R6. Effectively, the path via R6 is used only for multicast traffic while the path across R3 is used only for MPLS LSPs. R2 is configured for PIM-SSM and RPF proxy vector support for global routing table.

R2:
hostname R2
!
no ip domain-lookup
!
interface Serial 2/0
encapsulation frame-relay
no shut
!
ip multicast-routing
!
interface Serial 2/0.12 point-to-point
ip address 10.0.12.2 255.255.255.0
frame-relay interface-dlci 201
mpls ip
ip pim sparse-mode
!
ip pim ssm default
!
interface Serial 2/0.23 point-to-point
ip address 10.0.23.2 255.255.255.0
frame-relay interface-dlci 203
mpls ip
!
interface Serial 2/0.26 point-to-point
ip address 10.0.26.2 255.255.255.0
frame-relay interface-dlci 206
ip pim sparse-mode
!
interface Loopback0
ip address 10.0.2.2 255.255.255.255
!
ip multicast rpf proxy vector
!
router ospf 1
network 10.0.12.2 0.0.0.0 area 0
network 10.0.2.2 0.0.0.0 area 0
network 10.0.23.2 0.0.0.0 area 0
network 10.0.26.2 0.0.0.0 area 0

R3 is the ASBR used for implementing Inter-AS VPN Option B. It peers via BGP with R1 (the PE) and R4 (the other ASBR). Only VPNv4 address family is enabled with the BGP peers. Notice that the next-hop for VPNv4 prefixes is changed to self, in order to terminate the transport LSP from the PE on R3. No multicast or MDT SAFI information is exchanged across R3.

R3:
hostname R3
!
no ip domain-lookup
!
interface Serial 2/0
encapsulation frame-relay
no shut
!
interface Serial 2/0.23 point-to-point
ip address 10.0.23.3 255.255.255.0
frame-relay interface-dlci 302
mpls ip
!
interface Serial 2/0.34 point-to-point
ip address 172.16.34.3 255.255.255.0
frame-relay interface-dlci 304
mpls ip
!
interface Loopback0
ip address 10.0.3.3 255.255.255.255
!
router ospf 1
network 10.0.23.3 0.0.0.0 area 0
network 10.0.3.3 0.0.0.0 area 0
!
router bgp 100
no bgp default route-target filter
neighbor 10.0.1.1 remote-as 100
neighbor 10.0.1.1 update-source Loopback 0
neighbor 172.16.34.4 remote-as 200
address-family ipv4 unicast
no neighbor 10.0.1.1 activate
no neighbor 172.16.34.4 activate
address-family vpnv4 unicast
neighbor 10.0.1.1 activate
neighbor 10.0.1.1 next-hop-self
neighbor 10.0.1.1 send-community both
neighbor 172.16.34.4 activate
neighbor 172.16.34.4 send-community both

The second ASBR in AS 100 – R6, could be characterized as the multicast-only ASBR. In fact, this ASBR is only used to exchange prefixes in multicast and MDT SAFI address families with R1 and R4. MPLS is not enabled on this router, and its sole purpose is multicast forwarding between AS 100 and AS 200. There is no need to run MSDP as PIM SSM is used for multicast trees construction.

R6:
hostname R6
!
no ip domain-lookup
!
interface Serial 2/0
encapsulation frame-relay
no shut
!
ip multicast-routing
ip pim ssm default
ip multicast rpf proxy vector
!
interface Serial 2/0.26 point-to-point
ip address 10.0.26.6 255.255.255.0
frame-relay interface-dlci 602
ip pim sparse-mode
!
interface Serial 2/0.46 point-to-point
ip address 172.16.46.6 255.255.255.0
frame-relay interface-dlci 604
ip pim sparse-mode
!
interface Loopback0
ip pim sparse-mode
ip address 10.0.6.6 255.255.255.255
!
router ospf 1
network 10.0.6.6 0.0.0.0 area 0
network 10.0.26.6 0.0.0.0 area 0
!
router bgp 100
neighbor 10.0.1.1 remote-as 100
neighbor 10.0.1.1 update-source Loopback 0
neighbor 172.16.46.4 remote-as 200
address-family ipv4 unicast
no neighbor 10.0.1.1 activate
no neighbor 172.16.46.4 activate
address-family ipv4 mdt
neighbor 172.16.46.4 activate
neighbor 10.0.1.1 activate
neighbor 10.0.1.1 next-hop-self
address-family ipv4 multicast
neighbor 172.16.46.4 activate
neighbor 10.0.1.1 activate
neighbor 10.0.1.1 next-hop-self

Pay attention to the following. Firstly, R6 is set for PIM SSM and RPF proxy vector support. Secondly, R6 sets itself as the BGP next hop in the updates sent under multicast and MDF SAFI families. This is needed for proper MDT tree construction and correct RPF vector insertion. The next router, R4, is the combined VPN4 and Multicast ASBR for AS 200. It performs the same functions that R3 and R6 perform separately for AS 100. The VPNv4, MDT SAFI, and Multicast address families are enabled under BGP process for this router. At the same time, the router support RPF Proxy Vector and PIM-SSM for proper multicast forwarding. This router is the most configuration-intensive of all routers in both Autonomous Systems, as it also has to support MPLS label propagation via BGP and LDP. Of course, as a classic Option B ASBR, R4 has to change the BGP next-hop to itself for all address families updates sent to R5 – the PE in AS 200.

R4:
hostname R4
!
no ip domain-lookup
!
interface Serial 2/0
encapsulation frame-relay
no shut
!
ip pim ssm default
ip multicast rpf proxy vector
ip multicast-routing
!
interface Serial 2/0.34 point-to-point
ip address 172.16.34.4 255.255.255.0
frame-relay interface-dlci 403
mpls ip
!
interface Serial 2/0.45 point-to-point
ip address 20.0.45.4 255.255.255.0
frame-relay interface-dlci 405
mpls ip
ip pim sparse-mode
!
interface Serial 2/0.46 point-to-point
ip address 172.16.46.4 255.255.255.0
frame-relay interface-dlci 406
ip pim sparse-mode
!
interface Loopback0
ip address 20.0.4.4 255.255.255.255
!
router ospf 1
network 20.0.4.4 0.0.0.0 area 0
network 20.0.45.4 0.0.0.0 area 0
!
router bgp 200
no bgp default route-target filter
neighbor 172.16.34.3 remote-as 100
neighbor 172.16.46.6 remote-as 100
neighbor 20.0.5.5 remote-as 200
neighbor 20.0.5.5 update-source Loopback0
address-family ipv4 unicast
no neighbor 172.16.34.3 activate
no neighbor 20.0.5.5 activate
no neighbor 172.16.46.6 activate
address-family vpnv4 unicast
neighbor 172.16.34.3 activate
neighbor 172.16.34.3 send-community both
neighbor 20.0.5.5 activate
neighbor 20.0.5.5 send-community both
neighbor 20.0.5.5 next-hop-self
address-family ipv4 mdt
neighbor 172.16.46.6 activate
neighbor 20.0.5.5 activate
neighbor 20.0.5.5 next-hop-self
address-family ipv4 multicast
neighbor 20.0.5.5 activate
neighbor 20.0.5.5 next-hop-self
neighbor 172.16.46.6 activate

The last router in the diagram is R5. It’s a PE in AS 200 configured symmetrically to R1. It has to support VPNv4, MDT SAFI and Multicast address families to learn all necessary information from the ASBR. Of course, PIM RPF proxy vector is enabled for VRF RED’s MDT as well as PIM-SSM is configured for the default group range in the global routing table. There is no router in AS 200 that emulates BGP free core, as you may have noticed.

R5:
hostname R5
!
no ip domain-lookup
!
interface Serial 2/0
encapsulation frame-relay
no shut
!
ip multicast-routing
!
interface Serial 2/0.45 point-to-point
ip address 20.0.45.5 255.255.255.0
frame-relay interface-dlci 504
mpls ip
ip pim sparse-mode
!
interface Loopback0
ip pim sparse-mode
ip address 20.0.5.5 255.255.255.255
!
router ospf 1
network 20.0.5.5 0.0.0.0 area 0
network 20.0.45.5 0.0.0.0 area 0
!
ip vrf RED
rd 200:1
route-target both 200:1
route-target both 100:1
mdt default 232.1.1.1
!
router bgp 200
neighbor 20.0.4.4 remote-as 200
neighbor 20.0.4.4 update-source Loopback0
address-family ipv4 unicast
no neighbor 20.0.4.4 activate
address-family vpnv4 unicast
neighbor 20.0.4.4 activate
neighbor 20.0.4.4 send-community both
address-family ipv4 mdt
neighbor 20.0.4.4 activate
address-family ipv4 multicast
neighbor 20.0.4.4 activate
network 20.0.5.5 mask 255.255.255.255
address-family ipv4 vrf RED
redistribute connected
!
ip multicast vrf RED rpf proxy rd vector
ip pim ssm default
!
ip multicast-routing vrf RED
!
interface FastEthernet 0/0
ip vrf forwarding RED
ip address 192.168.5.1 255.255.255.0
ip pim dense-mode
no shutdown

Validating Unicast Paths

This is the simplest part. Use the show commands to see if the VPNv4 prefixes have propagated between the PEs and test end-to-end connectivity:

R1#sh ip route vrf RED

Routing Table: RED
Codes: C - connected, S - static, R - RIP, M - mobile, B - BGP
D - EIGRP, EX - EIGRP external, O - OSPF, IA - OSPF inter area
N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2
E1 - OSPF external type 1, E2 - OSPF external type 2
i - IS-IS, su - IS-IS summary, L1 - IS-IS level-1, L2 - IS-IS level-2
ia - IS-IS inter area, * - candidate default, U - per-user static route
o - ODR, P - periodic downloaded static route

Gateway of last resort is not set

B 192.168.5.0/24 [200/0] via 10.0.3.3, 00:50:35
C 192.168.1.0/24 is directly connected, FastEthernet0/0

R1#show bgp vpnv4 unicast vrf RED
BGP table version is 5, local router ID is 10.0.1.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

Network Next Hop Metric LocPrf Weight Path
Route Distinguisher: 100:1 (default for vrf RED)
*> 192.168.1.0 0.0.0.0 0 32768 ?
*>i192.168.5.0 10.0.3.3 0 100 0 200 ?

R1#show bgp vpnv4 unicast vrf RED 192.168.5.0
BGP routing table entry for 100:1:192.168.5.0/24, version 5
Paths: (1 available, best #1, table RED)
Not advertised to any peer
200, imported path from 200:1:192.168.5.0/24
10.0.3.3 (metric 129) from 10.0.3.3 (10.0.3.3)
Origin incomplete, metric 0, localpref 100, valid, internal, best
Extended Community: RT:100:1 RT:200:1
Connector Attribute: count=1
type 1 len 12 value 200:1:20.0.5.5
mpls labels in/out nolabel/23

R1#traceroute vrf RED 192.168.5.1

Type escape sequence to abort.
Tracing the route to 192.168.5.1

1 10.0.12.2 [MPLS: Labels 17/23 Exp 0] 432 msec 36 msec 60 msec
2 10.0.23.3 [MPLS: Label 23 Exp 0] 68 msec 8 msec 36 msec
3 172.16.34.4 [MPLS: Label 19 Exp 0] 64 msec 16 msec 48 msec
4 192.168.5.1 12 msec * 8 msec

Notice that in the output above, the prefix 192.168.5.0/24 has the next-hop value of 10.0.3.3 and the BGP Connector attribute value of 200:1:20.0.5.5. This information will be used for RPF checks further when we start feeding multicast traffic.

Validating Multicast Paths

Multicast forwarding is a bit more complicated. The first thing we should do is making sure the MDTs have been built from R1 towards R5 and from R5 towards R1. Check the PIM MDT groups on every PE:

R1#show ip pim mdt 
MDT Group Interface Source VRF
* 232.1.1.1 Tunnel0 Loopback0 RED
R1#show ip pim mdt bgp
MDT (Route Distinguisher + IPv4) Router ID Next Hop
MDT group 232.1.1.1
200:1:20.0.5.5 10.0.6.6 10.0.6.6

R5#show ip pim mdt
MDT Group Interface Source VRF
* 232.1.1.1 Tunnel0 Loopback0 RED

R5#show ip pim mdt bgp
MDT (Route Distinguisher + IPv4) Router ID Next Hop
MDT group 232.1.1.1
100:1:10.0.1.1 20.0.4.4 20.0.4.4

In the output above, pay attention to the next-hop values found in the MDT BGP information. In AS 100 it points toward R6 while in AS 200 it points to R4. Those next-hops are to be used as the proxy vectors for PIM Join messages. Check the mroutes for the tree (20.0.5.5, 232.1.1.1) starting from R1 and climbing up across R2, R6, R4 to R5:

R1#show ip mroute 232.1.1.1
IP Multicast Routing Table
Flags: D - Dense, S - Sparse, B - Bidir Group, s - SSM Group, C - Connected,
L - Local, P - Pruned, R - RP-bit set, F - Register flag,
T - SPT-bit set, J - Join SPT, M - MSDP created entry, E - Extranet,
X - Proxy Join Timer Running, A - Candidate for MSDP Advertisement,
U - URD, I - Received Source Specific Host Report,
Z - Multicast Tunnel, z - MDT-data group sender,
Y - Joined MDT-data group, y - Sending to MDT-data group,
V - RD & Vector, v - Vector
Outgoing interface flags: H - Hardware switched, A - Assert winner
Timers: Uptime/Expires
Interface state: Interface, Next-Hop or VCD, State/Mode

(20.0.5.5, 232.1.1.1), 00:58:49/00:02:59, flags: sTIZV
Incoming interface: Serial2/0.12, RPF nbr 10.0.12.2, vector 10.0.6.6
Outgoing interface list:
MVRF RED, Forward/Sparse, 00:58:49/00:01:17

(10.0.1.1, 232.1.1.1), 00:58:49/00:03:19, flags: sT
Incoming interface: Loopback0, RPF nbr 0.0.0.0
Outgoing interface list:
Serial2/0.12, Forward/Sparse, 00:58:47/00:02:55

R1#show ip mroute 232.1.1.1 proxy
(20.0.5.5, 232.1.1.1)
Proxy Assigner Origin Uptime/Expire
200:1/10.0.6.6 0.0.0.0 BGP MDT 00:58:51/stopped

R1 shows the RPF proxy value of 10.0.6.6 for the source 20.0.5.5. Notice that there is a tree toward 10.0.1.1, which has been originated from R5. This tree has no proxies, as its now inside its native AS. Next in turn, check R2 and R6 to find the same information (remember that the actual proxy removes the vector when it sees itself in the PIM Join message proxy field):

R2#show ip mroute 232.1.1.1
IP Multicast Routing Table
Flags: D - Dense, S - Sparse, B - Bidir Group, s - SSM Group, C - Connected,
L - Local, P - Pruned, R - RP-bit set, F - Register flag,
T - SPT-bit set, J - Join SPT, M - MSDP created entry, E - Extranet,
X - Proxy Join Timer Running, A - Candidate for MSDP Advertisement,
U - URD, I - Received Source Specific Host Report,
Z - Multicast Tunnel, z - MDT-data group sender,
Y - Joined MDT-data group, y - Sending to MDT-data group,
V - RD & Vector, v - Vector
Outgoing interface flags: H - Hardware switched, A - Assert winner
Timers: Uptime/Expires
Interface state: Interface, Next-Hop or VCD, State/Mode

(10.0.1.1, 232.1.1.1), 01:01:41/00:03:25, flags: sT
Incoming interface: Serial2/0.12, RPF nbr 10.0.12.1
Outgoing interface list:
Serial2/0.26, Forward/Sparse, 01:01:41/00:02:50

(20.0.5.5, 232.1.1.1), 01:01:43/00:03:25, flags: sTV
Incoming interface: Serial2/0.26, RPF nbr 10.0.26.6, vector 10.0.6.6
Outgoing interface list:
Serial2/0.12, Forward/Sparse, 01:01:43/00:02:56

R2#show ip mroute 232.1.1.1 proxy
(20.0.5.5, 232.1.1.1)
Proxy Assigner Origin Uptime/Expire
200:1/10.0.6.6 10.0.12.1 PIM 01:01:46/00:02:23

Notice the same proxy vector for (20.0.5.5, 232.1.1.1) set by R1. As expected, there is “contra-directional” tree built toward R1 from R5, that has no RPF proxy vector. Proceed to the outputs from R6. Notice that R6 knows of two proxy vectors, one of which is R6 itself.

R6#show ip mroute 232.1.1.1
IP Multicast Routing Table
Flags: D - Dense, S - Sparse, B - Bidir Group, s - SSM Group, C - Connected,
L - Local, P - Pruned, R - RP-bit set, F - Register flag,
T - SPT-bit set, J - Join SPT, M - MSDP created entry, E - Extranet,
X - Proxy Join Timer Running, A - Candidate for MSDP Advertisement,
U - URD, I - Received Source Specific Host Report,
Z - Multicast Tunnel, z - MDT-data group sender,
Y - Joined MDT-data group, y - Sending to MDT-data group,
V - RD & Vector, v - Vector
Outgoing interface flags: H - Hardware switched, A - Assert winner
Timers: Uptime/Expires
Interface state: Interface, Next-Hop or VCD, State/Mode

(10.0.1.1, 232.1.1.1), 01:05:40/00:03:21, flags: sT
Incoming interface: Serial2/0.26, RPF nbr 10.0.26.2
Outgoing interface list:
Serial2/0.46, Forward/Sparse, 01:05:40/00:02:56

(20.0.5.5, 232.1.1.1), 01:05:42/00:03:21, flags: sTV
Incoming interface: Serial2/0.46, RPF nbr 172.16.46.4, vector 172.16.46.4
Outgoing interface list:
Serial2/0.26, Forward/Sparse, 01:05:42/00:02:51

R6#show ip mroute proxy
(10.0.1.1, 232.1.1.1)
Proxy Assigner Origin Uptime/Expire
100:1/local 172.16.46.4 PIM 01:05:44/00:02:21

(20.0.5.5, 232.1.1.1)
Proxy Assigner Origin Uptime/Expire
200:1/local 10.0.26.2 PIM 01:05:47/00:02:17

The show commands outputs from R4 are similar to R6’s – it’s the proxy for both multicast trees:

R4#show ip mroute 232.1.1.1
IP Multicast Routing Table
Flags: D - Dense, S - Sparse, B - Bidir Group, s - SSM Group, C - Connected,
L - Local, P - Pruned, R - RP-bit set, F - Register flag,
T - SPT-bit set, J - Join SPT, M - MSDP created entry, E - Extranet,
X - Proxy Join Timer Running, A - Candidate for MSDP Advertisement,
U - URD, I - Received Source Specific Host Report,
Z - Multicast Tunnel, z - MDT-data group sender,
Y - Joined MDT-data group, y - Sending to MDT-data group,
V - RD & Vector, v - Vector
Outgoing interface flags: H - Hardware switched, A - Assert winner
Timers: Uptime/Expires
Interface state: Interface, Next-Hop or VCD, State/Mode

(10.0.1.1, 232.1.1.1), 01:08:42/00:03:16, flags: sTV
Incoming interface: Serial2/0.46, RPF nbr 172.16.46.6, vector 172.16.46.6
Outgoing interface list:
Serial2/0.45, Forward/Sparse, 01:08:42/00:02:51

(20.0.5.5, 232.1.1.1), 01:08:44/00:03:16, flags: sT
Incoming interface: Serial2/0.45, RPF nbr 20.0.45.5
Outgoing interface list:
Serial2/0.46, Forward/Sparse, 01:08:44/00:02:46

R4#show ip mroute proxy
(10.0.1.1, 232.1.1.1)
Proxy Assigner Origin Uptime/Expire
100:1/local 20.0.45.5 PIM 01:08:46/00:02:17

(20.0.5.5, 232.1.1.1)
Proxy Assigner Origin Uptime/Expire
200:1/local 172.16.46.6 PIM 01:08:48/00:02:12

Finally, the outputs on R5 mirror the ones we saw on R1. However, this time the multicast trees have swapped in their roles: the one toward R1 has the proxy vector set:

R5#show ip mroute 232.1.1.1
IP Multicast Routing Table
Flags: D - Dense, S - Sparse, B - Bidir Group, s - SSM Group, C - Connected,
L - Local, P - Pruned, R - RP-bit set, F - Register flag,
T - SPT-bit set, J - Join SPT, M - MSDP created entry, E - Extranet,
X - Proxy Join Timer Running, A - Candidate for MSDP Advertisement,
U - URD, I - Received Source Specific Host Report,
Z - Multicast Tunnel, z - MDT-data group sender,
Y - Joined MDT-data group, y - Sending to MDT-data group,
V - RD & Vector, v - Vector
Outgoing interface flags: H - Hardware switched, A - Assert winner
Timers: Uptime/Expires
Interface state: Interface, Next-Hop or VCD, State/Mode

(10.0.1.1, 232.1.1.1), 01:12:07/00:02:57, flags: sTIZV
Incoming interface: Serial2/0.45, RPF nbr 20.0.45.4, vector 20.0.4.4
Outgoing interface list:
MVRF RED, Forward/Sparse, 01:12:07/00:00:02

(20.0.5.5, 232.1.1.1), 01:13:40/00:03:27, flags: sT
Incoming interface: Loopback0, RPF nbr 0.0.0.0
Outgoing interface list:
Serial2/0.45, Forward/Sparse, 01:12:09/00:03:12

R5#show ip mroute proxy
(10.0.1.1, 232.1.1.1)
Proxy Assigner Origin Uptime/Expire
100:1/20.0.4.4 0.0.0.0 BGP MDT 01:12:10/stopped

R5’s BGP table also has the connector attribute information for R1’s Ethernet interface:

R5#show bgp vpnv4 unicast vrf RED 192.168.1.0
BGP routing table entry for 200:1:192.168.1.0/24, version 6
Paths: (1 available, best #1, table RED)
Not advertised to any peer
100, imported path from 100:1:192.168.1.0/24
20.0.4.4 (metric 65) from 20.0.4.4 (20.0.4.4)
Origin incomplete, metric 0, localpref 100, valid, internal, best
Extended Community: RT:100:1 RT:200:1
Connector Attribute: count=1
type 1 len 12 value 100:1:10.0.1.1
mpls labels in/out nolabel/18

In addition to the connector attributes, one last piece of information needed is the multicast source information propagated via IPv4 multicast address family. Both R1 and R5 should have this information in their BGP tables:

R5#show bgp ipv4 multicast
BGP table version is 3, local router ID is 20.0.5.5
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

Network Next Hop Metric LocPrf Weight Path
*>i10.0.1.1/32 20.0.4.4 0 100 0 100 i
*> 20.0.5.5/32 0.0.0.0 0 32768 i

R1#show bgp ipv4 multicast
BGP table version is 3, local router ID is 10.0.1.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

Network Next Hop Metric LocPrf Weight Path
*> 10.0.1.1/32 0.0.0.0 0 32768 i
*>i20.0.5.5/32 10.0.6.6 0 100 0 200 i

Now it’s time to verify the multicast connectivity. Make sure R1 and R5 see each other as PIM neighbors over the MDT and than do a multicast ping toward the group joined by both R1 and R5:

R1#show ip pim vrf RED neighbor 
PIM Neighbor Table
Mode: B - Bidir Capable, DR - Designated Router, N - Default DR Priority,
S - State Refresh Capable
Neighbor Interface Uptime/Expires Ver DR
Address Prio/Mode
20.0.5.5 Tunnel0 01:31:38/00:01:15 v2 1 / DR S P

R5#show ip pim vrf RED neighbor
PIM Neighbor Table
Mode: B - Bidir Capable, DR - Designated Router, N - Default DR Priority,
S - State Refresh Capable
Neighbor Interface Uptime/Expires Ver DR
Address Prio/Mode
10.0.1.1 Tunnel0 01:31:17/00:01:26 v2 1 / S P

R1#ping vrf RED 239.1.1.1 repeat 100

Type escape sequence to abort.
Sending 100, 100-byte ICMP Echos to 239.1.1.1, timeout is 2 seconds:

Reply to request 0 from 192.168.1.1, 12 ms
Reply to request 0 from 20.0.5.5, 64 ms
Reply to request 1 from 192.168.1.1, 8 ms
Reply to request 1 from 20.0.5.5, 56 ms
Reply to request 2 from 192.168.1.1, 8 ms
Reply to request 2 from 20.0.5.5, 100 ms
Reply to request 3 from 192.168.1.1, 16 ms
Reply to request 3 from 20.0.5.5, 56 ms

This final verification concludes our testbed verification.

Summary

In this blog post we demonstrated how MP-BGP and PIM extensions could be used to effectively implement Inter-AS multicast VPN between the autonomous systems with BGP-free cores. PIM SSM is used to build the inter-AS trees and MDT SAFI is used to discover the MDT group addresses along with the PEs associated with those. PIM RPF proxy vector allows for successful RPF checks in the multicast-route free core, by the virtue of proxy IPv4 address. Finally, BGP connector attribute allows for successful RPF checks inside a particular VRF.

Further Reading

Multicast VPN (draft-rosen-vpn-mcast)
Multicast Tunnel Discovery (draft-wijnands-mt-discovery)
PIM RPF Vector (draft-ietf-pim-rpf-vector-08)
MDT SAFI (draft-nalawade-idr-mdt-safi)

MDT SAFI Configuration
PIM RPF Proxy Vector Configuration

Subscribe to INE Blog Updates

New Blog Posts!