May
10

Update: Congrats to Mark, our winner of 100 rack rental tokens for the first correct answer, that XR2 is missing a BGP router-id.  In regular IOS, a router-id is chosen based on the highest Loopback interface.  If there is no Loopback interface the highest IP address of all up/up interfaces is chosen.  In the case of IOS XR however, the router-id will not be chosen from a physical link.  It will only be chosen from the highest Loopback interface, or from the manual router-id command.  Per the Cisco documentation:

BGP Router Identifier

For BGP sessions between neighbors to be established, BGP must be assigned a router ID. The router ID is sent to BGP peers in the OPEN message when a BGP session is established.

BGP attempts to obtain a router ID in the following ways (in order of preference):

  • By means of the address configured using the bgp router-id command in router configuration mode.
  • By using the highest IPv4 address on a loopback interface in the system if the router is booted with saved loopback address configuration.
  • By using the primary IPv4 address of the first loopback address that gets configured if there are not any in the saved configuration.

If none of these methods for obtaining a router ID succeeds, BGP does not have a router ID and cannot establish any peering sessions with BGP neighbors. In such an instance, an error message is entered in the system log, and the show bgp summary command displays a router ID of 0.0.0.0.

After BGP has obtained a router ID, it continues to use it even if a better router ID becomes available. This usage avoids unnecessary flapping for all BGP sessions. However, if the router ID currently in use becomes invalid (because the interface goes down or its configuration is changed), BGP selects a new router ID (using the rules described) and all established peering sessions are reset.

Since XR2 in this case does not have a Loopback configured, the BGP process cannot initialize.  The kicker with this problem is that the documentation states that when this problem occurs you should see that "an error message is entered in the system log", however in this case a Syslog was not generated about the error.  At least this is the last time this problem will bite me ;)

 


Today while working on additional content for our CCIE Service Provider Version 3.0 Lab Workbook I had one of those epic brain fart moments.  What started off as work on (what I thought was) a fairly simply design ended up as a 2 hour troubleshooting rabbit hole of rolling back config snippets one by one, debugging, and basically overall misery that can be perfectly summed up by this GIF of a guy smashing his head against his keyboard. :)

The scenario in question was a BGP peering between two IOS XR routers.  One was the PE of an MPLS L3VPN network and one was the CE.  As I've done this config literally hundreds of times in the past I could not for the life of me figure out why the BGP peering would not establish.  The relevant snippet of the topology diagram is as follows:

Since this scenario caused me so much pleasure I am offering 100 tokens good for CCIE Service Provider Version 3.0 Rack Rentals - or any of our other Routing & Switching rack rentals & mock labs, Security rack rentals, or Voice rack rentals - to whoever the first person is that can tell me why did these neighbors not establish a BGP peering.  The relevant outputs needed to troubleshoot the problem can be found below.  I still haven't decided whether I'm going to leave this problem in the workbook or not since it's such a mean one :)  Good luck!

 

 

<strong>RP/0/0/CPU0:XR1#show run</strong>
Fri May 11 00:34:38.563 UTC
Building configuration...
!! IOS XR Configuration 3.9.1
!! Last configuration change at Fri May 11 00:32:50 2012 by xr1
!
hostname XR1
username xr1
group root-lr
password 7 13061E010803
!
vrf ABC
address-family ipv4 unicast
import route-target
26:65001
!
export route-target
26:65001
!
!
!
line console
exec-timeout 0 0
!
ipv4 access-list PE_ROUTERS
10 permit ipv4 host 1.1.1.1 any
20 permit ipv4 host 2.2.2.2 any
30 permit ipv4 host 5.5.5.5 any
40 permit ipv4 host 19.19.19.19 any
!
interface Loopback0
ipv4 address 19.19.19.19 255.255.255.255
!
interface GigabitEthernet0/1/0/0
ipv4 address 172.19.10.19 255.255.255.0
!
interface GigabitEthernet0/1/0/1
ipv4 address 26.3.19.19 255.255.255.0
!
interface POS0/6/0/0
vrf ABC
ipv4 address 10.19.20.19 255.255.255.0
!
route-policy PASS
pass
end-policy
!
router isis 1
is-type level-2-only
net 49.0001.0000.0000.0019.00
address-family ipv4 unicast
mpls ldp auto-config
!
interface Loopback0
passive
address-family ipv4 unicast
!
!
interface GigabitEthernet0/1/0/1
point-to-point
hello-password hmac-md5 encrypted 022527722E
address-family ipv4 unicast
!
!
!
router bgp 26
address-family ipv4 unicast
!
! address-family ipv4 unicast
address-family vpnv4 unicast
!
neighbor-group PE_ROUTERS
remote-as 26
update-source Loopback0
address-family vpnv4 unicast
!
!
neighbor 1.1.1.1
use neighbor-group PE_ROUTERS
!
neighbor 2.2.2.2
use neighbor-group PE_ROUTERS
!
neighbor 5.5.5.5
use neighbor-group PE_ROUTERS
!
vrf ABC
rd 26:65001
address-family ipv4 unicast
!
neighbor 10.19.20.20
remote-as 65001
address-family ipv4 unicast
route-policy PASS in
route-policy PASS out
as-override
!
!
!
!
mpls ldp
label
allocate for PE_ROUTERS
!
!
end

RP/0/0/CPU0:XR1#

<strong>RP/0/3/CPU0:XR2#show run </strong>
Fri May 11 00:35:04.932 UTC
Building configuration...
!! IOS XR Configuration 3.9.1
!! Last configuration change at Fri May 11 00:30:30 2012 by xr2
!
hostname XR2
logging console debugging
username xr2
group root-lr
password 7 00071A150754
!
cdp
line console
exec-timeout 0 0
!
interface GigabitEthernet0/4/0/0
ipv4 address 10.20.20.20 255.255.255.0
ipv6 address 2001:10:20:20::20/64
!
interface POS0/7/0/0
ipv4 address 10.19.20.20 255.255.255.0
ipv6 address 2001:10:19:20::20/64
!
route-policy PASS
pass
end-policy
!
router bgp 65001
address-family ipv4 unicast
!
neighbor 10.19.20.19
remote-as 26
address-family ipv4 unicast
route-policy PASS in
route-policy PASS out
!
!
!
end

RP/0/3/CPU0:XR2#

RP/0/0/CPU0:XR1#show bgp vrf ABC ipv4 unicast summary 
Fri May 11 00:34:29.712 UTC
BGP VRF ABC, state: Active
BGP Route Distinguisher: 26:65001
VRF ID: 0x60000002
BGP router identifier 19.19.19.19, local AS number 26
BGP table state: Active
Table ID: 0xe0000002
BGP main routing table version 1

BGP is operating in STANDALONE mode.

Process RcvTblVer bRIB/RIB LabelVer ImportVer SendTblVer StandbyVer
Speaker 1 1 1 1 1 1

Neighbor Spk AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down St/PfxRcd
10.19.20.20 0 65001 2 7 0 0 0 00:03:59 Idle

 
<strong>RP/0/3/CPU0:XR2#show bgp ipv4 unicast summary</strong>
Fri May 11 00:35:02.278 UTC
BGP router identifier 0.0.0.0, local AS number 65001
BGP generic scan interval 60 secs
BGP table state: Active
Table ID: 0xe0000000
BGP main routing table version 1
BGP scan interval 60 secs

BGP is operating in STANDALONE mode.

Process RcvTblVer bRIB/RIB LabelVer ImportVer SendTblVer StandbyVer
Speaker 1 1 1 1 1 1

Neighbor Spk AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down St/PfxRcd
10.19.20.19 0 26 2 2 0 0 0 00:04:31 Active

 
RP/0/0/CPU0:XR1#show bgp vrf ABC ipv4 unicast neighbors 
Fri May 11 00:34:18.708 UTC

BGP neighbor is 10.19.20.20, vrf ABC
Remote AS 65001, local AS 26, external link
Remote router ID 0.0.0.0
BGP state = Idle
Last read 00:00:00, Last read before reset 00:04:10
Hold time is 180, keepalive interval is 60 seconds
Configured hold time: 180, keepalive: 60, min acceptable hold time: 3
Last write 00:00:15, attempted 53, written 53
Second last write 00:01:01, attempted 53, written 53
Last write before reset 00:04:10, attempted 72, written 72
Second last write before reset 00:04:15, attempted 53, written 53
Last write pulse rcvd May 11 00:34:02.927 last full not set pulse count 9
Last write pulse rcvd before reset 00:04:10
Socket not armed for io, not armed for read, not armed for write
Last write thread event before reset 00:04:10, second last 00:04:10
Last KA expiry before reset 00:00:00, second last 00:00:00
Last KA error before reset 00:00:00, KA not sent 00:00:00
Last KA start before reset 00:00:00, second last 00:00:00
Precedence: internet
Enforcing first AS is enabled
Received 2 messages, 0 notifications, 0 in queue
Sent 7 messages, 0 notifications, 0 in queue
Minimum time between advertisement runs is 0 secs

For Address Family: IPv4 Unicast
BGP neighbor version 0
Update group: 0.2
Route refresh request: received 0, sent 0
Policy for incoming advertisements is PASS
Policy for outgoing advertisements is PASS
0 accepted prefixes, 0 are bestpaths
Cumulative no. of prefixes denied: 0.
Prefix advertised 0, suppressed 0, withdrawn 0
Maximum prefixes allowed 524288
Threshold for warning message 75%, restart interval 0 min
AS override is set
An EoR was not received during read-only mode
Last ack version 0, Last synced ack version 0
Outstanding version objects: current 0, max 0

Connections established 1; dropped 1
Local host: 10.19.20.19, Local port: 19432
Foreign host: 10.19.20.20, Foreign port: 179
Last reset 00:00:15, due to Peer closing down the session
Peer reset reason: Remote closed the session (Connection timed out)
Time since last notification sent to neighbor: 00:02:11
Error Code: administrative shutdown
Notification data sent:
None

<strong>RP/0/3/CPU0:XR2#show bgp ipv4 unicast neighbors </strong>
Fri May 11 00:34:58.427 UTC

BGP neighbor is 10.19.20.19
Remote AS 26, local AS 65001, external link
Remote router ID 0.0.0.0
BGP state = Active
Last read 00:00:00, Last read before reset 00:04:50
Hold time is 180, keepalive interval is 60 seconds
Configured hold time: 180, keepalive: 60, min acceptable hold time: 3
Last write 00:04:50, attempted 19, written 19
Second last write 00:04:50, attempted 53, written 53
Last write before reset 00:04:50, attempted 19, written 19
Second last write before reset 00:04:50, attempted 53, written 53
Last write pulse rcvd May 11 00:30:08.305 last full not set pulse count 4
Last write pulse rcvd before reset 00:04:50
Socket not armed for io, not armed for read, not armed for write
Last write thread event before reset 00:04:50, second last 00:04:50
Last KA expiry before reset 00:00:00, second last 00:00:00
Last KA error before reset 00:00:00, KA not sent 00:00:00
Last KA start before reset 00:04:50, second last 00:00:00
Precedence: internet
Enforcing first AS is enabled
Received 2 messages, 0 notifications, 0 in queue
Sent 2 messages, 0 notifications, 0 in queue
Minimum time between advertisement runs is 30 secs

For Address Family: IPv4 Unicast
BGP neighbor version 0
Update group: 0.2
Route refresh request: received 0, sent 0
Policy for incoming advertisements is PASS
Policy for outgoing advertisements is PASS
0 accepted prefixes, 0 are bestpaths
Cumulative no. of prefixes denied: 0.
Prefix advertised 0, suppressed 0, withdrawn 0
Maximum prefixes allowed 524288
Threshold for warning message 75%, restart interval 0 min
An EoR was not received during read-only mode
Last ack version 0, Last synced ack version 0
Outstanding version objects: current 0, max 0

Connections established 1; dropped 1
Local host: 10.19.20.20, Local port: 60056
Foreign host: 10.19.20.19, Foreign port: 179
Last reset 00:02:27, due to Interface flap
Time since last notification sent to neighbor: 00:05:07
Error Code: administrative reset
Notification data sent:
None


                        
Oct
12

The BGP MED attribute, commonly referred to as the BGP metric, provides a means to convey to a neighboring Autonomous System (AS) a preferred entry point into the local AS.  BGP MED is a non-transitive optional attribute and thus the receiving AS cannot propagate it across its AS borders.  However, the receiving AS may reset the metric value upon receipt, if it so desires.

Previous versions of BGP (v2 and v3) defined this attribute as the inter-AS metric (INTER_AS_METRIC) but in BGPv4 it is defined as the multi-exit discriminator (MULTI_EXIT_DISC). The MED is an unsigned 32bit integer.  The MED value can be any from 0 to 4,294,967,295 (2^32-1) with a lower value being preferred.  Certain implementations of BGP will treat a path with a MED value of 4,294,967,295 as infinite and hence the path would be deemed unusable so the MED value will be reset to 4,294,967,294.  This rewriting of the MED value could lead to inconsistencies, unintended path selections or even churn. I’ll do a follow up article on how BGP MED can possibly cause an endless convergence loop in certain topologies.

Cisco’s BGP implementation automatically assign the value of the MED attribute based on the IGP metric value for any locally originate prefixes. The reasoning behind this is when there are multiple peering points with a neighboring AS the neighboring AS can use this metric to determine the best entry point into the local AS. This is the case when the originating AS’s network uses as single IGP. When multiple IGPs are used (i.e. OSPF and IS-IS) the metric value automatically copied into BGP will not be comparable. In this situation the metric values should be manually set before sending to the neighboring AS.

The MED value by default will only be used in Cisco’s BGP Best Path selection algorithm when comparing paths from the same AS.  If comparison is desired between different ASes the bgp always-compare-med router configuration command can be used. Use this command with caution as different ASes can have different policies regarding the setting of the MED value or in the case of the MED automatically being set they could be using different IGPs. Additionally by default MED is not compared between sub-autonomous systems in a BGP confederation. To enabled comparison between different sub-ASes within a confederation use the bgp bestpath med confed router configuration command.

As mentioned by default the MED values are compared for paths from the same AS but this presents a problem in the way BGP path comparison is done in the IOS.   Lets first examine how the path comparison is done to get a better understanding of the BGP Deterministic MED command and why Cisco recommends it to be enabled.

Here is the topology that we will use for this scenario:

BGP Deterministic MED

We will primarily look at the effects of BGP MED on the BGP best path decision process from R1’s perspective.   In this network AS 400 is advertising the 24.1.1.0/24 network.  R2, R3 and R4 are in AS 200 with R5 being in AS 300.  R2 is setting the MED for this network to 200, R3 to 300, R4 to 400 and R5 to 500 when the 24.1.1.0/24 network is advertised to R1.  R1’s BGP configuration is below:

Rack1R1# show run | sec router bgp 100
router bgp 100
no synchronization
bgp router-id 1.1.1.1
neighbor 54.1.12.2 remote-as 200
neighbor 54.1.13.3 remote-as 200
neighbor 54.1.14.4 remote-as 200
neighbor 54.1.15.5 remote-as 300
no auto-summary
Rack1R1#

The output of the show ip bgp on R1:

Rack1R1#show ip bgp
BGP table version is 2, local router ID is 1.1.1.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

Network Next Hop Metric LocPrf Weight Path
* 24.1.1.0/24 54.1.12.2 200 0 200 400 ?
* 54.1.14.4 400 0 200 400 ?
* 54.1.15.5 500 0 300 400 ?
*> 54.1.13.3 300 0 200 400 ?
Rack1R1#

Now lets look at the 24.1.1.0/24 network in a little more detail:

Rack1R1#show ip bgp 24.1.1.0/24
BGP routing table entry for 24.1.1.0/24, version 2
Paths: (4 available, best #4, table Default-IP-Routing-Table)
Advertised to update-groups:
2
200 400
54.1.12.2 from 54.1.12.2 (2.2.2.2)
Origin incomplete, metric 200, localpref 100, valid, external
200 400
54.1.14.4 from 54.1.14.4 (4.4.4.4)
Origin incomplete, metric 400, localpref 100, valid, external
300 400
54.1.15.5 from 54.1.15.5 (5.5.5.5)
Origin incomplete, metric 500, localpref 100, valid, external
200 400
54.1.13.3 from 54.1.13.3 (3.3.3.3)
Origin incomplete, metric 300, localpref 100, valid, external, best
Rack1R1#

As we can see R1 has selected R3’s (3.3.3.3) advertisement of the 24.1.1.0/24 as the best path.  The MED is 300 for this advertisement which isn’t the lowest of all advertisements from AS 200.  The advertisement from R2 is actually lower as it has a MED value of 200.  Remember that the lower MED value is preferred since this value is normally copied from the IGP metric and with IGPs the lower metric value is preferred.  Since the MED attribute is optional it may not be present in all paths. By default, BGP process will assume the MED value of zero for such paths, which will make them more preferred during the selection based on metric. If you want to change this behavior, use the bgp bestpath med missing-as-worst router configuration command.

Lets look at how R1 ended up selecting R3 as the best path.  First off the router will order the paths from the newest to the oldest.  By default all factors in the BGP best path decision process being the same, the oldest path will be selected as best.  BGP does to this reduce the amount of churn in the routing table.  To change this behavior and not use the oldest path as the best, the BGP router-ID can be used to determine the best path.  To enable this use the bgp bestpath compare-routerid router configuration command.

Below the bgp bestpath compare-routerid command is enabled on R1.  Now R1 has selected R2’s path as the best since it has the lowest BGP router ID.

Rack1R1#show run | sec router bgp
router bgp 100
no synchronization
bgp router-id 1.1.1.1
bgp bestpath compare-routerid
neighbor 54.1.12.2 remote-as 200
neighbor 54.1.13.3 remote-as 200
neighbor 54.1.14.4 remote-as 200
neighbor 54.1.15.5 remote-as 300
no auto-summary
Rack1R1#show ip bgp 24.1.1.0/24
BGP routing table entry for 24.1.1.0/24, version 3
Paths: (4 available, best #1, table Default-IP-Routing-Table)
Flag: 0x10840
Advertised to update-groups:
2
200 400
54.1.12.2 from 54.1.12.2 (2.2.2.2)
Origin incomplete, metric 200, localpref 100, valid, external, best
200 400
54.1.14.4 from 54.1.14.4 (4.4.4.4)
Origin incomplete, metric 400, localpref 100, valid, external
300 400
54.1.15.5 from 54.1.15.5 (5.5.5.5)
Origin incomplete, metric 500, localpref 100, valid, external
200 400
54.1.13.3 from 54.1.13.3 (3.3.3.3)
Origin incomplete, metric 300, localpref 100, valid, external
Rack1R1#

The bgp bestpath compare-routerid command is removed for the remainder of this scenario.  When the command is removed R3 is once again selected as best.

Rack1R1#show ip bgp 24.1.1.0/24
BGP routing table entry for 24.1.1.0/24, version 4
Paths: (4 available, best #4, table Default-IP-Routing-Table)
Flag: 0x10840
Advertised to update-groups:
2
200 400
54.1.12.2 from 54.1.12.2 (2.2.2.2)
Origin incomplete, metric 200, localpref 100, valid, external
200 400
54.1.14.4 from 54.1.14.4 (4.4.4.4)
Origin incomplete, metric 400, localpref 100, valid, external
300 400
54.1.15.5 from 54.1.15.5 (5.5.5.5)
Origin incomplete, metric 500, localpref 100, valid, external
200 400
54.1.13.3 from 54.1.13.3 (3.3.3.3)
Origin incomplete, metric 300, localpref 100, valid, external, best
Rack1R1#

Additional RFC 4277 (Experience with the BGP-4 Protocol) mentions the following in regards to selecting a path based upon oldest path.

7.1.4.  MEDs and Temporal Route Selection

Some implementations have hooks to apply temporal behavior in MED-based best path selection. That is, all things being equal up to MED consideration, preference would be applied to the "oldest" path, without preference for the lower MED value. The reasoning for this is that "older" paths are presumably more stable, and thus preferable. However, temporal behavior in route selection results in non-deterministic behavior, and as such, may often be undesirable.

 

Rack1R1#show ip bgp 24.1.1.0/24
BGP routing table entry for 24.1.1.0/24, version 4
Paths: (4 available, best #4, table Default-IP-Routing-Table)
Flag: 0x820
Advertised to update-groups:
2
200 400
54.1.12.2 from 54.1.12.2 (2.2.2.2)
Origin incomplete, metric 200, localpref 100, valid, external
200 400
54.1.13.3 from 54.1.13.3 (3.3.3.3)
Origin incomplete, metric 300, localpref 100, valid, external
200 400
54.1.14.4 from 54.1.14.4 (4.4.4.4)
Origin incomplete, metric 400, localpref 100, valid, external
300 400
54.1.15.5 from 54.1.15.5 (5.5.5.5)
Origin incomplete, metric 500, localpref 100, valid, external, best
Rack1R1#

First off it’s important to understand that the paths are compared in pairs starting with the newest path and comparing it with the second newest.  The winning path between the first and second is then compared to the third and in our cause the winner of that comparison is finally compared with the fourth and final path.   On R1 for the 24.1.1.0/24 network, R2’s and R3’s paths are compared first.  Everything in the BGP best path decision algorithm is the same down to MED (weight, local preference, AS path, etc).  Since the advertisements by R2 and R3 are in the same AS the MED is compared and R2 is wins since it has a MED of 200 as opposed to R3’s MED of 300.  Next R2 is then compared to the third oldest entry which is R4’s.  R2 and R4 are in the same AS so R2 wins based upon the lower MED value. Finally R2 is compared with R5. Everything is equal but the MED, router ID and age of the advertisement.  Since R2 and R5 are in different ASes and the bgp always-compare-med isn’t enable, MED isn’t compared.  Additionally we do not have bgp bestpath compare-routerid enabled which leads the R1 to select the oldest advertisement.  Since R5 is listed below R2 we know that it is older and in turn wins out due to being the older advertisement and is installed as the best path to reach the 24.1.1.0/24 network.

As we can see the MED comparison between the paths advertised by AS 200 did not happen as intended by AS 200.   AS 200 was setting the MED so that AS 100 will use R2 as the ingress point into AS 200. This is only because R5’s advertisement was second to the oldest that in turn broke the MED comparison between the AS 200 routers (R2, R3 and R4).

Ideally we want the MED compared between advertisements from the same AS irrespective of their age.  This is where the bgp deterministic-med router configuration command is useful.  When this command is enabled the router will group all paths from the same AS and compare them together before comparing them to paths from different ASes.  Lets enable the command on R1.  We should see that R2 is selected as the preferred path between R2, R3 and R4 but this will mean that once R2 is compared to R5, R5 will be installed since it is an older advertisement.

Rack1R1#show run | sec router bgp
router bgp 100
no synchronization
bgp router-id 1.1.1.1
bgp log-neighbor-changes
bgp deterministic-med
neighbor 54.1.12.2 remote-as 200
neighbor 54.1.13.3 remote-as 200
neighbor 54.1.14.4 remote-as 200
neighbor 54.1.15.5 remote-as 300
no auto-summary
Rack1R1#show ip bgp 24.1.1.0/24
BGP routing table entry for 24.1.1.0/24, version 5
Paths: (4 available, best #4, table Default-IP-Routing-Table)
Flag: 0x820
Advertised to update-groups:
2
200 400
54.1.12.2 from 54.1.12.2 (2.2.2.2)
Origin incomplete, metric 200, localpref 100, valid, external
200 400
54.1.13.3 from 54.1.13.3 (3.3.3.3)
Origin incomplete, metric 300, localpref 100, valid, external
200 400
54.1.14.4 from 54.1.14.4 (4.4.4.4)
Origin incomplete, metric 400, localpref 100, valid, external
300 400
54.1.15.5 from 54.1.15.5 (5.5.5.5)
Origin incomplete, metric 500, localpref 100, valid, external, best
Rack1R1#

It we want to have R2 selected as best we can clear the BGP neighbor relationship with R5 which will in turn cause R5’s paths to be cleared out.  Once the neighbor relationship with R5 comes back up and R5 advertised the 24.1.1.0/24 path, it will be the newest advertisement and in turn be listed at the top.

Rack1R1#clear ip bgp 54.1.15.5
Rack1R1#
%BGP-5-ADJCHANGE: neighbor 54.1.15.5 Down User reset
Rack1R1#
%BGP-5-ADJCHANGE: neighbor 54.1.15.5 Up
Rack1R1#

Now as expected R2 was finally selected as the best path.

Rack1R1#show ip bgp 24.1.1.0
BGP routing table entry for 24.1.1.0/24, version 6
Paths: (4 available, best #2, table Default-IP-Routing-Table)
Flag: 0x820
Advertised to update-groups:
2
300 400
54.1.15.5 from 54.1.15.5 (5.5.5.5)
Origin incomplete, metric 500, localpref 100, valid, external
200 400
54.1.12.2 from 54.1.12.2 (2.2.2.2)
Origin incomplete, metric 200, localpref 100, valid, external, best
200 400
54.1.13.3 from 54.1.13.3 (3.3.3.3)
Origin incomplete, metric 300, localpref 100, valid, external
200 400
54.1.14.4 from 54.1.14.4 (4.4.4.4)
Origin incomplete, metric 400, localpref 100, valid, external
Rack1R1#

Of course to always ensure R2 is selected in our network as the best path we could also use the bgp always-compare-med command to compare MED between different ASes but this command is normally not used in the real world unless MED policies are standardized between neighboring ASes.

Rack1R1#show run | sec router bgp
router bgp 100
no synchronization
bgp router-id 1.1.1.1
bgp always-compare-med
bgp deterministic-med
neighbor 54.1.12.2 remote-as 200
neighbor 54.1.13.3 remote-as 200
neighbor 54.1.14.4 remote-as 200
neighbor 54.1.15.5 remote-as 300
no auto-summary
Rack1R1#
Rack1R1#clear ip bgp *
%BGP-5-ADJCHANGE: neighbor 54.1.12.2 Down User reset
%BGP-5-ADJCHANGE: neighbor 54.1.13.3 Down User reset
%BGP-5-ADJCHANGE: neighbor 54.1.14.4 Down User reset
%BGP-5-ADJCHANGE: neighbor 54.1.15.5 Down User reset
Rack1R1#
%BGP-5-ADJCHANGE: neighbor 54.1.12.2 Up
%BGP-5-ADJCHANGE: neighbor 54.1.13.3 Up
%BGP-5-ADJCHANGE: neighbor 54.1.14.4 Up
%BGP-5-ADJCHANGE: neighbor 54.1.15.5 Up
Rack1R1#show ip bgp 24.1.1.0
BGP routing table entry for 24.1.1.0/24, version 4
Paths: (4 available, best #2, table Default-IP-Routing-Table)
Flag: 0x10860
Advertised to update-groups:
2
300 400
54.1.15.5 from 54.1.15.5 (5.5.5.5)
Origin incomplete, metric 500, localpref 100, valid, external
200 400
54.1.12.2 from 54.1.12.2 (2.2.2.2)
Origin incomplete, metric 200, localpref 100, valid, external, best
200 400
54.1.13.3 from 54.1.13.3 (3.3.3.3)
Origin incomplete, metric 300, localpref 100, valid, external
200 400
54.1.14.4 from 54.1.14.4 (4.4.4.4)
Origin incomplete, metric 400, localpref 100, valid, external
Rack1R1#

If BGP Deterministic MED is used, it should be enabled on all BGP speaking devices within an AS to ensure a consistent policy regarding the use of MEDs.

We should now have a better understanding of how MED is used in the BGP route selection process and the BGP route selection process is general.

My next post will be in regards the Two Rate Three Color Marker (trTCM) as defined in RFC 2698 and implemented in the Cisco IOS. Also I hope to see many of you in my new RS Bootcamps.

Nov
22

Introduction

BGP (see [0]) is the de-facto protocol used for Inter-AS connectivity nowadays. Even though it is commonly accepted that BGP protocol design is far from being ideal and there have been attempts to develop a better replacement for BGP, none of them has been successful. To further add to BGP's widespread adoption, MP-BGP extension allows BGP transporting almost any kind of control-plane information, e.g. to providing auto-discovery functions or control-plane interworking for MPLS/BGP VPNs. However, despite BGP's success, the problems with the protocol design did not disappear. One of them is slow convergence, which is a serious limiting factor for many modern applications. In this publication, we are going to discuss some techniques that could be used to improve BGP convergence for Intra-AS deployments.

BGP-Only Convergence Process
Tuning BGP Transport
BGP Fast Peering Session Deactivation
BGP and IGP Interaction
BGP PIC and Multiple-Path Propagation
Practical Scenario: BGP PIC + BGP NHT
Considerations for Implementing BGP PIC
Summary
Further Reading
Appendix: Practical Scenario Baseline Configuration


BGP-Only Convergence Process

BGP is a path-vector protocol - in other words, a distance-vector protocol featuring complex metric. In absence of any policies, BGP operates like if routes have metric equal to the length of the AS_PATH attribute. BGP routing polices may override this simple monotonous metric and potentially create divergence conditions in non-trivial BGP topologies (see [7],[8],[9]). While this may be a serious problem at a large scale, we are not going to discuss these pathological cases, but rather talk about convergence in general. Like any distance-vector protocol, BGP routing process accepts multiple incoming routing updates, and advertises only the best routes to its peers. BGP does not utilize periodic updates, and thus route invalidation is not based on expiring any sort of soft state information (e.g prefix-related timers like in RIP). Instead, BGP uses explicit withdrawal section in the triggered UPDATE message to signal neighbors of the loss of the particular path. In addition to the explicit withdrawals, BGP also support implicit signaling, where newer information for the same prefix from the same peer replaces the previously learned information.

Let's have a look at BGP UPDATE message below. As you can see, the UPDATE message may contain both withdrawn prefixes and new routing information. While withdrawn prefixes are listed simply as a collection of NLRIs, new information is grouped around the set of BGP attributes, shared by the group of announced prefixes. In other words, every BGP UPDATE message contains new information pertaining to a set of path attributes, at least prefixes sharing the same AS_PATH attribute. Therefore, every new collection of attributes requires a separate UPDATE message to be sent. This fact is important, as BGP process tries packing as many prefixes per update message as possible, when replicating routing information.

BGP-Convergence-FIG0

Look at the sample topology below. Let's assume that R1's session to R7 just came up and follow the way that prefix 20.0.0.0/8 takes to propagate through AS 300. In the course of this discussion we skip the complexities associated with BGP policy application and thus ignore the existence of BGP Adj-RIB-In space used for processing the prefixes learned from a peer prior to running the best-path selection process.

BGP-Convergence-FIG1

  • Upon session establishment and exchanging the BGP OPEN messages, R1 enters the "BGP Read-Only Mode". What this means, is that R1 will not start the BGP Best-Path selection process until it either receives all prefixes from R7 or reaches the BGP read-only mode timeout. The timeout is defined using the BGP process command bgp update-delay. The reason to hold the BGP best-path selection process is to ensure that the peer has supplied us all routing information. This allows minimizing the number of best-path selection process runs, simplify update generation and ensure better prefix per message packing, thus improving transportation efficiency.
  • BGP process determines the end of UPDATE messages flow in either of two ways: receiving BGP KEEPALIVE message or receiving BGP End-of-RIB message. The last message is normally used for BGP graceful restart (see [13]), but could also be used to explicitly signalize the end of BGP UPDATE exchange process. Even if BGP process does not support the End-of-RIB marker, Cisco's BGP implementation always sends a KEEPALIVE message when it finishes sending updates to a peer. It is clear that the best-path selection delay would be longer in case when peers have to exchange larger routing tables, or the underlying TCP transport and router ingress queue settings make the exchange slower. To address this, we'll briefly cover TCP transport optimization later.
  • When R1's BGP process leaves read-only mode, it starts the best-path selection running the BGP Router process. This process walks over new information and compare it with the local BGP RIB contents, selecting the best-path for every prefix. It takes time proportional to the amount of the new informational learned. Luckily, the computations are not very CPU-intensive, just like with any distance-vector protocol. As soon as the best-path process if finished, BGP has to upload all routes to the RIB, before advertising them to the peers. This is a requirement of distance-vector protocols - having the routing information active in the RIB before propagating it further. The RIB update will in turn trigger FIB information upload to the router's line-cards, if the platform supports distributed forwarding. Both RIB and FIB updates are time-consuming and take the time proportional to the number of prefixes being updated.
  • After information has been committed to RIB, R1 needs to replicate the best-paths to every peer that should receive it. The replication process could be most memory and CPU intensive as BGP process has to perform a full BGP table walk for every peer and construct the output for the corresponding BGP Adj-RIB-Out. This may require additional transient memory in the course of the update batch calculation. However, the update generation process is highly optimized in Cisco's BGP implementation by means of dynamic update groups. The essence of the dynamic update groups is that BGP process dynamically finds all neighbors sharing the same output policies, then elects a peer with the lowest IP address as the group leader and only generates the updates batch for the group leader. All other members of the same group receive the same updates. In our case, R1 has to generate two update sets: one for R5 and another for the pair of RR1 and RR2 route reflectors. The BGP update groups become very effective on route-reflectors that often have hundred of peers sharing the same policies. You may see the update groups using the command show ip bgp replication for IPv4 sessions.
  • R1 starts sending updates to R1 and RR1, RR2. This will take some time, depending on the BGP TCP transport settings and BGP table size. However, before R1 will ever start sending any updates to any peer/update group, it checks if Advertisement Interval timer is running for this peer. BGP speaker starts this timer on per-peer basis every time its done sending the full batch of updates to the peer. If the subsequent batch is prepared to be sent and the timer is still running, the update will be delayed until the timer expires. This is a dampening mechanism to prevent unstable peers from flooding the network with updates. The command to define this timer is neighbor X.X.X.X advertisment-interval XX. The default values are 30 seconds for eBGP and 5 seconds for iBGP/eiBGP sessions (intra-AS). This timer really starts playing its role only for "Down-Up" or "Up-Down" convergence, as any rapid flapping changes are delayed for the amount of advertisement-interval seconds. This becomes especially important for inter-AS route propagation, where the default advertisement-interval there is 30 seconds.

The process repeats itself on RR1 and RR2, starting with the incoming UPDATE packet reception, best-path selection and update generation. If for some reason the prefix 20.0.0.0/8 would vanish from AS 100 soon after it has been advertised, it may take as long as "Number_of_Hops x Advertisement_Interval" to reach to R3 and R4, as every hop may delay the fast subsequent update. As we can see, the main limiting factors of BGP convergence are BGP table size, transport-level settings and advertisement delay. The best-path selection time is proportional to the table size as well as time required for update batching.

Let's look at a slightly different scenario to demonstrate how BGP multi-path may potentially improve convergence. Firstly, observing the topology presented on FIG 1, we may notice that AS 300 has two connections to AS 100. Thus, it may be expected to see two paths to every route from AS 100 on every router in AS 300. But this is not always possible in situations where any topology other than BGP full mesh is used inside the AS. In our example, R1 and R2 advertise routing information to the route-reflectors RR1 and RR2. Per the distance-vector behavior, the reflectors will only re-advertise the best-path to AS 100 prefixes, and since both RRs elect paths consistently, they will advertise the same path to R3, R4 and R2. Both R3 and R4 will receive the prefix 10.0.0.0/24 from each of the RRs and use the path via R1. R2 will receive the best path via R1 as well but prefer using its eBGP connection. On contrary, if R1, R2, R3 and R4 were connected in the full mesh, then every router would have seen exits via R1 and R2 and be able to use BGP multi-path if configured. Let's review what happens in the topology on FIG1 when R1 loses connection to AS 100.

  • Depending on the failure detection mechanism, be it BGP keepalives or BFD, it will take some time for R1 to realize the connection is no longer valid. We'll discuss the options for fast failure detection later in this publication.
  • After realizing that R5 is gone, R1 deletes all paths via R7. Since RR1 and RR2 never advertised back to R1 the path via R2, R1 has no alternate paths to AS 100. Realizing this, R1 prepares a batch of UPDATE messages for RR1, RR2 and R7, containing the withdrawal messages for AS 100 prefixes. As soon as RR1 and RR2 are done receiving and processing the withdrawals, they elect the new best path via R2 and advertise withdrawals/updates to R1, R2, R3, R4.
  • R3 and R4 now have the new path via R2, and R2 loses the "backup" path via R1 it knew about from the RRs. The main workhorses of the re-convergence process in this case are the route-reflectors. The convergence time is sum of the peering session failure detection, update advertisement and BGP best-path recalculations in the RRs.

If BGP speakers were able to utilize multiple paths at the same time, then it could be possible to alleviate the severity of a network failure. Indeed, if load-balancing is in use, then a failure of an exit point will only affect flows going across this exit point (50% in our case) and only those flows will have to wait for re-convergence time. Even better, it is theoretically possible to do "fast" re-route in the case where multiple equal-cost (equivalent and thus loop--less) paths are available in BGP. Such switchover could be performed in the forwarding engine, as soon as the failure is signaled. However, there are two major problems with the re-route mechanism of this type:

  1. As we have seen, the use of route-reflectors (or confederations) has significant effect on redundancy by hiding alternate paths. Using full-mesh is not an option, so a mechanism needed allowing propagation of multiple alternate paths in RR/Confederation environment. It is interesting to point out that such mechanism is already available in BGP/MPLS VPN scenarios, where multiple point of attachments for CE sites could utilize different RD values to differentiate the same routes advertised from different connection points. However, a generic solution is required, allowing for advertising multiple alternate paths with IPv4 or any other address-family.
  2. Failure detection and propagation by means of BGP mechanics is slow, and depends on the number of affected prefixes. Therefore, the more severe is the damage, the slower it is propagated in the BGP. Some other, non-BGP mechanism needs to be used to report network failures and trigger BGP re-convergence.

In the following sections we are going to review various technologies developed to accelerate BGP convergence, enabling far better reaction times compared to "pure BGP based" failure detection and repair.


Tuning BGP Transport

Tuning BGP transport mechanism is a very important factor for improving BGP performance in the cases where purely BGP-based re-convergence process is in use. TCP is the underlying transport used for propagating BGP UPDATE messages, and optimizing TCP performance directly benefits BGP. If you take the full Internet routing table, which is above 300k prefixes (Y2010), then simply transporting the prefixes alone will consume over 10 Megabytes, not to count the path attributes and other information. Tuning TCP transport performance includes the following:

  1. Enabling TCP Path MTU discovery for every neighbor, to allow the TCP selecting optimum MSS size. Notice that this requires that no firewall blocks the ICMP unreachable messages used during the discovery process
  2. Tuning the router's ingress queue size to allow for successful absorption of large amount of TCP ACK messages. When a router starts replicating BGP UPDATES to its peers, every peer responds with TCP ACK message to normally every second segment sent (TCP Delayed ACK). The more peers router has, the higher will be the pressure on the ingress queue.

Very detailed information on tuning BGP transport could be found in [10] Chapter 3. We, therefore, skip an in-depth discussion of this topic here.


BGP Fast Peering Session Deactivation

When using BGP-only convergence mechanic, detecting a link failure is normally based on BGP KEEPALIVE timers, which are 60/180 seconds by default. It could be noted that TCP keepalives could be used for the same purpose, but since BGP already has similar mechanics these are not of any big help. It is possible to tune the BGP keepalive timers to be as low as 1/3 seconds, but the risk of peering session flapping become significant with such settings. Such instability is dangerous since there is no built-in session dampening mechanism in BGP session establishment process. Therefore, some other mechanism should be preferred - either BFD or fast BGP peering session deactivation. The last option is on by default for eBGP sessions, and tracks the outgoing interface associated with the BGP session. As soon as the interface (or the next-hop for multihop eBGP) is reported as down, the BGP session is deactivated. Interface flapping could be effectively dampened using IP Event Dampening in Cisco IOS (see [14]) and hence is less dangerous than BGP peering session flapping. The command to disable fast peering session deactivation is no bgp fast-external-fallover. Notice that this feature is by default off for iBGP sessions, as those are supposed to be routed and restored using the underlying IGP mechanics.

Using BFD is the best option on multipoint interfaces, such as Ethernet, that do not support fast link down detection e.g. by means of Ethernet OAM. BFD is especially attractive in the platforms that implement it in the hardware. The command to activate BFD fallover is neighbor fall-over bfd. In the following sections, we'll discuss the use of IGP for fast reporting of link failures.


BGP and IGP Interaction

BGP prefixes typically rely on recursive next-hop resolution. That is, next-hops associated with BGP prefixes are normally not directly connected, but rather resolved via IGP. The core of BGP and IGP interaction used to be implemented in the BGP Scanner process. This process runs periodically and among other work performs full BGP table walk and validates the BGP next-hop values. The validation consists of resolving the next-hop recursively through the router's RIB and possibly changing the forwarding information in response to IGP events. For example, if R1 crashes on FIG1, it will take 180 seconds for the RRs to notice the failure based on BGP KEEPALIVE message. However, the IGP will probably converge faster and report R1's address as unreachable. This event will be detected during the BGP Scanner process run and all paths via R1 will be invalidated by all BGP speakers in AS 100. The default BGP Scanner run-time is 60 seconds, and could be changed using the command bgp scan-time. Notice that setting this value too low may result in extra burden on router's CPU if you have large BGP tables, since the scanner process has to perform full table walk every time it executes.

The periodic behavior of BGP Scanner is still too slow to effectively respond to IGP events. IGP protocols could be tuned to react to a network change within hundreds of milliseconds (see [6]) and it would be desirable to make BGP aware of such changes as quickly as possible. This could be done with the help of BGP Next-Hop Tracking (NHT) feature. The idea is to make the BGP process register the next-hop values with the RIB "watcher" process and require a "call-back" every time information about the prefix corresponding to the next-hop changes. Typically, the number of registered next-hop values equals the number of exits from the local AS, or the number of PEs in MPLS/BGP VPN environment, so next-hop tracking does not impose heavy memory/CPU requirements. There are normally two types of events: IGP prefix becoming unreachable and IGP prefix metric change. The first event is more important and reported faster than metric change. Overall, IGP delays report of an event for the duration of bgp nextop trigger delay XX interval which is 5 seconds by default. This allows for more consecutive events to be processed and received from IGP and effectively implements event aggregation. This delay is helpful in various "fate sharing" scenarios where a facility failure affects multiple links in the network, and BGP needs to ensure that all IGP nodes have reported this failure and IGP has fully converged. Normally, you should set the NHT delay to be slightly above the time it takes the IGP to fully converge upon a change in the network. In a fast-tuned IGP network, you can set this delay to as low as 0 seconds, so that every IGP event is reported immediately, though this requires careful underlying IGP tuning to avoid oscillations. See [6] for more information on tuning the IGP protocol settings, but in short, you need to tune the SPF delay value in IGP to be conservative enough to capture all changes that could be caused by a failure in the network. Setting SPF delay too low may result is excessive BGP next-hop recalculations and massive best-path process runs.

As a reaction to an IGP next-hop change, the BGP process has to start BGP Router sub-process for re-calculating the best paths. This will affect every prefix that has the next-hop changed as a result of IGP event, and could take significant amount of time, based on number of prefixes associated with this nexthop. For example, if an AS has two connections to the Internet and receives full BGP tables over both connections, then a single exit failure will force full-table walk for over 300k prefixes. After this happens, BGP has to upload the new forwarding information to RIB/FIB, with the overall delay being proportional to the table size. To put it in other words, BGP convergence is non-deterministic in response to an IGP event, e.g. there is no well-defined finite time for the process to complete. However, if the IGP change did not result in any effects to BGP next-hop, e.g. if IGP was able to repair the path upon link failure and the path has the same cost, then BGP is not needed to be informed at all and convergence is handled at IGP level.

The last, less visible contributor to faster convergence is Hierarchical FIB. Look at the figure below - it shows how FIB could be organized as either "flat" or "hierarchical". In the "flat" case, BGP prefixes have their forwarding information directly associated - e.g. the outgoing interface, MAC rewrite, MPLS label information and so on. In such case, any change to a BGP next-hop may require updating a lot of prefixes sharing the same next-hop, which is a time consuming process. If the next-hop value remains the same, and only the output interface changes, the FIB update process still needs walking over all BGP prefixes and reprogramming the forwarding information. In case of "hierarchical" FIB, any IGP change that does not affect BGP prefixes, e.g. output interface change, only requires walking over the IGP prefixes, which are not as numerous as BGP. Therefore, hierarchical FIB organization significantly reduces FIB update latency in the cases where only IGP information needs to be changed. The use of hierarchical FIB is automatic and does not require any special commands. All major networking equipment vendors support this feature.

BGP-Convergence-FIG2

The last thing to discuss in relation to BGP NHT is IGP route summarization. Summarization hides detailed information and may conceal changes occurring in the network. In such case, BGP process will not be notified of the IGP event and will have to detect failure and re-converge using BGP-only mechanics. Look at the figure below - because of summarization, R1 will not be notified or R2's failure and the BGP process at R1 will have to wait till BGP session times out. Aside from avoiding summarization for the prefixes used for iBGP peering, an alternate solution could be using multi-hop BFD [15]. Additionally, there is some work in progress to allow the separation of routing and reachability information natively in IGP protocols.

BGP-Convergence-FIG3

You can see now how NHT may allow BGP to react quickly to the events inside its own AS, provided that underlying IGP is properly tuned for fast convergence. This fast convergence process effectively covers core link and node failures, as well as edge link and node failures, provided that all these could be detected by IGP. You may want to look at [1] for detailed convergence breakdowns. Pay special attention that edge link failure requires special handling. If your edge BGP speaker is changing the next-hop value to self for the routes received from another autonomous system, than IGP will only be able to detect failures for paths going to the BGP speaker's own IP address. However, if the edge link fails, the convergence will follow along the BGP path, using BGP withdrawal message propagation through the AS. The best approach in this case is to leave the eBGP next-hop IP address unmodified and advertise the edge link into IGP using the passive interface feature or redistribution. This will allow the IGP to respond to link down condition by quickly propagating the new LSA and synchronously trigger BGP re-convergence on all BGP speakers in the system by informing them of the failed next-hop. In topologies with large BGP tables this takes significantly less time compared to BGP-based convergence process. And lastly, despite all benefits that BGP NHT may provide for recovering from Intra-AS failures, the Inter-AS convergence is still purely BGP driven, based on BGP's distance-vector behavior.


BGP PIC and Multiple-Path Propagation

Even though BGP NHT enables fast reaction to IGP events, the convergence time is still not deterministic, because it depends on the number of prefixes BGP needs to be processed for best-path selection. Previously, we discussed how having multiple equal-cost BGP paths could be used for redundancy and fast failover at the forwarding engine level, without involving any BGP best-path selection. What if the paths are unequal - is it possible to use them for backup? In fact, since BGP treats the local AS as a single hop, all BGP speakers select the same path consistently, and changing from one path to another synchronously among all speakers should not create any permanent routing loops. Thus, even in scenarios where equal-cost BGP multi-path is not possible, the secondary paths may still be used for fast failover, provided that a signaling mechanism to detect the primary path failure exists. We already know that BGP NHT could be used to detect a failure and propagate this information quickly to all BGP speakers, triggering local switchover. This switchover does not require any BGP table walks and best-path re-election, but simply is a matter of changing the forwarding information - provided that hierarchical FIB is in use. Therefore, this process does not depend on the number of BGP prefixes, and thus known as Prefix Independent Convergence (PIC) process. You may think of this process as a BGP equivalent to IGP-based Fast Re-Route, though in IGP failure deception is local to the router and in BGP failure detection is local to the AS. BGP PIC could be used any time there are multiple paths to the destination prefix, such on R1 in the example below, where target prefix is reachable via multiple paths:

We have already stated the problem with multiple paths - only one best path is advertised by BGP speakers and the BGP speaker will only accept one path for a given prefix from a given peer. If a BGP speaker receives multiple paths for the same prefix within the same session it simply uses the newest advertisement. A special extension to BGP known as "Add Paths" (see [3] and [16]) allows BGP speaker to propagate and accept multiple paths for the same prefix. The "Add Paths" capability allows peering BGP speakers to negotiate whether they support advertising/receiving multiple paths per prefix and actually advertise such paths. A special 4-byte path-identifier is added to NLRIs to differentiate multiple paths for the same prefix sent across a peering session. Notice that BGP still considers all paths as comparable from the viewpoint of best-path selection process - all paths are stored in the BGP RIB and only one is selected as the best-path. The additional NLRI identifier is only used when prefixes are sent across a peering session to prevent implicit withdrawals by the receiving peer. These identifiers are generated locally and independently for every peering session that supports such capability.

BGP-Convergence-FIG4

in addition to propagating backup paths, the "Add Paths" capability could be used for other purposes, e.g. overcoming BGP divergence problems described in [9]. Alternatively, if backup paths are required but "Add Path" feature is not implemented, one of your options could be using full-mesh of BGP speakers, such as on the figure below. In this case, multiple exit point information is preserved and allows for implementing BGP PIC functionality.

BGP-Convergence-FIG5

Pay attention to the fact that BGP PIC is possible even without the "Add Paths" capability in RR scenarios, provided that RRs propagate the alternate paths to the edge nodes. This may require IGP metric manipulation to ensure different exit points are selected by the RRs or using other techniques, such as different RD values for multi-homed site attachment points.


Practical Scenario: BGP PIC + BGP NHT

In this hands-on scenario we are going to illustrate the use of IGP tuning, BGP NHT configuration and BGP PIC and demonstrate how they work together. First, look at the topology diagram: R9 is advertising a prefix, and R5, R6 receive this prefix via the RRs. In normal BGP environment, provided that the RRs elect the same path, R5 and R6 would have just one path for R9's prefix. However, we tune the scenario, disabling the connections between R1 and R4 and R2 and R3, so R3 has better cost to exit via R1 and R4 has better cost via R2. This will make the RRs elect different best-paths and propagate them to their clients.

BGP-Convergence-FIG6

The following is the key piece of configuration for enabling the fast backup path failover to be applied to every router in AS 100. As you can see, the SPF/LSA throttling timers are tuned very aggressively to allow for fastest reaction to IGP events. BGP nexthop trigger delay is set to 0 seconds, thus fully relying on IGP to aggregate underlying events. In any production environment, you should NOT use these values and pick up your own, matching your IGP scale and convergence rate.

router ospf 100
timers throttle spf 1 100 5000
timers throttle lsa all 0 100 5000
timers lsa arrival 50
!
router bgp 100
bgp nexthop trigger delay 0
bgp additional-paths install
no bgp recursion host

The command bgp additional-paths install when executed in non BGP-multipath environment allows for installing backup paths in additional to the best one elected by BGP. This, of course, requires that the additional paths have been advertised by the BGP Route Reflectors. At the moment of writing, Cisco IOS does not support the "Add Paths" capability, so you need to make sure BGP RRs elect different best-paths in order for the edge routers to be able to use additional paths. The command no bgp recursion host requires special explanation on its own. By default, when a BGP prefix loses next-hop, the CEF process will attempt to look-up the next longest-matching prefix for the next-hop to provide fallback. When additional repair paths are present, this functionality is not required and will, in fact, slower the convergence. This is why it's automatically disabled when you type the command bgp additional-paths install and thus typing it with the "no" prefix is not really required.

Now that we have our scenario set up, we are going to demonstrate the fact that at least in current implementation, Cisco IOS BGP process does not exchange/detects the capabilities for "Add Path" feature. Here is a debugging output from a peering session establishment process, which shows that no "Add Path Capability" (code 69, per the RFC draft) is being exchanged during session establishment.

R5#debug ip bgp 10.0.3.3
BGP debugging is on for neighbor 10.0.3.3 for address family: IPv4 Unicast
R5#clear ip bgp 10.0.3.3

BGP: 10.0.3.3 active rcv OPEN, version 4, holdtime 180 seconds
BGP: 10.0.3.3 active rcv OPEN w/ OPTION parameter len: 29
BGP: 10.0.3.3 active rcvd OPEN w/ optional parameter type 2 (Capability) len 6
BGP: 10.0.3.3 active OPEN has CAPABILITY code: 1, length 4
BGP: 10.0.3.3 active OPEN has MP_EXT CAP for afi/safi: 1/1
BGP: 10.0.3.3 active rcvd OPEN w/ optional parameter type 2 (Capability) len 2
BGP: 10.0.3.3 active OPEN has CAPABILITY code: 128, length 0
BGP: 10.0.3.3 active OPEN has ROUTE-REFRESH capability(old) for all address-families
BGP: 10.0.3.3 active rcvd OPEN w/ optional parameter type 2 (Capability) len 2
BGP: 10.0.3.3 active OPEN has CAPABILITY code: 2, length 0
BGP: 10.0.3.3 active OPEN has ROUTE-REFRESH capability(new) for all address-families
BGP: 10.0.3.3 active rcvd OPEN w/ optional parameter type 2 (Capability) len 3
BGP: 10.0.3.3 active OPEN has CAPABILITY code: 131, length 1
BGP: 10.0.3.3 active OPEN has MULTISESSION capability, without grouping
BGP: 10.0.3.3 active rcvd OPEN w/ optional parameter type 2 (Capability) len 6
BGP: 10.0.3.3 active OPEN has CAPABILITY code: 65, length 4
BGP: 10.0.3.3 active OPEN has 4-byte ASN CAP for: 100
BGP: nbr global 10.0.3.3 neighbor does not have IPv4 MDT topology activated
BGP: 10.0.3.3 active rcvd OPEN w/ remote AS 100, 4-byte remote AS 100
BGP: 10.0.3.3 active went from OpenSent to OpenConfirm
BGP: 10.0.3.3 active went from OpenConfirm to Established

This means that we need to rely on the BGP RRs to advertise multiple different paths in order for the edge nodes to leverage the backup path capability.

R5#debug ip bgp updates
BGP updates debugging is on for address family: IPv4 Unicast
R5#debug ip bgp addpath
BGP additional-path related events debugging is on
R5#clear ip bgp 10.0.3.3

BGP(0): 10.0.3.3 rcvd UPDATE w/ attr: nexthop 20.0.17.7, origin i, localpref 100, metric 0, originator 10.0.1.1, clusterlist 10.0.3.3, merged path 200, AS_PATH
BGP(0): 10.0.3.3 rcvd 20.0.99.0/24
BGP(0): 10.0.3.3 rcvd NEW PATH UPDATE (bp/be - Deny)w/ prefix: 20.0.99.0/24, label 1048577, bp=N, be=N
BGP(0): 10.0.3.3 rcvd UPDATE w/ prefix: 20.0.99.0/24, - DO BESTPATH
BGP(0): Calculating bestpath for 20.0.99.0/24

Here you can see that the RR with IP address 10.0.3.3 sends us an update that has better information than the one we currently know. However, before you enable the bgp additional-paths install there would be just one path installed for the prefix:

R5#show ip route repair-paths 20.0.99.0
Routing entry for 20.0.99.0/24
Known via "bgp 100", distance 200, metric 0
Tag 200, type internal
Last update from 20.0.17.7 00:02:31 ago
Routing Descriptor Blocks:
* 20.0.17.7, from 10.0.3.3, 00:02:31 ago
Route metric is 0, traffic share count is 1
AS Hops 1
Route tag 200
MPLS label: none

But as soon as the bgp additional-paths install option has been enabled, the output of the same command looks different:

R5#show ip route repair-paths 20.0.99.0
Routing entry for 20.0.99.0/24
Known via "bgp 100", distance 200, metric 0
Tag 200, type internal
Last update from 20.0.17.7 00:00:03 ago
Routing Descriptor Blocks:
* 20.0.17.7, from 10.0.3.3, 00:00:03 ago
Route metric is 0, traffic share count is 1
AS Hops 1
Route tag 200
MPLS label: none
[RPR]20.0.28.8, from 10.0.4.4, 00:00:03 ago
Route metric is 0, traffic share count is 1
AS Hops 1
Route tag 200
MPLS label: none

You may also see the second path in the BGP table with the "b" (backup) flag:

R5#show ip bgp 20.0.99.0
BGP routing table entry for 20.0.99.0/24, version 39
Paths: (2 available, best #1, table default)
Additional-path
Not advertised to any peer
200
20.0.17.7 (metric 192) from 10.0.3.3 (10.0.3.3)
Origin IGP, metric 0, localpref 100, valid, internal, best
Originator: 10.0.1.1, Cluster list: 10.0.3.3
200
20.0.28.8 (metric 192) from 10.0.4.4 (10.0.4.4)
Origin IGP, metric 0, localpref 100, valid, internal, backup/repair
Originator: 10.0.2.2, Cluster list: 10.0.4.4

And if you check the CEF entry for this prefix, you will notice there are multiple next-hops and output interfaces that could be used for primary/backup paths:

R5#show ip cef 20.0.99.0 detail
20.0.99.0/24, epoch 0, flags rib only nolabel, rib defined all labels
recursive via 20.0.17.7
recursive via 20.0.17.0/24
nexthop 10.0.35.3 Serial1/0
recursive via 20.0.28.8, repair
recursive via 20.0.28.0/24
nexthop 10.0.35.3 Serial1/0
nexthop 10.0.45.4 Serial1/2

Notice that in oder to use the PIC functionality, BGP multi-path should be turned off - otherwise, equal-cost paths will be used for load-sharing, not for primary/backup behavior. You may opt to using equal-cost multipath if allowed by the network topology, as it offers better resource utilization and CEF switching layer allows for fast path failover in case of equal-cost load-balancing. Now for debugging the fast failover process. We want to shut down R1's connection to R7 and see fast backup path switchover at R5. There are few caveats here, because we have very simplified topology. Firstly, we only have one prefix advertised into BGP on R9. Propagating this prefix through BGP is almost instant, since BGP best-path selection is done quickly and advertisement delay does not apply to a single event. Thus, if we shutdown R1's connection to R7, which is used as primary path, then R1 will detect the link failure and shutdown the session. Immediately after this BGP process will flood an UPDATE with prefix removal and this message would reach R5 and R6 even before OSPF finishes SPF computations. The reason being, of course, single prefix propagated via BGP and no advertisement-interval used to delay to a single event.

It may seems like that disabling BGP fast external fallover on R1 could help us to take BGP out of the equation. However, we still have BGP NHT enabled in R1 - as soon as we shut down the link, the RIB process would report to BGP of the next-hop failure and UPDATE message will be sent right away. Thus, we would also need to disable NTH in R1, using the command no bgp nexthop trigger enable. If we think further, we'll notice that we also need to enable NHT in R3 and R4, just so that they cannot to generate their own UPDATEs to R5 ahead of OSPF notification. Therefore, prior to running experiment we disable BGP NHT in R1, R3, R4 and disable fast external fallover in R1. This will allow the event from R1 propagate via OSPF ahead of BGP UPDATE message and trigger fast switchover on R5. The below is the output of the debugging commands enabled on R5 after we shut down R1's connection to R7.

R5#debug ip ospf spf
OSPF spf events debugging is on
OSPF spf intra events debugging is on
OSPF spf inter events debugging is on
OSPF spf external events debugging is on

R5#debug ip bgp addpath
BGP additional-path related events debugging is on

R5 receive the LSA at 26.223 then BGP starts the path switchover at 26.295 - It took 72ms to run SPF, update RIB and inform BGP of the event and then change the paths.

14:00:26.223: OSPF: Detect change in topology Base with MTID-0, in LSA type 1, LSID 10.0.1.1 from 10.0.1.1 area 0
14:00:26.223: OSPF: Schedule SPF in area 0, topology Base with MTID 0
Change in LS ID 10.0.1.1, LSA type R, spf-type Full
….
14:00:26.295: BGP(0): Calculating bestpath for 20.0.99.0/24, New bestpath is 20.0.28.8 :path_count:- 2/0, best-path =20.0.28.8, bestpath runtime :- 4 ms(or 3847 usec) for net 20.0.99.0
14:00:26.299: BGP(0): Calculating backuppath::Backup-Path for 20.0.99.0/24:BUMP-VERSION-BACKUP-DELETE:, backup path runtime :- 0 ms (or 193 usec)

14:00:32.439: BGP(0): 10.0.3.3 rcvd UPDATE w/ prefix: 20.0.99.0/24, - DO BESTPATH
14:00:32.443: BGP(0): Calculating bestpath for 20.0.99.0/24, bestpath is 20.0.28.8 :path_count:- 2/0, best-path =20.0.28.8, bestpath runtime :- 0 ms(or 222 usec) for net 20.0.99.0
14:00:32.443: BGP(0): Calculating backuppath::Backup-Path for 20.0.99.0/24, backup path runtime :- 0 ms (or 133 usec)

In the debugging output above, you can see that the BGP process in R5 switched to backup path even before it received the UPDATE message from R3, signaling the change of the best-path in the RR. Notice that the update does not have any path identifiers in the NLRI, as the RR has only a single best-path. Let's see how much time it actually took to run SPF, as compared to overall detection/failover process:

R5#show ip ospf statistics

OSPF Router with ID (10.0.5.5) (Process ID 100)

Area 0: SPF algorithm executed 15 times

Summary OSPF SPF statistic

SPF calculation time
Delta T Intra D-Intra Summ D-Summ Ext D-Ext Total Reason
00:28:00 44 0 0 4 0 4 56 R
…..

As you can see, the total SPF runtime was 56ms. Therefore, the remaining 20ms were spent on updating RIB and triggering the next-hop change event. Of course, all these numbers have only relative meaning, as we are using Dynamips for this simulation, but you may use similar methodology when validating real-world designs.


Considerations for Implementing BGP Add Paths

Even though the Add Paths feature is not yet implemented, it is worth considering the drawbacks of this approach. One drawback is that the amount information needed to be sent and stored is now multiplied by the number of additional paths. Previously, the most stressed routers in BGP AS were route reflectors, that had to carry the largest BGP tables. With the Add-Path functionality, every non-RR speaker now receives all information that RR stores in its BGP table. This puts extra requirement on the edge speakers and should be accounted when planning to use this feature. Furthermore, additional paths will utilize extra memory on the forwarding engines, as now PIC-enabled prefixes have multiple alternate paths. However, since the number of prefixes remains the same, TCAM fast lookup memory resources will not be wasted, and thus only dynamic RAM is being affected the most. You may read more about scalability/performance trade-offs in [17]


Summary

Achieving fast BGP convergence is not easy, because BGP is a complicated routing protocol running overlay on top of an IGP process. We found out that tuning purely BGP-based convergence requires the following general steps:

  • Tuning BGP TCP Transport and router ingress queues to achieve faster routing information propagation.
  • Proper organization of outbound policies for achieving optimum update group construction.
  • Tuning BGP Advertisement Interval if needed to respond to fast "Down->Up" conditions
  • Activating BGP fast external fallover and possible BFD for fast external peering session deactivation.

As we noticed previously, pure-BGP based convergence is the only thing available for Inter-AS scenarios. However, for fastest convergence inside a single AS, understanding and tuning BGP and IGP interaction can make BGP converge almost as fast as the underlying IGP. This allows for fast recovery in response to intra-AS link and node failures, as well as to edge link failures. Optimizing BGP and IGP interaction requires the following:

  • Tuning the underlying IGP for fast convergence. It is possible to tune the IGP even for large network to converge under one second.
  • Enabling BGP Next-Hop Tracking process for all BGP speakers and tuning the BGP NHT delay in accordance with IGP response time.
  • Applying IGP summarization carefully to avoid hiding BGP NHT information.
  • Leveraging IGP for propagation of external peering link failures, in addition to relying on BGP peering session deactivation.
  • Using the Add-Path Functionality in critical BGP speakers (e.g. RRs) to allow for propagation of redundant paths if supported by implementation.
  • Use BGP PIC or fast backup switchover in the environments that allow for multiple paths to be propagated - e.g. multihomed MPLS VPN sites using different RD values.

We've also briefly covered some caveats resulting from the future use of "Add-Path" functionality, such as excessive usage of memory resources on router-processor and line-cards and extra toll on BGP best-path process due to the growth of alternate paths. There are few things that were left out of the scope of this paper. We didn't not concentrate on the detailed mechanics of BGP fast peering session deactivation e.g. for multihop sessions and we did not cover the MP-BGP specific features. Some MP-BGP extensions such as the additional import scan interval and edge control plane interworking have their effects on end-to-end convergence, but this is a topic for another discussion.


Further Reading

[0]RFC4271: Border Gateway Protocol
[1]Advanced BGP Convergence Techniques
[2]Graph Overlays on Path Vector: A Possible Next Step in BGP
[3]BGP Add Paths Capability
[4]BGP Convergence in much less than a second
[5]BGP PIC Configuration Guide
[6]OSPF Fast Convergence
[7]An Analysis of BGP Convergence Properties
[8]RFC4451: BGP MULTI_EXIT_DISC (MED) Considerations
[9]RFC3345: Border Gateway Protocol (BGP) Persistent Route Oscillation Condition
[10]BGP Design and Implementation by Randy Zhang
[11]RFC 4274: BGP Protocol Analysis
[12]Day in the Life of a BGP Update in Cisco IOS
[13]RFC 4724: Graceful Restart for BGP
[14]Optimizing IP Event Dampening
[15]RFC 5883: Multihop BFD
[16]BGP Add Path Overview
[17]BGP Add Paths Scaling/Performance Tradeoffs


Appendix: Practical Scenario Baseline Configuration

The below are the initial configurations for the Dynamips topology used to validate BGP PIC behavior.

====R1:====
hostname R1
!
ip tcp synwait-time 5
no ip domain-lookup
no service timestamps
!
line con 0
logging synch
exec-timeout 0 0
privilege level 15
!
ip routing
!
interface Serial 1/0
ip address 20.0.17.1 255.255.255.0
no shut
!
interface Serial 1/2
no shut
ip address 10.0.12.1 255.255.255.0
!
interface Serial 1/1
no shut
ip address 10.0.13.1 255.255.255.0
!
interface Serial 1/3
ip address 10.0.14.1 255.255.255.0
!
interface Loopback0
ip address 10.0.1.1 255.255.255.255
!
router ospf 100
router-id 10.0.1.1
network 0.0.0.0 0.0.0.0 area 0
passive-interface Serial 1/0
!
router bgp 100
neighbor 10.0.3.3 remote-as 100
neighbor 10.0.3.3 update-source Loopback0
neighbor 10.0.4.4 remote-as 100
neighbor 10.0.4.4 update-source Loopback0
neighbor 20.0.17.7 remote-as 200

====R2:====
hostname R2
!
ip tcp synwait-time 5
no ip domain-lookup
no service timestamps
!
line con 0
logging synch
exec-timeout 0 0
privilege level 15
!
ip routing
!
interface Serial 1/0
ip address 20.0.28.2 255.255.255.0
no shut
!
interface Serial 1/2
no shut
ip address 10.0.12.2 255.255.255.0
!
interface Serial 1/1
no shut
ip address 10.0.24.2 255.255.255.0
!
interface Serial 1/3
no shut
ip address 10.0.23.2 255.255.255.0
!
interface Loopback0
ip address 10.0.2.2 255.255.255.255
!
router ospf 100
router-id 10.0.2.2
network 0.0.0.0 0.0.0.0 area 0
passive-interface Serial 1/0
!
router bgp 100
neighbor 10.0.3.3 remote-as 100
neighbor 10.0.3.3 update-source Loopback0
neighbor 10.0.4.4 remote-as 100
neighbor 10.0.4.4 update-source Loopback0
neighbor 20.0.28.8 remote-as 200

====R3:====
hostname R3
!
ip tcp synwait-time 5
no ip domain-lookup
no service timestamps
!
line con 0
logging synch
exec-timeout 0 0
privilege level 15
!
ip routing
!
interface Serial 1/0
no shut
ip address 10.0.13.3 255.255.255.0
!
interface Serial 1/1
no shut
ip address 10.0.35.3 255.255.255.0
!
interface Serial 1/2
no shut
ip address 10.0.34.3 255.255.255.0
!
interface Serial 1/3
no shut
ip address 10.0.23.3 255.255.255.0
!
interface Serial 1/4
no shut
ip address 10.0.36.3 255.255.255.0
!
router ospf 100
router-id 10.0.3.3
network 0.0.0.0 0.0.0.0 area 0
!
interface Loopback0
ip address 10.0.3.3 255.255.255.255
!
router bgp 100
neighbor IBGP peer-group
neighbor IBGP remote-as 100
neighbor IBGP update-source Loopback0
neighbor IBGP route-reflector-client
neighbor 10.0.1.1 peer-group IBGP
neighbor 10.0.2.2 peer-group IBGP
neighbor 10.0.5.5 peer-group IBGP
neighbor 10.0.6.6 peer-group IBGP
neighbor 10.0.4.4 remote-as 100
neighbor 10.0.4.4 update-source Loopback0

====R4:====
hostname R4
!
ip tcp synwait-time 5
no ip domain-lookup
no service timestamps
!
line con 0
logging synch
exec-timeout 0 0
privilege level 15
!
ip routing
!
interface Serial 1/0
no shut
ip address 10.0.24.4 255.255.255.0
!
interface Serial 1/1
no shut
ip address 10.0.46.4 255.255.255.0
!
interface Serial 1/2
no shut
ip address 10.0.34.4 255.255.255.0
!
interface Serial 1/3
no shut
ip address 10.0.14.4 255.255.255.0
!
interface Serial 1/4
no shut
ip address 10.0.45.4 255.255.255.0
!
router ospf 100
router-id 10.0.4.4
network 0.0.0.0 0.0.0.0 area 0
!
interface Loopback0
ip address 10.0.4.4 255.255.255.255
!
router bgp 100
neighbor IBGP peer-group
neighbor IBGP remote-as 100
neighbor IBGP update-source Loopback0
neighbor IBGP route-reflector-client
neighbor 10.0.1.1 peer-group IBGP
neighbor 10.0.2.2 peer-group IBGP
neighbor 10.0.5.5 peer-group IBGP
neighbor 10.0.6.6 peer-group IBGP
neighbor 10.0.3.3 remote-as 100
neighbor 10.0.3.3 update-source Loopback0

====R5:====
hostname R5
!
ip tcp synwait-time 5
no ip domain-lookup
no service timestamps
!
line con 0
logging synch
exec-timeout 0 0
privilege level 15
!
ip routing
!
interface Serial 1/0
no shut
ip address 10.0.35.5 255.255.255.0
!
interface Serial 1/1
no shut
ip address 10.0.56.5 255.255.255.0
!
interface Serial 1/2
no shut
ip address 10.0.45.5 255.255.255.0
!
router ospf 100
router-id 10.0.5.5
network 0.0.0.0 0.0.0.0 area 0
!
interface Loopback0
ip address 10.0.5.5 255.255.255.0
!
router bgp 100
neighbor 10.0.3.3 remote-as 100
neighbor 10.0.3.3 update-source Loopback0
neighbor 10.0.4.4 remote-as 100
neighbor 10.0.4.4 update-source Loopback0

====R6:====
hostname R6
!
ip tcp synwait-time 5
no ip domain-lookup
no service timestamps
!
line con 0
logging synch
exec-timeout 0 0
privilege level 15
!
ip routing
!
interface Serial 1/0
no shut
ip address 10.0.46.6 255.255.255.0
!
interface Serial 1/1
no shut
ip address 10.0.56.6 255.255.255.0
!
interface Serial 1/2
no shut
ip address 10.0.36.6 255.255.255.0
!
router ospf 100
router-id 10.0.6.6
network 0.0.0.0 0.0.0.0 area 0
!
interface Loopback0
ip address 10.0.6.6 255.255.255.0
!
router bgp 100
neighbor 10.0.3.3 remote-as 100
neighbor 10.0.3.3 update-source Loopback0
neighbor 10.0.4.4 remote-as 100
neighbor 10.0.4.4 update-source Loopback0

====R7:====
hostname R7
!
ip tcp synwait-time 5
no ip domain-lookup
no service timestamps
!
line con 0
logging synch
exec-timeout 0 0
privilege level 15
!
ip routing
!
interface Serial 1/0
no shut
ip address 20.0.17.7 255.255.255.0
!
interface Serial 1/1
no shut
ip address 20.0.78.7 255.255.255.0
!
interface Serial 1/2
no shut
ip address 20.0.79.7 255.255.255.0
!
interface Loopback0
ip address 20.0.7.7 255.255.255.0
!
router ospf 1
router-id 20.0.7.7
network 0.0.0.0 0.0.0.0 area 0
passive-interface Serial 1/0
!
router bgp 200
neighbor 20.0.17.1 remote-as 100
neighbor 20.0.9.9 remote-as 200
neighbor 20.0.9.9 update-source Loopback0
neighbor 20.0.8.8 remote-as 200
neighbor 20.0.8.8 update-source Loopback0

====R8:====
hostname R8
!
ip tcp synwait-time 5
no ip domain-lookup
no service timestamps
!
line con 0
logging synch
exec-timeout 0 0
privilege level 15
!
ip routing
!
interface Serial 1/0
no shut
ip address 20.0.28.8 255.255.255.0
!
interface Serial 1/1
no shut
ip address 20.0.78.8 255.255.255.0
!
interface Serial 1/2
no shut
ip address 20.0.89.8 255.255.255.0
!
interface Loopback0
ip address 20.0.8.8 255.255.255.0
!
router ospf 1
router-id 20.0.8.8
network 0.0.0.0 0.0.0.0 area 0
passive-interface Serial 1/0
!
router bgp 200
neighbor 20.0.28.2 remote-as 100
neighbor 20.0.9.9 remote-as 200
neighbor 20.0.9.9 update-source Loopback0
neighbor 20.0.7.7 remote-as 200
neighbor 20.0.7.7 update-source Loopback0

====R9:====
hostname R9
!
ip tcp synwait-time 5
no ip domain-lookup
no service timestamps
!
line con 0
logging synch
exec-timeout 0 0
privilege level 15
!
ip routing
!
interface Serial 1/0
no shut
ip address 20.0.79.9 255.255.255.0
!
interface Serial 1/1
no shut
ip address 20.0.89.9 255.255.255.0
!
interface Loopback0
ip address 20.0.9.9 255.255.255.0
!
interface Loopback100
ip address 20.0.99.99 255.255.255.0
!
router ospf 1
router-id 20.0.9.9
network 0.0.0.0 0.0.0.0 area 0
!
router bgp 200
neighbor 20.0.8.8 remote-as 200
neighbor 20.0.8.8 update-source Loopback0
neighbor 20.0.7.7 remote-as 200
neighbor 20.0.7.7 update-source Loopback0
network 20.0.99.0 mask 255.255.255.0

Aug
21

Our BGP class is coming up!  This class is for learners who are pursuing the CCIP track, or simply want to really master BGP.  I have been working through the slides, examples  and demos that we'll use in class, and it is going to be excellent.  :) If you can't make the live event, we are recording it, so it will be available as a class on demand, after the live event.    More information, can be found by clicking here.

One of the common questions that comes up is "Why does the router choose THAT route?

We all know, (or at least after reading the list below, we will know), that BGP uses the following order, to determine the "best" path.

bgp bestpath

So now for the question.   Take a look at the partial output of the show command below:

bgp bestpath

Regarding the 2.2.2.0/24 network, why did this router select the 192.168.68.8 next hop route, over the one just below it?

Post your ideas, and we will have a drawing next week, before the BGP class begins.   We'll give 1 lucky winner some rack tokens for our preferred rack vendor, Graded Labs.   Everyone who comments, will be entered into the drawing.  I will update the post with the lucky winner.

Thanks for your ideas, and happy learning.

Thank you to all who responded.  eBGP is preferred over iBGP, and that is what it came down to.

The winner of the graded labs tokens is Jon!  Congratulations.

Aug
16

Last week we wrapped up the MPLS bootcamp, and it was a blast!   A big shout out to all the students who attended,  as well as to many of the INE staff who stopped by (you know who you are :)).    Thank you all.

Here is the topology we used for the class, as we built the network, step by step.

MPLS-class blog

The class was organized and delivered in 30 specific lessons. Here is the "overview" slide from class:

MPLS Journey Statement

One of the important items we discussed was troubleshooting.   When we understand all the components of Layer3 VPNs, the troubleshooting is easy.   Here are the steps:

  • Can PE see CE’s routes?
  • Are VPN routes going into MP-BGP?  (The Export)
  • Are remote PEs seeing the VPN routes?
  • Are remote PEs inserting the VPN routes into the correct local VRF? (The Import)
  • Are remote PEs, advertising these to remote CEs?
  • Are the remote CEs seeing the routes?

We had lots of fun, and included wireshark protocol analysis, so we could see and verify what we were learning.   Here is one example, of a BGP updated from a downstream iBGP neighbor which includes the VPN label:

VPN Label

If you missed the class, but still want to benefit from it, we have recorded all 30 sessions, and it is available as an on-demand version of the class.

Next week, the BGP bootcamp is running, so if you need to  brush up on BGP, we will be covering the following topics, also  in 30, easy to digest lessons:

  • Monitoring and Troubleshooting BGP
  • Multi-Homed BGP Networks
  • AS-Path Filters
  • Prefix-List Filters
  • Outbound Route Filtering
  • Route-Maps as BGP Filters
  • BGP Path Attributes
  • BGP Local Preference
  • BGP Multi-Exit-Discriminator (MED)
  • BGP Communities
  • BGP Customer Multi-Homed to a Single Service Provider
  • BGP Customer Multi-Homed to Multiple Service Providers
  • Transit Autonomous System Functions
  • Packet Forwarding in Transit Autonomous Systems
  • Monitoring and Troubleshooting IBGP in Transit AS
  • Network Design with Route Reflectors
  • Limiting the Number of Prefixes Received from a BGP Neighbor
  • AS-Path Prepending
  • BGP Peer Group
  • BGP Route Flap Dampening
  • Troubleshooting Routing Issues
  • Scaling BGP

I look forward to seeing you in class!

Best wishes in all of your learning.

May
25

It isn't my fault, they configured it that way before I got here! That was the entry level technician's story Monday morning, and he was sticking to it.  :)

Here is the rest of the story.   Over the weekend, some testing had been done regarding a proposed BGP configuration.   The objective was simple, R1 and R3 needed to ping each others loobacks at 1.1.1.1 and 3.3.3.3 respectively, with those 2 networks, being carried by BGP.  R2 is performing NAT.    The topology diagram looks like this:

3 routers in a row-NO-user

The ping between loopbacks didn't work, but R1 and R3 had these console messages:

R1#
%TCP-6-BADAUTH: No MD5 digest from 10.0.0.3(179) to 10.0.0.1(28556) (RST)

R1#
%TCP-6-BADAUTH: No MD5 digest from 10.0.0.3(179) to 10.0.0.1(28556) (RST)
R1#

R3#
%TCP-6-BADAUTH: No MD5 digest from 23.0.0.1(179) to 23.0.0.3(59922) (RST)
R3#
%TCP-6-BADAUTH: No MD5 digest from 23.0.0.1(179) to 23.0.0.3(59922) (RST)
R3#

The senior engineer looked at the configurations for R1, R2 and R3 and found 5 specific items, each of which was independently causing a failure.

Here is the challenge:  Can you find 1 or more of them?

Let us know what your troubleshooting skills can find, and post your comments here on the blog.

Here are the configurations for the 3 routers:

R1#show run
version 12.4
hostname R1
!
interface Loopback0
ip address 1.1.1.1 255.255.255.0
!
interface FastEthernet0/0
ip address 10.0.0.1 255.255.255.0
!
router ospf 1
network 10.0.0.0 0.0.0.255 area 0
!
router bgp 1
no synchronization
bgp log-neighbor-changes
network 1.1.1.1 mask 255.255.255.255
neighbor 10.0.0.3 remote-as 3
neighbor 10.0.0.3 password cisco
no auto-summary
!
end
R1#

R2#show run
version 12.4
hostname R2
!
interface Loopback0
ip address 2.2.2.2 255.255.255.0
!
interface FastEthernet0/0
ip address 10.0.0.2 255.255.255.0
ip nat inside
ip virtual-reassembly
!
interface FastEthernet0/1
ip address 23.0.0.2 255.255.255.0
ip nat outside
ip virtual-reassembly
!
router ospf 1
network 2.2.2.2 0.0.0.0 area 0
network 10.0.0.2 0.0.0.0 area 0
network 23.0.0.2 0.0.0.0 area 0
!
ip nat inside source static 10.0.0.1 23.0.0.1
ip nat outside source static 23.0.0.3 10.0.0.3
!
end

R3#show run
version 12.4
hostname R3
!
interface Loopback0
ip address 3.3.3.3 255.255.255.0
!
interface FastEthernet0/1
ip address 23.0.0.3 255.255.255.0
!
router ospf 1
log-adjacency-changes
network 23.0.0.0 0.0.0.255 area 0
!
router bgp 3
no synchronization
bgp log-neighbor-changes
network 3.3.3.3 mask 255.255.255.255
neighbor 23.0.0.1 remote-as 1
neighbor 23.0.0.1 password cisco123
no auto-summary
!
end
R3#

Let us know what you find!

Best wishes.

 

 

 

UPDATE:   ANSWERS

Your contributions and input is great.  You ROCK!

I have summarized the 5 specific errors/issues with the configuration, and here they are:

  • R2: NAT isn't fully baked. Can fix with  "ip nat outside source static 23.0.0.3 10.0.0.3 add-route" (or we could manually add the route as well).
  • R1 & R3: The BGP passwords don't match, but it doesn't matter. BGP authentication doesn't work between NAT'd BGP neighbors, so it would have to be removed. :)
  • R1 & R3: Incorrect network statements for loopback addresses on both BGP routers (incorrect mask)
  • R1 & R3: Ebgp-multihop statements are needed on both neighbors (not directly connected EBGP)
  • R2: R2 doesn't know how to reach 1.1.1.1 or 3.3.3.3 (non-BGP routing issue)

Again, thanks for the time and effort invested in this solution, and in learning in general.   I appreciate you!

Best wishes.

Apr
08

One of our students in the INE RS bootcamp today, asked about an OSPF sham-link. I thought it would make a beneficial addition to our blog, and here it is.  Thanks for the request Christian!

Reader's Digest version: MPLS networks aren't free. If a customers is using OSPF to peer between the CE and PE routers, and also has an OSPF CE to CE neighborship, the CE's will prefer the Intra-Area CE to CE routes (sometimes called the "backdoor" route in this situation), instead of using the Inter-Area CE to PE learned routes that use the MPLS network as a transit path. OSPF sham-links correct this behavior.

This blog post walks through the problem and the solution, including the configuration steps to create and verify a sham-link.

To begin, MPLS is set up in the network as shown with R2 and R4 acting as Provider Edge (PE) routers, and MPLS is enabled throughout R2-R3-R4.

R1 and R5 are Customer Edge (CE) routers, and the Serial0/1.15 interfaces of R1 and R5 are temporarily shut down, (this means the backdoor route isn't in place yet, and at the moment, there is no problem).

mpls-ospf sham

Currently, R1 and R5 see the routes to each others local networks through the VPNv4 MPLS network, and the routes show up as Inter-Area OSPF routes with the PE routers as the next hop.

Let’s do some testing and verification of what is currently in place. Notice that R1 and R5 can see each others Fa0/0 and Fa0/1 connected networks. These routes show up as Inter-Area (IA) routes.

R1#show ip route ospf
10.0.0.0/24 is subnetted, 2 subnets
O IA 10.45.0.0 [110/2] via 10.12.0.2, 00:00:58, FastEthernet0/0 O IA 192.168.1.0/24 [110/3] via 10.12.0.2, 00:00:43, FastEthernet0/0

R5#show ip route ospf
172.16.0.0/24 is subnetted, 1 subnets
O IA 172.16.0.0 [110/3] via 10.45.0.4, 00:01:49, FastEthernet0/1
10.0.0.0/24 is subnetted, 2 subnets
O IA 10.12.0.0 [110/2] via 10.45.0.4, 00:01:49, FastEthernet0/1

Next, we will enable the Serial0/1.15 interfaces of R1 and R5. When we enable these interfaces, R1 and R5 will become neighbors, and see each others routes to the Fa0/0 and Fa0/1 networks as Intra-Area routes. Even though the OSPF cost will be worse via the serial interfaces, take a close look at what happens and which routes end up in the routing table.

R1(config)#int ser 0/1.15
R1(config-subif)#no shut

R5(config)#int ser 0/1.15
R5(config-subif)#no shut

We’ll wait a few moments, to give the network  time to converge, then take a look at the OSPF routes on the CE routers R1 and R5, just as we did earlier, and see if the routes are different.

R1#show ip route ospf
10.0.0.0/24 is subnetted, 3 subnets
O 10.45.0.0 [110/65] via 10.15.0.5, 00:02:52, Serial0/1.15 O 192.168.1.0/24 [110/65] via 10.15.0.5, 00:02:52, Serial0/1.15

R5#show ip route ospf
172.16.0.0/24 is subnetted, 1 subnets
O 172.16.0.0 [110/65] via 10.15.0.1, 00:03:19, Serial0/1.15
10.0.0.0/24 is subnetted, 3 subnets
O 10.12.0.0 [110/65] via 10.15.0.1, 00:03:19, Serial0/1.15

Notice, that the remote customer networks attached to Fa0/0 and Fa0/1 are now reachable via the serial 0/1.15 interface, and they appear as Intra-Area routes. Even though the metric of 65 is worse than before, and using the slower serial link, the routers prefer these routes instead of using the PE learned routes, because Intra-Area routes are preferred over  Inter-Area routes. Now the Service Provider’s MPLS network will only be used as a backup in the event the serial connection fails. (I don’t think they will be providing a price break either). ;)

To train the network to use the MPLS network as the primary transit path, we need to make the remote Ethernet customer networks look like Intra-Area routes via the PE routers, with a better metric than the serial interfaces, so they can be used instead of the slower serial link. We are actually going to pull a fast one, or a “sham”, on OSPF because the MPLS network is really acting as a “superbackbone” for OSPF, and therefore routes between the CEs are indeed Inter-Area by default. To create the illusion of the CEs not being separated by a backbone, we will create an OSPF sham-link. We will create a couple loopback interfaces in the VRFs on both PEs, and make sure those loopbacks are originated and advertised via BGP. We will use those loopbacks as the source/destination of the OSPF sham-link.

Because the sham-link is seen as an Intra-Area link between PE routers (R2 and R4), an OSPF adjacency is created and database exchange takes place across the sham-link. The two PE routers can then flood LSAs between sites from across the MPLS VPN backbone. As a result, the desired Intra-Area routes are created.

Enough chat, lets create this sham-link!

R2(config)#int loop 100
R2(config-if)#ip vrf forwarding Vrf1
R2(config-if)#ip address 11.11.11.2 255.255.255.255
R2(config-if)#router bgp 24
R2(config-router)#address-family ipv4 vrf Vrf1
R2(config-router-af)#network 11.11.11.2 mask 255.255.255.255
R2(config-router-af)#exit
R2(config-router)#router ospf 1 vrf Vrf1
R2(config-router)#area 1 sham-link 11.11.11.2 11.11.11.4 cost 5

R4(config)#int loop 100
R4(config-if)#ip vrf forwarding Vrf1
R4(config-if)#ip address 11.11.11.4 255.255.255.255
R4(config-if)#router bgp 24
R4(config-router)#address-family ipv4 vrf Vrf1
R4(config-router-af)#network 11.11.11.4 mask 255.255.255.255
R4(config-router-af)#exit
R4(config-router)#router ospf 1 vrf Vrf1
R4(config-router)#area 1 sham-link 11.11.11.4 11.11.11.2 cost 5
%OSPF-5-ADJCHG: Process 1, Nbr 10.12.0.2 on OSPF_SL0 from LOADING to FULL, Loading Done

Looks like the sham-link came up.  Let’s take a closer look at the sham link with a show command made just for that purpose.

R4#show ip ospf sham-links
Sham Link OSPF_SL0 to address 11.11.11.2 is up
Area 1 source address 11.11.11.4
Run as demand circuit
DoNotAge LSA allowed. Cost of using 5 State POINT_TO_POINT,
Timer intervals configured, Hello 10, Dead 40, Wait 40,
Hello due in 00:00:06
Adjacency State FULL (Hello suppressed)
Index 2/2, retransmission queue length 0, number of retransmission 0
First 0x0(0)/0x0(0) Next 0x0(0)/0x0(0)
Last retransmission scan length is 0, maximum is 0
Last retransmission scan time is 0 msec, maximum is 0 msec

Looks like it is in place, but is it creating the desired result, of having the CE routers R1 and R5 see the Ethernet remote networks as reachable through the PE routers R2 and R4? Let’s go to R1 and see!

R1#show ip route ospf
10.0.0.0/24 is subnetted, 3 subnets
O 10.45.0.0 [110/7] via 10.12.0.2, 00:06:02, FastEthernet0/0
11.0.0.0/32 is subnetted, 2 subnets
O E2 11.11.11.2 [110/1] via 10.12.0.2, 00:06:43, FastEthernet0/0
O E2 11.11.11.4 [110/1] via 10.12.0.2, 00:06:13, FastEthernet0/0
O 192.168.1.0/24 [110/8] via 10.12.0.2, 00:06:02, FastEthernet0/0

That looks perfect! How about R5?

R5#show ip route ospf
172.16.0.0/24 is subnetted, 1 subnets
O 172.16.0.0 [110/8] via 10.45.0.4, 00:06:27, FastEthernet0/1
10.0.0.0/24 is subnetted, 3 subnets
O 10.12.0.0 [110/7] via 10.45.0.4, 00:06:27, FastEthernet0/1
11.0.0.0/32 is subnetted, 2 subnets
O E2 11.11.11.2 [110/1] via 10.45.0.4, 00:07:05, FastEthernet0/1
O E2 11.11.11.4 [110/1] via 10.45.0.4, 00:06:45, FastEthernet0/1

And just to be sure, a ping to verify connectivity. We will ping the remote Fa0/1 interface of CE router R1 from CE router R5.

R5#ping 172.16.0.1

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 172.16.0.1, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 120/130/148 ms

That’s cool, so we know we have connectivity, and based on the routing table output, we believe it is going through the SP MPLS network. Let’s do one more test to prove that as well. A traceroute.

R5#trace 172.16.0.1

Type escape sequence to abort.
Tracing the route to 172.16.0.1

1 10.45.0.4 48 msec 92 msec 12 msec 2 10.34.0.3 [MPLS: Labels 16/24 Exp 0] 136 msec 180 msec 228 msec 3 10.12.0.2 [MPLS: Label 24 Exp 0] 124 msec 80 msec 88 msec 4 10.12.0.1 112 msec * 176 msec

Tags and all!  I still love it when a plan comes together.   Now our transit traffic is moving through the MPLS network, and the serial 0/1.15 interfaces are available as a backup.

More fun times regarding MPLS, OSPF and MPBGP can be found in our workbooks for RS and SP.

Best wishes, and enjoy the journey!

Apr
06

Having a blast in Chicago with the RS bootcamp students.    Thanks for all the hard work you are doing this week!

A student from a past Reno class, named Michal, asked if I would create a blog post regarding BGP proportional load balancing based on the bandwidth of the links to EBGP peers. It has been on my list of things to do, and here it is. Thanks for the request Michal.

The secret to this trick is to pay attention to the links between directly connected external BGP neighbors, (in this case between R6-R5 and R2-R3), and send the link bandwidth extended community attribute to iBGP peer R1.  It is enabled by entering the bgp dmzlink-bw command and using extended communities to share the information.  To summarize: routes learned from directly connected external neighbor are advertised to IBGP peers including the bandwidth of the external link where the routes were learned, and then the IBGP router (R1) can proportionally load balance between the two paths.

Here is the diagram we will use.

BGP Diagram

We’ll use loobpacks for our IBGP connections, so let’s verify that we have connectivity between loopbacks in AS 123.

R1#ping 6.6.6.6 source loopback 0

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 6.6.6.6, timeout is 2 seconds:
Packet sent with a source address of 1.1.1.1
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 16/43/76 ms
R1#
R1#ping 2.2.2.2 source loopback 0

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 2.2.2.2, timeout is 2 seconds:
Packet sent with a source address of 1.1.1.1
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 16/40/72 ms

Ok, that looks good, so let’s configure R1 to be an IBGP peer with R6 and R2.  The dmzlink-bw feature is implemented as part of the IPv4 address family configuration.

R1(config)#router bgp 126
R1(config-router)#neighbor 6.6.6.6 remote-as 126
R1(config-router)#neighbor 2.2.2.2 remote-as 126
R1(config-router)#neighbor 6.6.6.6 update-source lo0
R1(config-router)#neighbor 2.2.2.2 update-source lo0

R1(config-router)#address-family ipv4
R1(config-router-af)#bgp dmzlink-bw
R1(config-router-af)#neighbor 6.6.6.6 activate
R1(config-router-af)#neighbor 2.2.2.2 activate
R1(config-router-af)#neighbor 6.6.6.6 send-community both
R1(config-router-af)#neighbor 2.2.2.2 send-community both
R1(config-router-af)#maximum-paths ibgp 2
R1(config-router-af)#end

Next, we will configure R6, and R2 to be IBGP neighbors with R1, and EBGP neighbors with R5 and R3 respectively. We are going to manipulate the external interfaces on R6 and R2 to reflect a bandwidth of 6000k and 5000k respectively using the bandwidth command.  BGP can originate the link bandwidth community only for directly connected links to eBGP neighbors.  In our example, this will be originated from R6 and R2.

R6(config)#router bgp 126
R6(config-router)#neighbor 1.1.1.1 remote-as 126
R6(config-router)#neighbor 1.1.1.1 update-source lo0
R6(config-router)#neighbor 10.56.0.5 remote-as 345
R6(config-router)#address-family ipv4
R6(config-router-af)#bgp dmzlink-bw
R6(config-router-af)#neighbor 1.1.1.1 activate
R6(config-router-af)#neighbor 1.1.1.1 next-hop-self
R6(config-router-af)#neighbor 1.1.1.1 send-community both
R6(config-router-af)#neighbor 10.56.0.5 activate
R6(config-router-af)#neighbor 10.56.0.5 dmzlink-bw
R6(config-router-af)#int fa 0/0
R6(config-if)#bandwidth 6000

Now, on to R2, with virtually the same configuration.

R2(config)#router bgp 126
R2(config-router)#neighbor 1.1.1.1 remote-as 126
R2(config-router)#neighbor 1.1.1.1 update-source lo0
R2(config-router)#neighbor 10.23.0.3 remote-as 345
R2(config-router)#address-family ipv4
R2(config-router-af)#bgp dmzlink-bw
R2(config-router-af)#neighbor 1.1.1.1 activate
R2(config-router-af)#neighbor 1.1.1.1 next-hop-self
R2(config-router-af)#neighbor 1.1.1.1 send-community both
R2(config-router-af)#neighbor 10.23.0.3 activate
R2(config-router-af)#neighbor 10.23.0.3 dmzlink-bw
R2(config-router-af)#int ser 0/1.23
R2(config-subif)#bandwidth 5000

Now we will configure R5 and R3 as the EBGP neighbors of R6 and R2 respectively.  These EBGP peers don't need any special configuration, other than standard BGP.

R5(config)#router bgp 345
R5(config-router)#neighbor 10.56.0.6 remote-as 126
R5(config-router)#neighbor 4.4.4.4 remote-as 345
R5(config-router)#neighbor 4.4.4.4 update-source lo0
R5(config-router)#neighbor 4.4.4.4 next-hop-self

R3(config)#router bgp 345
R3(config-router)#neighbor 10.23.0.2 remote-as 126
R3(config-router)#neighbor 4.4.4.4 remote-as 345
R3(config-router)#neighbor 4.4.4.4 update-source lo0
R3(config-router)#neighbor 4.4.4.4 next-hop-self

Last, but not least we configure R4 as an IBGP peer to R5 and R3. In addition, we will create a loopback and add it into BGP.  We will use the loopack as a target destination from R1 to verify the load balancing in a later step, so watch for that coming up.

R4(config)#int loop 44
R4(config-if)#ip add 44.44.44.44 255.255.255.0
R4(config-if)#router bgp 345
R4(config-router)#neighbor 5.5.5.5 remote-as 345
R4(config-router)#neighbor 3.3.3.3 remote-as 345
R4(config-router)#network 44.44.44.0 mask 255.255.255.0

Now let’s verify. Because we are on R4, let’s verify the BGP neighborships it has.

R4#show ip bgp summary
BGP router identifier 44.44.44.44, local AS number 345
BGP table version is 2, main routing table version 2
1 network entries using 120 bytes of memory
1 path entries using 52 bytes of memory
2/1 BGP path/bestpath attribute entries using 248 bytes of memory
0 BGP route-map cache entries using 0 bytes of memory
0 BGP filter-list cache entries using 0 bytes of memory
Bitfield cache entries: current 1 (at peak 1) using 32 bytes of memory
BGP using 452 total bytes of memory
BGP activity 1/0 prefixes, 1/0 paths, scan interval 60 secs

Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd 3.3.3.3 4 345 4 5 2 0 0 00:00:41 0 5.5.5.5 4 345 4 5 2 0 0 00:00:35 0
! Note: we can easily verify what routes are being advertised out from R4.

R4#show ip bgp neighbors 5.5.5.5 advertised-routes
BGP table version is 2, local router ID is 44.44.44.44
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

Network Next Hop Metric LocPrf Weight Path
*> 44.44.44.0/24 0.0.0.0 0 32768 i

Total number of prefixes 1
R4#show ip bgp neighbors 3.3.3.3 advertised-routes
BGP table version is 2, local router ID is 44.44.44.44
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

Network Next Hop Metric LocPrf Weight Path
*> 44.44.44.0/24 0.0.0.0 0 32768 i

Total number of prefixes 1
R4#

Looks like AS 345 is fine. Let’s jump to R1, in AS 126, and verify from there.

R1#show ip bgp summary
BGP router identifier 1.1.1.1, local AS number 126
BGP table version is 3, main routing table version 3
1 network entries using 120 bytes of memory
2 path entries using 104 bytes of memory
1 multipath network entries and 2 multipath paths
2/1 BGP path/bestpath attribute entries using 248 bytes of memory
1 BGP AS-PATH entries using 24 bytes of memory
0 BGP route-map cache entries using 0 bytes of memory
0 BGP filter-list cache entries using 0 bytes of memory
BGP using 496 total bytes of memory
BGP activity 1/0 prefixes, 2/0 paths, scan interval 60 secs

Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd 2.2.2.2 4 126 10 9 3 0 0 00:06:39 1 6.6.6.6 4 126 11 10 3 0 0 00:07:14 1
R1#show ip bgp
BGP table version is 3, local router ID is 1.1.1.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

Network Next Hop Metric LocPrf Weight Path
* i44.44.44.0/24 6.6.6.6 0 100 0 345 i *>i 2.2.2.2 0 100 0 345 i

! Note: Looks like we have the neighbors, and the 44.44.44.0/24 prefix.
! To see more detail on the 44.44.44.0 network, we can use a couple additional commands.

R1#show ip bgp 44.44.44.0
BGP routing table entry for 44.44.44.0/24, version 3
Paths: (2 available, best #2, table Default-IP-Routing-Table)
Multipath: iBGP
Flag: 0x820
Not advertised to any peer
345
6.6.6.6 (metric 1) from 6.6.6.6 (6.6.6.6)
Origin IGP, metric 0, localpref 100, valid, internal, multipath
DMZ-Link Bw 750 kbytes
345
2.2.2.2 (metric 1) from 2.2.2.2 (2.2.2.2)
Origin IGP, metric 0, localpref 100, valid, internal, multipath, best
DMZ-Link Bw 625 kbytes

! Note: Let's see what the routing table has to say about this network.

R1#show ip route 44.44.44.0
Routing entry for 44.44.44.0/24
Known via "bgp 126", distance 200, metric 0
Tag 345, type internal
Last update from 2.2.2.2 00:02:56 ago
Routing Descriptor Blocks:
* 6.6.6.6, from 6.6.6.6, 00:02:56 ago
Route metric is 0, traffic share count is 6
AS Hops 1
Route tag 345
2.2.2.2, from 2.2.2.2, 00:02:56 ago
Route metric is 0, traffic share count is 5
AS Hops 1
Route tag 345

! Note: We can also get the information from the CEF table.

R1#show ip cef 44.44.44.0
44.44.44.0/24, version 47, epoch 0, per-destination sharing
0 packets, 0 bytes
via 6.6.6.6, 0 dependencies, recursive
traffic share 6
next hop 10.16.0.6, FastEthernet0/1 via 6.6.6.0/24
valid adjacency
via 2.2.2.2, 0 dependencies, recursive
traffic share 5
next hop 10.12.0.2, FastEthernet0/0 via 2.2.2.0/24
valid adjacency
0 packets, 0 bytes switched through the prefix
tmstats: external 0 packets, 0 bytes
internal 0 packets, 0 bytes

So now that the route is there, how do we test the load balancing? One option is to do an extended ping, and record the path. We are expecting a 6 to 5 ratio for outbound traffic favoring the R6 path more than the R2 path. Let's send 30 ping requests, and show the full response for the benefit of verification.

R1#ping
Protocol [ip]:
Target IP address: 44.44.44.44
Repeat count [5]: 30
Datagram size [100]:
Timeout in seconds [2]:
Extended commands [n]: y
Source address or interface: loopback0
Type of service [0]:
Set DF bit in IP header? [no]:
Validate reply data? [no]:
Data pattern [0xABCD]:
Loose, Strict, Record, Timestamp, Verbose[none]: r
Number of hops [ 9 ]: 4
Loose, Strict, Record, Timestamp, Verbose[RV]:
Sweep range of sizes [n]:
Type escape sequence to abort.
Sending 30, 100-byte ICMP Echos to 44.44.44.44, timeout is 2 seconds:
Packet sent with a source address of 1.1.1.1
Packet has IP options: Total option bytes= 19, padded length=20
Record route: <*>
(0.0.0.0)
(0.0.0.0)
(0.0.0.0)
(0.0.0.0)

Reply to request 0 (204 ms). Received packet has options
Total option bytes= 20, padded length=20
Record route: (10.12.0.1) (10.23.0.2) (10.34.0.3) (44.44.44.44)
<*>
End of list

Reply to request 1 (156 ms). Received packet has options
Total option bytes= 20, padded length=20
Record route: (10.12.0.1) (10.23.0.2) (10.34.0.3) (44.44.44.44)
<*>
End of list

! Note: the path changes on the next ping request, and begins to use R6 as the next hop.

Reply to request 2 (160 ms). Received packet has options
Total option bytes= 20, padded length=20
Record route: (10.16.0.1) (10.56.0.6) (10.45.0.5) (44.44.44.44)
<*>
End of list

Reply to request 3 (128 ms). Received packet has options
Total option bytes= 20, padded length=20
Record route: (10.16.0.1) (10.56.0.6) (10.45.0.5) (44.44.44.44)
<*>
End of list

Reply to request 4 (156 ms). Received packet has options
Total option bytes= 20, padded length=20
Record route:
(10.16.0.1)
(10.56.0.6)
(10.45.0.5)
(44.44.44.44)
<*>
End of list

Reply to request 5 (172 ms). Received packet has options
Total option bytes= 20, padded length=20
Record route:
(10.16.0.1)
(10.56.0.6)
(10.45.0.5)
(44.44.44.44)
<*>
End of list

Reply to request 6 (108 ms). Received packet has options
Total option bytes= 20, padded length=20
Record route:
(10.16.0.1)
(10.56.0.6)
(10.45.0.5)
(44.44.44.44)
<*>
End of list

Reply to request 7 (136 ms). Received packet has options
Total option bytes= 20, padded length=20
Record route:
(10.16.0.1)
(10.56.0.6)
(10.45.0.5)
(44.44.44.44)
<*>
End of list

Reply to request 8 (180 ms). Received packet has options
Total option bytes= 20, padded length=20
Record route: (10.12.0.1) (10.23.0.2) (10.34.0.3) (44.44.44.44)
<*>
End of list

Reply to request 9 (152 ms). Received packet has options
Total option bytes= 20, padded length=20
Record route:
(10.12.0.1)
(10.23.0.2)
(10.34.0.3)
(44.44.44.44)
<*>
End of list

Reply to request 10 (80 ms). Received packet has options
Total option bytes= 20, padded length=20
Record route:
(10.12.0.1)
(10.23.0.2)
(10.34.0.3)
(44.44.44.44)
<*>
End of list

Reply to request 11 (308 ms). Received packet has options
Total option bytes= 20, padded length=20
Record route:
(10.12.0.1)
(10.23.0.2)
(10.34.0.3)
(44.44.44.44)
<*>
End of list

Reply to request 12 (204 ms). Received packet has options
Total option bytes= 20, padded length=20
Record route:
(10.12.0.1)
(10.23.0.2)
(10.34.0.3)
(44.44.44.44)
<*>
End of list

Reply to request 13 (108 ms). Received packet has options
Total option bytes= 20, padded length=20
Record route:
(10.16.0.1)
(10.56.0.6)
(10.45.0.5)
(44.44.44.44)
<*>
End of list

Reply to request 14 (160 ms). Received packet has options
Total option bytes= 20, padded length=20
Record route:
(10.16.0.1)
(10.56.0.6)
(10.45.0.5)
(44.44.44.44)
<*>
End of list

Reply to request 15 (140 ms). Received packet has options
Total option bytes= 20, padded length=20
Record route:
(10.16.0.1)
(10.56.0.6)
(10.45.0.5)
(44.44.44.44)
<*>
End of list

Reply to request 16 (140 ms). Received packet has options
Total option bytes= 20, padded length=20
Record route:
(10.16.0.1)
(10.56.0.6)
(10.45.0.5)
(44.44.44.44)
<*>
End of list

Reply to request 17 (104 ms). Received packet has options
Total option bytes= 20, padded length=20
Record route:
(10.16.0.1)
(10.56.0.6)
(10.45.0.5)
(44.44.44.44)
<*>
End of list

Reply to request 18 (84 ms). Received packet has options
Total option bytes= 20, padded length=20
Record route:
(10.16.0.1)
(10.56.0.6)
(10.45.0.5)
(44.44.44.44)
<*>
End of list

Reply to request 19 (192 ms). Received packet has options
Total option bytes= 20, padded length=20
Record route:
(10.12.0.1)
(10.23.0.2)
(10.34.0.3)
(44.44.44.44)
<*>
End of list

Reply to request 20 (232 ms). Received packet has options
Total option bytes= 20, padded length=20
Record route:
(10.12.0.1)
(10.23.0.2)
(10.34.0.3)
(44.44.44.44)
<*>
End of list

Reply to request 21 (220 ms). Received packet has options
Total option bytes= 20, padded length=20
Record route:
(10.12.0.1)
(10.23.0.2)
(10.34.0.3)
(44.44.44.44)
<*>
End of list

Reply to request 22 (168 ms). Received packet has options
Total option bytes= 20, padded length=20
Record route:
(10.12.0.1)
(10.23.0.2)
(10.34.0.3)
(44.44.44.44)
<*>
End of list

Reply to request 23 (140 ms). Received packet has options
Total option bytes= 20, padded length=20
Record route:
(10.12.0.1)
(10.23.0.2)
(10.34.0.3)
(44.44.44.44)
<*>
End of list

Reply to request 24 (88 ms). Received packet has options
Total option bytes= 20, padded length=20
Record route:
(10.16.0.1)
(10.56.0.6)
(10.45.0.5)
(44.44.44.44)
<*>
End of list

Reply to request 25 (224 ms). Received packet has options
Total option bytes= 20, padded length=20
Record route:
(10.16.0.1)
(10.56.0.6)
(10.45.0.5)
(44.44.44.44)
<*>
End of list

Reply to request 26 (484 ms). Received packet has options
Total option bytes= 20, padded length=20
Record route:
(10.16.0.1)
(10.56.0.6)
(10.45.0.5)
(44.44.44.44)
<*>
End of list

Reply to request 27 (128 ms). Received packet has options
Total option bytes= 20, padded length=20
Record route:
(10.16.0.1)
(10.56.0.6)
(10.45.0.5)
(44.44.44.44)
<*>
End of list

Reply to request 28 (108 ms). Received packet has options
Total option bytes= 20, padded length=20
Record route:
(10.16.0.1)
(10.56.0.6)
(10.45.0.5)
(44.44.44.44)
<*>
End of list

Reply to request 29 (136 ms). Received packet has options
Total option bytes= 20, padded length=20
Record route:
(10.16.0.1)
(10.56.0.6)
(10.45.0.5)
(44.44.44.44)
<*>
End of list

Success rate is 100 percent (30/30), round-trip min/avg/max = 80/166/484 ms
R1#

The first 2 requests, numbered 0-1, used the path of R2-R3-R4. The next 6 requests, numbered 2-7, used the path of R6-R5-r4. The next 5, numbered 8-12, use the R2-R3-R4 path again, and then the next 6 use the R6-R5-R4 path.

Happy studies.

Jan
30

Introduction

In this series of posts, we are going to review some interesting topics illustrating unexpected behavior of the BGP routing protocol. It may seem that BGP is a robust and stable protocol, however the way it was designed inherently presents some anomalies in optimal route selection. The main reason for this is the fact that BGP is a path-vector protocol, much like a distance-vector protocol with optimal route selection based on policies, rather than simple additive metrics.

The fact that BGP is mainly used for Inter-AS routing results in different routing policies used inside every AS. When those different policies come to interact, the resulting behavior might not be the same as expected by individual policy developers. For example, prepending the AS_PATH attribute may not result in proper global path manipulation if an upstream AS performs additional prepending.

In addition to that, BGP was designed for inter-AS loop detection based on the AS_PATH attribute and therefore cannot detect intra-AS routing loops. Optimally, intra-AS routing loops could be prevented by ensuring a full mesh of BGP peering between all routers in the AS. However, implementing full-mesh is not possible for a large number of BGP routers. Known solutions to this problem - Route Reflectors and BGP Confederations - prevent all BGP speakers from having full information on all potential AS exit points due to the best-path selection process. This unavoidable loss of additional information may result in suboptimal routing or routing loops, as illustrated below.

BGP RRs and Intra-AS Routing Loops

As mentioned above, a full mesh of BGP peering sessions eliminates intra-AS routing loops. However, using Route Reflectors (RRs) - a common solution to the full-mesh problem, will not result in the same behavior, as RRs only propagate best-paths to the clients, thus hiding the complete routing information from edge routers. This may result in inconsistent best-path selection by clients and end up in routing loops. A known design rule used to avoid this is to place Route Reflectors along the packet forwarding paths between the RR clients in different clusters. This also translates in the design principle where iBGP peering sessions closely follow the physical (geographical) topology.

Here is an example of what could happen in the situation where this rule is not observed. Look at the topology below, where R5 peers with the RR that is not the one closest to it in terms of IGP metrics. At the same time, R1 and R2 peer with another RR, and R5 is on the forwarding path between R1, R2 and R4. The problem here is that R5 receives external BGP prefixes from a different RR than R1 and R2 use. Thus, the exit point that R1 and R2 consider optimal may not be optimal for R5. Here is what happens:

bgp-anomalies-part1-1

BB3 advertises AS54 prefixes to R4 and BB1 advertises the same set of prefixes to R6. R4 and R6 exchange this information and every route-reflector prefers the directly connected exit point and advertises best path to its route-reflector clients. R4 sends the best paths to R1 and R2 and those clients install best-paths with the next hop of R4:

Rack1R2#show ip bgp  
BGP table version is 22, local router ID is 150.1.2.2
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

Network Next Hop Metric LocPrf Weight Path
*>i28.119.16.0/24 150.1.4.4 0 100 0 54 i
*>i28.119.17.0/24 150.1.4.4 0 100 0 54 i
*>i112.0.0.0 150.1.4.4 0 100 0 54 50 60 i
*>i113.0.0.0 150.1.4.4 0 100 0 54 50 60 i
*>i114.0.0.0 150.1.4.4 0 100 0 54 i
*>i115.0.0.0 150.1.4.4 0 100 0 54 i
*>i116.0.0.0 150.1.4.4 0 100 0 54 i
*>i117.0.0.0 150.1.4.4 0 100 0 54 i
*>i118.0.0.0 150.1.4.4 0 100 0 54 i
*>i119.0.0.0 150.1.4.4 0 100 0 54 i

And R5 receives the best paths from R6, which prefers the exit point via BB1. Thus, the best-paths in R5 would point toward R6:

Rack1R5#show ip bgp 
BGP table version is 22, local router ID is 150.1.5.5
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

Network Next Hop Metric LocPrf Weight Path
*>i28.119.16.0/24 150.1.6.6 0 100 0 54 i
*>i28.119.17.0/24 150.1.6.6 0 100 0 54 i
*>i112.0.0.0 150.1.6.6 0 100 0 54 50 60 i
*>i113.0.0.0 150.1.6.6 0 100 0 54 50 60 i
*>i114.0.0.0 150.1.6.6 0 100 0 54 i
*>i115.0.0.0 150.1.6.6 0 100 0 54 i
*>i116.0.0.0 150.1.6.6 0 100 0 54 i
*>i117.0.0.0 150.1.6.6 0 100 0 54 i
*>i118.0.0.0 150.1.6.6 0 100 0 54 i
*>i119.0.0.0 150.1.6.6 0 100 0 54 i
*>i139.1.0.0 150.1.6.6 0 100 0 i

And since R5 has to traverse R1 or R2 to reach R6 and R1 and R2 have to traverse R5 to get to R4, we have a routing loop:

Rack1SW3#traceroute 28.119.16.1

Type escape sequence to abort.
Tracing the route to 28.119.16.1

1 139.1.11.1 1004 msec 0 msec 4 msec
2 150.1.2.2 36 msec 32 msec 36 msec
3 139.1.25.5 64 msec 60 msec 64 msec
4 139.1.25.2 56 msec 56 msec 52 msec
5 139.1.25.5 84 msec 80 msec 84 msec
6 139.1.25.2 76 msec 76 msec 72 msec
7 139.1.25.5 104 msec 104 msec 100 msec
8 139.1.25.2 96 msec 96 msec 96 msec
9 139.1.25.5 136 msec 120 msec 124 msec
10 139.1.25.2 116 msec

The best way to avoid these routing loops is to make iBGP sessions closely follow the physical topology, illustrated on the diagram below:

bgp-anomalies-part1-2

Another solution would be to adjust the topology to follow the iBGP peering sessions. For example, we could configure a GRE tunnel between R5 and R6 and exchange BGP routes over it. This will result in suboptimal routing but will prevent routing loops. Of course, this is not the recommended solution. However, the use of tunneling to resolve this issue prompts another idea: using MPLS forwarding and a BGP free core.

We are not going to illustrate this well-known concept here, but simply point to the fact that PE routers label-encapsulate IP packets routed towards BGP prefixes using MPLS labels for BGP next-hops. The actual packet forwarding is based on shortest IGP paths (or MPLS TE paths) and there are no intermediate routers that may steer packets according to BGP routing tables. Effectively, you may place a route reflector anywhere in the topology and peer your PE routers however you prefer – the optimum routing inside the AS is not based on BGP anymore. However, just from the logical perspective, it still makes sense to group RR clusters based on geographical proximity.

To be continued

In the next blog post from this series we will review situations when BGP gets stuck with permanently oscillating routes, resulting in continuous prefix advertisements and withdraws. We will see how dangerous the BGP MED attribute can be and explain the rationale behind the Cisco IOS command bgp always-compare-med and bgp deterministic-med

Jan
17

Abstract:

Inter-AS Multicast VPN solution introduces some challenges in cases where peering systems implement BGP-free core. This post illustrates a known solution to this problem, implemented in Cisco IOS software. The solution involves the use of special MP-BGP and PIM extensions. The reader is assumed to have understanding of basic Cisco's mVPN implementation, PIM-protocol and Multi-Protocol BGP extensions.

Abbreviations used

mVPN – Multicast VPN
MSDP – Multicast Source Discovery Protocol
PE – Provider Edge
CE – Customer Edge
RPF – Reverse Path Forwarding
MP-BGP – Multi-Protocol BGP
PIM – Protocol Independent Multicast
PIM SM – PIM Sparse Mode
PIM SSM – PIM Source Specific Multicast
LDP – Label Distribution Protocol
MDT – Multicast Distribution Tree
P-PIM – Provider Facing PIM Instance
C-PIM – Customer Facing PIM Instance
NLRI – Network Layer Rechability Information

Inter-AS mVPN Overview

A typical "classic" Inter-AS mVPN solution leverages the following key components:

  • PIM-SM maintaining separate RP for every AS
  • MSDP used to exchange information on active multicast sources
  • (Optionally) MP-BGP multicast extension to propagate information about multicast prefixes for RPF validations

With this solution, different PEs participating in the same MDT discover each other by joining the shared tree towards the local RP and listening to the multicast packets sent by other PEs. Those PEs belong to the same or different Autonomous Systems. In the latter case, the sources are discovered by the virtue of MSDP peering. This scenario assumes that every router in the local AS has complete routing information about multicast sources (PE’s loopback addresses) residing in other system. Such information is necessary for the purpose of RPF check. In turn, this leads to the requirement of running BGP on all P routers OR redistributing MP-BGP (BGP) multicast prefixes information into IGP. The redistribution approach clearly has limited scalability, while the other method requires enabling BGP on the P routers, which nullifies with the idea of BGP-free core.

A solution alternative to running PIM in Sparse Mode would be using PIM SSM, which relies on out-of-band information for multicast sources discovery. For such case, Cisco released a draft proposal listing new MP-BGP MDT SAFI that is used to propagate MDT information and associated PE addresses. Let’s make a short detour to MP-BGP to get a better understanding of SAFIs.

MP-BGP Overview

Recall the classic BGP UPDATE message format. It consists of the following sections: [Withdrawn prefixes (Optional)] + [Path Attributes] + [NLRIs]. The Withdrawn Prefixes and NLRIs are IPv4 prefixes, and their structure does not support any other network protocols. The Path Attributes (e.g. AS_PATH, ORIGIN, LOCAL_PREF, NEXT_HOP) are associated with all NLRIs; prefixes sharing different set of path attributes should be carried in a separate UPDATE message. Also, notice that NEXT_HOP is an IPv4 address as well.

In order to introduce support for non-IPv4 network protocols into BGP, two new optional transitive Path Attributes have been added to BGP. The first attribute is known as MP_REACH_NLRI, has the following structure: [AFI/SAFI] + [NEXT_HOP] + [NLRI]. Both NEXT_HOP and NLRI are formatted according to the protocol encoded via AFI/SAFI that stands for Address Family Identifier and Subsequent Address Family Identifier respectively. For example, this could be an IPv6 or CLNS prefix. Thus, all information about non-IPv4 prefixes is encoded in a new BGP Path Attribute. A typical BGP Update message that contains MP_REACH_NLRI attributes would have no “classic” NEXT_HOP attribute and no “Withdrawn Prefixes” or “NLRIs” found in normal UPDATE messages. For the next-hop calculations, a receiving BGP speaker should use the information found in MP_REACH_NLRI attribute. However, the multi-protocol UPDATE message may contain other BGP path attributes such as AS_PATH, ORIGIN, MED, LOCAL_PREF and so on. However, this time those attributes are associated with the non-IPv4 prefixes found in all attached MP_REACH_NLRI attributes.

The second attribute, MP_UNREACH_NLRI has format similar to MP_REACH_NLRI but lists the “multi-protocol” addresses to be removed. No other path attributes need to be associated with this attribute, an UPDATE message may simply contain the list of MP_UNREACH_NLRIs.

The list of supported AFIs may be found in RFC1700 (though it’s obsolete now, it is still very informative) – for example, AFI 1 stands for IPv4, AFI 2 stands for IPv6 etc. The subsequent AFI is needed to clarify the purpose of the information found in MP_REACH_NLRI. For example, SAFI value of 1 means the prefixes should be used for unicast forwarding, SAFI 2 means the prefixes are to be used for multicast RPF checks and SAFI 3 means the prefixes could be used for both purposes. Last, but not least – SAFI of 128 means MPLS labeled VPN address.

Just as a reminder, BGP process would perform separate best-path election process for “classic” IPv4 prefixes and every AFI/SAFI pair prefixes separately, based on the path attributes. This allows for independent route propagation for the addresses found in different address families. Since a given BGP speaker may not support particular network protocols, the list of supported AFI/SAFI pairs is advertised using BGP capabilities feature (another BGP extension), and the particular network protocol information is only propagated if both speakers support it.

MDT SAFI Overview

Cisco drafted a new SAFI to be used with regular AFIs such as IPv4 or IPv6. This SAFI is needed to propagate MDT group address and the associated PE’s Loopback address. The format for AFI 1 (IPv4) is as follows:

MP_NLRI = [RD:PE’s IPv4 Address]:[MDT Group Address],
MP_NEXT_HOP = [BGP Peer IPv4 Address].

Here RD is the RD corresponding to the VRF that has the MDT configured and “IPv4 address” is the respective PE router’s Loopback address. Normally, per Cisco rules this is the same Loopback interface used for VPNv4 peering, but this could be changed using the VRF-level command bgp next-hop.

If all PEs in the local AS exchange this information and pass it to PIM SSM, the P-PIM (Provider PIM, facing the SP core) process will be able to build an (S,G) trees for the MDT group address towards the other PE’s IPv4 addresses. This is by the virtue of the fact that the PE’s IPv4 address is known via IGP, as all PEs are in the same AS. There are no problems using the BGP-free core for intra-AS mVPN with PIM-SSM. Also, it’s worth mentioning that a precursor to MDT was as special extended community used along with VPNv4 address family. MP-BGP would use RD value of 2 (not applicable to any unicast VRF) to transport the associated PE’s IPv4 address along with an extended community that contains the MDT group address. This solution allowed for the “bootstrap” information propagation inside a single AS, since the extended-community was non-transitive. Using extended communities allowed for PIM SSM discovery information distribution inside a single AS. This temporary solution was replaced by the MDT SAFI draft.

Next, consider the case of Inter-AS VPN where at least one AS uses BGP-free core. When two peering Autonomous Systems activate the IPv4 MDT SAFI, the ASBRs will advertise all information learned from PE’s to each other. The information will further propagate down to each AS’s PEs. Next, the P-PIM processes will attempt to build (S, G) trees towards the PE IP addresses in neighboring systems. Even though the PEs may know the other PE's addresses (e.g. if Inter-AS VPN Option C is being used), the P-routers don’t have this information. If Inter-AS VPN Option B is in use, even the PE routers will have no proper information to build the (S, G) trees.

RPF Proxy Vector

The solution to this problem uses a modification to PIM protocol and RPF check functionality. Known as RPF Proxy Vector, it defines a new PIM TLV that contains the IPv4 address of the “proxy” router used for RPF checks and as an intermediate destination for PIM Joins. Let’s see how it works in a particular scenario.

On the diagram below you can see AS 100 and AS 200 using Inter-AS VPN Option B to exchange VPNv4 routes. PEs and ASBRs peer via BGP and exchange VPNv4 and IPv4 MDT SAFI prefixes. For every external prefix relayed to its own PEs, the ASBRs would change the next-hop found MP_REACH_NLRI to its local Loopback address. For VPNv4 prefixes, this achieves the goal of terminating the LSP on the ASBR. For MDT SAFI prefixes, this procedure sets the IPv4 address to be used as “proxy” in PIM Joins.

mvpn-inter-as-basic

Let’s say that MDT group used by R1 and R5 is 232.1.1.1. When R1 receives the MDT SAFI update with the MDT value of 200:1:232.1.1.1, PE IPv4 address 20.0.5.5 and the next-hop value of 10.0.3.3 (R3’ Loopback0 interface) it will pass this information down to PIM process. The PIM process will construct a PIM Join for group 232.1.1.1 towards the IP address 20.0.5.5 (not known in AS 100) and insert a proxy vector value of 10.0.3.3. The PIM process will then use the route to 10.0.3.3 to find the next upstream PIM peer to send the Join message to. Every P-router will process the PIM Join message with the proxy vector, and use the proxy IPv4 address to relay the message upstream. As soon as the message reaches the proxy router (in our case it’s R3), the proxy vector is being removed and the PIM Joins propagate further using the regular procedure, as the domain behind the proxy is supposed to have visibility of the actual Join target.

In addition to use the proxy vector to relay the PIM Join upwards, every routers creates a special mroute state for the (S,G) pair where S is the PE IPv4 address and G is the MDT group. This mroute state will have the proxy IPv4 address associated with it. When a matching multicast packet going from the external PE towards the MDT address hits the router, the RPF check will be performed based on the upstream interface associated with the proxy IPv4 address, not the actual source IPv4 address found in the packet. For example, in our scenario, R2 would have an mroute state for (20.0.5.5, 232.1.1.1) with the proxy IPv4 address of 10.0.3.3. All packets coming from R5 to 232.1.1.1 will be RPF checked based on the upstream interface towards 10.0.3.3.

Using the above-described “proxy-based” procedure, the P routers may successfully perform RPF checks for packets with the source IPv4 addresses not found in the local RIB. The tradeoff is the amount of multicast state information that has to be stored in the P-routers memory – it’s going to be proportional to the number of PEs multiplied by number of mVPN in the worst-case scenario where every PE’s participate in every mVPN. There could be more multicast route states in situations where Data MDT are being used in addition to the Default MDT.

BGP Connector Attribute

Another additional piece of information is needed for “intra-VPN” operations: joining a PIM tree towards a particular IP address inside a VPN and performing an RPF check inside the VPN. Consider the use of Inter-AS VPN Option B, where VPNv4 prefixes have their MP_REACH_NLRI next-hop changed to the local ASBR’s IPv4 address. When a local PE receives a multicast packet on the MDT tunnel interface it decapsulates it and performs a source IPv4 address lookup inside the VRF’s table. Based on MP-BGP learned routes, the next-hop would point towards the ASBR (Option B), while the packets might be coming across a different inter-AS link running multicast MP-BGP peering. Thus, relying solely on the unicast next-hop may not be sufficient for Inter-AS RPF checks.

For example, look at the figure below, where R3 and R4 run MP-BGP for VPN4 while R4 and R6 run multicast MP-BGP extension. R1 peers with both ASBRs and learns VPNv4 prefixes from R3 while it learns as MDT SAFI information and PE’s IPv4 addresses from R6. PIM is enabled only on the link connecting R6 and R4.

mvpn-inter-as-diverse-paths

In this situation, the RPF lookup would fail, as MDT SAFI information is exchanged across the link running M-BGP, while VPNv4 prefixes next-hop point to R3. Thus, a method is required to preserve the information for the RPF lookup.

Cisco suggested the use of a new, optional transitive attribute named BGP Connector to be exported along with VPNv4 prefixes out of PE hosting an mVPN. This attribute contains the following two components: [AFI/SAFI] + [Connector Information] and in general defines information needed by network protocol identified by AFI/SAFI pair to connect to the routing information found in MP_REACH_NLRI. If AFI=IPv4 and SAFI=MDT, the connector attribute contains the IPv4 address of the router originating the prefixes associated with the VRF that has an MDT configured.

The customer-facing PIM process (C-PIM) in the PE routers will use information found in the BGP Connector attribute to perform the intra-VPN RPF check as well as to find the next-hop to send PIM Joins. Notice the C-PIM Joins do not need to have the RPF proxy vector piggy-backed in the PIM messages, as those are transported inside MDT Tunnel towards the remote PEs.

You may notice that the use of BGP Connector attribute eliminates the need for special “VPNv4 multicast” address family that could be used to transport RPF check information for VPNv4 prefixes. The VPNv4 multicast address family is not really needed as the multicast packets are tunneled through SP cores using MDT Tunnel and using the BGP connector is sufficient for RPF-checks at the provider edge. However, the use of mBGP is still needed in situations where diverse unicast and multicast transport paths are used between the Autonomous Systems.

Case Study

Let’s put all concepts to test in a sample scenario. Here, two autonomous systems AS #100 and #200 peer using Inter-AS VPN Option B to exchange VPNv4 prefixes across the link between R3-R4. At the same time, multicast prefixes are exchanged across the peering link R6-R4 along with MDT SAFI attributes. AS 100 implements BGP-free core, so R2 does no peer via BGP with any other routers and only uses OSPF for IGP prefixes exchange. R1 peers with both ASBRs: R3 and R6 via MP-BGP for the purpose of VPNv4 and multicast prefixes and MDT SAFI exchange.

mvpn-case-study

Here on the diagram the links highlighted with orange color are enabled for PIM SM and may carry the multicast traffic. Notice that MPLS traffic and multicast traffic take different paths between the systems. The following are the main R1’s configuration highlights:

  • VRF RED configured with MDT of 232.1.1.1. PIM SSM configured for P-PIM instance using the default group range of 232/8.
  • The command ip multicast vrf RED rpf proxy rd vector
    ensures the use of RPF proxy vector for the MDT built for VRF RED. Thus the MDT tree built using P-PIM would make use of RPF proxy vector. Notice that the command ip multicast rpf proxy vector applies to the Joins received in the global routing table and is typically seen in the P-routers.
  • R1 exchanges VPNv4 prefixes with R3 and multicast prefixes with R6 via BGP. At the same time R1 exchanges IPv4 MDT SAFI with R6 to learn the MDT information from AS 200.
  • Connected routes from VRF RED are redistributed into MP-BGP.
R1:
hostname R1
!
interface Serial 2/0
encapsulation frame-relay
no shutdown
!
ip multicast-routing
ip pim ssm default
!
interface Serial 2/0.12 point-to-point
ip address 10.0.12.1 255.255.255.0
frame-relay interface-dlci 102
mpls ip
ip pim sparse-mode
!
interface Loopback0
ip pim sparse-mode
ip address 10.0.1.1 255.255.255.255
!
router ospf 1
network 10.0.12.1 0.0.0.0 area 0
network 10.0.1.1 0.0.0.0 area 0
!
router bgp 100
neighbor 10.0.3.3 remote-as 100
neighbor 10.0.3.3 update-source Loopback 0
neighbor 10.0.6.6 remote-as 100
neighbor 10.0.6.6 update-source Loopback 0
address-family ipv4 unicast
no neighbor 10.0.3.3 activate
no neighbor 10.0.6.6 activate
address-family vpnv4 unicast
neighbor 10.0.3.3 activate
neighbor 10.0.3.3 send-community both
address-family ipv4 mdt
neighbor 10.0.6.6 activate
address-family ipv4 multicast
neighbor 10.0.6.6 activate
network 10.0.1.1 mask 255.255.255.255
address-family ipv4 vrf RED
redistribute connected
!
no ip domain-lookup
!
ip multicast vrf RED rpf proxy rd vector
!
ip vrf RED
rd 100:1
route-target both 200:1
route-target both 100:1
mdt default 232.1.1.1
!
ip multicast-routing vrf RED
!
interface FastEthernet 0/0
ip vrf forwarding RED
ip address 192.168.1.1 255.255.255.0
ip pim dense-mode
no shutdown

Notice that in the configuration above, Loopback0 interface is used as the source for the MDT tunnel, and therefore has to have PIM (multicast routing) enabled on it. Next in turn, R2’s configuration is straightforward – OSPF used for IGP, adjacencies with R1, R3 and R3 and label exchange via LDP. Notice that R2 does NOT run PIM on the uplink to R3, and does NOT run LDP with R6. Effectively, the path via R6 is used only for multicast traffic while the path across R3 is used only for MPLS LSPs. R2 is configured for PIM-SSM and RPF proxy vector support for global routing table.

R2:
hostname R2
!
no ip domain-lookup
!
interface Serial 2/0
encapsulation frame-relay
no shut
!
ip multicast-routing
!
interface Serial 2/0.12 point-to-point
ip address 10.0.12.2 255.255.255.0
frame-relay interface-dlci 201
mpls ip
ip pim sparse-mode
!
ip pim ssm default
!
interface Serial 2/0.23 point-to-point
ip address 10.0.23.2 255.255.255.0
frame-relay interface-dlci 203
mpls ip
!
interface Serial 2/0.26 point-to-point
ip address 10.0.26.2 255.255.255.0
frame-relay interface-dlci 206
ip pim sparse-mode
!
interface Loopback0
ip address 10.0.2.2 255.255.255.255
!
ip multicast rpf proxy vector
!
router ospf 1
network 10.0.12.2 0.0.0.0 area 0
network 10.0.2.2 0.0.0.0 area 0
network 10.0.23.2 0.0.0.0 area 0
network 10.0.26.2 0.0.0.0 area 0

R3 is the ASBR used for implementing Inter-AS VPN Option B. It peers via BGP with R1 (the PE) and R4 (the other ASBR). Only VPNv4 address family is enabled with the BGP peers. Notice that the next-hop for VPNv4 prefixes is changed to self, in order to terminate the transport LSP from the PE on R3. No multicast or MDT SAFI information is exchanged across R3.

R3:
hostname R3
!
no ip domain-lookup
!
interface Serial 2/0
encapsulation frame-relay
no shut
!
interface Serial 2/0.23 point-to-point
ip address 10.0.23.3 255.255.255.0
frame-relay interface-dlci 302
mpls ip
!
interface Serial 2/0.34 point-to-point
ip address 172.16.34.3 255.255.255.0
frame-relay interface-dlci 304
mpls ip
!
interface Loopback0
ip address 10.0.3.3 255.255.255.255
!
router ospf 1
network 10.0.23.3 0.0.0.0 area 0
network 10.0.3.3 0.0.0.0 area 0
!
router bgp 100
no bgp default route-target filter
neighbor 10.0.1.1 remote-as 100
neighbor 10.0.1.1 update-source Loopback 0
neighbor 172.16.34.4 remote-as 200
address-family ipv4 unicast
no neighbor 10.0.1.1 activate
no neighbor 172.16.34.4 activate
address-family vpnv4 unicast
neighbor 10.0.1.1 activate
neighbor 10.0.1.1 next-hop-self
neighbor 10.0.1.1 send-community both
neighbor 172.16.34.4 activate
neighbor 172.16.34.4 send-community both

The second ASBR in AS 100 – R6, could be characterized as the multicast-only ASBR. In fact, this ASBR is only used to exchange prefixes in multicast and MDT SAFI address families with R1 and R4. MPLS is not enabled on this router, and its sole purpose is multicast forwarding between AS 100 and AS 200. There is no need to run MSDP as PIM SSM is used for multicast trees construction.

R6:
hostname R6
!
no ip domain-lookup
!
interface Serial 2/0
encapsulation frame-relay
no shut
!
ip multicast-routing
ip pim ssm default
ip multicast rpf proxy vector
!
interface Serial 2/0.26 point-to-point
ip address 10.0.26.6 255.255.255.0
frame-relay interface-dlci 602
ip pim sparse-mode
!
interface Serial 2/0.46 point-to-point
ip address 172.16.46.6 255.255.255.0
frame-relay interface-dlci 604
ip pim sparse-mode
!
interface Loopback0
ip pim sparse-mode
ip address 10.0.6.6 255.255.255.255
!
router ospf 1
network 10.0.6.6 0.0.0.0 area 0
network 10.0.26.6 0.0.0.0 area 0
!
router bgp 100
neighbor 10.0.1.1 remote-as 100
neighbor 10.0.1.1 update-source Loopback 0
neighbor 172.16.46.4 remote-as 200
address-family ipv4 unicast
no neighbor 10.0.1.1 activate
no neighbor 172.16.46.4 activate
address-family ipv4 mdt
neighbor 172.16.46.4 activate
neighbor 10.0.1.1 activate
neighbor 10.0.1.1 next-hop-self
address-family ipv4 multicast
neighbor 172.16.46.4 activate
neighbor 10.0.1.1 activate
neighbor 10.0.1.1 next-hop-self

Pay attention to the following. Firstly, R6 is set for PIM SSM and RPF proxy vector support. Secondly, R6 sets itself as the BGP next hop in the updates sent under multicast and MDF SAFI families. This is needed for proper MDT tree construction and correct RPF vector insertion. The next router, R4, is the combined VPN4 and Multicast ASBR for AS 200. It performs the same functions that R3 and R6 perform separately for AS 100. The VPNv4, MDT SAFI, and Multicast address families are enabled under BGP process for this router. At the same time, the router support RPF Proxy Vector and PIM-SSM for proper multicast forwarding. This router is the most configuration-intensive of all routers in both Autonomous Systems, as it also has to support MPLS label propagation via BGP and LDP. Of course, as a classic Option B ASBR, R4 has to change the BGP next-hop to itself for all address families updates sent to R5 – the PE in AS 200.

R4:
hostname R4
!
no ip domain-lookup
!
interface Serial 2/0
encapsulation frame-relay
no shut
!
ip pim ssm default
ip multicast rpf proxy vector
ip multicast-routing
!
interface Serial 2/0.34 point-to-point
ip address 172.16.34.4 255.255.255.0
frame-relay interface-dlci 403
mpls ip
!
interface Serial 2/0.45 point-to-point
ip address 20.0.45.4 255.255.255.0
frame-relay interface-dlci 405
mpls ip
ip pim sparse-mode
!
interface Serial 2/0.46 point-to-point
ip address 172.16.46.4 255.255.255.0
frame-relay interface-dlci 406
ip pim sparse-mode
!
interface Loopback0
ip address 20.0.4.4 255.255.255.255
!
router ospf 1
network 20.0.4.4 0.0.0.0 area 0
network 20.0.45.4 0.0.0.0 area 0
!
router bgp 200
no bgp default route-target filter
neighbor 172.16.34.3 remote-as 100
neighbor 172.16.46.6 remote-as 100
neighbor 20.0.5.5 remote-as 200
neighbor 20.0.5.5 update-source Loopback0
address-family ipv4 unicast
no neighbor 172.16.34.3 activate
no neighbor 20.0.5.5 activate
no neighbor 172.16.46.6 activate
address-family vpnv4 unicast
neighbor 172.16.34.3 activate
neighbor 172.16.34.3 send-community both
neighbor 20.0.5.5 activate
neighbor 20.0.5.5 send-community both
neighbor 20.0.5.5 next-hop-self
address-family ipv4 mdt
neighbor 172.16.46.6 activate
neighbor 20.0.5.5 activate
neighbor 20.0.5.5 next-hop-self
address-family ipv4 multicast
neighbor 20.0.5.5 activate
neighbor 20.0.5.5 next-hop-self
neighbor 172.16.46.6 activate

The last router in the diagram is R5. It’s a PE in AS 200 configured symmetrically to R1. It has to support VPNv4, MDT SAFI and Multicast address families to learn all necessary information from the ASBR. Of course, PIM RPF proxy vector is enabled for VRF RED’s MDT as well as PIM-SSM is configured for the default group range in the global routing table. There is no router in AS 200 that emulates BGP free core, as you may have noticed.

R5:
hostname R5
!
no ip domain-lookup
!
interface Serial 2/0
encapsulation frame-relay
no shut
!
ip multicast-routing
!
interface Serial 2/0.45 point-to-point
ip address 20.0.45.5 255.255.255.0
frame-relay interface-dlci 504
mpls ip
ip pim sparse-mode
!
interface Loopback0
ip pim sparse-mode
ip address 20.0.5.5 255.255.255.255
!
router ospf 1
network 20.0.5.5 0.0.0.0 area 0
network 20.0.45.5 0.0.0.0 area 0
!
ip vrf RED
rd 200:1
route-target both 200:1
route-target both 100:1
mdt default 232.1.1.1
!
router bgp 200
neighbor 20.0.4.4 remote-as 200
neighbor 20.0.4.4 update-source Loopback0
address-family ipv4 unicast
no neighbor 20.0.4.4 activate
address-family vpnv4 unicast
neighbor 20.0.4.4 activate
neighbor 20.0.4.4 send-community both
address-family ipv4 mdt
neighbor 20.0.4.4 activate
address-family ipv4 multicast
neighbor 20.0.4.4 activate
network 20.0.5.5 mask 255.255.255.255
address-family ipv4 vrf RED
redistribute connected
!
ip multicast vrf RED rpf proxy rd vector
ip pim ssm default
!
ip multicast-routing vrf RED
!
interface FastEthernet 0/0
ip vrf forwarding RED
ip address 192.168.5.1 255.255.255.0
ip pim dense-mode
no shutdown

Validating Unicast Paths

This is the simplest part. Use the show commands to see if the VPNv4 prefixes have propagated between the PEs and test end-to-end connectivity:

R1#sh ip route vrf RED

Routing Table: RED
Codes: C - connected, S - static, R - RIP, M - mobile, B - BGP
D - EIGRP, EX - EIGRP external, O - OSPF, IA - OSPF inter area
N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2
E1 - OSPF external type 1, E2 - OSPF external type 2
i - IS-IS, su - IS-IS summary, L1 - IS-IS level-1, L2 - IS-IS level-2
ia - IS-IS inter area, * - candidate default, U - per-user static route
o - ODR, P - periodic downloaded static route

Gateway of last resort is not set

B 192.168.5.0/24 [200/0] via 10.0.3.3, 00:50:35
C 192.168.1.0/24 is directly connected, FastEthernet0/0

R1#show bgp vpnv4 unicast vrf RED
BGP table version is 5, local router ID is 10.0.1.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

Network Next Hop Metric LocPrf Weight Path
Route Distinguisher: 100:1 (default for vrf RED)
*> 192.168.1.0 0.0.0.0 0 32768 ?
*>i192.168.5.0 10.0.3.3 0 100 0 200 ?

R1#show bgp vpnv4 unicast vrf RED 192.168.5.0
BGP routing table entry for 100:1:192.168.5.0/24, version 5
Paths: (1 available, best #1, table RED)
Not advertised to any peer
200, imported path from 200:1:192.168.5.0/24
10.0.3.3 (metric 129) from 10.0.3.3 (10.0.3.3)
Origin incomplete, metric 0, localpref 100, valid, internal, best
Extended Community: RT:100:1 RT:200:1
Connector Attribute: count=1
type 1 len 12 value 200:1:20.0.5.5
mpls labels in/out nolabel/23

R1#traceroute vrf RED 192.168.5.1

Type escape sequence to abort.
Tracing the route to 192.168.5.1

1 10.0.12.2 [MPLS: Labels 17/23 Exp 0] 432 msec 36 msec 60 msec
2 10.0.23.3 [MPLS: Label 23 Exp 0] 68 msec 8 msec 36 msec
3 172.16.34.4 [MPLS: Label 19 Exp 0] 64 msec 16 msec 48 msec
4 192.168.5.1 12 msec * 8 msec

Notice that in the output above, the prefix 192.168.5.0/24 has the next-hop value of 10.0.3.3 and the BGP Connector attribute value of 200:1:20.0.5.5. This information will be used for RPF checks further when we start feeding multicast traffic.

Validating Multicast Paths

Multicast forwarding is a bit more complicated. The first thing we should do is making sure the MDTs have been built from R1 towards R5 and from R5 towards R1. Check the PIM MDT groups on every PE:

R1#show ip pim mdt 
MDT Group Interface Source VRF
* 232.1.1.1 Tunnel0 Loopback0 RED
R1#show ip pim mdt bgp
MDT (Route Distinguisher + IPv4) Router ID Next Hop
MDT group 232.1.1.1
200:1:20.0.5.5 10.0.6.6 10.0.6.6

R5#show ip pim mdt
MDT Group Interface Source VRF
* 232.1.1.1 Tunnel0 Loopback0 RED

R5#show ip pim mdt bgp
MDT (Route Distinguisher + IPv4) Router ID Next Hop
MDT group 232.1.1.1
100:1:10.0.1.1 20.0.4.4 20.0.4.4

In the output above, pay attention to the next-hop values found in the MDT BGP information. In AS 100 it points toward R6 while in AS 200 it points to R4. Those next-hops are to be used as the proxy vectors for PIM Join messages. Check the mroutes for the tree (20.0.5.5, 232.1.1.1) starting from R1 and climbing up across R2, R6, R4 to R5:

R1#show ip mroute 232.1.1.1
IP Multicast Routing Table
Flags: D - Dense, S - Sparse, B - Bidir Group, s - SSM Group, C - Connected,
L - Local, P - Pruned, R - RP-bit set, F - Register flag,
T - SPT-bit set, J - Join SPT, M - MSDP created entry, E - Extranet,
X - Proxy Join Timer Running, A - Candidate for MSDP Advertisement,
U - URD, I - Received Source Specific Host Report,
Z - Multicast Tunnel, z - MDT-data group sender,
Y - Joined MDT-data group, y - Sending to MDT-data group,
V - RD & Vector, v - Vector
Outgoing interface flags: H - Hardware switched, A - Assert winner
Timers: Uptime/Expires
Interface state: Interface, Next-Hop or VCD, State/Mode

(20.0.5.5, 232.1.1.1), 00:58:49/00:02:59, flags: sTIZV
Incoming interface: Serial2/0.12, RPF nbr 10.0.12.2, vector 10.0.6.6
Outgoing interface list:
MVRF RED, Forward/Sparse, 00:58:49/00:01:17

(10.0.1.1, 232.1.1.1), 00:58:49/00:03:19, flags: sT
Incoming interface: Loopback0, RPF nbr 0.0.0.0
Outgoing interface list:
Serial2/0.12, Forward/Sparse, 00:58:47/00:02:55

R1#show ip mroute 232.1.1.1 proxy
(20.0.5.5, 232.1.1.1)
Proxy Assigner Origin Uptime/Expire
200:1/10.0.6.6 0.0.0.0 BGP MDT 00:58:51/stopped

R1 shows the RPF proxy value of 10.0.6.6 for the source 20.0.5.5. Notice that there is a tree toward 10.0.1.1, which has been originated from R5. This tree has no proxies, as its now inside its native AS. Next in turn, check R2 and R6 to find the same information (remember that the actual proxy removes the vector when it sees itself in the PIM Join message proxy field):

R2#show ip mroute 232.1.1.1
IP Multicast Routing Table
Flags: D - Dense, S - Sparse, B - Bidir Group, s - SSM Group, C - Connected,
L - Local, P - Pruned, R - RP-bit set, F - Register flag,
T - SPT-bit set, J - Join SPT, M - MSDP created entry, E - Extranet,
X - Proxy Join Timer Running, A - Candidate for MSDP Advertisement,
U - URD, I - Received Source Specific Host Report,
Z - Multicast Tunnel, z - MDT-data group sender,
Y - Joined MDT-data group, y - Sending to MDT-data group,
V - RD & Vector, v - Vector
Outgoing interface flags: H - Hardware switched, A - Assert winner
Timers: Uptime/Expires
Interface state: Interface, Next-Hop or VCD, State/Mode

(10.0.1.1, 232.1.1.1), 01:01:41/00:03:25, flags: sT
Incoming interface: Serial2/0.12, RPF nbr 10.0.12.1
Outgoing interface list:
Serial2/0.26, Forward/Sparse, 01:01:41/00:02:50

(20.0.5.5, 232.1.1.1), 01:01:43/00:03:25, flags: sTV
Incoming interface: Serial2/0.26, RPF nbr 10.0.26.6, vector 10.0.6.6
Outgoing interface list:
Serial2/0.12, Forward/Sparse, 01:01:43/00:02:56

R2#show ip mroute 232.1.1.1 proxy
(20.0.5.5, 232.1.1.1)
Proxy Assigner Origin Uptime/Expire
200:1/10.0.6.6 10.0.12.1 PIM 01:01:46/00:02:23

Notice the same proxy vector for (20.0.5.5, 232.1.1.1) set by R1. As expected, there is “contra-directional” tree built toward R1 from R5, that has no RPF proxy vector. Proceed to the outputs from R6. Notice that R6 knows of two proxy vectors, one of which is R6 itself.

R6#show ip mroute 232.1.1.1
IP Multicast Routing Table
Flags: D - Dense, S - Sparse, B - Bidir Group, s - SSM Group, C - Connected,
L - Local, P - Pruned, R - RP-bit set, F - Register flag,
T - SPT-bit set, J - Join SPT, M - MSDP created entry, E - Extranet,
X - Proxy Join Timer Running, A - Candidate for MSDP Advertisement,
U - URD, I - Received Source Specific Host Report,
Z - Multicast Tunnel, z - MDT-data group sender,
Y - Joined MDT-data group, y - Sending to MDT-data group,
V - RD & Vector, v - Vector
Outgoing interface flags: H - Hardware switched, A - Assert winner
Timers: Uptime/Expires
Interface state: Interface, Next-Hop or VCD, State/Mode

(10.0.1.1, 232.1.1.1), 01:05:40/00:03:21, flags: sT
Incoming interface: Serial2/0.26, RPF nbr 10.0.26.2
Outgoing interface list:
Serial2/0.46, Forward/Sparse, 01:05:40/00:02:56

(20.0.5.5, 232.1.1.1), 01:05:42/00:03:21, flags: sTV
Incoming interface: Serial2/0.46, RPF nbr 172.16.46.4, vector 172.16.46.4
Outgoing interface list:
Serial2/0.26, Forward/Sparse, 01:05:42/00:02:51

R6#show ip mroute proxy
(10.0.1.1, 232.1.1.1)
Proxy Assigner Origin Uptime/Expire
100:1/local 172.16.46.4 PIM 01:05:44/00:02:21

(20.0.5.5, 232.1.1.1)
Proxy Assigner Origin Uptime/Expire
200:1/local 10.0.26.2 PIM 01:05:47/00:02:17

The show commands outputs from R4 are similar to R6’s – it’s the proxy for both multicast trees:

R4#show ip mroute 232.1.1.1
IP Multicast Routing Table
Flags: D - Dense, S - Sparse, B - Bidir Group, s - SSM Group, C - Connected,
L - Local, P - Pruned, R - RP-bit set, F - Register flag,
T - SPT-bit set, J - Join SPT, M - MSDP created entry, E - Extranet,
X - Proxy Join Timer Running, A - Candidate for MSDP Advertisement,
U - URD, I - Received Source Specific Host Report,
Z - Multicast Tunnel, z - MDT-data group sender,
Y - Joined MDT-data group, y - Sending to MDT-data group,
V - RD & Vector, v - Vector
Outgoing interface flags: H - Hardware switched, A - Assert winner
Timers: Uptime/Expires
Interface state: Interface, Next-Hop or VCD, State/Mode

(10.0.1.1, 232.1.1.1), 01:08:42/00:03:16, flags: sTV
Incoming interface: Serial2/0.46, RPF nbr 172.16.46.6, vector 172.16.46.6
Outgoing interface list:
Serial2/0.45, Forward/Sparse, 01:08:42/00:02:51

(20.0.5.5, 232.1.1.1), 01:08:44/00:03:16, flags: sT
Incoming interface: Serial2/0.45, RPF nbr 20.0.45.5
Outgoing interface list:
Serial2/0.46, Forward/Sparse, 01:08:44/00:02:46

R4#show ip mroute proxy
(10.0.1.1, 232.1.1.1)
Proxy Assigner Origin Uptime/Expire
100:1/local 20.0.45.5 PIM 01:08:46/00:02:17

(20.0.5.5, 232.1.1.1)
Proxy Assigner Origin Uptime/Expire
200:1/local 172.16.46.6 PIM 01:08:48/00:02:12

Finally, the outputs on R5 mirror the ones we saw on R1. However, this time the multicast trees have swapped in their roles: the one toward R1 has the proxy vector set:

R5#show ip mroute 232.1.1.1
IP Multicast Routing Table
Flags: D - Dense, S - Sparse, B - Bidir Group, s - SSM Group, C - Connected,
L - Local, P - Pruned, R - RP-bit set, F - Register flag,
T - SPT-bit set, J - Join SPT, M - MSDP created entry, E - Extranet,
X - Proxy Join Timer Running, A - Candidate for MSDP Advertisement,
U - URD, I - Received Source Specific Host Report,
Z - Multicast Tunnel, z - MDT-data group sender,
Y - Joined MDT-data group, y - Sending to MDT-data group,
V - RD & Vector, v - Vector
Outgoing interface flags: H - Hardware switched, A - Assert winner
Timers: Uptime/Expires
Interface state: Interface, Next-Hop or VCD, State/Mode

(10.0.1.1, 232.1.1.1), 01:12:07/00:02:57, flags: sTIZV
Incoming interface: Serial2/0.45, RPF nbr 20.0.45.4, vector 20.0.4.4
Outgoing interface list:
MVRF RED, Forward/Sparse, 01:12:07/00:00:02

(20.0.5.5, 232.1.1.1), 01:13:40/00:03:27, flags: sT
Incoming interface: Loopback0, RPF nbr 0.0.0.0
Outgoing interface list:
Serial2/0.45, Forward/Sparse, 01:12:09/00:03:12

R5#show ip mroute proxy
(10.0.1.1, 232.1.1.1)
Proxy Assigner Origin Uptime/Expire
100:1/20.0.4.4 0.0.0.0 BGP MDT 01:12:10/stopped

R5’s BGP table also has the connector attribute information for R1’s Ethernet interface:

R5#show bgp vpnv4 unicast vrf RED 192.168.1.0
BGP routing table entry for 200:1:192.168.1.0/24, version 6
Paths: (1 available, best #1, table RED)
Not advertised to any peer
100, imported path from 100:1:192.168.1.0/24
20.0.4.4 (metric 65) from 20.0.4.4 (20.0.4.4)
Origin incomplete, metric 0, localpref 100, valid, internal, best
Extended Community: RT:100:1 RT:200:1
Connector Attribute: count=1
type 1 len 12 value 100:1:10.0.1.1
mpls labels in/out nolabel/18

In addition to the connector attributes, one last piece of information needed is the multicast source information propagated via IPv4 multicast address family. Both R1 and R5 should have this information in their BGP tables:

R5#show bgp ipv4 multicast
BGP table version is 3, local router ID is 20.0.5.5
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

Network Next Hop Metric LocPrf Weight Path
*>i10.0.1.1/32 20.0.4.4 0 100 0 100 i
*> 20.0.5.5/32 0.0.0.0 0 32768 i

R1#show bgp ipv4 multicast
BGP table version is 3, local router ID is 10.0.1.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

Network Next Hop Metric LocPrf Weight Path
*> 10.0.1.1/32 0.0.0.0 0 32768 i
*>i20.0.5.5/32 10.0.6.6 0 100 0 200 i

Now it’s time to verify the multicast connectivity. Make sure R1 and R5 see each other as PIM neighbors over the MDT and than do a multicast ping toward the group joined by both R1 and R5:

R1#show ip pim vrf RED neighbor 
PIM Neighbor Table
Mode: B - Bidir Capable, DR - Designated Router, N - Default DR Priority,
S - State Refresh Capable
Neighbor Interface Uptime/Expires Ver DR
Address Prio/Mode
20.0.5.5 Tunnel0 01:31:38/00:01:15 v2 1 / DR S P

R5#show ip pim vrf RED neighbor
PIM Neighbor Table
Mode: B - Bidir Capable, DR - Designated Router, N - Default DR Priority,
S - State Refresh Capable
Neighbor Interface Uptime/Expires Ver DR
Address Prio/Mode
10.0.1.1 Tunnel0 01:31:17/00:01:26 v2 1 / S P

R1#ping vrf RED 239.1.1.1 repeat 100

Type escape sequence to abort.
Sending 100, 100-byte ICMP Echos to 239.1.1.1, timeout is 2 seconds:

Reply to request 0 from 192.168.1.1, 12 ms
Reply to request 0 from 20.0.5.5, 64 ms
Reply to request 1 from 192.168.1.1, 8 ms
Reply to request 1 from 20.0.5.5, 56 ms
Reply to request 2 from 192.168.1.1, 8 ms
Reply to request 2 from 20.0.5.5, 100 ms
Reply to request 3 from 192.168.1.1, 16 ms
Reply to request 3 from 20.0.5.5, 56 ms

This final verification concludes our testbed verification.

Summary

In this blog post we demonstrated how MP-BGP and PIM extensions could be used to effectively implement Inter-AS multicast VPN between the autonomous systems with BGP-free cores. PIM SSM is used to build the inter-AS trees and MDT SAFI is used to discover the MDT group addresses along with the PEs associated with those. PIM RPF proxy vector allows for successful RPF checks in the multicast-route free core, by the virtue of proxy IPv4 address. Finally, BGP connector attribute allows for successful RPF checks inside a particular VRF.

Further Reading

Multicast VPN (draft-rosen-vpn-mcast)
Multicast Tunnel Discovery (draft-wijnands-mt-discovery)
PIM RPF Vector (draft-ietf-pim-rpf-vector-08)
MDT SAFI (draft-nalawade-idr-mdt-safi)

MDT SAFI Configuration
PIM RPF Proxy Vector Configuration

Subscribe to INE Blog Updates

New Blog Posts!