Jan
30

Introduction

In this series of posts, we are going to review some interesting topics illustrating unexpected behavior of the BGP routing protocol. It may seem that BGP is a robust and stable protocol, however the way it was designed inherently presents some anomalies in optimal route selection. The main reason for this is the fact that BGP is a path-vector protocol, much like a distance-vector protocol with optimal route selection based on policies, rather than simple additive metrics.

The fact that BGP is mainly used for Inter-AS routing results in different routing policies used inside every AS. When those different policies come to interact, the resulting behavior might not be the same as expected by individual policy developers. For example, prepending the AS_PATH attribute may not result in proper global path manipulation if an upstream AS performs additional prepending.

In addition to that, BGP was designed for inter-AS loop detection based on the AS_PATH attribute and therefore cannot detect intra-AS routing loops. Optimally, intra-AS routing loops could be prevented by ensuring a full mesh of BGP peering between all routers in the AS. However, implementing full-mesh is not possible for a large number of BGP routers. Known solutions to this problem - Route Reflectors and BGP Confederations - prevent all BGP speakers from having full information on all potential AS exit points due to the best-path selection process. This unavoidable loss of additional information may result in suboptimal routing or routing loops, as illustrated below.

BGP RRs and Intra-AS Routing Loops

As mentioned above, a full mesh of BGP peering sessions eliminates intra-AS routing loops. However, using Route Reflectors (RRs) - a common solution to the full-mesh problem, will not result in the same behavior, as RRs only propagate best-paths to the clients, thus hiding the complete routing information from edge routers. This may result in inconsistent best-path selection by clients and end up in routing loops. A known design rule used to avoid this is to place Route Reflectors along the packet forwarding paths between the RR clients in different clusters. This also translates in the design principle where iBGP peering sessions closely follow the physical (geographical) topology.

Here is an example of what could happen in the situation where this rule is not observed. Look at the topology below, where R5 peers with the RR that is not the one closest to it in terms of IGP metrics. At the same time, R1 and R2 peer with another RR, and R5 is on the forwarding path between R1, R2 and R4. The problem here is that R5 receives external BGP prefixes from a different RR than R1 and R2 use. Thus, the exit point that R1 and R2 consider optimal may not be optimal for R5. Here is what happens:

bgp-anomalies-part1-1

BB3 advertises AS54 prefixes to R4 and BB1 advertises the same set of prefixes to R6. R4 and R6 exchange this information and every route-reflector prefers the directly connected exit point and advertises best path to its route-reflector clients. R4 sends the best paths to R1 and R2 and those clients install best-paths with the next hop of R4:

Rack1R2#show ip bgp  
BGP table version is 22, local router ID is 150.1.2.2
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

Network Next Hop Metric LocPrf Weight Path
*>i28.119.16.0/24 150.1.4.4 0 100 0 54 i
*>i28.119.17.0/24 150.1.4.4 0 100 0 54 i
*>i112.0.0.0 150.1.4.4 0 100 0 54 50 60 i
*>i113.0.0.0 150.1.4.4 0 100 0 54 50 60 i
*>i114.0.0.0 150.1.4.4 0 100 0 54 i
*>i115.0.0.0 150.1.4.4 0 100 0 54 i
*>i116.0.0.0 150.1.4.4 0 100 0 54 i
*>i117.0.0.0 150.1.4.4 0 100 0 54 i
*>i118.0.0.0 150.1.4.4 0 100 0 54 i
*>i119.0.0.0 150.1.4.4 0 100 0 54 i

And R5 receives the best paths from R6, which prefers the exit point via BB1. Thus, the best-paths in R5 would point toward R6:

Rack1R5#show ip bgp 
BGP table version is 22, local router ID is 150.1.5.5
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

Network Next Hop Metric LocPrf Weight Path
*>i28.119.16.0/24 150.1.6.6 0 100 0 54 i
*>i28.119.17.0/24 150.1.6.6 0 100 0 54 i
*>i112.0.0.0 150.1.6.6 0 100 0 54 50 60 i
*>i113.0.0.0 150.1.6.6 0 100 0 54 50 60 i
*>i114.0.0.0 150.1.6.6 0 100 0 54 i
*>i115.0.0.0 150.1.6.6 0 100 0 54 i
*>i116.0.0.0 150.1.6.6 0 100 0 54 i
*>i117.0.0.0 150.1.6.6 0 100 0 54 i
*>i118.0.0.0 150.1.6.6 0 100 0 54 i
*>i119.0.0.0 150.1.6.6 0 100 0 54 i
*>i139.1.0.0 150.1.6.6 0 100 0 i

And since R5 has to traverse R1 or R2 to reach R6 and R1 and R2 have to traverse R5 to get to R4, we have a routing loop:

Rack1SW3#traceroute 28.119.16.1

Type escape sequence to abort.
Tracing the route to 28.119.16.1

1 139.1.11.1 1004 msec 0 msec 4 msec
2 150.1.2.2 36 msec 32 msec 36 msec
3 139.1.25.5 64 msec 60 msec 64 msec
4 139.1.25.2 56 msec 56 msec 52 msec
5 139.1.25.5 84 msec 80 msec 84 msec
6 139.1.25.2 76 msec 76 msec 72 msec
7 139.1.25.5 104 msec 104 msec 100 msec
8 139.1.25.2 96 msec 96 msec 96 msec
9 139.1.25.5 136 msec 120 msec 124 msec
10 139.1.25.2 116 msec

The best way to avoid these routing loops is to make iBGP sessions closely follow the physical topology, illustrated on the diagram below:

bgp-anomalies-part1-2

Another solution would be to adjust the topology to follow the iBGP peering sessions. For example, we could configure a GRE tunnel between R5 and R6 and exchange BGP routes over it. This will result in suboptimal routing but will prevent routing loops. Of course, this is not the recommended solution. However, the use of tunneling to resolve this issue prompts another idea: using MPLS forwarding and a BGP free core.

We are not going to illustrate this well-known concept here, but simply point to the fact that PE routers label-encapsulate IP packets routed towards BGP prefixes using MPLS labels for BGP next-hops. The actual packet forwarding is based on shortest IGP paths (or MPLS TE paths) and there are no intermediate routers that may steer packets according to BGP routing tables. Effectively, you may place a route reflector anywhere in the topology and peer your PE routers however you prefer – the optimum routing inside the AS is not based on BGP anymore. However, just from the logical perspective, it still makes sense to group RR clusters based on geographical proximity.

To be continued

In the next blog post from this series we will review situations when BGP gets stuck with permanently oscillating routes, resulting in continuous prefix advertisements and withdraws. We will see how dangerous the BGP MED attribute can be and explain the rationale behind the Cisco IOS command bgp always-compare-med and bgp deterministic-med

Petr Lapukhov, 4xCCIE/CCDE
About Petr Lapukhov, 4xCCIE/CCDE

Subscribe to INE Blog Updates