In this series of posts, we are going to review some interesting topics illustrating unexpected behavior of the BGP routing protocol. It may seem that BGP is a robust and stable protocol, however the way it was designed inherently presents some anomalies in optimal route selection. The main reason for this is the fact that BGP is a path-vector protocol, much like a distance-vector protocol with optimal route selection based on policies, rather than simple additive metrics.
The fact that BGP is mainly used for Inter-AS routing results in different routing policies used inside every AS. When those different policies come to interact, the resulting behavior might not be the same as expected by individual policy developers. For example, prepending the AS_PATH attribute may not result in proper global path manipulation if an upstream AS performs additional prepending.
In addition to that, BGP was designed for inter-AS loop detection based on the AS_PATH attribute and therefore cannot detect intra-AS routing loops. Optimally, intra-AS routing loops could be prevented by ensuring a full mesh of BGP peering between all routers in the AS. However, implementing full-mesh is not possible for a large number of BGP routers. Known solutions to this problem – Route Reflectors and BGP Confederations – prevent all BGP speakers from having full information on all potential AS exit points due to the best-path selection process. This unavoidable loss of additional information may result in suboptimal routing or routing loops, as illustrated below.
BGP RRs and Intra-AS Routing Loops
As mentioned above, a full mesh of BGP peering sessions eliminates intra-AS routing loops. However, using Route Reflectors (RRs) – a common solution to the full-mesh problem, will not result in the same behavior, as RRs only propagate best-paths to the clients, thus hiding the complete routing information from edge routers. This may result in inconsistent best-path selection by clients and end up in routing loops. A known design rule used to avoid this is to place Route Reflectors along the packet forwarding paths between the RR clients in different clusters. This also translates in the design principle where iBGP peering sessions closely follow the physical (geographical) topology.
Here is an example of what could happen in the situation where this rule is not observed. Look at the topology below, where R5 peers with the RR that is not the one closest to it in terms of IGP metrics. At the same time, R1 and R2 peer with another RR, and R5 is on the forwarding path between R1, R2 and R4. The problem here is that R5 receives external BGP prefixes from a different RR than R1 and R2 use. Thus, the exit point that R1 and R2 consider optimal may not be optimal for R5. Here is what happens:
BB3 advertises AS54 prefixes to R4 and BB1 advertises the same set of prefixes to R6. R4 and R6 exchange this information and every route-reflector prefers the directly connected exit point and advertises best path to its route-reflector clients. R4 sends the best paths to R1 and R2 and those clients install best-paths with the next hop of R4:
Rack1R2#show ip bgp BGP table version is 22, local router ID is 184.108.40.206 Status codes: s suppressed, d damped, h history, * valid, > best, i - internal, r RIB-failure, S Stale Origin codes: i - IGP, e - EGP, ? - incomplete Network Next Hop Metric LocPrf Weight Path *>i220.127.116.11/24 18.104.22.168 0 100 0 54 i *>i22.214.171.124/24 126.96.36.199 0 100 0 54 i *>i188.8.131.52 184.108.40.206 0 100 0 54 50 60 i *>i220.127.116.11 18.104.22.168 0 100 0 54 50 60 i *>i22.214.171.124 126.96.36.199 0 100 0 54 i *>i188.8.131.52 184.108.40.206 0 100 0 54 i *>i220.127.116.11 18.104.22.168 0 100 0 54 i *>i22.214.171.124 126.96.36.199 0 100 0 54 i *>i188.8.131.52 184.108.40.206 0 100 0 54 i *>i220.127.116.11 18.104.22.168 0 100 0 54 i
And R5 receives the best paths from R6, which prefers the exit point via BB1. Thus, the best-paths in R5 would point toward R6:
Rack1R5#show ip bgp BGP table version is 22, local router ID is 22.214.171.124 Status codes: s suppressed, d damped, h history, * valid, > best, i - internal, r RIB-failure, S Stale Origin codes: i - IGP, e - EGP, ? - incomplete Network Next Hop Metric LocPrf Weight Path *>i126.96.36.199/24 188.8.131.52 0 100 0 54 i *>i184.108.40.206/24 220.127.116.11 0 100 0 54 i *>i18.104.22.168 22.214.171.124 0 100 0 54 50 60 i *>i126.96.36.199 188.8.131.52 0 100 0 54 50 60 i *>i184.108.40.206 220.127.116.11 0 100 0 54 i *>i18.104.22.168 22.214.171.124 0 100 0 54 i *>i126.96.36.199 188.8.131.52 0 100 0 54 i *>i184.108.40.206 220.127.116.11 0 100 0 54 i *>i18.104.22.168 22.214.171.124 0 100 0 54 i *>i126.96.36.199 188.8.131.52 0 100 0 54 i *>i184.108.40.206 220.127.116.11 0 100 0 i
And since R5 has to traverse R1 or R2 to reach R6 and R1 and R2 have to traverse R5 to get to R4, we have a routing loop:
Rack1SW3#traceroute 18.104.22.168 Type escape sequence to abort. Tracing the route to 22.214.171.124 1 126.96.36.199 1004 msec 0 msec 4 msec 2 188.8.131.52 36 msec 32 msec 36 msec 3 184.108.40.206 64 msec 60 msec 64 msec 4 220.127.116.11 56 msec 56 msec 52 msec 5 18.104.22.168 84 msec 80 msec 84 msec 6 22.214.171.124 76 msec 76 msec 72 msec 7 126.96.36.199 104 msec 104 msec 100 msec 8 188.8.131.52 96 msec 96 msec 96 msec 9 184.108.40.206 136 msec 120 msec 124 msec 10 220.127.116.11 116 msec
The best way to avoid these routing loops is to make iBGP sessions closely follow the physical topology, illustrated on the diagram below:
Another solution would be to adjust the topology to follow the iBGP peering sessions. For example, we could configure a GRE tunnel between R5 and R6 and exchange BGP routes over it. This will result in suboptimal routing but will prevent routing loops. Of course, this is not the recommended solution. However, the use of tunneling to resolve this issue prompts another idea: using MPLS forwarding and a BGP free core.
We are not going to illustrate this well-known concept here, but simply point to the fact that PE routers label-encapsulate IP packets routed towards BGP prefixes using MPLS labels for BGP next-hops. The actual packet forwarding is based on shortest IGP paths (or MPLS TE paths) and there are no intermediate routers that may steer packets according to BGP routing tables. Effectively, you may place a route reflector anywhere in the topology and peer your PE routers however you prefer – the optimum routing inside the AS is not based on BGP anymore. However, just from the logical perspective, it still makes sense to group RR clusters based on geographical proximity.
To be continued
In the next blog post from this series we will review situations when BGP gets stuck with permanently oscillating routes, resulting in continuous prefix advertisements and withdraws. We will see how dangerous the BGP MED attribute can be and explain the rationale behind the Cisco IOS command bgp always-compare-med and bgp deterministic-med
About Petr Lapukhov, 4xCCIE/CCDE:
Petr Lapukhov's career in IT begain in 1988 with a focus on computer programming, and progressed into networking with his first exposure to Novell NetWare in 1991. Initially involved with Kazan State University's campus network support and UNIX system administration, he went through the path of becoming a networking consultant, taking part in many network deployment projects. Petr currently has over 12 years of experience working in the Cisco networking field, and is the only person in the world to have obtained four CCIEs in under two years, passing each on his first attempt. Petr is an exceptional case in that he has been working with all of the technologies covered in his four CCIE tracks (R&S, Security, SP, and Voice) on a daily basis for many years. When not actively teaching classes, developing self-paced products, studying for the CCDE Practical & the CCIE Storage Lab Exam, and completing his PhD in Applied Mathematics.
6 Responses to “Anomalies in BGP: Part I”
Leave a Reply