In this series of posts, we are going to review some interesting topics illustrating unexpected behavior of the BGP routing protocol. It may seem that BGP is a robust and stable protocol, however the way it was designed inherently presents some anomalies in optimal route selection. The main reason for this is the fact that BGP is a path-vector protocol, much like a distance-vector protocol with optimal route selection based on policies, rather than simple additive metrics.

The fact that BGP is mainly used for Inter-AS routing results in different routing policies used inside every AS. When those different policies come to interact, the resulting behavior might not be the same as expected by individual policy developers. For example, prepending the AS_PATH attribute may not result in proper global path manipulation if an upstream AS performs additional prepending.

In addition to that, BGP was designed for inter-AS loop detection based on the AS_PATH attribute and therefore cannot detect intra-AS routing loops. Optimally, intra-AS routing loops could be prevented by ensuring a full mesh of BGP peering between all routers in the AS. However, implementing full-mesh is not possible for a large number of BGP routers. Known solutions to this problem – Route Reflectors and BGP Confederations – prevent all BGP speakers from having full information on all potential AS exit points due to the best-path selection process. This unavoidable loss of additional information may result in suboptimal routing or routing loops, as illustrated below.

BGP RRs and Intra-AS Routing Loops

As mentioned above, a full mesh of BGP peering sessions eliminates intra-AS routing loops. However, using Route Reflectors (RRs) – a common solution to the full-mesh problem, will not result in the same behavior, as RRs only propagate best-paths to the clients, thus hiding the complete routing information from edge routers. This may result in inconsistent best-path selection by clients and end up in routing loops. A known design rule used to avoid this is to place Route Reflectors along the packet forwarding paths between the RR clients in different clusters. This also translates in the design principle where iBGP peering sessions closely follow the physical (geographical) topology.

Here is an example of what could happen in the situation where this rule is not observed. Look at the topology below, where R5 peers with the RR that is not the one closest to it in terms of IGP metrics. At the same time, R1 and R2 peer with another RR, and R5 is on the forwarding path between R1, R2 and R4. The problem here is that R5 receives external BGP prefixes from a different RR than R1 and R2 use. Thus, the exit point that R1 and R2 consider optimal may not be optimal for R5. Here is what happens:


BB3 advertises AS54 prefixes to R4 and BB1 advertises the same set of prefixes to R6. R4 and R6 exchange this information and every route-reflector prefers the directly connected exit point and advertises best path to its route-reflector clients. R4 sends the best paths to R1 and R2 and those clients install best-paths with the next hop of R4:

Rack1R2#show ip bgp  
BGP table version is 22, local router ID is
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
              r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
*>i28.119.16.0/24                0    100      0 54 i
*>i28.119.17.0/24                0    100      0 54 i
*>i112.0.0.0                0    100      0 54 50 60 i
*>i113.0.0.0                0    100      0 54 50 60 i
*>i114.0.0.0                0    100      0 54 i
*>i115.0.0.0                0    100      0 54 i
*>i116.0.0.0                0    100      0 54 i
*>i117.0.0.0                0    100      0 54 i
*>i118.0.0.0                0    100      0 54 i
*>i119.0.0.0                0    100      0 54 i

And R5 receives the best paths from R6, which prefers the exit point via BB1. Thus, the best-paths in R5 would point toward R6:

Rack1R5#show ip bgp 
BGP table version is 22, local router ID is
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
              r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
*>i28.119.16.0/24                0    100      0 54 i
*>i28.119.17.0/24                0    100      0 54 i
*>i112.0.0.0                0    100      0 54 50 60 i
*>i113.0.0.0                0    100      0 54 50 60 i
*>i114.0.0.0                0    100      0 54 i
*>i115.0.0.0                0    100      0 54 i
*>i116.0.0.0                0    100      0 54 i
*>i117.0.0.0                0    100      0 54 i
*>i118.0.0.0                0    100      0 54 i
*>i119.0.0.0                0    100      0 54 i
*>i139.1.0.0                0    100      0 i

And since R5 has to traverse R1 or R2 to reach R6 and R1 and R2 have to traverse R5 to get to R4, we have a routing loop:


Type escape sequence to abort.
Tracing the route to

  1 1004 msec 0 msec 4 msec
  2 36 msec 32 msec 36 msec
  3 64 msec 60 msec 64 msec
  4 56 msec 56 msec 52 msec
  5 84 msec 80 msec 84 msec
  6 76 msec 76 msec 72 msec
  7 104 msec 104 msec 100 msec
  8 96 msec 96 msec 96 msec
  9 136 msec 120 msec 124 msec
 10 116 msec

The best way to avoid these routing loops is to make iBGP sessions closely follow the physical topology, illustrated on the diagram below:


Another solution would be to adjust the topology to follow the iBGP peering sessions. For example, we could configure a GRE tunnel between R5 and R6 and exchange BGP routes over it. This will result in suboptimal routing but will prevent routing loops. Of course, this is not the recommended solution. However, the use of tunneling to resolve this issue prompts another idea: using MPLS forwarding and a BGP free core.

We are not going to illustrate this well-known concept here, but simply point to the fact that PE routers label-encapsulate IP packets routed towards BGP prefixes using MPLS labels for BGP next-hops. The actual packet forwarding is based on shortest IGP paths (or MPLS TE paths) and there are no intermediate routers that may steer packets according to BGP routing tables. Effectively, you may place a route reflector anywhere in the topology and peer your PE routers however you prefer – the optimum routing inside the AS is not based on BGP anymore. However, just from the logical perspective, it still makes sense to group RR clusters based on geographical proximity.

To be continued

In the next blog post from this series we will review situations when BGP gets stuck with permanently oscillating routes, resulting in continuous prefix advertisements and withdraws. We will see how dangerous the BGP MED attribute can be and explain the rationale behind the Cisco IOS command bgp always-compare-med and bgp deterministic-med

About Petr Lapukhov, 4xCCIE/CCDE:

Petr Lapukhov's career in IT begain in 1988 with a focus on computer programming, and progressed into networking with his first exposure to Novell NetWare in 1991. Initially involved with Kazan State University's campus network support and UNIX system administration, he went through the path of becoming a networking consultant, taking part in many network deployment projects. Petr currently has over 12 years of experience working in the Cisco networking field, and is the only person in the world to have obtained four CCIEs in under two years, passing each on his first attempt. Petr is an exceptional case in that he has been working with all of the technologies covered in his four CCIE tracks (R&S, Security, SP, and Voice) on a daily basis for many years. When not actively teaching classes, developing self-paced products, studying for the CCDE Practical & the CCIE Storage Lab Exam, and completing his PhD in Applied Mathematics.

Find all posts by Petr Lapukhov, 4xCCIE/CCDE | Visit Website

You can leave a response, or trackback from your own site.

7 Responses to “Anomalies in BGP: Part I”

  1. Bob says:

    Could you paint a scenario where you get a routing loop even if all RR-clients peer with all RRs? Or is that not going to happen then?

    That is, what if R1,R2,R3,R5 all had BGP peering to R4 and R6?

  2. Kaushik Randeriya says:

    Could you please tell how did you inject routes in BB1 and BB3? Which protocol in IGP?

    If possible can you post diagram with IP information?

  3. Marcio says:

    I think the best design for this topology is to move the RRs from R4 and R6 to R5 and R3 and then do the RR peers between other I mean for sample R5 (RR) peers to R4 and R2 and R3 (RR) to R6 and R1. Will be this a better design?

  4. ODIN says:

    May be it is not a good idea to use 2 different IBGP clusters in one AS. If not MPLS, then topology-aligned confederation and, if necessary, one cluster with 2 redundant reflectors for private AS inside confederation.

  5. > Kaushik

    This problem is independent of the IGP used. BB1 and BB3 are running iBGP with each other and advertising routes into BGP. It doesn’t matter what protocol they use, but you can see their specific configs here: http://www.ine.com/downloads/bb1.txt and http://www.ine.com/downloads/bb3.txt

  6. > Odin,

    You would get this same result if you used confederation. The problem is that the peerings don’t follow the physical topology.

  7. Ulrich says:

    This is the output i got from CCIEv5 RS TSHOOT lab 2 (question 3). I went to a lot of document explaining bgp path selection and I still cannot explain why bgp chooses the ‘internal’ route over the ‘external’ one. even if I set the weight higher.

    R18#sh bgp vpnv4 u all
    BGP routing table entry for 65000:65000:, version 26
    Paths: (2 available, best #2, table SiteA)
    Advertised to update-groups:
    Refresh Epoch 1
    65004 (via vrf SiteA) from (
    Origin incomplete, metric 0, localpref 100, weight 1000, valid, external
    Extended Community: RT:65000:65000
    rx pathid: 0, tx pathid: 0
    Refresh Epoch 1
    Local (metric 2) (via default) from (
    Origin incomplete, metric 117760, localpref 100, valid, internal, best
    Extended Community: RT:65000:65000 Cost:pre-bestpath:128:117760
    0×8800:32768:0 0×8801:56:5632 0×8802:65283:2560 0×8803:65281:1400
    mpls labels in/out nolabel/16
    rx pathid: 0, tx pathid: 0×0


Leave a Reply


CCIE Bloggers