May
10

Edit: Thanks for playing! You can find the official answer and explanation here.

I had an interesting question come across my desk today which involved a very common area of confusion in OSPF routing logic, and now I'm posing this question to you as a challenge!

The first person to answer correctly will get free attendance to our upcoming CCIE Routing & Switching Lab Cram Session, which runs the week of June 1st 2015, as well as a free copy of the class in download format after it is complete.  The question is as follows:

Given the below topology, where R4 mutually redistributes between EIGRP and OSPF, which path(s) will R1 choose to reach the network 5.5.5.5/32, and why?

Bonus Questions:

  • What will R2's path selection to 5.5.5.5/32 be, and why?
  • What will R3's path selection to 5.5.5.5/32 be, and why?
  • Assume R3's link to R1 is lost.  Does this affect R1's path selection to 5.5.5.5/32? If so, how?

Tomorrow I'll be post topology and config files for CSR1000v, VIRL, GNS3, etc. so you can try this out yourself, but first answer the question without seeing the result and see if your expected result matches the actual result!

 

Good luck everyone!

Oct
28

INE's long awaited CCIE Service Provider Advanced Technologies Class is now available! But first, congratulations to Tedhi Achdiana who just passed the CCIE Service Provider Lab Exam! Here's what Tedhi had to say about his preparation:

Finally i passed my CCIE Service Provider Lab exam in Hongkong on Oct, 17 2011. I used your CCIE Service Provider Printed Materials Bundle. This product makes me deep understand how the Service Provider technology works, so it doesn`t matter when Cisco has changed the SP Blueprint. You just need to practise with IOS XR and finding similiar command in IOS platform.

Thanks to INE and keep good working !

Tedhi Achdiana
CCIE#30949 - Service Provider

The CCIE Service Provider Advanced Technologies Class covers the newest CCIE SP Version 3.0 Blueprint, including the addition of IOS XR hardware. Class topics include Catalyst ME3400 switching, IS-IS, OSPF, BGP, MPLS Layer 3 VPNs (L3VPN), Inter-AS MPLS L3VPNs, IPv6 over MPLS with 6PE and 6VPE, AToM and VPLS based MPLS Layer 2 VPNs (L2VPN), MPLS Traffic Engineering, Service Provider Multicast, and Service Provider QoS. Understanding the topics covered in this class will ensure that students are ready to tackle the next step in their CCIE preparation, applying the technologies themselves with INE's CCIE Service Provider Lab Workbook, and then finally taking and passing the CCIE Service Provider Lab Exam!

Streaming access is available for All Access Pass subscribers for as low as $65/month! Download access can be purchased here for $299. AAP members can additionally upgrade to the download version for $149.

Sample videos from class can be found after the break:

 

The detailed outline of class is as follows:

  • Introduction
  • Catalyst ME3400 Switching
  • Frame Relay / HDLC / PPP & PPPoE
  • IS-IS Overview / Level 1 & Level 2 Routing
  • IS-IS Network Types / Path Selection & Route Leaking
  • IS-IS Route Leaking on IOS XR / IOS XR Routing Policy Language (RPL)
  • IS-IS IPv6 Routing / IS-IS Multi Topology
  • MPLS Overview / LDP Overview
  • Basic MPLS Configuration
  • MPLS Tunnels
  • MPLS Layer 3 VPN (L3VPN) Overview / VRF Overview
  • VPNv4 BGP Overview / Route Distinguishers vs. Route Targets
  • Basic MPLS L3VPN Configuration
  • MPLS L3VPN Verification & Troubleshooting
  • VPNv4 Route Reflectors
  • BGP PE-CE Routing / BGP AS Override
  • RIP PE-CE Routing
  • EIGRP PE-CE Routing
  • OSPF PE-CE Routing / OSPF Domain IDs / Domain Tags & Sham Links
  • OSPF Multi VRF CE Routing
  • MPLS Central Services L3VPNs
  • IPv6 over MPLS with 6PE & 6VPE
  • Inter AS MPLS L3VPN Overview
  • Inter AS MPLS L3VPN Option A - Back-to-Back VRF Exchange Part 1
  • Inter AS MPLS L3VPN Option A - Back-to-Back VRF Exchange Part 2
  • Inter AS MPLS L3VPN Option B - ASBRs Peering VPNv4
  • Inter AS MPLS L3VPN Option C - ASBRs Peering BGP+Label Part 1
  • Inter AS MPLS L3VPN Option C - ASBRs Peering BGP+Label Part 2
  • Carrier Supporting Carrier (CSC) MPLS L3VPN
  • MPLS Layer 2 VPN (L2VPN) Overview
  • Ethernet over MPLS L2VPN AToM
  • PPP & Frame Relay over MPLS L2VPN AToM
  • MPLS L2VPN AToM Interworking
  • Virtual Private LAN Services (VPLS)
  • MPLS Traffic Engineering (TE) Overview
  • MPLS TE Configuration
  • MPLS TE on IOS XR / LDP over MPLS TE Tunnels
  • MPLS TE Fast Reroute (FRR) Link and Node Protection
  • Service Provider Multicast
  • Service Provider QoS

Additionally completely new versions of INE CCIE Service Provider Lab Workbook Volumes I & II are on their way, and should be released before the end of the year. Stay tuned for more information on the workbook and rack rental availability!

Oct
18

One of our most anticipated products of the year - INE's CCIE Service Provider v3.0 Advanced Technologies Class - is now complete!  The videos from class are in the final stages of post production and will be available for streaming and download access later this week.  Download access can be purchased here for $299.  Streaming access is available for All Access Pass subscribers for as low as $65/month!  AAP members can additionally upgrade to the download version for $149.

At roughly 40 hours, the CCIE SPv3 ATC covers the newly released CCIE Service Provider version 3 blueprint, which includes the addition of IOS XR hardware. This class includes both technology lectures and hands on configuration, verification, and troubleshooting on both regular IOS and IOS XR. Class topics include Catalyst ME3400 switching, IS-IS, OSPF, BGP, MPLS Layer 3 VPNs (L3VPN), Inter-AS MPLS L3VPNs, IPv6 over MPLS with 6PE and 6VPE, AToM and VPLS based MPLS Layer 2 VPNs (L2VPN), MPLS Traffic Engineering, Service Provider Multicast, and Service Provider QoS.

Below you can see a sample video from the class, which covers IS-IS Route Leaking, and its implementation on IOS XR with the Routing Policy Language (RPL)

Apr
04

Hi Brian,

What is the major difference in using an E1 route over an E2 route in OSPF?

From what I’ve observed, if you redistribute a route into OSPF either E1 or E2, the upstream router will still use the shortest path to get to the ASBR regardless of what is shown in the routing table.

The more I read about this, the more confused I get. Am I missing something?

Matt

Hi Matt,

This is actually a very common area of confusion and misunderstanding in OSPF. Part of the problem is that the vast majority of CCNA and CCNP texts teach the theory that for OSPF path selection of E1 vs E2 routes, E1 routes use the redistributed cost plus the cost to the ASBR, while with E2 routes only use the redistributed cost. When I just checked the most recent CCNP ROUTE text from Cisco Press, it specifically says that "[w]hen flooded, OSPF has little work to do to calculate the metric for an E2 route, because by definition, the E2 route’s metric is simply the metric listed in the Type 5 LSA. In other words, the OSPF routers do not add any internal OSPF cost to the metric for an E2 route." While technically true, this statement is an oversimplification. For CCNP level, this might be fine, but for CCIE level it is not.

The key point that I'll demonstrate in this post is that while it is true that "OSPF routers do not add any internal OSPF cost to the metric for an E2 route", both the intra-area and inter-area cost is still considered in the OSPF path selection state machine for these routes.

First, let's review the order of the OSPF path selection process. Regardless of a route’s metric or administrative distance, OSPF will choose routes in the following order:

Intra-Area (O)
Inter-Area (O IA)
External Type 1 (E1)
External Type 2 (E2)
NSSA Type 1 (N1)
NSSA Type 2 (N2)

To demonstrate this, take the following topology:

R1 connects to R2 and R3 via area 0. R2 and R3 connect to R4 and R5 via area 1 respectively. R4 and R5 connect to R6 via another routing domain, which is EIGRP in this case. R6 advertises the prefix 10.1.6.0/24 into EIGRP. R4 and R5 perform mutual redistribution between EIGRP and OSPF with the default parameters, as follows:

R4:
router eigrp 10
redistribute ospf 1 metric 100000 100 255 1 1500
!
router ospf 1
redistribute eigrp 10 subnets

R5:
router eigrp 10
redistribute ospf 1 metric 100000 100 255 1 1500
!
router ospf 1
redistribute eigrp 10 subnets

The result of this is that R1 learns the prefix 10.1.6.0/24 as an OSPF E2 route via both R2 and R3, with a default cost of 20. This can be seen in the routing table output below. The other OSPF learned routes are the transit links between the routers in question.

R1#sh ip route ospf
10.0.0.0/24 is subnetted, 8 subnets
O E2 10.1.6.0 [110/20] via 10.1.13.3, 00:09:43, FastEthernet0/0.13
[110/20] via 10.1.12.2, 00:09:43, FastEthernet0/0.12
O IA 10.1.24.0 [110/2] via 10.1.12.2, 00:56:44, FastEthernet0/0.12
O E2 10.1.46.0 [110/20] via 10.1.13.3, 00:09:43, FastEthernet0/0.13
[110/20] via 10.1.12.2, 00:09:43, FastEthernet0/0.12
O IA 10.1.35.0 [110/2] via 10.1.13.3, 00:56:44, FastEthernet0/0.13
O E2 10.1.56.0 [110/20] via 10.1.13.3, 00:09:43, FastEthernet0/0.13
[110/20] via 10.1.12.2, 00:09:43, FastEthernet0/0.12

Note that all the routes redistributed from EIGRP appear on R1 with a default metric of 20. Now let’s examine the details of the route 10.1.6.0/24 on R1.

R1#show ip route 10.1.6.0
Routing entry for 10.1.6.0/24
Known via "ospf 1", distance 110, metric 20, type extern 2, forward metric 2
Last update from 10.1.13.3 on FastEthernet0/0.13, 00:12:03 ago
Routing Descriptor Blocks:
10.1.13.3, from 10.1.5.5, 00:12:03 ago, via FastEthernet0/0.13
Route metric is 20, traffic share count is 1
* 10.1.12.2, from 10.1.4.4, 00:12:03 ago, via FastEthernet0/0.12
Route metric is 20, traffic share count is 1

As expected, the metric of both paths via R2 and R3 have a metric of 20. However, there is an additional field in the route’s output called the “forward metric”. This field denotes the cost to the ASBR(s). In this case, the ASBRs are R4 and R5 for the routes via R2 and R3 respectively. Since all interfaces are FastEthernet, with a default OSPF cost of 1, the cost to both R4 and R5 is 2, or essentially 2 hops.

The reason that multiple routes are installed in R1’s routing table is that the route type (E2), the metric (20), and the forward metric (2) are all a tie. If any of these fields were to change, the path selection would change.

To demonstrate this, let’s change the route type to E1 under R4’s OSPF process. This can be accomplished as follows:

R4#config t
Enter configuration commands, one per line. End with CNTL/Z.
R4(config)#router ospf 1
R4(config-router)#redistribute eigrp 10 subnets metric-type 1
R4(config-router)#end
R4#

The result of this change is that R1 now only installs a single route to 10.1.6.0/24, the E1 route learned via R2.

R1#show ip route 10.1.6.0
Routing entry for 10.1.6.0/24
Known via "ospf 1", distance 110, metric 22, type extern 1
Last update from 10.1.12.2 on FastEthernet0/0.12, 00:00:35 ago
Routing Descriptor Blocks:
* 10.1.12.2, from 10.1.4.4, 00:00:35 ago, via FastEthernet0/0.12
Route metric is 22, traffic share count is 1

Note that the metric and the forward metric seen in the previous E2 route is now collapsed into the single “metric” field of the E1 route. Although the value is technically the same, a cost of 2 to the ASBR, and the cost of 20 the ASBR reports in, the E1 route is preferred over the E2 route due to the OSPF path selection state machine preference. Even if we were to raise the metric of the E1 route so that the cost is higher than the E2 route, the E1 route would be preferred:

R4#config t
Enter configuration commands, one per line. End with CNTL/Z.
R4(config)#router ospf 1
R4(config-router)#redistribute eigrp 10 subnets metric-type 1 metric 100
R4(config-router)#end
R4#

R1 still installs the E1 route, even though the E1 metric of 102 is higher than the E2 metric of 20 plus a forward metric of 2.

R1#show ip route 10.1.6.0
Routing entry for 10.1.6.0/24
Known via "ospf 1", distance 110, metric 102, type extern 1
Last update from 10.1.12.2 on FastEthernet0/0.12, 00:00:15 ago
Routing Descriptor Blocks:
* 10.1.12.2, from 10.1.4.4, 00:00:15 ago, via FastEthernet0/0.12
Route metric is 102, traffic share count is 1

R1 still knows about both the E1 and the E2 route in the Link-State Database, but the E1 route must always be preferred:

R1#show ip ospf database external 10.1.6.0

OSPF Router with ID (10.1.1.1) (Process ID 1)

Type-5 AS External Link States

Routing Bit Set on this LSA
LS age: 64
Options: (No TOS-capability, DC)
LS Type: AS External Link
Link State ID: 10.1.6.0 (External Network Number )
Advertising Router: 10.1.4.4
LS Seq Number: 80000003
Checksum: 0x1C8E
Length: 36
Network Mask: /24
Metric Type: 1 (Comparable directly to link state metric)
TOS: 0
Metric: 100
Forward Address: 0.0.0.0
External Route Tag: 0

LS age: 1388
Options: (No TOS-capability, DC)
LS Type: AS External Link
Link State ID: 10.1.6.0 (External Network Number )
Advertising Router: 10.1.5.5
LS Seq Number: 80000001
Checksum: 0x7307
Length: 36
Network Mask: /24
Metric Type: 2 (Larger than any link state path)
TOS: 0
Metric: 20
Forward Address: 0.0.0.0
External Route Tag: 0

This is the behavior we would expect, because E1 routes must always be preferred over E2 routes. Now let’s look at some of the commonly misunderstood cases, where the E2 routes use both the metric and the forward metric for their path selection.

First, R4’s redistribution is modified to return the metric-type to E2, but to use a higher metric of 100 than the default of 20:

R4#conf t
Enter configuration commands, one per line. End with CNTL/Z.
R4(config)#router ospf 1
R4(config-router)#redistribute eigrp 10 subnets metric-type 2 metric 100
R4(config-router)#end
R4#

The result on R1 is that the route via R4 is less preferred, since it now has a metric of 100 (and still a forward metric of 2) vs the metric of 20 (and the forward metric of 2) via R5.

R1#show ip route 10.1.6.0
Routing entry for 10.1.6.0/24
Known via "ospf 1", distance 110, metric 20, type extern 2, forward metric 2
Last update from 10.1.13.3 on FastEthernet0/0.13, 00:00:30 ago
Routing Descriptor Blocks:
* 10.1.13.3, from 10.1.5.5, 00:00:30 ago, via FastEthernet0/0.13
Route metric is 20, traffic share count is 1

The alternate route via R4 can still be seen in the database.

R1#show ip ospf database external 10.1.6.0

OSPF Router with ID (10.1.1.1) (Process ID 1)

Type-5 AS External Link States

Routing Bit Set on this LSA
LS age: 34
Options: (No TOS-capability, DC)
LS Type: AS External Link
Link State ID: 10.1.6.0 (External Network Number )
Advertising Router: 10.1.4.4
LS Seq Number: 80000004
Checksum: 0x9D8B
Length: 36
Network Mask: /24
Metric Type: 2 (Larger than any link state path)
TOS: 0
Metric: 100
Forward Address: 0.0.0.0
External Route Tag: 0

Routing Bit Set on this LSA
LS age: 1653
Options: (No TOS-capability, DC)
LS Type: AS External Link
Link State ID: 10.1.6.0 (External Network Number )
Advertising Router: 10.1.5.5
LS Seq Number: 80000001
Checksum: 0x7307
Length: 36
Network Mask: /24
Metric Type: 2 (Larger than any link state path)
TOS: 0
Metric: 20
Forward Address: 0.0.0.0
External Route Tag: 0

This is the path selection that we would ideally want, because the total cost of the path via R4 is 102 (metric of 100 plus a forward metric of 2), while the cost of the path via R5 is 22 (metric of 20 plus a forward metric of 2). The result of this path selection would be the same if we were to change both routes to E1, as seen below.

R4#conf t
Enter configuration commands, one per line. End with CNTL/Z.
R4(config)#router ospf 1
R4(config-router)#redistribute eigrp 10 subnets metric-type 1 metric 100
R4(config-router)#end
R4#

R5#config t
Enter configuration commands, one per line. End with CNTL/Z.
R5(config)#router ospf 1
R5(config-router)#redistribute eigrp 10 subnets metric-type 1
R5(config-router)#end
R5#

R1 still chooses the route via R5, since this has a cost of 22 vs R4’s cost of 102.

R1#show ip route 10.1.6.0
Routing entry for 10.1.6.0/24
Known via "ospf 1", distance 110, metric 22, type extern 1
Last update from 10.1.13.3 on FastEthernet0/0.13, 00:00:41 ago
Routing Descriptor Blocks:
* 10.1.13.3, from 10.1.5.5, 00:00:41 ago, via FastEthernet0/0.13
Route metric is 22, traffic share count is 1

R1#show ip ospf database external 10.1.6.0

OSPF Router with ID (10.1.1.1) (Process ID 1)

Type-5 AS External Link States

Routing Bit Set on this LSA
LS age: 56
Options: (No TOS-capability, DC)
LS Type: AS External Link
Link State ID: 10.1.6.0 (External Network Number )
Advertising Router: 10.1.4.4
LS Seq Number: 80000005
Checksum: 0x1890
Length: 36
Network Mask: /24
Metric Type: 1 (Comparable directly to link state metric)
TOS: 0
Metric: 100
Forward Address: 0.0.0.0
External Route Tag: 0

Routing Bit Set on this LSA
LS age: 45
Options: (No TOS-capability, DC)
LS Type: AS External Link
Link State ID: 10.1.6.0 (External Network Number )
Advertising Router: 10.1.5.5
LS Seq Number: 80000003
Checksum: 0xEB0D
Length: 36
Network Mask: /24
Metric Type: 1 (Comparable directly to link state metric)
TOS: 0
Metric: 20
Forward Address: 0.0.0.0
External Route Tag: 0

R1#

Note that the E1 route itself in the database does not include the cost to the ASBR. This must be calculated separately either based on the Type-1 LSA or Type-4 LSA, depending on whether the route to the ASBR is Intra-Area or Inter-Area respectively.

So now this begs the question, why does it matter if we use E1 vs E2? Of course as we saw E1 is always preferred over E2, due to the OSPF path selection order, but what is the difference between having *all* E1 routes vs having *all* E2 routes? Now let’s at a case where it *does* matter if you’re using E1 vs E2.

R1’s OSPF cost on the link to R2 is increased as follows:

R1#config t
Enter configuration commands, one per line. End with CNTL/Z.
R1(config)#interface Fa0/0.12
R1(config-subif)#ip ospf cost 100
R1(config-subif)#end
R1#

R4 and R5’s redistribution is modified as follows:

R4#config t
Enter configuration commands, one per line. End with CNTL/Z.
R4(config)#router ospf 1
R4(config-router)#redistribute eigrp 10 subnets metric-type 1 metric 99
R4(config-router)#end
R4#

R5#config t
Enter configuration commands, one per line. End with CNTL/Z.
R5(config)#router ospf 1
R5(config-router)#redistribute eigrp 10 subnets metric-type 1 metric 198
R5(config-router)#end
R5#

Now R1’s routes to the prefix 10.1.6.0/24 are as follows: Path 1 via the link to R2 with a cost of 100, plus the link to R4 with a cost of 1, plus the redistributed metric of 99, making this total path a cost of 200. Next, Path 2 is available via the link to R3 with a cost of 1, plus the link to R5 with a cost of 1, plus the redistributed metric of 198, masking this total path a cost of 200 as well. The result is that R1 installs both paths equally:

R1#show ip route 10.1.6.0
Routing entry for 10.1.6.0/24
Known via "ospf 1", distance 110, metric 200, type extern 1
Last update from 10.1.12.2 on FastEthernet0/0.12, 00:02:54 ago
Routing Descriptor Blocks:
* 10.1.13.3, from 10.1.5.5, 00:02:54 ago, via FastEthernet0/0.13
Route metric is 200, traffic share count is 1
10.1.12.2, from 10.1.4.4, 00:02:54 ago, via FastEthernet0/0.12
Route metric is 200, traffic share count is 1

Note that the database lists the costs of the Type-5 External LSAs as different though:

R1#show ip ospf database external 10.1.6.0

OSPF Router with ID (10.1.1.1) (Process ID 1)

Type-5 AS External Link States

Routing Bit Set on this LSA
LS age: 291
Options: (No TOS-capability, DC)
LS Type: AS External Link
Link State ID: 10.1.6.0 (External Network Number )
Advertising Router: 10.1.4.4
LS Seq Number: 80000006
Checksum: 0xC9C
Length: 36
Network Mask: /24
Metric Type: 1 (Comparable directly to link state metric)
TOS: 0
Metric: 99
Forward Address: 0.0.0.0
External Route Tag: 0

Routing Bit Set on this LSA
LS age: 207
Options: (No TOS-capability, DC)
LS Type: AS External Link
Link State ID: 10.1.6.0 (External Network Number )
Advertising Router: 10.1.5.5
LS Seq Number: 80000004
Checksum: 0xE460
Length: 36
Network Mask: /24
Metric Type: 1 (Comparable directly to link state metric)
TOS: 0
Metric: 198
Forward Address: 0.0.0.0
External Route Tag: 0

What happens if we were to change the metric-type to 2 on both R4 and R5 now? Let’s see:

R4(config)#router ospf 1
R4(config-router)#redistribute eigrp 10 subnets metric-type 2 metric 99
R4(config-router)#end
R4#

R5#config t
Enter configuration commands, one per line. End with CNTL/Z.
R5(config)#router ospf 1
R5(config-router)#redistribute eigrp 10 subnets metric-type 2 metric 198
R5(config-router)#end
R5#

Even though the end-to-end costs are still the same, R1 should now prefer the path with the lower redistributed metric via R4:

R1#show ip route 10.1.6.0
Routing entry for 10.1.6.0/24
Known via "ospf 1", distance 110, metric 99, type extern 2, forward metric 101
Last update from 10.1.12.2 on FastEthernet0/0.12, 00:01:09 ago
Routing Descriptor Blocks:
* 10.1.12.2, from 10.1.4.4, 00:01:09 ago, via FastEthernet0/0.12
Route metric is 99, traffic share count is 1

The forward metric of this route means that the total cost is still 200 (the metric of 99 plus the forward metric of 101). In this case, even though both paths are technically equal, only the path with the lower redistribution metric is installed. Now let’s see what happens if we do set the redistribution metric the same.

R4#config t
Enter configuration commands, one per line. End with CNTL/Z.
R4(config)#router ospf 1
R4(config-router)#redistribute eigrp 10 subnets metric-type 2 metric 1
R4(config-router)#end
R4#

R5#config t
Enter configuration commands, one per line. End with CNTL/Z.
R5(config)#router ospf 1
R5(config-router)#redistribute eigrp 10 subnets metric-type 2 metric 1
R5(config-router)#end
R5#

Both routes now have the same metric of 1, so both should be installed in R1’s routing table, right? Let’s check:

R1#show ip route 10.1.6.0
Routing entry for 10.1.6.0/24
Known via "ospf 1", distance 110, metric 1, type extern 2, forward metric 2
Last update from 10.1.13.3 on FastEthernet0/0.13, 00:00:42 ago
Routing Descriptor Blocks:
* 10.1.13.3, from 10.1.5.5, 00:00:42 ago, via FastEthernet0/0.13
Route metric is 1, traffic share count is 1

This is the result we may not expect. Only the path via R5 is installed, not the path via R4. Let’s look at the database and see why:

R1#show ip ospf database external 10.1.6.0

OSPF Router with ID (10.1.1.1) (Process ID 1)

Type-5 AS External Link States

Routing Bit Set on this LSA
LS age: 56
Options: (No TOS-capability, DC)
LS Type: AS External Link
Link State ID: 10.1.6.0 (External Network Number )
Advertising Router: 10.1.4.4
LS Seq Number: 80000008
Checksum: 0xB3D4
Length: 36
Network Mask: /24
Metric Type: 2 (Larger than any link state path)
TOS: 0
Metric: 1
Forward Address: 0.0.0.0
External Route Tag: 0

Routing Bit Set on this LSA
LS age: 47
Options: (No TOS-capability, DC)
LS Type: AS External Link
Link State ID: 10.1.6.0 (External Network Number )
Advertising Router: 10.1.5.5
LS Seq Number: 80000006
Checksum: 0xAADD
Length: 36
Network Mask: /24
Metric Type: 2 (Larger than any link state path)
TOS: 0
Metric: 1
Forward Address: 0.0.0.0
External Route Tag: 0

Both of these routes show the same cost, as denoted by the “Metric: 1”, so why is one being chosen over the other? The reason is that in reality, OSPF External Type-2 (E2) routes *do* take the cost to the ASBR into account during route calculation. The problem though is that by looking at just the External LSA’s information, we can’t see why we’re choosing one over the other.

Now let’s go through the entire recursion process in the database to figure out why R1 is choosing the path via R5 over the path to R4.

First, as we saw above, R1 finds both routes to the prefix with a metric of 1. Since this is a tie, the next thing R1 does is determine if the route to the ASBR is via an Intra-Area path. This is done by looking up the Type-1 Router LSA for the Advertising Router field found in the Type-5 External LSA.

R1#show ip ospf database router 10.1.4.4

OSPF Router with ID (10.1.1.1) (Process ID 1)
R1#show ip ospf database router 10.1.5.5

OSPF Router with ID (10.1.1.1) (Process ID 1)
R1#

This output on R1 means that it does not have an Intra-Area path to either of the ASBRs advertising these routes. The next step is to check if there is an Inter-Area path. This is done by examining the Type-4 ASBR Summary LSA.

R1#show ip ospf database asbr-summary 10.1.4.4

OSPF Router with ID (10.1.1.1) (Process ID 1)

Summary ASB Link States (Area 0)

Routing Bit Set on this LSA
LS age: 1889
Options: (No TOS-capability, DC, Upward)
LS Type: Summary Links(AS Boundary Router)
Link State ID: 10.1.4.4 (AS Boundary Router address)
Advertising Router: 10.1.2.2
LS Seq Number: 80000002
Checksum: 0x24F3
Length: 28
Network Mask: /0
TOS: 0 Metric: 1

R1#show ip ospf database asbr-summary 10.1.5.5

OSPF Router with ID (10.1.1.1) (Process ID 1)

Summary ASB Link States (Area 0)

Routing Bit Set on this LSA
LS age: 1871
Options: (No TOS-capability, DC, Upward)
LS Type: Summary Links(AS Boundary Router)
Link State ID: 10.1.5.5 (AS Boundary Router address)
Advertising Router: 10.1.3.3
LS Seq Number: 80000002
Checksum: 0x212
Length: 28
Network Mask: /0
TOS: 0 Metric: 1

This output indicates that R1 does have Inter-Area routes to the ASBRs R4 and R5. The Inter-Area metric to reach them is 1 via ABRs R2 (10.1.2.2) and R3 (10.1.3.3) respectively. Now R1 needs to know which ABR is closer, R2 or R3? This is accomplished by looking up the Type-1 Router LSA to the ABRs that are originating the Type-4 ASBR Summary LSAs.

R1#show ip ospf database router 10.1.2.2

OSPF Router with ID (10.1.1.1) (Process ID 1)

Router Link States (Area 0)

Routing Bit Set on this LSA
LS age: 724
Options: (No TOS-capability, DC)
LS Type: Router Links
Link State ID: 10.1.2.2
Advertising Router: 10.1.2.2
LS Seq Number: 8000000D
Checksum: 0xA332
Length: 36
Area Border Router
Number of Links: 1

Link connected to: a Transit Network
(Link ID) Designated Router address: 10.1.12.2
(Link Data) Router Interface address: 10.1.12.2
Number of TOS metrics: 0
TOS 0 Metrics: 1

R1#show ip ospf database router 10.1.3.3

OSPF Router with ID (10.1.1.1) (Process ID 1)

Router Link States (Area 0)

Routing Bit Set on this LSA
LS age: 1217
Options: (No TOS-capability, DC)
LS Type: Router Links
Link State ID: 10.1.3.3
Advertising Router: 10.1.3.3
LS Seq Number: 80000010
Checksum: 0x9537
Length: 36
Area Border Router
Number of Links: 1

Link connected to: a Transit Network
(Link ID) Designated Router address: 10.1.13.1
(Link Data) Router Interface address: 10.1.13.3
Number of TOS metrics: 0
TOS 0 Metrics: 1

This output indicates that R2 and R3 are adjacent with the Designated Routers 10.1.12.2 and 10.1.13.3 respectively. Since R1 is also adjacent with these DRs, the cost from R1 to the DR is now added to the path.

R1#show ip ospf database router 10.1.1.1

OSPF Router with ID (10.1.1.1) (Process ID 1)

Router Link States (Area 0)

LS age: 948
Options: (No TOS-capability, DC)
LS Type: Router Links
Link State ID: 10.1.1.1
Advertising Router: 10.1.1.1
LS Seq Number: 8000000F
Checksum: 0x6FA6
Length: 60
Number of Links: 3

Link connected to: a Stub Network
(Link ID) Network/subnet number: 10.1.1.1
(Link Data) Network Mask: 255.255.255.255
Number of TOS metrics: 0
TOS 0 Metrics: 1

Link connected to: a Transit Network
(Link ID) Designated Router address: 10.1.13.1
(Link Data) Router Interface address: 10.1.13.1
Number of TOS metrics: 0
TOS 0 Metrics: 1

Link connected to: a Transit Network
(Link ID) Designated Router address: 10.1.12.2
(Link Data) Router Interface address: 10.1.12.1
Number of TOS metrics: 0
TOS 0 Metrics: 100

R1 now knows that its cost to the DR 10.1.12.2 is 100, who is adjacent with R2, whose cost to R4 is 1, whose redistributed metric is 1. R1 also now knows that its cost to the DR 10.1.13.3 is 1, who is adjacent with R3, whose cost to R5 is 1, whose redistributed metric is 1. This means that the total cost to go to 10.1.6.0 via the R1 -> R2 -> R4 path is 102, while the total cost to go to 10.1.6.0 via the R1 -> R3 -> R5 path is 3.

The final result of this is that R1 chooses the shorter path to the ASBR, which is the R1 -> R3 -> R5 path. Although the other route to the prefix is via an E2 route with the same external cost, one is preferred over another due to the shorter ASBR path.

Based on this we can see that both E1 and E2 routes take both the redistributed cost and the cost to the ASBR into account when making their path selection. The key difference is that E1 is always preferred over E2, followed by the E2 route with the lower redistribution metric. If multiple E2 routes exist with the same redistribution metric, the path with the lower forward metric (metric to the ASBR) is preferred. If there are multiple E2 routes with both the same redistribution metric and forward metric, they can both be installed in the routing table. Why does OSPF do this though? Originally this stems from the design concepts of "hot potato" and "cold potato" routing.

Think of a routing domain learning external routes. Typically those prefixes have some "external" metric associated with them - for example, E2 external metric or the BGP MED attribute value. If the routers in the local domain select the exit point based on the external metric they are said to perform "cold potato" routing. This means that the exit point is selected based on the external metric preference, e.g. distances to the prefix in the bordering routing system. This optimizes link utilization in the external system but may lead to suboptimal path selection in the local domain. Conversely, "hot potato" routing is the model where the exit point selection is performed based on the local metric to the exit point associated with the prefix. In other words, "hot potato" model tries to push packets out of the local system as quick as possible, optimizing internal link utilization.

Now within the scope of OSPF, think of the E2 route selection process: OSPF chooses the best exit point based on the external metric and uses the internal cost to ASBR as a tie breaker. In other words, OSPF performs "cold potato" routing with respect to E2 prefixes. It is easy to turn this process into "hot potato" by ensuring that every exit point uses the same E2 metric value. It is also possible to perform other sorts of traffic engineering by selectively manipulating the external metric associated with the E2 route, allowing for full flexibility of exit point selection.

Finally, we approach E1. This type of routing is a hybrid of hot and cold routing models - external metrics are directly added to the internal metrics. This implicitly assumes that external metrics are "comparable" to the internal metrics. In turn, this means E1 is meant to be used with another OSPF domain that uses a similar metric system. This is commonly found in split/merge scenarios where you have multiple routing processes within the same autonomous system, and want to achieve optimum path selection accounting for both metrics in both systems. This is similar to the way EIGRP performs metric computation for external prefixes.

So there we have it. While it is technically true that "OSPF routers do not add any internal OSPF cost to the metric for an E2 route", both the intra-area and inter-area cost can still be considered in the OSPF path selection regardless of whether the route is E1 or E2.

Mar
30

OSPF and MTU Mismatch

Dear Brian,

What is the difference between using the “system mtu routing 1500” and the “ip ospf mtu-ignore” commands when running OSPF between a router and a switch?

Thanks,

Paul

Hi Paul,

Within the scope of the CCIE Lab Exam, it may be acceptable to issue either of these commands to solve a specific lab task. However, it is key to note that there is a difference between ignoring the MTU for the purpose of OSPF adjacency and matching the MTU within a real production network.

By design, OSPF will automatically detect a MTU mismatch between two devices when they exchange the Database Description (DBD) packets during the formation of adjacency. This is per the standard OSPF specification defined in RFC 2328, “OSPF Version 2”. Specifically the RFC states the following:

10.6.  Receiving Database Description Packets
This section explains the detailed processing of a received
Database Description Packet.
[snip]
If the Interface MTU field in the Database Description packet
indicates an IP datagram size that is larger than the router can
accept on the receiving interface without fragmentation, the
Database Description packet is rejected.
[/snip]

Basically this means that if a router tries to negotiate an adjacency on an interface in which the remote neighbor has a larger MTU, the adjacency will be denied. The idea behind this check is two-fold. The first is to alleviate a problem in the data plane, in which a sending host transmits packets to a receiver that are too large to accept. Typically, Path MTU Discovery (PMTUD) should be implemented on the sender to prevent this case, however this process relies on ICMP messages that could possibly be filtered out in the transit path due to a security policy. The second, and most important issue, is to alleviate a problem in the control plane in which OSPF packets are exchanged.

Specifically this problem stems from the issue that the OSPF Hello, Database Description (DBD), Link-State Request (LSR), and Link-State Acknowledgement (LSAck) packets are generally small, but the Link-State Update (LSU) packets are generally not.

When establishing a new OSPF adjacency, the DBD packet is used to tell new neighbors what LSAs are in the database, but not to give the details about them. Specifically the DBD contains the LSA Header information, but not the actual LSA payload. The idea behind this is to optimize flooding in the case that the receiving router already received the LSA from another neighbor, in which case flooding does not need to occur during adjacency establishment.

For example, suppose that you and I, routers A and B, both have neighbors C and D, and the database is synchronized. If you and I form a new adjacency, my DBD exchange to you will say that I have LSAs A, B, C, and D in my database. Since you are already adjacent with C and D, and I am adjacent with them, you already have all of my LSAs, possibly with the exception of the new link that connects us. This means that even though I describe LSAs A and B to you with my DBD packet, you don't send an LSR to me for them, which means I don't send you an LSU about them. This is the normal optimization of how the database is exchanged so that excessive flooding doesn't occur.

Suppose next that you, router A, know about LSAs A1 through An in your database, and I, router B, know about LSAs B1 through Bn. When we establish an adjacency your DBD to me will describe LSAs A1-An, while mine will describe LSAs B1-Bn. Since I don't have LSAs A1-An, I will send you an LSR about them, and likewise since you don't have B1-Bn, you will send an LSR about those to me. When you reply back to me with the LSUs about A1-An, it is likely that the LSU packet itself will contain more than one LSA in the payload, or that if the LSA is large, that it will span multiple IP fragments. The idea behind this is that since you need to send me more than one LSA, it's more efficient to send them in as few LSUs as possible, instead of sending one LSA per LSU. The problem that can occur in this procedure however is when the router that is flooding has a larger MTU than the router that is receiving.

For example, suppose that the flooding router has a Gigabit Ethernet interface that supports Jumbo frames, which exceed the normal Ethernet MTU of 1500 bytes; however, the receiving router has not enabled Jumbo frame support, which implies that frames over 1500 bytes (excluding layer 2 overhead) will be dropped. If the flooding router sends multiple LSAs in an LSU forcing the packet size to exceed 1500 bytes, or if a single LSA sent by the flooding router is large enough to exceed 1500 bytes, such as a Router LSA (LSA Type 1) with many links, the results can be non-deterministic.
To demonstrate this, take the following topology.

 

R1 and R2 connect with GigabitEthernet, while R2 and R3 connect with FastEthernet. R1 has a default MTU of 1500 bytes configured on its link to R2, while R2 has Jumbo frame support configured up to 2000 bytes. R2 and R3’s link uses the default MTU of 1500 bytes. Per the RFC’s defined behavior, R1 should reject a OSPF adjacency with R2. This default behavior can be seen as follows:

R1:
interface GigabitEthernet1/0
ip address 12.0.0.1 255.255.255.0
!
router ospf 1
network 0.0.0.0 255.255.255.255 area 0

R2:
interface GigabitEthernet1/0
mtu 2000
ip address 12.0.0.2 255.255.255.0
!
router ospf 1
network 0.0.0.0 255.255.255.255 area 0

R1#debug ip packet detail
IP packet debugging is on (detailed)
R1#debug ip ospf adj
OSPF adjacency events debugging is on

01:07:18: OSPF: Rcv DBD from 2.2.2.2 on GigabitEthernet1/0 seq 0x172A opt 0x52 flag 0x7 len 32 mtu 2000 state EXSTART
01:07:18: OSPF: Nbr 2.2.2.2 has larger interface MTU
01:07:18: OSPF: Retransmitting DBD to 2.2.2.2 on GigabitEthernet1/0
01:07:18: OSPF: Up DBD Retransmit cnt to 5 for 2.2.2.2 on GigabitEthernet1/0
01:07:18: OSPF: Send DBD to 2.2.2.2 on GigabitEthernet1/0 seq 0x1813 opt 0x52 flag 0x7 len 32

In this case we can see that R1 rejects R2's DBD packet, since the MTU is larger. Although the obvious solution to this problem is to simply match the MTU of the links to avoid this problem in the first place, IOS also offers the "ip ospf mtu-ignore" command at the interface level to skip over this check in the OSPF adjacency state machine. Once applied, as seen below, R1 and R2 form an adjacency.

R1#conf t
Enter configuration commands, one per line. End with CNTL/Z.
R1(config)#interface Gig1/0
R1(config-if)#ip ospf mtu-ignore
R1(config-if)#end
R1#
%OSPF-5-ADJCHG: Process 1, Nbr 2.2.2.2 on GigabitEthernet1/0 from LOADING to FULL, Loading Done
R1#show ip ospf neighbor

Neighbor ID Pri State Dead Time Address Interface
2.2.2.2 1 FULL/DR 00:00:36 12.0.0.2 GigabitEthernet1/0

At this point, both R1 and R2 learn the routes to each other's Loopback0 interfaces, as seen below.

R1#show ip route ospf
2.0.0.0/32 is subnetted, 1 subnets
O 2.2.2.2 [110/2] via 12.0.0.2, 00:00:05, GigabitEthernet1/0

R2#show ip route ospf
1.0.0.0/32 is subnetted, 1 subnets
O 1.1.1.1 [110/2] via 12.0.0.1, 00:00:46, GigabitEthernet1/0

As expected however, since there is an MTU mismatch, R1 is unable to receive packets from R2 that exceed an MTU of 1500 bytes.

R2#ping 1.1.1.1

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 1.1.1.1, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 12/16/20 ms

R2#ping
Protocol [ip]:
Target IP address: 1.1.1.1
Repeat count [5]:
Datagram size [100]: 2000
Timeout in seconds [2]:
Extended commands [n]: y
Source address or interface:
Type of service [0]:
Set DF bit in IP header? [no]: yes
Validate reply data? [no]:
Data pattern [0xABCD]:
Loose, Strict, Record, Timestamp, Verbose[none]:
Sweep range of sizes [n]:
Type escape sequence to abort.
Sending 5, 2000-byte ICMP Echos to 1.1.1.1, timeout is 2 seconds:
.....
Success rate is 0 percent (0/5)

Theoretically this MTU mismatch should not matter, since end hosts that send traffic should ideally implement Path MTU Discovery. However, let's now see a case where R2 is unable to flood LSAs to R1 for which the IP packet size exceeds 1500 bytes.

R3, who connects to R2, has been configured with a large number of Loopback interfaces in order to generate a large Router LSA (LSA Type 1). R3's configuration is as follows, where Loopbacks 3.3.3.2 - 3.3.3.253 have been omitted:

R3:
interface FastEthernet0/0
ip address 23.0.0.3 255.255.255.0
shutdown
!
interface Loopback3330
ip address 3.3.3.0 255.255.255.255
!
[snip]
!
interface Loopback333254
ip address 3.3.3.254 255.255.255.255
!
router ospf 1
network 0.0.0.0 255.255.255.255 area 0

The number of resulting local links can be seen in R3's database as follows:

R3#show ip ospf database

OSPF Router with ID (23.0.0.3) (Process ID 1)

Router Link States (Area 0)

Link ID ADV Router Age Seq# Checksum Link count
23.0.0.3 23.0.0.3 299 0x80000007 0x0050D2 254

Now let's activate the link between R2 and R3, which will cause R3 to flood a large Router LSA to R2, which in turn causes R2 to flood this to R1.

R3#config t
Enter configuration commands, one per line. End with CNTL/Z.
R3(config)#int Fa0/0
R3(config-if)#no shutdown
R3(config-if)#end
R3#

R2#debug ip packet detail
IP packet debugging is on (detailed)
R2#debug ip ospf packet
OSPF packet debugging is on

R2#config t
Enter configuration commands, one per line. End with CNTL/Z.
R2(config)#interface Fa2/0
R2(config-if)#no shutdown
R2(config-if)#end
R2#
%SYS-5-CONFIG_I: Configured from console by console
IP: s=23.0.0.3 (FastEthernet2/0), d=224.0.0.5, len 76, rcvd 0, proto=89
OSPF: rcv. v:2 t:1 l:44 rid:23.0.0.3
aid:0.0.0.0 chk:D59B aut:0 auk: from FastEthernet2/0
IP: s=23.0.0.2 (local), d=23.0.0.3 (FastEthernet2/0), len 80, sending, proto=89
[snip]

R2 and R3 form adjacency, and R3's LSA is flooded to R2. Since the LSA takes more than one 1500 byte packet, it is fragmented into multiple packets, with the largest being the shared MTU of 1500 between them.

IP: s=23.0.0.3 (FastEthernet2/0), d=23.0.0.2, len 1500, rcvd 0
IP Fragment, Ident = 497, fragment offset = 0, proto=89
IP: recv fragment from 23.0.0.3 offset 0 bytes
IP: s=23.0.0.3 (FastEthernet2/0), d=23.0.0.2, len 1500, rcvd 0
IP Fragment, Ident = 497, fragment offset = 1480
IP: recv fragment from 23.0.0.3 offset 1480 bytes
IP: s=23.0.0.3 (FastEthernet2/0), d=23.0.0.2, len 172, rcvd 0
IP Fragment, Ident = 497, fragment offset = 2960
IP: recv fragment from 23.0.0.3 offset 2960 bytes
OSPF: rcv. v:2 t:4 l:3112 rid:23.0.0.3
aid:0.0.0.0 chk:297C aut:0 auk: from FastEthernet2/0
%OSPF-5-ADJCHG: Process 1, Nbr 23.0.0.3 on FastEthernet2/0 from LOADING to FULL, Loading Done

Once the adjacency is full, R2 installs R3's routes, and begins to flood to R1:

R2#show ip route ospf
1.0.0.0/32 is subnetted, 1 subnets
O 1.1.1.1 [110/2] via 12.0.0.1, 00:00:10, GigabitEthernet1/0
3.0.0.0/32 is subnetted, 254 subnets
O 3.3.3.1 [110/2] via 23.0.0.3, 00:00:10, FastEthernet2/0
[snip]
O 3.3.3.254 [110/2] via 23.0.0.3, 00:00:10, FastEthernet2/0

R2#
IP: s=12.0.0.2 (local), d=224.0.0.5 (GigabitEthernet1/0), len 3132, sending broad/multicast, proto=89
IP: s=12.0.0.2 (local), d=224.0.0.5 (GigabitEthernet1/0), len 1996, sending fragment
IP Fragment, Ident = 854, fragment offset = 0, proto=89
IP: s=12.0.0.2 (local), d=224.0.0.5 (GigabitEthernet1/0), len 1156, sending last fragment
IP Fragment, Ident = 854, fragment offset = 1976

Note that since the LSA exceeds the MTU of 2000 bytes, it is fragmented into multiple packets. Since R1 cannot accept packets that exceed its MTU of 1500 bytes, the LSUs are never received. This means that R1 cannot synchronize the database with R2, as seen as follows.

R1#show ip ospf database

OSPF Router with ID (1.1.1.1) (Process ID 1)

Router Link States (Area 0)

Link ID ADV Router Age Seq# Checksum Link count
1.1.1.1 1.1.1.1 62 0x80000005 0x6592 2
2.2.2.2 2.2.2.2 35 0x8000000D 0x613E 3

Net Link States (Area 0)

Link ID ADV Router Age Seq# Checksum
12.0.0.1 1.1.1.1 62 0x80000001 0x61BB
23.0.0.3 23.0.0.3 36 0x80000001 0x974C

R2#show ip ospf database

OSPF Router with ID (2.2.2.2) (Process ID 1)

Router Link States (Area 0)

Link ID ADV Router Age Seq# Checksum Link count
1.1.1.1 1.1.1.1 67 0x80000005 0x6592 2
2.2.2.2 2.2.2.2 38 0x8000000D 0x613E 3
23.0.0.3 23.0.0.3 39 0x80000005 0x2AAD 255

Net Link States (Area 0)

Link ID ADV Router Age Seq# Checksum
12.0.0.1 1.1.1.1 67 0x80000001 0x61BB
23.0.0.3 23.0.0.3 39 0x80000001 0x974C

R3#show ip ospf database

OSPF Router with ID (23.0.0.3) (Process ID 1)

Router Link States (Area 0)

Link ID ADV Router Age Seq# Checksum Link count
1.1.1.1 1.1.1.1 69 0x80000005 0x006592 2
2.2.2.2 2.2.2.2 40 0x8000000D 0x00613E 3
23.0.0.3 23.0.0.3 39 0x80000005 0x002AAD 255

Net Link States (Area 0)

Link ID ADV Router Age Seq# Checksum
12.0.0.1 1.1.1.1 69 0x80000001 0x0061BB
23.0.0.3 23.0.0.3 39 0x80000001 0x00974C

This also implies that R1 cannot install routes towards R3:

R1#show ip route ospf
2.0.0.0/32 is subnetted, 1 subnets
O 2.2.2.2 [110/2] via 12.0.0.2, 00:00:02, GigabitEthernet1/0
23.0.0.0/24 is subnetted, 1 subnets
O 23.0.0.0 [110/2] via 12.0.0.2, 00:00:02, GigabitEthernet1/0

Eventually the adjacency state between R1 and R2 is lost, due to the lack of LSAcks sent in response to R2's LSUs. This can be seen in R1's "debug ip ospf packet" as follows, and the "show ip ospf neighbor" on both devices:

R1#
OSPF: rcv. v:2 t:1 l:44 rid:2.2.2.2
aid:0.0.0.0 chk:DC98 aut:0 auk: from GigabitEthernet1/0
OSPF: Cannot see ourself in hello from 2.2.2.2 on GigabitEthernet1/0, state INIT

R1#show ip ospf neighbor

Neighbor ID Pri State Dead Time Address Interface
2.2.2.2 1 LOADING/DR 00:00:34 12.0.0.2 GigabitEthernet1/0

R2#show ip ospf neighbor

Neighbor ID Pri State Dead Time Address Interface
23.0.0.3 1 FULL/DR 00:00:35 23.0.0.3 FastEthernet2/0
1.1.1.1 1 FULL/BDR 00:00:39 12.0.0.1 GigabitEthernet1/0

The key with this example is that although the "ip ospf mtu-ignore" command allows the initial adjacency to form between R1 and R2, we can see that synchronization fails between them when an LSA replication event causes packet sizes generated by R2 to exceed R1's MTU.

Based on this we can see that the "ip ospf mtu-ignore" command is not a fix to the underlying problem. Instead it is simply an exception to the OSPF adjacency state machine. The real fix to this problem is to ensure that the MTU values match between neighbors, which prevents both routing exchange in the control plane, and packet drops due to unsupported sizes in the data plane.

Jan
08

One of the most important technical protocols on the planet is Open Shortest Path First (OSPF). This highly tunable and very scalable Interior Gateway Protocol (IGP) was designed as the replacement technology for the very problematic Routing Information Protocol (RIP). As such, it has become the IGP chosen by many corporate enterprises.

OSPF’s design, operation, implementation and maintenance can be extremely complex. The 3-Day INE bootcamp dedicated to this protocol will be the most in-depth coverage in the history of INE videos.

This course will be developed by Brian McGahan, and Petr Lapukhov. It will be delivered online in a Self-Paced format. The course will be available for purchase soon for $295.

Here is a preliminary outline:

Day 1 OSPF Operations

●      Dijkstra Algorithm

●      Neighbors and Adjacencies

○   OSPF Packet Formats

○   OSPF Authentication

○   Link-State information Flooding

●      Concept of Areas

○   Notion of ABR

○   Notion of ASBR

●      Network Types

○   Flooding on P2P Links

○   Flooding with DR

○   Topologic Representation

●      The Link State Database

○   LSA Format (Checksum, Seq#, etc)

○   LSA Types

○   LSA Purging

●      The Routing Table

○   How is RIB computed from LSDB

●      Flooding Reduction

○   DNA bit

○   DC Circuits

○   Database Filter

Day 2 Configuring OSPF

●      Basic Configurations

○   Setting Router IDs

○   OSPF and Secondary Addresses

●      NBMA Networks

○   Selecting Network Type

○   Ensuring peer reachability

●      Special Areas

○   Stub Area Types

○   Routing in NSSA Areas

●      OSPF Summarization

○   Internal vs External

●      Virtual Links

○   Transit Capability

○   Summarization and Virtual Links

Day 3 Advanced Topics and Troubleshooting

●      OSPF Fast Convergence

○   L3 and L2 interaction

○   SPF and LSA Throttling

●      OSPF Tuning

○   LSA Pacing

○   Hello Timer Tuning

○   Max-Metric LSA

●      OSPF in MPLS Layer 3 VPNs

○   Superbackbone

○   MP-BGP extensions for OSPF

○   Loop-Prevention Concepts

○   Sham-Link

●      Inter-Area Loop Prevention Caveats

●      Key OSPF Verifications

●      OSPF Troubleshooting Process

○   Adjacency Problems (e.g. MTU issues)

○   Intra-area reachability (e.g. network types mismatch)

○   Inter-area reachability (e.g. summary LSA blocking)

○   Troubleshooting VLs and SLs

Jan
01

This document is presented as a series of Questions and Answers, discussing various aspects of OSPF protocol designed to prevent inter-area routing loops and related issues. The discussion covers ABR functions, Virtual-Links, OSPF Super-backbone, OSPF Sham-Links, BGP Cost Community. Reader is assumed to know these concepts already, as this publication focuses on complex interaction features arising in MPLS/BGP VPN scenarios. The discussion is culminated by analyzing a number of issues arising in complex multi-area multi-homed OSPF site deployed in MPLS VPN environment. Please download the following document to read the publication: Loop Prevention in OSPF

Sep
03

I enjoyed Petr's article regarding explicit next hop.  It reminded me of a scenario where a redistributed route, going into OSPF conditionally worked, depending on which reachable next hop was used.

Here is the topology for the scenario:

3 routers ospf fa blogpost

Here is the relevant (and working :)) information for R1.

R1 screenshot

When we replace the static route, with a new reachable next hop, we loose the ability to ping 100.100.100.3

R1 screenshot 2

When we change the next hop for the static route, (which is being redistributed into OSPF), the route to 100.100.100.0/24 no longer works, even though we have verified ability to ping the new next hop.

Can you solve this puzzle?  Please post your ideas!

For more troubleshooting scenarios, please see our CCIE Route-Switch workbooks, volume 2, for more than 100 challenging troubleshooting scenarios.

We will post the results right here, in a few days, after you have had a chance to post your comments and ideas.

Best wishes.

 

Follow-up-

Thank you for all the great answers, (below in the comments).

R1, using a next hop of 172.16.33.33 in its static route, will include that same address in the LSA as the forwarding address.  Among the requirements that make this possible, the one we are going to focus on here is that this next hop is in the same IP subnet as an OSPF interface (Lo0) on R1.  172.16.1.1/16 (R1) and 172.16.33.33 (next hop address, owned by R3).

R1 sends 172.16.33.33 as fa

If we use a next hop that isn't in the same IP subnet as an OSPF interface on R1, the LSA will not include the next hop forwarding address, which will then cause R2 to believe that R1 is the next hop and the route will fail to work.   We could also cause the 0.0.0.0 to show up by changing the ospf network type for R1 Loop 0 to point-to-point, not including Loop 0 in the network statement for OSPF, or by setting Loop 0 as a passive interface for OSPF. (take your pick) :)

R1 sends 0.0.0.0 fa

Again, thanks to all for the EXCELLENT answers and insights.

You rock!

Sep
02

Abstract

This publication briefly covers the use of 3rd party next-hops in OSPF, RIP, EIGRP and BGP routing protocols. Common concepts are introduced and protocol-specific implementations are discussed. Basic understanding of the routing protocol function is required before reading this blog post.

Overview

Third-party next-hop concept appears only to distance vector protocol, or in the parts of the link-state protocols that exhibit distance-vector behavior. The idea is that a distance-vector update carries explicit next-hop value, which is used by receiving side, as opposed to the "implicit" next-hop calculated as the sending router's address - the source address in the IP header carrying the routing update. Such "explicit" next-hop is called "third-party" next-hop IP address, allowing for pointing to a different next-hop, other than advertising router. Intitively, this is only possible if the advertising and receiving router are on a shared segment, but the "shared segment" concept could be generalized and abstracted. Every popular distance-vector protocols support third party next-hop - RIPv2, EIGRP, OSPF and BGP all carry explicit next-hop value. Look at the figure below - it illustrates the situation where two different distance-vector protocols are running on the shared segment, but none of them runs on all routers attached to the segment. The protocols "overlap" at a "pivotal" router and redistribution is used to provide inter-protocol route exchange.

third-party-nh-generic

Per the default distance-vector protocol behavior, traffic from one routing domain going into another has cross the "pivotal" router, the router where the two domains overlap (R3 in our case) - as opposed to going directly to the closes next-hop on the shared segment. The reason for this is that there is no direct "native" update exchange between the hops running different routing protocols. In situations like this, it is beneficial to rewrite the next-hop IP address to point toward the "optimum" exit point, using the "pivotal" router's knowledge of both routing protocols.

OSPF is somewhat special with respect to the 3rd party next-hop implementation. It supports third-party next-hop in Type-5/7 LSAs (External Routing Information LSA and NSSA External LSA). These LSAs are processed in "distance-vector manner" by every receiving router. By default, the LSA is assumed to advertise the external prefix "connected" to the advertising router. However, if the FA is non-zero, the address in this field is used to calculate the forwarding information, as opposed to default forwarding toward the advertising router. Forwarding Address is always present in Type-7 LSAs, for the reason illustrated on the figure below:

third-party-nh-ospf-nssa-fa

Since there could be multiple ABRs in NSSA area, only one is elected to perform 7-to-5 LSA translation - otherwise the routing information will loop back in the area, unless manual filtering implemented in the ABRs (which is prone to errors). Translating ABR is elected based on the highest Router-ID, and may not be on the optimum path toward the advertising ASBR. Therefore, the forwarding address should prompt the more optimum path, based on the inter-area routing information.

EIGRP

We start with the scenario where we redistribute RIP into EIGRP.

third-party-nh-rip2eigrp

Notice that EIGRP will not insert the third-party next-hop until you apply the command no ip next-hop-self eigrp on R3's connection to the shared segment. Look at the routing table output prior to applying the no ip next-hop-self eigrp command.

R1#show  ip route eigrp 
140.1.0.0/16 is variably subnetted, 2 subnets, 2 masks
D EX 140.1.2.2/32
[170/2560002816] via 140.1.123.3, 00:00:27, FastEthernet0/0

After the command has been applied to R3’s interface:

R1#show  ip route eigrp
140.1.0.0/16 is variably subnetted, 2 subnets, 2 masks
D EX 140.1.2.2/32
[170/2560002816] via 140.1.123.2, 00:00:04, FastEthernet0/0

The same behavior is observed when redistributing OSPF into EIGRP, but not when redistributing BGP. For some reason, BGP's next-hop is not copied into EIGRP, e.g. in the example below, EIGRP will NOT insert the BGP's next-hop into updates. Notice that you may enable or disable the third-party next-hop behavior in EIGRP using the interface-level command ip next-hop-self eigrp.

RIP

RIP passes the third-party next-hop from OSPF, BGP or EIGRP. For instance, assume EIGRP redistribution into RIP. You have to turn on the no ip split-horizon on R3's Ethernet connection to get this to work:

third-party-nh-eigrp2rip

R2#show ip route rip 
140.1.0.0/16 is variably subnetted, 3 subnets, 2 masks
R 140.1.1.1/32 [120/1] via 140.1.123.1, 00:00:17, FastEthernet0/0

Notice the following RIP debugging output, which lists the third-party next-hop:

RIP: received v2 update from 140.1.123.3 on FastEthernet0/0
140.1.1.1/32 via 140.1.123.1 in 1 hops
140.1.123.0/24 via 0.0.0.0 in 1 hops

Surprisingly, there is NO need to enable the command no ip split-horizon on the interface when redistributing BGP or OSPF routes into RIP. Seem like only EIGRP to RIP redistribution requires that. Keep in mind, however, that split-horizon is OFF by default on physical frame-relay interfaces. Here is a sample output of redistributing BGP into RIP using the third-party next-hop:

R3#show ip route bgp 
140.1.0.0/16 is variably subnetted, 3 subnets, 2 masks
B 140.1.2.2/32 [20/0] via 140.1.123.2, 00:22:13
R3#

R1#show ip route rip
140.1.0.0/16 is variably subnetted, 3 subnets, 2 masks
R 140.1.2.2/32 [120/1] via 140.1.123.2, 00:00:09, FastEthernet0/0

RIP’s third-party next-hop behavior is fully automatic. You cannot disable or enable it, like you do in EIGRP.

OSPF

Similarly to RIP, OSPF has no problems picking up the third-party next-hop from BGP, EIGRP or RIP. Here is how it would look like (guess which protocol is redistributed into OSPF, based solely on the commands output):

R1#sh ip route ospf 
140.1.0.0/16 is variably subnetted, 3 subnets, 2 masks
O E2 140.1.2.2/32 [110/1] via 140.1.123.2, 00:34:59, FastEthernet0/0

R1#show ip ospf database external

OSPF Router with ID (140.1.1.1) (Process ID 1)

Type-5 AS External Link States

Routing Bit Set on this LSA
LS age: 131
Options: (No TOS-capability, DC)
LS Type: AS External Link
Link State ID: 140.1.2.2 (External Network Number )
Advertising Router: 140.1.123.3
LS Seq Number: 80000002
Checksum: 0xF749
Length: 36
Network Mask: /32
Metric Type: 2 (Larger than any link state path)
TOS: 0
Metric: 1
Forward Address: 140.1.123.2
External Route Tag: 200

If you’re still guessing, the external protocol is BGP, as could have been seen observing the automatic External Route Tag – OSPF set’s it to the last AS# found in the AS_PATH.

third-party-nh-bgp2ospf

There are special conditions to be met by OSPF for the FA address to be used. First, the interface where the third party next-hop resides should be advertised into OSPF using the network command. Secondly, this interface should not be passive in OSPF and should not have network type point-to-point or point-to-multipoint. Violating any of these conditions will stop OSPF from using the FA in type-5 LSA created for external routes. Violating any of these conditions prevents third-party next-hop installation in the external LSAs.

OSPF is special in one other respect. Distance vector-protocols such as RIP or EIGRP modify the next-hop as soon as they pass the routing information to other devices. That is, the third party next-hop is not maintained through the RIP or EIGRP domain. Contrary to these, OSPF LSAs are flooded within their scope with the FA unmodified. This creates interesting problem: if the FA address is not reachable in the receiving router’s routing table, the external information found in type 7/5 LSA is not used. This situation is discussed in the blog post “OSPF Filtering using FA Address”.

BGP

When you redistribute any protocol into BGP, the system correctly sets the third-party next-hop in the local BGP table. Look at the diagram below, where EIGRP prefixes are being redistributed into BGP AS 300:

third-party-nh-eigrp2bgp

R3’s BGP process installs R1 Loopback0 prefix into the BGP table with the next-hop value of R1’s address, not “0.0.0.0” like it would be for locally advertised routes. You will observe the same behavior if you inject EIGRP prefixes into BGP using the network command.

R3#sh ip bgp
BGP table version is 9, local router ID is 140.1.123.3
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

Network Next Hop Metric LocPrf Weight Path
*> 140.1.1.1/32 140.1.123.1 156160 32768 ?

Furthermore, BGP is supposed to change the next-hop to self when advertising prefixes over eBGP peering sessions. However, when all peers share the same segment, the prefixes re-advertised over the shared segment do not have their next-hop changed. See the diagram below:

third-pary-nh-bgp2bgp

Here R1 advertises prefix 140.1.1.1/24 to R3 and R3 re-advertises it back to R2 over the same segment. Unless non-physical interfaces are used to form the BGP sessions (e.g. Loopbacks), the next-hop received from R1 is not changed when passing it down to R2. This implements the default third-party next-hop preservation over eBGP sessions. Look at the sample output for the configuration illustrated above: R1 receives R2’s prefix with unmodified next-hop.

R1#show ip bgp 
BGP table version is 3, local router ID is 140.1.1.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

Network Next Hop Metric LocPrf Weight Path
*> 140.1.1.1/32 0.0.0.0 0 32768 i
*> 140.1.2.2/32 140.1.123.2 0 300 200 i

There is a way to disable this default behavior in BGP. A logical assumption would be that using the command neighbor X.X.X.X next-hop-self would work, and it does indeed, in the recent IOS versions. The older IOS, such as 12.2T did not have this command working for eBGP sessions, and your option would have been using a route-map with set ip next-hop command. The route-map method may still be handy, if you want insert totally “bogus” IP next-hop from the shared segment – receiving BGP speaker will accept any IP address that is on the same segment. That is not something you would do in the production environment too often, but definitely an interesting idea for lab practicing. One good use in production is changing the BGP next-hop to an HSRP virtual IP address, to provide physical BGP speaker redundancy. Here is a sample code for setting an explicit next-hop in BGP update:

router bgp 300
neighbor 140.1.123.1 remote-as 100
neighbor 140.1.123.1 route-map BGP_NEXT_HOP out
!
route-map BGP_NEXT_HOP permit 10
set ip next-hop 140.1.123.100

Summary

All popular distance-vector protocols support third-party next-hop insertion. This mechanism is useful on multi-access segments, in situations when you want pass optimum path information between routers belonging to different routing protocols. We illustrated that RIP implements this function automatically, and does not allow any tuning. On the other hand, EIGRP supports third-party next-hop passing from any protocol, other than BGP, and you may turn this function on/off on per-interface basis. Furthermore, OSPF’s special feature is propagation of the third-party next-hop within an area/autonomous system, unlike the distance-vector protocols that reset the next-hop at every hop (considering AS a being a “single-hop” for BGP). Thanks to that feature, OSPF offers interesting possibility to filter external routing information by blocking FA prefix from the routing tables. Finally, BGP gives most flexibility when it comes to the IP next-hop manipulation, allowing for changing it to any value.

Further Reading

Common Routing Problem with OSPF Forwarding Address
OSPF Prefix Filtering Using Forwarding Address
BGP Redundancy using HSRP

Jun
02

This goal of this post is brief discussion of main factors controlling fast convergence in OSPF-based networks. Network convergence is a term that is sometimes used under various interpretations. Before we discuss the optimization procedures for OSPF, we define network convergence as the process of synchronizing network forwarding tables after a topology change. Network is said to be converged when none of forwarding tables are changing for "some reasonable" amount of time. This "some" amount of time could be defined as some interval, based on the expected maximum time to stabilize after a single topology change. Network convergence based on native IGP mechanisms is also known as network restoration, since it heals the lost connections. Network mechanisms for traffic protection such as ECMP, MPLS FRR or IP FRR offering different approach to failure handling are outside the scope of this article. We are further taking multicast routing fast recovery out of the scope as well, even though this process is tied to IGP re-convergence.

It is interesting to notice that IGP-based "restoration" techniques have one (more or less) important problem. During the time of re-convergence, temporary micro-loops may exist in the topology due to inconsistency of FIB (forwarding) tables of different routers. This behavior is fundamental to link-state algorithms, as routers closer to failure tend to update their forwarding database before the other routers. The only popular routing protocol that lacks this property is EIGRP, which is loop-free at any moment during re-convergence, thanks to the explicit termination of the diffusing computations. For the link state-protocols, there are some enhancements to the FIB update procedures that allow avoiding such micro-loops with link-state routing, described in the document [ORDERED-FIB].

Even though we are mainly concerned with OSPF, ISIS will be mentioned in the discussion as well. It should be noted that compared to IS-IS, OSPF provides less "knobs" for convergence optimization. The main reason is probably the fact that ISIS is being developed and supported by a separate team of developers, more geared towards the ISPs where fast convergence is a critical competitive factor. The common optimization principles, however, are the same for both protocols, and during the conversation will point out at the features that OSPF lacks while IS-IS has for tuning. Finally, we start our discussion with a formula, which is further explained in the text:

Convergence = Failure_Detection_Time + Event_Propagation_Time + SPF_Run_Time + RIB_FIB_Update_Time

The formula reflects the fact that convergence time for a link-state protocol is sum of the following components:

  • Time to detect the network failure, e.g. interface down condition.
  • Time to propagate the event, i.e. flood the LSA across the topology.
  • Time to perform SPF calculations on all routers upon reception of the new information.
  • Time to update the forwarding tables for all routers in the area.

Part I: Fast Failure Detection

Detecting link and node failures quickly is number one priority for fast convergence. For maximum speed, relying on IGP keepalive times should be avoided whether possible and physical failure detection mechanisms should be used. This implies the use of physical point-to-point links whether possible. As for link technology, it should be able to detect loss of link within shortest interval possible. For example, a point-to-point gigabit Ethernet link may report failure almost instantly (by detecting of network pulses) if there is no Ethernet switch connecting the two nodes. However, there could be some hardware-dependent timers that may delay reporting the physical-layer event, such as debounce timers. With the GigE example, there is carrier-delay timer, which is set per interface using the command carrier-delay (ms). Aiming at fast convergence, you would like to set this time to zero, unless you have special subnetwork technology, such as SONET, which is able to provide protection within a short interval e.g. under 50ms. In that case, you may want to consider setting the technology-specific delay timer to a value higher than the SONET recovery time, so that a non-critical physical failure is never noticed and healed under the network layer. In most cases, it makes sense to rely on subnetwork recovery mechanics if it is available and provides timely repair within your target convergence time. However, more often you have to deal with "cheaper" technology, such as GigE running over DWDM lambdas, and if that's the case, minimizing the detection/indication timers is your primary goal. Notice that another positive result of using point-to-point link is the fact that OSPF becomes adjacent faster, thanks to the fact that DR elections are no longer needed. Additionally, type 2 LSAs are not generated for point-to-point link, which slightly reduces OSPF LSDB size and topology complexity.

What would you do if your connection is not physical point-to-point or does not allow translating loss of signal information in timely fashion? Good example could be switched Ethernet or Frame-Relay PVC link. Sometimes there are solutions such as Ethernet port failure translation that may detect an upstream switch port failure and reflect it to the downstream ports, which could be reasonably fast. For another example, Frame-Relay may signal PVC loss via asynchronous LMI updates or A-bit (active bit) in LMI status reports. However, such mechanisms, especially the ones relying on Layer 2 feature may not be timely enough to report failure fast. In such cases, it could be a good idea to rely on fast IGP keepalive timers. Both OSPF and ISIS support fast hellos with the dead/hold interval of one second and sub-second hello intervals ([OSPF-FASTHELLO]). Using this medium-agnostic mechanism could reduce fault detection on non point-to-point links to one second, which could be better than relying on Layer-2 specific signaling. However, fast hello timers have one significant drawback: since all hello packets are processes by the router's main CPU, having hundreds or more of OSPF/IS-IS neighbors may have significant impact on router's control plane performance. An alternative could be using BFD (bi-directional forwarding detection, see [BFD]), which provides protocol-agnostic failure detection mechanism that could be reused by multiple routing protocols (e.g. OSPF/ISIS/BGP and so on). BFD is based on the same idea of sub-second keepalive timers, that could be implemented in distributed router interface line-cards, therefore saving the control-plane and central CPU from over-utilization.

Part II: Event Propagation

In OSPF and IS-IS topology changes (event) are advertised by means of LSA/LSP flooding. For network to completely converge, an LSA/LSP needs to reach every router within its flooding scope. Normally, in properly designed network, the flooding scope is one area (flooding domain), unless the information is flooded as external, i.e. by means of Type-5 LSA in OSPF. In general, LSA/LSP propagation time is determined by the following factors:

  1. LSA generation delay. IGP implementations normally throttle LSA generation to prevent excessive flooding in case of oscillating (constantly flapping) links. Original OSPF specification required every LSA generation to be delayed for a fixed interval that defaulted to one second. To optimize this behavior, Cisco's OSPF and ISIS implementations use exponential backoff algorithm to dynamically calculate the delay for generating the SAME LSA (same LSA ID, LSA type and originating Router ID) by the router. You may find more information about truncated exponential backoff in [TUNING-OSPF], but in short the process works as following.

    Three parameters control the throttling process: initial interval, hold, and max_wait time specified using the command timers throttle lsa initial hold max_wait. Suppose the network was stable for a relatively long time, and then a router link goes down. As a result, the router needs to generate new router LSA, listing the new connection status. The router delays LSA generation the initial amount if milliseconds and sets the next interval to hold milliseconds. This ensures that two consecutive events (e.g. link going down and then back up) will be separated by at least the hold interval. After this, if an additional event occurs after the initial wait window expired, the event would be held for processing until the hold milliseconds window expire. Thus, all events occurring after the initial delay will be accumulated and processed after the hold time expires. This means the next router LSA will be generated no earlier than hold milliseconds. At the same time, the next hold-time would be doubled, i.e. set to 2*hold. Effectively, every time an event occurs during the current wait window, the processing is delayed until the current hold-time expires and the next hold-time interval is doubled. The hold-time grows exponentially as 2^t*hold until it reaches the max_wait value. After this, every event received during current hold-time window would result in the next interval being equal to the constant max_wait. This ensures that exponential growth is limited or in other words the process is truncated. If there are no events for the duration of 2*max_wait milliseconds, the hold-time window is reset back to the initial value, assuming that the flapping link has returned back to the normal condition.

    Initial LSA generation delay has significant impact on network convergence time, so it is important to tune it appropriately. The initial delay should be kept to a minimum, such as 5-10 milliseconds - setting it to zero is still not recommended, as multiple link failure may occur synchronously (e.g. SRLG failure) and it could be beneficial to reflect them all in a single LSA/LSP. The hold interval should be tuned so that the next LSA is only sent after the network has converged in response to the first event. This means the LSA hold time should be based on the convergence time per the formula above, or more accurately it should be at least above LSA_Initial_Delay + LSA_Propagation_Delay + SPF_Initial_Delay. You may then set the maximum hold time to at least twice the hold interval to enhance flooding protection against at least two concurrent oscillating processes (having more parallel oscillations in not very probable). Notice that a single link failure normally results in at least two LSAs being generated, by every attached router.

  2. LSA reception delay. This delay is a sum of the ingress queueing delay and LSA arrival delay. When a router receives LSA, it may be subject to ingress queueing, though this effect is not significant unless massive BGP re-convergence is occurring at the same time. Even under heavy BGP TCP ACK storm, Cisco IOS input queue discipline known as Selective Packet Discard (see [SPD]) provides enough room for IGP traffic and handles it as highest priority. The received packets are then rate-limited based on the LSA arrival interval. OSPF rate-limits only reception of the SAME LSAs (see the definition above): there is a fixed delay between reception of the same LSA originated by a peer. This delay should not exceed the hold-time used for LSA generation - otherwise the receiving router may drop the second LSA generated by peer, say upon link recovery. Notice that every router on the LSA flooding path adds cumulative delay to this component, but the good news is that the initial LSA/LSP will not be rate-limited - the arrival delay applies only to the consecutive copy of the same LSA. As such, you may mainly ignore this component for the purpose of the fast reaction to a change, thanks to fast ingress queueing and expedited reception. Keep in mind that if you are tuning the arrival delay you need to adjust the OSPF retransmission timer to be slightly above the first timer. Otherwise, the side that just sent an LSA and has not received an acknowledgemnt may end up re-sending it again just to be dropped by the receiving side. The command to control retransmission interval for the same LSA is timers pacing retransmission
  3. Processing Delay is the amount of time it takes the router to put the LSA on the outgoing flood lists. This delay could be signification if SPF process starts before flooding the LSA. SPF runtime is not the only contributor to the processing delay, but it's the one you have control over. If you configured SPF throttling to be fast enough (see next session) - the exact time varies but mainly the initial delays below than 40ms - it may happen so that SPF run occurs before the triggering LSA is flooded to neighbors. This will result in slower flooding process. For faster convergence, it is required that LSAs are always flooded prior to SPF run. ISIS process in Cisco IOS supports the command fast-flood, which ensures the LSPs are flooded ahead of running SPF, irrespective of the initial SPF delay. On contrary, OSPF does not support this feature and your only option (at the moment) is properly tuning SPF runtime delays (see below).

    The other component that may affect processing delay is the interface LSA/LSP flooding pacing and egress queueing. Interface flooding pacing is the OSPF feature that mandates a minimum interval between flooding consecutive LSAs out of an interface. This timer runs per interface and only triggers when there is an LSA needed to be sent out right after the previous LSA. The process-level command to control this interval is timers pacing flood (ms) with the default value of 55ms. Note that if there is just one LSA being flooded through the network, this timer will have no effect on its propagation delay, and only the next consecutive LSA could be rate-limited. Therefore, just like with the arrival timer tuning, we can mainly ignore the impact of this delay on the fast convergence process. Still, it is worth tuning the interface flood pacing timers to a small value possible (e.g. 5ms-10ms) to account for the event when multiple LSAs have to be flooded through the topology, since a link failure normally generates at least two LSA/LSPs from both attached routers (we discussed that earlier already). Interesting to note, that a reception of single LSA signaling loss of link from one router is enough to properly rebuild the topology, since SPF algorithm automatically verifies that the link is bidirectional before accounting it for shortest-path computations. Additionally, reducing interface flooding pace timer helps newly attached router to load OSPF database significantly faster, at the expense of some excessive CPU usage. This applies mainly to large OSPF databases and/or flapping link conditions. To protect against frequent massive database reloads on point-to-point links you may additionally use IP Event Dampening feature for suppression of interface status or properly design network for redundancy to avoid full database reloads upon single link restoration. See [OPT-DAMPENING] for information on tuning the IP Event Dampening parameters.

    Lastly, egress queueing may result in significant delay on over-utilized links. In short, router's egress queue depth could be approximated as Q_Depth=Utilization/(1-Utilization), meaning that links with 50% or above constant utilization always result in some queueing delay (in average). Proper QoS configuration, such as reserving enough bandwidth to the control plane packets should neutralize the effect of this component, coupled with the fact that routing update packets normally have higher priority for handling by router processes.

  4. Packet Propagation Delay. This variable depends is a sum of two major contributors: serialization delay at every hop and cumulative signal propagation delay across the topology. The serialization delay is almost negligible on the modern "fast" links (e.g. 12usec for 1500 bytes packet over a 1Gbps link), though it could be more significant on slow WAN links such as series of T1s. Therefore, signal propagation delay is the main contributor due to physical limitations. This value mainly depends on the distance the signal has to travel to cross the whole OSPF/ISIS area. The propagation delay could be roughly approximated as 0.82 ms per 100 miles and have significant impact only for inter-continental deployments or satellite links. For example, it would take at least 41ms to travel a 5000 miles wide topology. However, since most OSPF/ISIS area sizes do not exceed a single continent, this value could not seriously impact total convergence time.

Part III: SPF Calculations

The SPF algorithm complexity could be bounded as O(L+N*log(N)) where N is number of the nodes and L is the number of the links in a topology under consideration. This estimation hold true provided that implementation is optimal (see [DIJKSTRA-SPF]). Worst case complexity for dense topologies could be as high as O(N^2), but this is rarely seen in real-world topologies. SPF runtime used to be a major limiting factor in the routers of 80s (link-state routing was invented in ARPANET) and 90s (initial OSPF/ISIS deployments) that used slow CPUs where SPF computations may have taken seconds to complete. However, progress in modern hardware (the Moore's Law) allowed significantly reducing the impact of this factor on the network convergence, though it is still one of the major contributors to the convergence time. The use of Incremental SPF (iSPF) allows to further minimize the amount of calculations needed when partial changes occur in the network (see [TUNING-OSPF]). For example, OSPF Type-1 LSA flooding for a leaf connection does not cause complete SPF re-calculation anymore like it would have been when using classic SPF. An important benefit is that the farther away the router is from the failed link, the less time it needs to recompute the SPF. This compensates for the longer propagation delay to deliver the LSA from a distant corner of the network. Notice that OSPF also supports PRC (partial route computation), which takes only a few milliseconds upon reception of Type 3,4,5 LSAs that are treated as distance-vector updates. The PRC process is not delayed and you cannot tune exponential backoff time for PRC, like you can do for IS-IS.

You may find out typical SPF runtimes for your network (to estimate the total convergence time) by using the command show ip ospf statistics

show ip ospf statistics

OSPF Router with ID (10.4.1.1) (Process ID 1)

Area 10: SPF algorithm executed 18 times

Summary OSPF SPF statistic

SPF calculation time
Delta T Intra D-Intra Summ D-Summ Ext D-Ext Total Reason
1w3d 8 0 0 0 0 0 8 R, X
1w3d 12 0 0 0 4 0 16 R, X
1w3d 16 0 0 0 4 0 20 R, X
1w3d 8 0 0 0 0 0 8 R,
1w3d 20 0 0 0 0 0 20 R, X
1w2d 24 0 0 0 8 0 32 R, X
1w2d 8 4 0 0 0 0 12 R,
6d16h 4 0 0 0 0 4 8 R, X
6d16h 4 0 0 0 0 0 4 R,
6d16h 12 0 0 0 8 0 20 R, X

RIB manipulation time during SPF (in msec):
Delta T RIB Update RIB Delete
1w3d 4 0
1w3d 8 0
1w3d 10 0
1w3d 5 0
1w3d 8 0
1w2d 10 0
1w2d 3 0
6d16h 2 0
6d16h 1 0
6d16h 9 0

The above output is divided in two sections: SPF calculation times and RIB manipulation time. For now, we are interested in the values under the "Total" column, which represent the total time it took OSPF process to run SPF. You may see how these values vary, depending on the "Reason" field. You may want to find the maximum value and use it as an upper limit for SPF computation in your network. In our case, it's 32ms. The other section of the output will be discussed later.

The next "problem" is known as SPF throttling. Recent Cisco IOS OSPF implementation is designed to use exponential backoff algorithm when scheduling SPF runs. The goal, as usual, is to avoid excessive calculations in the times of high network instability but keep SPF reaction fast for stable networks. Exponential process is identical to the one used for LSA throttling, with the same timer semantics.

So how would one pick up optimal SPF throttling values? As mentioned before, the initial delay should be kept as short as possible to allow for instant reaction to a change but long enough not to trigger the SPF before the LSA is flooded out. It's hard to determine the delay to flood the LSA, but at least the initial timer should stay above the per-interface LSA flood pacing timer, so that it does not delay two consecutive LSAs flooded through the topology (as you remember, a typical transit link failure results in generation of at least two LSAs). Setting the interface flooding pacing timer to 5ms and initial SPF delay to 10ms should be a good starting point. After the initial run, SPF algorithm should be further held down for at least the amount of time it takes the network to converge after the initial event. This means that the SPF hold-time should be strictly higher than the value "SPF_Initial_Delay + SPF_Runtime + RIB_FIB_Update_Time". There exists alternate, more pragmatic approach to this timer tuning as well. Let's say we want to make sure SPF computations do not take more than 50% of the router's CPU time. For this to happen, the hold time should be at least the same as a typical SPF run time. This value could be found based on the router statistics and tuned individually on every router. Based on our example, we may set the hold interval to 32ms+20% (error margin, set higher to add more safety), which is about 38ms, and the maximum interval could be set to twice the hold time, which translates into 33% CPU usage under the worst condition of non-stop LSA storms. Notice that SPF hold and maximum timers could be tuned per-router, to account for the different CPU powers, if this applies to your scenario. Total network convergence time should be estimated based on the "slowest" router in the area.

Part IV: RIB/FIB Update

After completing SPF computation, OSPF performs sequential RIB update to reflect the changed topology. The RIB updates are further propagated to the FIB table - based on the platform architecture this could be either centralized or distributed process. The RIB/FIB update process may contribute the most to the convergence time in the topologies with large amount of prefixes, e.g. thousands or tens of thousands. In such networks, updating RIB and distributed FIB databases on line-cards may take considerable amount of time, such as at the order of 10's if not 100's of milliseconds (varies depending on the platform). There are two major ways to minimize the update delay: advertise less prefixes and sequence FIB updates so that important paths are updated before any other.

If you think of all prefixes that need to be in a typical network core, you would realize that you don't need any core "transit" link prefixes in there. In fact, all you need are normally the stub links at the edge of your network, e.g. PE router loopbacks or the summary prefixes injected from the lower layers of your network hierarchy. Therefore, it makes sense to suppress the network prefix information advertised for the transit links. One option would be configuring all transit links as IP unnumbered using the IP addresses of the routers' Loopback interfaces. However, both IS-IS and OSPF has a special protocol capability to implement suppression automatically. In OSPF it is known as "prefix-suppression" and prevents OSPF from including the link type 3 (stub network address) in the router LSA (see [OSPF-PREFIX-SUPPRESS]). As you remember, OSPF represents a point-to-point connection between two routers via two link types in a router LSA: type 1, declaring the connection to another router based on its Router-ID and type 3, describing the stub connection/prefix of a point-to-point link (snmp ifIndex is used if the link is unnunmered). The prefix-suppression feature drops the second link type and leaves only the topological information in the router LSA. As a result, you will not be able to reach the transit link subnet address but still have perfect connectivity within the topology. The command to enable global prefix suppression is entered under OSPF routing process as prefix-suppression to enable suppression globally or per-interface using the syntax ip ospf prefix-suppression [disable]. Notice that by default OSPF does not suppress stub-link advertisement for the router loopback interfaces, unless you have explicitly configured these for suppression.

As soon as you're done suppressing all transit link subnets, you are normally left with the router loopback interfaces (typically /32 prefixes) and routing information external to the area, such as summary-addresses or external prefixes. Depending on your network configuration the amount of summary addresses could be significant. The best solution to this problem is optimal summarization and filtering unnecessary prefixes, e.g. by means of of summary-address filters and stub area features. Obviously, this requires a hierarchical address plan, which is not always readily available. If re-designing you network's IP addressing is not an option, you may still rely on Cisco IOS priority prefix sequencing, which is supported in ISIS. Unfortunately, there is no support for this feature in OSPF for IOS yet, though there is support in IOS-XR. You may read more about ISIS support for priority-driven RIB Prefix Installation here ([ISIS-PRIODRIVEN]). The general idea is to expedite some prefix insertion into the forwarding table, starting with the most important ones, such as PE /32 prefixes. It is worth noting that priority sequencing may extend duration of the routing micro-loops during the re-convergence process. In general, the procedure described in ([ORDERED-FIB] works against fast convergence, trading it for loop-free process.

Is there a way to estimate the RIB/FIB manipulation times? As we have seen before, the show ip ospf statistics command provides information on RIB update time, though this output is not provided on every platform, nor there is clear interpretation of the values in Cisco's documentation, e.g. it's unclear whether there is a checkpoint mechanism to inform OSPF of the FIB entry updates. Special measurements should be taken to estimate these values, as done in [BLACKBOX-OSPF], and more importantly these values will heavily depend on the platform used. Still the OSPF RIB manipulation statistics could be useful to estimate the lower bound of network convergence time (though we are mostly interested in the accurate upper boundary).

Sample Fast Convergence Profile

Putting the above information together, let's try to find an optimum convergence profile based on the fact that we have "show ip ospf statistics" output from the "weakest" router in the area.

show ip ospf statistics

OSPF Router with ID (10.4.1.1) (Process ID 1)

Area 10: SPF algorithm executed 18 times

Summary OSPF SPF statistic

SPF calculation time
Delta T Intra D-Intra Summ D-Summ Ext D-Ext Total Reason
1w3d 8 0 0 0 0 0 8 R, X
1w3d 12 0 0 0 4 0 16 R, X
1w3d 16 0 0 0 4 0 20 R, X
1w3d 8 0 0 0 0 0 8 R,
1w3d 20 0 0 0 0 0 20 R, X
1w2d 24 0 0 0 8 0 32 R, X
1w2d 8 4 0 0 0 0 12 R,
6d16h 4 0 0 0 0 4 8 R, X
6d16h 4 0 0 0 0 0 4 R,
6d16h 12 0 0 0 8 0 20 R, X

RIB manipulation time during SPF (in msec):
Delta T RIB Update RIB Delete
1w3d 4 0
1w3d 8 0
1w3d 10 0
1w3d 5 0
1w3d 8 0
1w2d 10 0
1w2d 3 0
6d16h 2 0
6d16h 1 0
6d16h 9 0

Failure Detection Delay: about 5-10ms worst case to detect/report loss of network pulses.
Maximum SPF runtime: 32ms, doubling for safety makes it 64ms
Maximum RIB update: 10ms, doubling for safety makes it 20ms
OSPF interface flood pacing timer: 5ms (does not apply to the initial LSA flooded)

LSA Generation Initial Delay: 10ms (enough to detect multiple link failures resulting from SRLG failure)
SPF Initial Delay: 10ms (enough to hold SPF to allow two consecutive LSAs to be flooded)
Network geographical size: 100 miles (signal propagation is negligible)
Network physical media: 1 Gbps links (serialization delay is negligible)

Estimated network convergence time in response to initial event: 32*2 + 10*2 + 10 + 10 = 40+64 = 100ms. This estimation does not precisely account for FIB update time, but we assume it would be approximately the same as RIB update. We need to make sure out maximum backoff timers exceed this convergence timer to ensure processing is delay above the convergence interval in the worst case scenario.

LSA Generation Hold Time: 100ms (approximately the convergence time)
LSA Generation Maximum Time: 1s (way above the 100ms)
OSPF Arrival Time: 50ms (way below the LSA Generation hold time)
SPF Hold Time: 100ms
SPF Maximum Hold Time: 1s ( Maximum SPF runtime is 32ms, meaning we skip 30 SPF runtimes in the worst condition. This results in SPF consuming no more than 3% of CPU time under worst-case scenario).

Now estimate the worst-case convergence time: LSA_Maximum_Delay (1s) + SPF_Maximum_Delay (1s) + RIB_Update (

router ospf 10
!
! Suppress transit link prefixes
!
prefix-suppression
!
! Wait at least 50ms between accepting the same LSA
!
timers lsa arrival 50
!
! Throttle LSA generation
!
timers throttle lsa all 10 100 1000
!
! Throttle SPF runs
!
timers throttle spf 10 100 1000
!
! Pace interface-level flooding
!
timers pacing flood 5
!
! Make retransmission timer > than arrival
!
timers pacing retransmission 60
!
! Enable incremental SPF
!
ispf

Conclusions

It has been well known that link-state IGPs could be tuned for sub-second convergence under almost any practical scenario, yet maintain network stability by the virtue of adaptive backoff timers. In this post we tried to provide a practical approach to calculating the optimum throttling timer values based on your recorded network performance. It is worth noting that three most important timers to tune network for sub-second convergence are the failure detection delay, initial LSA generation delay and initial SPF delay. All other timers, such as hold and maximum time serve the purpose of stabilizing network, and affect convergence in "worst-case" unstable network scenarios. Cisco's recommended values for the initial/hold/maximum timers are 10/100/5000 ms (see [ROUTED-CAMPUS], but those may look a bit conservative as they result in the worst-case convergence time above 10 seconds. Additionally, it is important to notice that in large topologies, significant amount of time is spent updating the RIB/FIB updates after reconvergence. Therefore, in addition to tuning the throttling timers you may want to implement other measures such as prefix-suppression, better summarization (e.g. totally stub areas) and minimization of external routing information. If your platform supports the feature, you may also implement priority-driven RIB prefix installation process.

We omitted other fast-convergence elements such as resilient network design, e.g. redundancy resulting in equal-cost multipathing and faster OSPF adjacency restoration or NSF feature which is very helpful to avoid re-convergence during planned downtimes. We also skipped discussing some other features related to OSPF stability such as flooding reduction and LSA group pacing, that could yield performance benefits in networks with large LSDs. It is not possible to cover all relevant technologies in a single blog post, but you may refer to the further reading documents for more information. And finally, if you are planning to tune your IGP for fast convergence, make sure you understand all consequences. Modern routing platforms are capable of handling almost any "stormy" network condition without losing overall network stability, but pushing network to its limits could always be dangerous. Make sure you monitor your OSPF statistics for potentially high or unusual conditions after you performed tuning, or set maximum timers to more conservative values (e.g. 3-5 seconds) to provide additional safety.

Further Reading

The following is the minimum list of the publications suggested to read on the topic of fast IGP convergence.

[ORDERED-FIB] "Loop-free convergence using oFIB"
[BLACKBOX-OSPF] "Experience in Black-box OSPF Measurement”
[SUBSEC-CONV] “Achieving Sub-second IGP Convergence in Large IP Networks”
[OSPF-FASTHELLO] "OSPF Fast Hello Enhancement"
[SPD] "Understanding Selective Packet Discard"
[TUNING-OSPF] "Tuning OSPF Performance"
[BFD] "Bi-Directional Forwardin Detection"
[OSPF-PREFIX-SUPPRESS] "OSPF Prefix Suppression Feature"
[ROUTED-CAMPUS] "Cisco fully routed campus design guidelines"
[OPT-DAMPENING] "Optimized IP Event Dampening"
[DIJKSTRA-SPF] "Dijkstra SPF algorithm"

Subscribe to INE Blog Updates

New Blog Posts!