Edit: Thanks for playing! You can find the official answer and explanation here.

I had an interesting question come across my desk today which involved a very common area of confusion in OSPF routing logic, and now I'm posing this question to you as a challenge!

The first person to answer correctly will get free attendance to our upcoming CCIE Routing & Switching Lab Cram Session, which runs the week of June 1st 2015, as well as a free copy of the class in download format after it is complete.  The question is as follows:

Given the below topology, where R4 mutually redistributes between EIGRP and OSPF, which path(s) will R1 choose to reach the network 5.5.5.5/32, and why?

Bonus Questions:

• What will R2's path selection to 5.5.5.5/32 be, and why?
• What will R3's path selection to 5.5.5.5/32 be, and why?
• Assume R3's link to R1 is lost.  Does this affect R1's path selection to 5.5.5.5/32? If so, how?

Tomorrow I'll be post topology and config files for CSR1000v, VIRL, GNS3, etc. so you can try this out yourself, but first answer the question without seeing the result and see if your expected result matches the actual result!

Good luck everyone!

Hi Brian,

What is the major difference in using an E1 route over an E2 route in OSPF?

From what I’ve observed, if you redistribute a route into OSPF either E1 or E2, the upstream router will still use the shortest path to get to the ASBR regardless of what is shown in the routing table.

Matt

Hi Matt,

This is actually a very common area of confusion and misunderstanding in OSPF. Part of the problem is that the vast majority of CCNA and CCNP texts teach the theory that for OSPF path selection of E1 vs E2 routes, E1 routes use the redistributed cost plus the cost to the ASBR, while with E2 routes only use the redistributed cost. When I just checked the most recent CCNP ROUTE text from Cisco Press, it specifically says that "[w]hen flooded, OSPF has little work to do to calculate the metric for an E2 route, because by definition, the E2 route’s metric is simply the metric listed in the Type 5 LSA. In other words, the OSPF routers do not add any internal OSPF cost to the metric for an E2 route." While technically true, this statement is an oversimplification. For CCNP level, this might be fine, but for CCIE level it is not.

The key point that I'll demonstrate in this post is that while it is true that "OSPF routers do not add any internal OSPF cost to the metric for an E2 route", both the intra-area and inter-area cost is still considered in the OSPF path selection state machine for these routes.

First, let's review the order of the OSPF path selection process. Regardless of a route’s metric or administrative distance, OSPF will choose routes in the following order:

Intra-Area (O)
Inter-Area (O IA)
External Type 1 (E1)
External Type 2 (E2)
NSSA Type 1 (N1)
NSSA Type 2 (N2)

To demonstrate this, take the following topology:

R1 connects to R2 and R3 via area 0. R2 and R3 connect to R4 and R5 via area 1 respectively. R4 and R5 connect to R6 via another routing domain, which is EIGRP in this case. R6 advertises the prefix 10.1.6.0/24 into EIGRP. R4 and R5 perform mutual redistribution between EIGRP and OSPF with the default parameters, as follows:

```R4:
router eigrp 10
redistribute ospf 1 metric 100000 100 255 1 1500
!
router ospf 1
redistribute eigrp 10 subnets
R5:
router eigrp 10
redistribute ospf 1 metric 100000 100 255 1 1500
!
router ospf 1
redistribute eigrp 10 subnets```

The result of this is that R1 learns the prefix 10.1.6.0/24 as an OSPF E2 route via both R2 and R3, with a default cost of 20. This can be seen in the routing table output below. The other OSPF learned routes are the transit links between the routers in question.

```R1#sh ip route ospf
10.0.0.0/24 is subnetted, 8 subnets
O E2    10.1.6.0 [110/20] via 10.1.13.3, 00:09:43, FastEthernet0/0.13
[110/20] via 10.1.12.2, 00:09:43, FastEthernet0/0.12
O IA    10.1.24.0 [110/2] via 10.1.12.2, 00:56:44, FastEthernet0/0.12
O E2    10.1.46.0 [110/20] via 10.1.13.3, 00:09:43, FastEthernet0/0.13
[110/20] via 10.1.12.2, 00:09:43, FastEthernet0/0.12
O IA    10.1.35.0 [110/2] via 10.1.13.3, 00:56:44, FastEthernet0/0.13
O E2    10.1.56.0 [110/20] via 10.1.13.3, 00:09:43, FastEthernet0/0.13
[110/20] via 10.1.12.2, 00:09:43, FastEthernet0/0.12```

Note that all the routes redistributed from EIGRP appear on R1 with a default metric of 20. Now let’s examine the details of the route 10.1.6.0/24 on R1.

```R1#show ip route 10.1.6.0
Routing entry for 10.1.6.0/24
Known via "ospf 1", distance 110, metric 20, type extern 2, forward metric 2
Last update from 10.1.13.3 on FastEthernet0/0.13, 00:12:03 ago
Routing Descriptor Blocks:
10.1.13.3, from 10.1.5.5, 00:12:03 ago, via FastEthernet0/0.13
Route metric is 20, traffic share count is 1
* 10.1.12.2, from 10.1.4.4, 00:12:03 ago, via FastEthernet0/0.12
Route metric is 20, traffic share count is 1```

As expected, the metric of both paths via R2 and R3 have a metric of 20. However, there is an additional field in the route’s output called the “forward metric”. This field denotes the cost to the ASBR(s). In this case, the ASBRs are R4 and R5 for the routes via R2 and R3 respectively. Since all interfaces are FastEthernet, with a default OSPF cost of 1, the cost to both R4 and R5 is 2, or essentially 2 hops.

The reason that multiple routes are installed in R1’s routing table is that the route type (E2), the metric (20), and the forward metric (2) are all a tie. If any of these fields were to change, the path selection would change.

To demonstrate this, let’s change the route type to E1 under R4’s OSPF process. This can be accomplished as follows:

```R4#config t
Enter configuration commands, one per line.  End with CNTL/Z.
R4(config)#router ospf 1
R4(config-router)#redistribute eigrp 10 subnets metric-type 1
R4(config-router)#end
R4#```

The result of this change is that R1 now only installs a single route to 10.1.6.0/24, the E1 route learned via R2.

```R1#show ip route 10.1.6.0
Routing entry for 10.1.6.0/24
Known via "ospf 1", distance 110, metric 22, type extern 1
Last update from 10.1.12.2 on FastEthernet0/0.12, 00:00:35 ago
Routing Descriptor Blocks:
* 10.1.12.2, from 10.1.4.4, 00:00:35 ago, via FastEthernet0/0.12
Route metric is 22, traffic share count is 1```

Note that the metric and the forward metric seen in the previous E2 route is now collapsed into the single “metric” field of the E1 route. Although the value is technically the same, a cost of 2 to the ASBR, and the cost of 20 the ASBR reports in, the E1 route is preferred over the E2 route due to the OSPF path selection state machine preference. Even if we were to raise the metric of the E1 route so that the cost is higher than the E2 route, the E1 route would be preferred:

```R4#config t
Enter configuration commands, one per line.  End with CNTL/Z.
R4(config)#router ospf 1
R4(config-router)#redistribute eigrp 10 subnets metric-type 1 metric 100
R4(config-router)#end
R4#```

R1 still installs the E1 route, even though the E1 metric of 102 is higher than the E2 metric of 20 plus a forward metric of 2.

```R1#show ip route 10.1.6.0
Routing entry for 10.1.6.0/24
Known via "ospf 1", distance 110, metric 102, type extern 1
Last update from 10.1.12.2 on FastEthernet0/0.12, 00:00:15 ago
Routing Descriptor Blocks:
* 10.1.12.2, from 10.1.4.4, 00:00:15 ago, via FastEthernet0/0.12
Route metric is 102, traffic share count is 1```

R1 still knows about both the E1 and the E2 route in the Link-State Database, but the E1 route must always be preferred:

```R1#show ip ospf database external 10.1.6.0
OSPF Router with ID (10.1.1.1) (Process ID 1)
Routing Bit Set on this LSA
LS age: 64
Options: (No TOS-capability, DC)
Link State ID: 10.1.6.0 (External Network Number )
LS Seq Number: 80000003
Checksum: 0x1C8E
Length: 36
Metric Type: 1 (Comparable directly to link state metric)
TOS: 0
Metric: 100
External Route Tag: 0
LS age: 1388
Options: (No TOS-capability, DC)
Link State ID: 10.1.6.0 (External Network Number )
LS Seq Number: 80000001
Checksum: 0x7307
Length: 36
Metric Type: 2 (Larger than any link state path)
TOS: 0
Metric: 20
External Route Tag: 0```

This is the behavior we would expect, because E1 routes must always be preferred over E2 routes. Now let’s look at some of the commonly misunderstood cases, where the E2 routes use both the metric and the forward metric for their path selection.

First, R4’s redistribution is modified to return the metric-type to E2, but to use a higher metric of 100 than the default of 20:

```R4#conf t
Enter configuration commands, one per line.  End with CNTL/Z.
R4(config)#router ospf 1
R4(config-router)#redistribute eigrp 10 subnets metric-type 2 metric 100
R4(config-router)#end
R4#```

The result on R1 is that the route via R4 is less preferred, since it now has a metric of 100 (and still a forward metric of 2) vs the metric of 20 (and the forward metric of 2) via R5.

```R1#show ip route 10.1.6.0
Routing entry for 10.1.6.0/24
Known via "ospf 1", distance 110, metric 20, type extern 2, forward metric 2
Last update from 10.1.13.3 on FastEthernet0/0.13, 00:00:30 ago
Routing Descriptor Blocks:
* 10.1.13.3, from 10.1.5.5, 00:00:30 ago, via FastEthernet0/0.13
Route metric is 20, traffic share count is 1```

The alternate route via R4 can still be seen in the database.

```R1#show ip ospf database external 10.1.6.0
OSPF Router with ID (10.1.1.1) (Process ID 1)
Routing Bit Set on this LSA
LS age: 34
Options: (No TOS-capability, DC)
Link State ID: 10.1.6.0 (External Network Number )
LS Seq Number: 80000004
Checksum: 0x9D8B
Length: 36
Metric Type: 2 (Larger than any link state path)
TOS: 0
Metric: 100
External Route Tag: 0
Routing Bit Set on this LSA
LS age: 1653
Options: (No TOS-capability, DC)
Link State ID: 10.1.6.0 (External Network Number )
LS Seq Number: 80000001
Checksum: 0x7307
Length: 36
Metric Type: 2 (Larger than any link state path)
TOS: 0
Metric: 20
External Route Tag: 0```

This is the path selection that we would ideally want, because the total cost of the path via R4 is 102 (metric of 100 plus a forward metric of 2), while the cost of the path via R5 is 22 (metric of 20 plus a forward metric of 2). The result of this path selection would be the same if we were to change both routes to E1, as seen below.

```R4#conf t
Enter configuration commands, one per line.  End with CNTL/Z.
R4(config)#router ospf 1
R4(config-router)#redistribute eigrp 10 subnets metric-type 1 metric 100
R4(config-router)#end
R4#
R5#config t
Enter configuration commands, one per line.  End with CNTL/Z.
R5(config)#router ospf 1
R5(config-router)#redistribute eigrp 10 subnets metric-type 1
R5(config-router)#end
R5#```

R1 still chooses the route via R5, since this has a cost of 22 vs R4’s cost of 102.

```R1#show ip route 10.1.6.0
Routing entry for 10.1.6.0/24
Known via "ospf 1", distance 110, metric 22, type extern 1
Last update from 10.1.13.3 on FastEthernet0/0.13, 00:00:41 ago
Routing Descriptor Blocks:
* 10.1.13.3, from 10.1.5.5, 00:00:41 ago, via FastEthernet0/0.13
Route metric is 22, traffic share count is 1
R1#show ip ospf database external 10.1.6.0
OSPF Router with ID (10.1.1.1) (Process ID 1)
Routing Bit Set on this LSA
LS age: 56
Options: (No TOS-capability, DC)
Link State ID: 10.1.6.0 (External Network Number )
LS Seq Number: 80000005
Checksum: 0x1890
Length: 36
Metric Type: 1 (Comparable directly to link state metric)
TOS: 0
Metric: 100
External Route Tag: 0
Routing Bit Set on this LSA
LS age: 45
Options: (No TOS-capability, DC)
Link State ID: 10.1.6.0 (External Network Number )
LS Seq Number: 80000003
Checksum: 0xEB0D
Length: 36
Metric Type: 1 (Comparable directly to link state metric)
TOS: 0
Metric: 20
External Route Tag: 0
R1#```

Note that the E1 route itself in the database does not include the cost to the ASBR. This must be calculated separately either based on the Type-1 LSA or Type-4 LSA, depending on whether the route to the ASBR is Intra-Area or Inter-Area respectively.

So now this begs the question, why does it matter if we use E1 vs E2? Of course as we saw E1 is always preferred over E2, due to the OSPF path selection order, but what is the difference between having *all* E1 routes vs having *all* E2 routes? Now let’s at a case where it *does* matter if you’re using E1 vs E2.

R1’s OSPF cost on the link to R2 is increased as follows:

```R1#config t
Enter configuration commands, one per line.  End with CNTL/Z.
R1(config)#interface Fa0/0.12
R1(config-subif)#ip ospf cost 100
R1(config-subif)#end
R1#```

R4 and R5’s redistribution is modified as follows:

```R4#config t
Enter configuration commands, one per line.  End with CNTL/Z.
R4(config)#router ospf 1
R4(config-router)#redistribute eigrp 10 subnets metric-type 1 metric 99
R4(config-router)#end
R4#
R5#config t
Enter configuration commands, one per line.  End with CNTL/Z.
R5(config)#router ospf 1
R5(config-router)#redistribute eigrp 10 subnets metric-type 1 metric 198
R5(config-router)#end
R5#```

Now R1’s routes to the prefix 10.1.6.0/24 are as follows: Path 1 via the link to R2 with a cost of 100, plus the link to R4 with a cost of 1, plus the redistributed metric of 99, making this total path a cost of 200. Next, Path 2 is available via the link to R3 with a cost of 1, plus the link to R5 with a cost of 1, plus the redistributed metric of 198, masking this total path a cost of 200 as well. The result is that R1 installs both paths equally:

```R1#show ip route 10.1.6.0
Routing entry for 10.1.6.0/24
Known via "ospf 1", distance 110, metric 200, type extern 1
Last update from 10.1.12.2 on FastEthernet0/0.12, 00:02:54 ago
Routing Descriptor Blocks:
* 10.1.13.3, from 10.1.5.5, 00:02:54 ago, via FastEthernet0/0.13
Route metric is 200, traffic share count is 1
10.1.12.2, from 10.1.4.4, 00:02:54 ago, via FastEthernet0/0.12
Route metric is 200, traffic share count is 1```

Note that the database lists the costs of the Type-5 External LSAs as different though:

```R1#show ip ospf database external 10.1.6.0
OSPF Router with ID (10.1.1.1) (Process ID 1)
Routing Bit Set on this LSA
LS age: 291
Options: (No TOS-capability, DC)
Link State ID: 10.1.6.0 (External Network Number )
LS Seq Number: 80000006
Checksum: 0xC9C
Length: 36
Metric Type: 1 (Comparable directly to link state metric)
TOS: 0
Metric: 99
External Route Tag: 0
Routing Bit Set on this LSA
LS age: 207
Options: (No TOS-capability, DC)
Link State ID: 10.1.6.0 (External Network Number )
LS Seq Number: 80000004
Checksum: 0xE460
Length: 36
Metric Type: 1 (Comparable directly to link state metric)
TOS: 0
Metric: 198
External Route Tag: 0```

What happens if we were to change the metric-type to 2 on both R4 and R5 now? Let’s see:

```R4(config)#router ospf 1
R4(config-router)#redistribute eigrp 10 subnets metric-type 2 metric 99
R4(config-router)#end
R4#
R5#config t
Enter configuration commands, one per line.  End with CNTL/Z.
R5(config)#router ospf 1
R5(config-router)#redistribute eigrp 10 subnets metric-type 2 metric 198
R5(config-router)#end
R5#```

Even though the end-to-end costs are still the same, R1 should now prefer the path with the lower redistributed metric via R4:

```R1#show ip route 10.1.6.0
Routing entry for 10.1.6.0/24
Known via "ospf 1", distance 110, metric 99, type extern 2, forward metric 101
Last update from 10.1.12.2 on FastEthernet0/0.12, 00:01:09 ago
Routing Descriptor Blocks:
* 10.1.12.2, from 10.1.4.4, 00:01:09 ago, via FastEthernet0/0.12
Route metric is 99, traffic share count is 1```

The forward metric of this route means that the total cost is still 200 (the metric of 99 plus the forward metric of 101). In this case, even though both paths are technically equal, only the path with the lower redistribution metric is installed. Now let’s see what happens if we do set the redistribution metric the same.

```R4#config t
Enter configuration commands, one per line.  End with CNTL/Z.
R4(config)#router ospf 1
R4(config-router)#redistribute eigrp 10 subnets metric-type 2 metric 1
R4(config-router)#end
R4#
R5#config t
Enter configuration commands, one per line.  End with CNTL/Z.
R5(config)#router ospf 1
R5(config-router)#redistribute eigrp 10 subnets metric-type 2 metric 1
R5(config-router)#end
R5#```

Both routes now have the same metric of 1, so both should be installed in R1’s routing table, right? Let’s check:

```R1#show ip route 10.1.6.0
Routing entry for 10.1.6.0/24
Known via "ospf 1", distance 110, metric 1, type extern 2, forward metric 2
Last update from 10.1.13.3 on FastEthernet0/0.13, 00:00:42 ago
Routing Descriptor Blocks:
* 10.1.13.3, from 10.1.5.5, 00:00:42 ago, via FastEthernet0/0.13
Route metric is 1, traffic share count is 1```

This is the result we may not expect. Only the path via R5 is installed, not the path via R4. Let’s look at the database and see why:

```R1#show ip ospf database external 10.1.6.0
OSPF Router with ID (10.1.1.1) (Process ID 1)
Routing Bit Set on this LSA
LS age: 56
Options: (No TOS-capability, DC)
Link State ID: 10.1.6.0 (External Network Number )
LS Seq Number: 80000008
Checksum: 0xB3D4
Length: 36
Metric Type: 2 (Larger than any link state path)
TOS: 0
Metric: 1
External Route Tag: 0
Routing Bit Set on this LSA
LS age: 47
Options: (No TOS-capability, DC)
Link State ID: 10.1.6.0 (External Network Number )
LS Seq Number: 80000006
Length: 36
Metric Type: 2 (Larger than any link state path)
TOS: 0
Metric: 1
External Route Tag: 0```

Both of these routes show the same cost, as denoted by the “Metric: 1”, so why is one being chosen over the other? The reason is that in reality, OSPF External Type-2 (E2) routes *do* take the cost to the ASBR into account during route calculation. The problem though is that by looking at just the External LSA’s information, we can’t see why we’re choosing one over the other.

Now let’s go through the entire recursion process in the database to figure out why R1 is choosing the path via R5 over the path to R4.

First, as we saw above, R1 finds both routes to the prefix with a metric of 1. Since this is a tie, the next thing R1 does is determine if the route to the ASBR is via an Intra-Area path. This is done by looking up the Type-1 Router LSA for the Advertising Router field found in the Type-5 External LSA.

```R1#show ip ospf database router 10.1.4.4
OSPF Router with ID (10.1.1.1) (Process ID 1)
R1#show ip ospf database router 10.1.5.5
OSPF Router with ID (10.1.1.1) (Process ID 1)
R1#```

This output on R1 means that it does not have an Intra-Area path to either of the ASBRs advertising these routes. The next step is to check if there is an Inter-Area path. This is done by examining the Type-4 ASBR Summary LSA.

```R1#show ip ospf database asbr-summary 10.1.4.4
OSPF Router with ID (10.1.1.1) (Process ID 1)
Summary ASB Link States (Area 0)
Routing Bit Set on this LSA
LS age: 1889
Options: (No TOS-capability, DC, Upward)
LS Type: Summary Links(AS Boundary Router)
LS Seq Number: 80000002
Checksum: 0x24F3
Length: 28
TOS: 0  Metric: 1
R1#show ip ospf database asbr-summary 10.1.5.5
OSPF Router with ID (10.1.1.1) (Process ID 1)
Summary ASB Link States (Area 0)
Routing Bit Set on this LSA
LS age: 1871
Options: (No TOS-capability, DC, Upward)
LS Type: Summary Links(AS Boundary Router)
LS Seq Number: 80000002
Checksum: 0x212
Length: 28
TOS: 0  Metric: 1```

This output indicates that R1 does have Inter-Area routes to the ASBRs R4 and R5. The Inter-Area metric to reach them is 1 via ABRs R2 (10.1.2.2) and R3 (10.1.3.3) respectively. Now R1 needs to know which ABR is closer, R2 or R3? This is accomplished by looking up the Type-1 Router LSA to the ABRs that are originating the Type-4 ASBR Summary LSAs.

```R1#show ip ospf database router 10.1.2.2
OSPF Router with ID (10.1.1.1) (Process ID 1)
Routing Bit Set on this LSA
LS age: 724
Options: (No TOS-capability, DC)
LS Seq Number: 8000000D
Checksum: 0xA332
Length: 36
Area Border Router
Link connected to: a Transit Network
Number of TOS metrics: 0
TOS 0 Metrics: 1
R1#show ip ospf database router 10.1.3.3
OSPF Router with ID (10.1.1.1) (Process ID 1)
Routing Bit Set on this LSA
LS age: 1217
Options: (No TOS-capability, DC)
LS Seq Number: 80000010
Checksum: 0x9537
Length: 36
Area Border Router
Link connected to: a Transit Network
Number of TOS metrics: 0
TOS 0 Metrics: 1```

This output indicates that R2 and R3 are adjacent with the Designated Routers 10.1.12.2 and 10.1.13.3 respectively. Since R1 is also adjacent with these DRs, the cost from R1 to the DR is now added to the path.

```R1#show ip ospf database router 10.1.1.1
OSPF Router with ID (10.1.1.1) (Process ID 1)
LS age: 948
Options: (No TOS-capability, DC)
LS Seq Number: 8000000F
Checksum: 0x6FA6
Length: 60
Link connected to: a Stub Network
Number of TOS metrics: 0
TOS 0 Metrics: 1
Link connected to: a Transit Network
Number of TOS metrics: 0
TOS 0 Metrics: 1
Link connected to: a Transit Network
Number of TOS metrics: 0
TOS 0 Metrics: 100```

R1 now knows that its cost to the DR 10.1.12.2 is 100, who is adjacent with R2, whose cost to R4 is 1, whose redistributed metric is 1. R1 also now knows that its cost to the DR 10.1.13.3 is 1, who is adjacent with R3, whose cost to R5 is 1, whose redistributed metric is 1. This means that the total cost to go to 10.1.6.0 via the R1 -> R2 -> R4 path is 102, while the total cost to go to 10.1.6.0 via the R1 -> R3 -> R5 path is 3.

The final result of this is that R1 chooses the shorter path to the ASBR, which is the R1 -> R3 -> R5 path. Although the other route to the prefix is via an E2 route with the same external cost, one is preferred over another due to the shorter ASBR path.

Based on this we can see that both E1 and E2 routes take both the redistributed cost and the cost to the ASBR into account when making their path selection. The key difference is that E1 is always preferred over E2, followed by the E2 route with the lower redistribution metric. If multiple E2 routes exist with the same redistribution metric, the path with the lower forward metric (metric to the ASBR) is preferred. If there are multiple E2 routes with both the same redistribution metric and forward metric, they can both be installed in the routing table. Why does OSPF do this though? Originally this stems from the design concepts of "hot potato" and "cold potato" routing.

Think of a routing domain learning external routes. Typically those prefixes have some "external" metric associated with them - for example, E2 external metric or the BGP MED attribute value. If the routers in the local domain select the exit point based on the external metric they are said to perform "cold potato" routing. This means that the exit point is selected based on the external metric preference, e.g. distances to the prefix in the bordering routing system. This optimizes link utilization in the external system but may lead to suboptimal path selection in the local domain. Conversely, "hot potato" routing is the model where the exit point selection is performed based on the local metric to the exit point associated with the prefix. In other words, "hot potato" model tries to push packets out of the local system as quick as possible, optimizing internal link utilization.

Now within the scope of OSPF, think of the E2 route selection process: OSPF chooses the best exit point based on the external metric and uses the internal cost to ASBR as a tie breaker. In other words, OSPF performs "cold potato" routing with respect to E2 prefixes. It is easy to turn this process into "hot potato" by ensuring that every exit point uses the same E2 metric value. It is also possible to perform other sorts of traffic engineering by selectively manipulating the external metric associated with the E2 route, allowing for full flexibility of exit point selection.

Finally, we approach E1. This type of routing is a hybrid of hot and cold routing models - external metrics are directly added to the internal metrics. This implicitly assumes that external metrics are "comparable" to the internal metrics. In turn, this means E1 is meant to be used with another OSPF domain that uses a similar metric system. This is commonly found in split/merge scenarios where you have multiple routing processes within the same autonomous system, and want to achieve optimum path selection accounting for both metrics in both systems. This is similar to the way EIGRP performs metric computation for external prefixes.

So there we have it. While it is technically true that "OSPF routers do not add any internal OSPF cost to the metric for an E2 route", both the intra-area and inter-area cost can still be considered in the OSPF path selection regardless of whether the route is E1 or E2.

OSPF and MTU Mismatch

Dear Brian,

What is the difference between using the “system mtu routing 1500” and the “ip ospf mtu-ignore” commands when running OSPF between a router and a switch?

Thanks,

Paul

Hi Paul,

Within the scope of the CCIE Lab Exam, it may be acceptable to issue either of these commands to solve a specific lab task. However, it is key to note that there is a difference between ignoring the MTU for the purpose of OSPF adjacency and matching the MTU within a real production network.

By design, OSPF will automatically detect a MTU mismatch between two devices when they exchange the Database Description (DBD) packets during the formation of adjacency. This is per the standard OSPF specification defined in RFC 2328, “OSPF Version 2”. Specifically the RFC states the following:

```10.6.  Receiving Database Description Packets
This section explains the detailed processing of a received
Database Description Packet.
[snip]
If the Interface MTU field in the Database Description packet
indicates an IP datagram size that is larger than the router can
accept on the receiving interface without fragmentation, the
Database Description packet is rejected.
[/snip]
```

Basically this means that if a router tries to negotiate an adjacency on an interface in which the remote neighbor has a larger MTU, the adjacency will be denied. The idea behind this check is two-fold. The first is to alleviate a problem in the data plane, in which a sending host transmits packets to a receiver that are too large to accept. Typically, Path MTU Discovery (PMTUD) should be implemented on the sender to prevent this case, however this process relies on ICMP messages that could possibly be filtered out in the transit path due to a security policy. The second, and most important issue, is to alleviate a problem in the control plane in which OSPF packets are exchanged.

Specifically this problem stems from the issue that the OSPF Hello, Database Description (DBD), Link-State Request (LSR), and Link-State Acknowledgement (LSAck) packets are generally small, but the Link-State Update (LSU) packets are generally not.

When establishing a new OSPF adjacency, the DBD packet is used to tell new neighbors what LSAs are in the database, but not to give the details about them. Specifically the DBD contains the LSA Header information, but not the actual LSA payload. The idea behind this is to optimize flooding in the case that the receiving router already received the LSA from another neighbor, in which case flooding does not need to occur during adjacency establishment.

For example, suppose that you and I, routers A and B, both have neighbors C and D, and the database is synchronized. If you and I form a new adjacency, my DBD exchange to you will say that I have LSAs A, B, C, and D in my database. Since you are already adjacent with C and D, and I am adjacent with them, you already have all of my LSAs, possibly with the exception of the new link that connects us. This means that even though I describe LSAs A and B to you with my DBD packet, you don't send an LSR to me for them, which means I don't send you an LSU about them. This is the normal optimization of how the database is exchanged so that excessive flooding doesn't occur.

For example, suppose that the flooding router has a Gigabit Ethernet interface that supports Jumbo frames, which exceed the normal Ethernet MTU of 1500 bytes; however, the receiving router has not enabled Jumbo frame support, which implies that frames over 1500 bytes (excluding layer 2 overhead) will be dropped. If the flooding router sends multiple LSAs in an LSU forcing the packet size to exceed 1500 bytes, or if a single LSA sent by the flooding router is large enough to exceed 1500 bytes, such as a Router LSA (LSA Type 1) with many links, the results can be non-deterministic.
To demonstrate this, take the following topology.

R1 and R2 connect with GigabitEthernet, while R2 and R3 connect with FastEthernet. R1 has a default MTU of 1500 bytes configured on its link to R2, while R2 has Jumbo frame support configured up to 2000 bytes. R2 and R3’s link uses the default MTU of 1500 bytes. Per the RFC’s defined behavior, R1 should reject a OSPF adjacency with R2. This default behavior can be seen as follows:

```R1:
interface GigabitEthernet1/0
!
router ospf 1
network 0.0.0.0 255.255.255.255 area 0
R2:
interface GigabitEthernet1/0
mtu 2000
!
router ospf 1
network 0.0.0.0 255.255.255.255 area 0
R1#debug ip packet detail
IP packet debugging is on (detailed)
OSPF adjacency events debugging is on
01:07:18: OSPF: Rcv DBD from 2.2.2.2 on GigabitEthernet1/0 seq 0x172A opt 0x52 flag 0x7 len 32  mtu 2000 state EXSTART
01:07:18: OSPF: Nbr 2.2.2.2 has larger interface MTU
01:07:18: OSPF: Retransmitting DBD to 2.2.2.2 on GigabitEthernet1/0
01:07:18: OSPF: Up DBD Retransmit cnt to 5 for 2.2.2.2 on GigabitEthernet1/0
01:07:18: OSPF: Send DBD to 2.2.2.2 on GigabitEthernet1/0 seq 0x1813 opt 0x52 flag 0x7 len 32```

In this case we can see that R1 rejects R2's DBD packet, since the MTU is larger. Although the obvious solution to this problem is to simply match the MTU of the links to avoid this problem in the first place, IOS also offers the "ip ospf mtu-ignore" command at the interface level to skip over this check in the OSPF adjacency state machine. Once applied, as seen below, R1 and R2 form an adjacency.

```R1#conf t
Enter configuration commands, one per line.  End with CNTL/Z.
R1(config)#interface Gig1/0
R1(config-if)#ip ospf mtu-ignore
R1(config-if)#end
R1#
R1#show ip ospf neighbor
2.2.2.2           1   FULL/DR         00:00:36    12.0.0.2        GigabitEthernet1/0```

At this point, both R1 and R2 learn the routes to each other's Loopback0 interfaces, as seen below.

```R1#show ip route ospf
2.0.0.0/32 is subnetted, 1 subnets
O       2.2.2.2 [110/2] via 12.0.0.2, 00:00:05, GigabitEthernet1/0
R2#show ip route ospf
1.0.0.0/32 is subnetted, 1 subnets
O       1.1.1.1 [110/2] via 12.0.0.1, 00:00:46, GigabitEthernet1/0```

As expected however, since there is an MTU mismatch, R1 is unable to receive packets from R2 that exceed an MTU of 1500 bytes.

```R2#ping 1.1.1.1
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 1.1.1.1, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 12/16/20 ms
R2#ping
Protocol [ip]:
Repeat count [5]:
Datagram size [100]: 2000
Timeout in seconds [2]:
Extended commands [n]: y
Type of service [0]:
Set DF bit in IP header? [no]: yes
Data pattern [0xABCD]:
Loose, Strict, Record, Timestamp, Verbose[none]:
Sweep range of sizes [n]:
Type escape sequence to abort.
Sending 5, 2000-byte ICMP Echos to 1.1.1.1, timeout is 2 seconds:
.....
Success rate is 0 percent (0/5)```

Theoretically this MTU mismatch should not matter, since end hosts that send traffic should ideally implement Path MTU Discovery. However, let's now see a case where R2 is unable to flood LSAs to R1 for which the IP packet size exceeds 1500 bytes.

R3, who connects to R2, has been configured with a large number of Loopback interfaces in order to generate a large Router LSA (LSA Type 1). R3's configuration is as follows, where Loopbacks 3.3.3.2 - 3.3.3.253 have been omitted:

```R3:
interface FastEthernet0/0
shutdown
!
interface Loopback3330
!
[snip]
!
interface Loopback333254
!
router ospf 1
network 0.0.0.0 255.255.255.255 area 0```

The number of resulting local links can be seen in R3's database as follows:

```R3#show ip ospf database
OSPF Router with ID (23.0.0.3) (Process ID 1)
23.0.0.3        23.0.0.3        299         0x80000007 0x0050D2 254```

Now let's activate the link between R2 and R3, which will cause R3 to flood a large Router LSA to R2, which in turn causes R2 to flood this to R1.

```R3#config t
Enter configuration commands, one per line.  End with CNTL/Z.
R3(config)#int Fa0/0
R3(config-if)#no shutdown
R3(config-if)#end
R3#
R2#debug ip packet detail
IP packet debugging is on (detailed)
R2#debug ip ospf packet
OSPF packet debugging is on
R2#config t
Enter configuration commands, one per line.  End with CNTL/Z.
R2(config)#interface Fa2/0
R2(config-if)#no shutdown
R2(config-if)#end
R2#
%SYS-5-CONFIG_I: Configured from console by console
IP: s=23.0.0.3 (FastEthernet2/0), d=224.0.0.5, len 76, rcvd 0, proto=89
OSPF: rcv. v:2 t:1 l:44 rid:23.0.0.3
aid:0.0.0.0 chk:D59B aut:0 auk: from FastEthernet2/0
IP: s=23.0.0.2 (local), d=23.0.0.3 (FastEthernet2/0), len 80, sending, proto=89
[snip]```

R2 and R3 form adjacency, and R3's LSA is flooded to R2. Since the LSA takes more than one 1500 byte packet, it is fragmented into multiple packets, with the largest being the shared MTU of 1500 between them.

```IP: s=23.0.0.3 (FastEthernet2/0), d=23.0.0.2, len 1500, rcvd 0
IP Fragment, Ident = 497, fragment offset = 0, proto=89
IP: recv fragment from 23.0.0.3 offset 0 bytes
IP: s=23.0.0.3 (FastEthernet2/0), d=23.0.0.2, len 1500, rcvd 0
IP Fragment, Ident = 497, fragment offset = 1480
IP: recv fragment from 23.0.0.3 offset 1480 bytes
IP: s=23.0.0.3 (FastEthernet2/0), d=23.0.0.2, len 172, rcvd 0
IP Fragment, Ident = 497, fragment offset = 2960
IP: recv fragment from 23.0.0.3 offset 2960 bytes
OSPF: rcv. v:2 t:4 l:3112 rid:23.0.0.3
aid:0.0.0.0 chk:297C aut:0 auk: from FastEthernet2/0

Once the adjacency is full, R2 installs R3's routes, and begins to flood to R1:

```R2#show ip route ospf
1.0.0.0/32 is subnetted, 1 subnets
O       1.1.1.1 [110/2] via 12.0.0.1, 00:00:10, GigabitEthernet1/0
3.0.0.0/32 is subnetted, 254 subnets
O       3.3.3.1 [110/2] via 23.0.0.3, 00:00:10, FastEthernet2/0
[snip]
O       3.3.3.254 [110/2] via 23.0.0.3, 00:00:10, FastEthernet2/0
R2#
IP: s=12.0.0.2 (local), d=224.0.0.5 (GigabitEthernet1/0), len 3132, sending broad/multicast, proto=89
IP: s=12.0.0.2 (local), d=224.0.0.5 (GigabitEthernet1/0), len 1996, sending fragment
IP Fragment, Ident = 854, fragment offset = 0, proto=89
IP: s=12.0.0.2 (local), d=224.0.0.5 (GigabitEthernet1/0), len 1156, sending last fragment
IP Fragment, Ident = 854, fragment offset = 1976```

Note that since the LSA exceeds the MTU of 2000 bytes, it is fragmented into multiple packets. Since R1 cannot accept packets that exceed its MTU of 1500 bytes, the LSUs are never received. This means that R1 cannot synchronize the database with R2, as seen as follows.

```R1#show ip ospf database
OSPF Router with ID (1.1.1.1) (Process ID 1)
1.1.1.1         1.1.1.1         62          0x80000005 0x6592   2
2.2.2.2         2.2.2.2         35          0x8000000D 0x613E   3
12.0.0.1        1.1.1.1         62          0x80000001 0x61BB
23.0.0.3        23.0.0.3        36          0x80000001 0x974C
R2#show ip ospf database
OSPF Router with ID (2.2.2.2) (Process ID 1)
1.1.1.1         1.1.1.1         67          0x80000005 0x6592   2
2.2.2.2         2.2.2.2         38          0x8000000D 0x613E   3
23.0.0.3        23.0.0.3        39          0x80000005 0x2AAD   255
12.0.0.1        1.1.1.1         67          0x80000001 0x61BB
23.0.0.3        23.0.0.3        39          0x80000001 0x974C
R3#show ip ospf database
OSPF Router with ID (23.0.0.3) (Process ID 1)
1.1.1.1         1.1.1.1         69          0x80000005 0x006592 2
2.2.2.2         2.2.2.2         40          0x8000000D 0x00613E 3
23.0.0.3        23.0.0.3        39          0x80000005 0x002AAD 255
12.0.0.1        1.1.1.1         69          0x80000001 0x0061BB
23.0.0.3        23.0.0.3        39          0x80000001 0x00974C```

This also implies that R1 cannot install routes towards R3:

```R1#show ip route ospf
2.0.0.0/32 is subnetted, 1 subnets
O       2.2.2.2 [110/2] via 12.0.0.2, 00:00:02, GigabitEthernet1/0
23.0.0.0/24 is subnetted, 1 subnets
O       23.0.0.0 [110/2] via 12.0.0.2, 00:00:02, GigabitEthernet1/0```

Eventually the adjacency state between R1 and R2 is lost, due to the lack of LSAcks sent in response to R2's LSUs. This can be seen in R1's "debug ip ospf packet" as follows, and the "show ip ospf neighbor" on both devices:

```R1#
OSPF: rcv. v:2 t:1 l:44 rid:2.2.2.2
aid:0.0.0.0 chk:DC98 aut:0 auk: from GigabitEthernet1/0
OSPF: Cannot see ourself in hello from 2.2.2.2 on GigabitEthernet1/0, state INIT
R1#show ip ospf neighbor
R2#show ip ospf neighbor
23.0.0.3          1   FULL/DR         00:00:35    23.0.0.3        FastEthernet2/0
1.1.1.1           1   FULL/BDR        00:00:39    12.0.0.1        GigabitEthernet1/0```

The key with this example is that although the "ip ospf mtu-ignore" command allows the initial adjacency to form between R1 and R2, we can see that synchronization fails between them when an LSA replication event causes packet sizes generated by R2 to exceed R1's MTU.

Based on this we can see that the "ip ospf mtu-ignore" command is not a fix to the underlying problem. Instead it is simply an exception to the OSPF adjacency state machine. The real fix to this problem is to ensure that the MTU values match between neighbors, which prevents both routing exchange in the control plane, and packet drops due to unsupported sizes in the data plane.

One of the most important technical protocols on the planet is Open Shortest Path First (OSPF). This highly tunable and very scalable Interior Gateway Protocol (IGP) was designed as the replacement technology for the very problematic Routing Information Protocol (RIP). As such, it has become the IGP chosen by many corporate enterprises.

OSPF’s design, operation, implementation and maintenance can be extremely complex. The 3-Day INE bootcamp dedicated to this protocol will be the most in-depth coverage in the history of INE videos.

This course will be developed by Brian McGahan, and Petr Lapukhov. It will be delivered online in a Self-Paced format. The course will be available for purchase soon for \$295.

Here is a preliminary outline:

Day 1 OSPF Operations

●      Dijkstra Algorithm

○   OSPF Packet Formats

○   OSPF Authentication

●      Concept of Areas

○   Notion of ABR

○   Notion of ASBR

●      Network Types

○   Flooding with DR

○   Topologic Representation

○   LSA Format (Checksum, Seq#, etc)

○   LSA Types

○   LSA Purging

●      The Routing Table

○   How is RIB computed from LSDB

●      Flooding Reduction

○   DNA bit

○   DC Circuits

○   Database Filter

Day 2 Configuring OSPF

●      Basic Configurations

○   Setting Router IDs

●      NBMA Networks

○   Selecting Network Type

○   Ensuring peer reachability

●      Special Areas

○   Stub Area Types

○   Routing in NSSA Areas

●      OSPF Summarization

○   Internal vs External

○   Transit Capability

Day 3 Advanced Topics and Troubleshooting

●      OSPF Fast Convergence

○   L3 and L2 interaction

○   SPF and LSA Throttling

●      OSPF Tuning

○   LSA Pacing

○   Hello Timer Tuning

○   Max-Metric LSA

●      OSPF in MPLS Layer 3 VPNs

○   Superbackbone

○   MP-BGP extensions for OSPF

○   Loop-Prevention Concepts

●      Inter-Area Loop Prevention Caveats

●      Key OSPF Verifications

●      OSPF Troubleshooting Process

○   Adjacency Problems (e.g. MTU issues)

○   Intra-area reachability (e.g. network types mismatch)

○   Inter-area reachability (e.g. summary LSA blocking)

○   Troubleshooting VLs and SLs

Continuing my review of titles from Petr’s excellent CCDE reading list for his upcoming LIVE and ONLINE CCDE Bootcamps, here are further notes to keep in mind regarding EIGRP.

• The algorithm used for this advanced Distance Vector protocol is the Diffusing Update Algorithm.
• As we discussed at length in this post, the metric is based upon Bandwidth and Delay values.
• For updates, EIGRP uses Update and Query packets that are sent to a multicast address.
• Split horizon and DUAL form the basis of loop prevention for EIGRP.
• EIGRP is a classless routing protocol that is capable of Variable Length Subnet Masking.
• Automatic summarization is on by default, but summarization and filtering can be accomplished anywhere inside the network.

EIGRP forms "neighbor relationships" as a key part of its operation. Hello packets are used to help maintain the relationship. A hold time dictates the assumption that a neighbor is no longer accessible and causes the removal of topology information learned from that neighbor. This hold timer value is reset when any packet is received from the neighbor, not just a Hello packet.

EIGRP uses the network type in order to dictate default Hello and Hold Time values:

• For all point-to-point types - the default Hello is 5 seconds and the default Hold is 15
• For all links with a bandwidth over 1 MB - the default is also 5 and 15 seconds respectively
• For all multi-point links with a bandwidth less than 1 MB - the default Hello is 60 seconds and the default Hold is 180 seconds

Interestingly, these values are carried in the Hello packets themselves and do not need to match in order for an adjacency to form (unlike OSPF).

### Reliable Transport

By default, EIGRP sends updates and other information to multicast 224.0.0.10 and the associated multicast MAC address of 01-00-5E-00-00-0A.

For multicast packets that need to be reliably delivered, EIGRP waits until a RTO (retransmission timeout) before beginning a recovery action. This RTO value is based off of the SRTT (smooth round-trip time) for the neighbor. These values can be seen in the show ip eigrp neighbor command.

If the router sends out a reliable packet and does not receive an Acknowledgement from a neighbor, the router informs that neighbor to no longer listen to multicast until it is told to once again. The local router then begins unicasting the update information. Once the router begins unicasting, it will try for 16 times or the expiration of the Hold timer, whichever is greater. It will then reset the neighbor and declare a Retransmission Limit Exceeded error.

Note that not all EIGRP packets follow this reliable routine - just Updates and Queries. Hellos and acknowledgements are examples of packets that are not sent reliably.

This document is presented as a series of Questions and Answers, discussing various aspects of OSPF protocol designed to prevent inter-area routing loops and related issues. The discussion covers ABR functions, Virtual-Links, OSPF Super-backbone, OSPF Sham-Links, BGP Cost Community. Reader is assumed to know these concepts already, as this publication focuses on complex interaction features arising in MPLS/BGP VPN scenarios. The discussion is culminated by analyzing a number of issues arising in complex multi-area multi-homed OSPF site deployed in MPLS VPN environment. Please download the following document to read the publication: Loop Prevention in OSPF

To start my reading from Petr's excellent CCDE reading list for his upcoming LIVE and ONLINE CCDE Bootcamps, I decided to start with:
EIGRP for IP: Basic Operation and Configuration by Russ White and Alvaro Retana
I was able to grab an Amazon Kindle version for about \$9, and EIGRP has always been one of my favorite protocols.
The text dives right in to none other than the composite metric of EIGRP and it brought a smile to my face as I thought about all of the misconceptions I had regarding this topic from early on in my Cisco studies. Let us review some key points regarding this metric and hopefully put some of your own misconceptions to rest.

• While we are taught since CCNA days that the EIGRP metric consists of 5 possible components - BW, Delay, Load, Reliability, and MTU; we realize when we look at the actual formula for the metric computation, MTU is actually not part of the metric. Why have we been taught this then? Cisco indicates that MTU is used as a tie-breaker in a situation that might require it. To review the actual formula that is used to compute the metric, click here.
• Notice from the formula that the K (constant values) impact which components of the metric are actually considered. By default K1 is set to 1 and K3 is set to 1 to ensure that Bandwidth and Delay are utilized in the calculation. If you wanted to make Bandwidth twice as significant in the calculation, you could set K1 to 2, as an example. The metric weights command is used for this manipulation. Note that it starts with a TOS parameter that should always be set to 0. Cisco never did fully implement this functionality.
• The Bandwidth that effects the metric is taken from the bandwidth command used in interface configuration mode. Obviously, if you do not provide this value - the Cisco router will select a default based on the interface type.
• The Delay value that effects the metric is taken from the delay command used in interface configuration mode. This value depends on the interface hardware type, e.g. it is lower for Ethernet but higher for Serial interfaces. Note how the Delay parameter allows you to influence EIGRP pathing decisions without the manipulation of the Bandwidth value. This is nice since other mechanisms could be relying heavily on the bandwidth setting, e.g. EIGRP bandwidth pacing or absolute QoS reservation values for CBWFQ.
• The actual metric value for a prefix is derived from the SUM of the delay values in the path, and the LOWEST bandwidth value along the path. This is yet another reason to use more predictive Delay manipulations to change EIGRP path preference.

In the next post on the EIGRP metric, we will examine this at the actual command line, and discuss EIGRP load balancing options. Thanks for reading!

I enjoyed Petr's article regarding explicit next hop.  It reminded me of a scenario where a redistributed route, going into OSPF conditionally worked, depending on which reachable next hop was used.

Here is the topology for the scenario:

Here is the relevant (and working :)) information for R1.

When we replace the static route, with a new reachable next hop, we loose the ability to ping 100.100.100.3

When we change the next hop for the static route, (which is being redistributed into OSPF), the route to 100.100.100.0/24 no longer works, even though we have verified ability to ping the new next hop.

For more troubleshooting scenarios, please see our CCIE Route-Switch workbooks, volume 2, for more than 100 challenging troubleshooting scenarios.

We will post the results right here, in a few days, after you have had a chance to post your comments and ideas.

Best wishes.

Follow-up-

R1, using a next hop of 172.16.33.33 in its static route, will include that same address in the LSA as the forwarding address.  Among the requirements that make this possible, the one we are going to focus on here is that this next hop is in the same IP subnet as an OSPF interface (Lo0) on R1.  172.16.1.1/16 (R1) and 172.16.33.33 (next hop address, owned by R3).

If we use a next hop that isn't in the same IP subnet as an OSPF interface on R1, the LSA will not include the next hop forwarding address, which will then cause R2 to believe that R1 is the next hop and the route will fail to work.   We could also cause the 0.0.0.0 to show up by changing the ospf network type for R1 Loop 0 to point-to-point, not including Loop 0 in the network statement for OSPF, or by setting Loop 0 as a passive interface for OSPF. (take your pick) :)

Again, thanks to all for the EXCELLENT answers and insights.

You rock!

#### Abstract

This publication briefly covers the use of 3rd party next-hops in OSPF, RIP, EIGRP and BGP routing protocols. Common concepts are introduced and protocol-specific implementations are discussed. Basic understanding of the routing protocol function is required before reading this blog post.

#### Overview

Third-party next-hop concept appears only to distance vector protocol, or in the parts of the link-state protocols that exhibit distance-vector behavior. The idea is that a distance-vector update carries explicit next-hop value, which is used by receiving side, as opposed to the "implicit" next-hop calculated as the sending router's address - the source address in the IP header carrying the routing update. Such "explicit" next-hop is called "third-party" next-hop IP address, allowing for pointing to a different next-hop, other than advertising router. Intitively, this is only possible if the advertising and receiving router are on a shared segment, but the "shared segment" concept could be generalized and abstracted. Every popular distance-vector protocols support third party next-hop - RIPv2, EIGRP, OSPF and BGP all carry explicit next-hop value. Look at the figure below - it illustrates the situation where two different distance-vector protocols are running on the shared segment, but none of them runs on all routers attached to the segment. The protocols "overlap" at a "pivotal" router and redistribution is used to provide inter-protocol route exchange.

Per the default distance-vector protocol behavior, traffic from one routing domain going into another has cross the "pivotal" router, the router where the two domains overlap (R3 in our case) - as opposed to going directly to the closes next-hop on the shared segment. The reason for this is that there is no direct "native" update exchange between the hops running different routing protocols. In situations like this, it is beneficial to rewrite the next-hop IP address to point toward the "optimum" exit point, using the "pivotal" router's knowledge of both routing protocols.

OSPF is somewhat special with respect to the 3rd party next-hop implementation. It supports third-party next-hop in Type-5/7 LSAs (External Routing Information LSA and NSSA External LSA). These LSAs are processed in "distance-vector manner" by every receiving router. By default, the LSA is assumed to advertise the external prefix "connected" to the advertising router. However, if the FA is non-zero, the address in this field is used to calculate the forwarding information, as opposed to default forwarding toward the advertising router. Forwarding Address is always present in Type-7 LSAs, for the reason illustrated on the figure below:

Since there could be multiple ABRs in NSSA area, only one is elected to perform 7-to-5 LSA translation - otherwise the routing information will loop back in the area, unless manual filtering implemented in the ABRs (which is prone to errors). Translating ABR is elected based on the highest Router-ID, and may not be on the optimum path toward the advertising ASBR. Therefore, the forwarding address should prompt the more optimum path, based on the inter-area routing information.

#### EIGRP

Notice that EIGRP will not insert the third-party next-hop until you apply the command no ip next-hop-self eigrp on R3's connection to the shared segment. Look at the routing table output prior to applying the no ip next-hop-self eigrp command.

```R1#show  ip route eigrp
140.1.0.0/16 is variably subnetted, 2 subnets, 2 masks
D EX    140.1.2.2/32
[170/2560002816] via 140.1.123.3, 00:00:27, FastEthernet0/0
```

After the command has been applied to R3’s interface:

```R1#show  ip route eigrp
140.1.0.0/16 is variably subnetted, 2 subnets, 2 masks
D EX    140.1.2.2/32
[170/2560002816] via 140.1.123.2, 00:00:04, FastEthernet0/0
```

The same behavior is observed when redistributing OSPF into EIGRP, but not when redistributing BGP. For some reason, BGP's next-hop is not copied into EIGRP, e.g. in the example below, EIGRP will NOT insert the BGP's next-hop into updates. Notice that you may enable or disable the third-party next-hop behavior in EIGRP using the interface-level command ip next-hop-self eigrp.

#### RIP

RIP passes the third-party next-hop from OSPF, BGP or EIGRP. For instance, assume EIGRP redistribution into RIP. You have to turn on the no ip split-horizon on R3's Ethernet connection to get this to work:

```R2#show ip route rip
140.1.0.0/16 is variably subnetted, 3 subnets, 2 masks
R       140.1.1.1/32 [120/1] via 140.1.123.1, 00:00:17, FastEthernet0/0
```

Notice the following RIP debugging output, which lists the third-party next-hop:

```RIP: received v2 update from 140.1.123.3 on FastEthernet0/0
140.1.1.1/32 via 140.1.123.1 in 1 hops
140.1.123.0/24 via 0.0.0.0 in 1 hops
```

Surprisingly, there is NO need to enable the command no ip split-horizon on the interface when redistributing BGP or OSPF routes into RIP. Seem like only EIGRP to RIP redistribution requires that. Keep in mind, however, that split-horizon is OFF by default on physical frame-relay interfaces. Here is a sample output of redistributing BGP into RIP using the third-party next-hop:

```R3#show ip route bgp
140.1.0.0/16 is variably subnetted, 3 subnets, 2 masks
B       140.1.2.2/32 [20/0] via 140.1.123.2, 00:22:13
R3#
R1#show ip route rip
140.1.0.0/16 is variably subnetted, 3 subnets, 2 masks
R       140.1.2.2/32 [120/1] via 140.1.123.2, 00:00:09, FastEthernet0/0
```

RIP’s third-party next-hop behavior is fully automatic. You cannot disable or enable it, like you do in EIGRP.

#### OSPF

Similarly to RIP, OSPF has no problems picking up the third-party next-hop from BGP, EIGRP or RIP. Here is how it would look like (guess which protocol is redistributed into OSPF, based solely on the commands output):

```R1#sh ip route ospf
140.1.0.0/16 is variably subnetted, 3 subnets, 2 masks
O E2    140.1.2.2/32 [110/1] via 140.1.123.2, 00:34:59, FastEthernet0/0
R1#show ip ospf database external
OSPF Router with ID (140.1.1.1) (Process ID 1)
Routing Bit Set on this LSA
LS age: 131
Options: (No TOS-capability, DC)
Link State ID: 140.1.2.2 (External Network Number )
LS Seq Number: 80000002
Checksum: 0xF749
Length: 36
Metric Type: 2 (Larger than any link state path)
TOS: 0
Metric: 1
External Route Tag: 200
```

If you’re still guessing, the external protocol is BGP, as could have been seen observing the automatic External Route Tag – OSPF set’s it to the last AS# found in the AS_PATH.

There are special conditions to be met by OSPF for the FA address to be used. First, the interface where the third party next-hop resides should be advertised into OSPF using the network command. Secondly, this interface should not be passive in OSPF and should not have network type point-to-point or point-to-multipoint. Violating any of these conditions will stop OSPF from using the FA in type-5 LSA created for external routes. Violating any of these conditions prevents third-party next-hop installation in the external LSAs.

OSPF is special in one other respect. Distance vector-protocols such as RIP or EIGRP modify the next-hop as soon as they pass the routing information to other devices. That is, the third party next-hop is not maintained through the RIP or EIGRP domain. Contrary to these, OSPF LSAs are flooded within their scope with the FA unmodified. This creates interesting problem: if the FA address is not reachable in the receiving router’s routing table, the external information found in type 7/5 LSA is not used. This situation is discussed in the blog post “OSPF Filtering using FA Address”.

#### BGP

When you redistribute any protocol into BGP, the system correctly sets the third-party next-hop in the local BGP table. Look at the diagram below, where EIGRP prefixes are being redistributed into BGP AS 300:

R3’s BGP process installs R1 Loopback0 prefix into the BGP table with the next-hop value of R1’s address, not “0.0.0.0” like it would be for locally advertised routes. You will observe the same behavior if you inject EIGRP prefixes into BGP using the network command.

```R3#sh ip bgp
BGP table version is 9, local router ID is 140.1.123.3
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete
Network          Next Hop            Metric LocPrf Weight Path
*> 140.1.1.1/32     140.1.123.1         156160         32768 ?
```

Furthermore, BGP is supposed to change the next-hop to self when advertising prefixes over eBGP peering sessions. However, when all peers share the same segment, the prefixes re-advertised over the shared segment do not have their next-hop changed. See the diagram below:

Here R1 advertises prefix 140.1.1.1/24 to R3 and R3 re-advertises it back to R2 over the same segment. Unless non-physical interfaces are used to form the BGP sessions (e.g. Loopbacks), the next-hop received from R1 is not changed when passing it down to R2. This implements the default third-party next-hop preservation over eBGP sessions. Look at the sample output for the configuration illustrated above: R1 receives R2’s prefix with unmodified next-hop.

```R1#show ip bgp
BGP table version is 3, local router ID is 140.1.1.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete
Network          Next Hop            Metric LocPrf Weight Path
*> 140.1.1.1/32     0.0.0.0                  0         32768 i
*> 140.1.2.2/32     140.1.123.2                            0 300 200 i
```

There is a way to disable this default behavior in BGP. A logical assumption would be that using the command neighbor X.X.X.X next-hop-self would work, and it does indeed, in the recent IOS versions. The older IOS, such as 12.2T did not have this command working for eBGP sessions, and your option would have been using a route-map with set ip next-hop command. The route-map method may still be handy, if you want insert totally “bogus” IP next-hop from the shared segment – receiving BGP speaker will accept any IP address that is on the same segment. That is not something you would do in the production environment too often, but definitely an interesting idea for lab practicing. One good use in production is changing the BGP next-hop to an HSRP virtual IP address, to provide physical BGP speaker redundancy. Here is a sample code for setting an explicit next-hop in BGP update:

```router bgp 300
neighbor 140.1.123.1 remote-as 100
neighbor 140.1.123.1 route-map BGP_NEXT_HOP out
!
route-map BGP_NEXT_HOP permit 10
set ip next-hop 140.1.123.100
```

#### Summary

All popular distance-vector protocols support third-party next-hop insertion. This mechanism is useful on multi-access segments, in situations when you want pass optimum path information between routers belonging to different routing protocols. We illustrated that RIP implements this function automatically, and does not allow any tuning. On the other hand, EIGRP supports third-party next-hop passing from any protocol, other than BGP, and you may turn this function on/off on per-interface basis. Furthermore, OSPF’s special feature is propagation of the third-party next-hop within an area/autonomous system, unlike the distance-vector protocols that reset the next-hop at every hop (considering AS a being a “single-hop” for BGP). Thanks to that feature, OSPF offers interesting possibility to filter external routing information by blocking FA prefix from the routing tables. Finally, BGP gives most flexibility when it comes to the IP next-hop manipulation, allowing for changing it to any value.

This goal of this post is brief discussion of main factors controlling fast convergence in OSPF-based networks. Network convergence is a term that is sometimes used under various interpretations. Before we discuss the optimization procedures for OSPF, we define network convergence as the process of synchronizing network forwarding tables after a topology change. Network is said to be converged when none of forwarding tables are changing for "some reasonable" amount of time. This "some" amount of time could be defined as some interval, based on the expected maximum time to stabilize after a single topology change. Network convergence based on native IGP mechanisms is also known as network restoration, since it heals the lost connections. Network mechanisms for traffic protection such as ECMP, MPLS FRR or IP FRR offering different approach to failure handling are outside the scope of this article. We are further taking multicast routing fast recovery out of the scope as well, even though this process is tied to IGP re-convergence.

It is interesting to notice that IGP-based "restoration" techniques have one (more or less) important problem. During the time of re-convergence, temporary micro-loops may exist in the topology due to inconsistency of FIB (forwarding) tables of different routers. This behavior is fundamental to link-state algorithms, as routers closer to failure tend to update their forwarding database before the other routers. The only popular routing protocol that lacks this property is EIGRP, which is loop-free at any moment during re-convergence, thanks to the explicit termination of the diffusing computations. For the link state-protocols, there are some enhancements to the FIB update procedures that allow avoiding such micro-loops with link-state routing, described in the document [ORDERED-FIB].

Even though we are mainly concerned with OSPF, ISIS will be mentioned in the discussion as well. It should be noted that compared to IS-IS, OSPF provides less "knobs" for convergence optimization. The main reason is probably the fact that ISIS is being developed and supported by a separate team of developers, more geared towards the ISPs where fast convergence is a critical competitive factor. The common optimization principles, however, are the same for both protocols, and during the conversation will point out at the features that OSPF lacks while IS-IS has for tuning. Finally, we start our discussion with a formula, which is further explained in the text:

Convergence = Failure_Detection_Time + Event_Propagation_Time + SPF_Run_Time + RIB_FIB_Update_Time

The formula reflects the fact that convergence time for a link-state protocol is sum of the following components:

• Time to detect the network failure, e.g. interface down condition.
• Time to propagate the event, i.e. flood the LSA across the topology.
• Time to perform SPF calculations on all routers upon reception of the new information.
• Time to update the forwarding tables for all routers in the area.

#### Part I: Fast Failure Detection

What would you do if your connection is not physical point-to-point or does not allow translating loss of signal information in timely fashion? Good example could be switched Ethernet or Frame-Relay PVC link. Sometimes there are solutions such as Ethernet port failure translation that may detect an upstream switch port failure and reflect it to the downstream ports, which could be reasonably fast. For another example, Frame-Relay may signal PVC loss via asynchronous LMI updates or A-bit (active bit) in LMI status reports. However, such mechanisms, especially the ones relying on Layer 2 feature may not be timely enough to report failure fast. In such cases, it could be a good idea to rely on fast IGP keepalive timers. Both OSPF and ISIS support fast hellos with the dead/hold interval of one second and sub-second hello intervals ([OSPF-FASTHELLO]). Using this medium-agnostic mechanism could reduce fault detection on non point-to-point links to one second, which could be better than relying on Layer-2 specific signaling. However, fast hello timers have one significant drawback: since all hello packets are processes by the router's main CPU, having hundreds or more of OSPF/IS-IS neighbors may have significant impact on router's control plane performance. An alternative could be using BFD (bi-directional forwarding detection, see [BFD]), which provides protocol-agnostic failure detection mechanism that could be reused by multiple routing protocols (e.g. OSPF/ISIS/BGP and so on). BFD is based on the same idea of sub-second keepalive timers, that could be implemented in distributed router interface line-cards, therefore saving the control-plane and central CPU from over-utilization.

#### Part II: Event Propagation

In OSPF and IS-IS topology changes (event) are advertised by means of LSA/LSP flooding. For network to completely converge, an LSA/LSP needs to reach every router within its flooding scope. Normally, in properly designed network, the flooding scope is one area (flooding domain), unless the information is flooded as external, i.e. by means of Type-5 LSA in OSPF. In general, LSA/LSP propagation time is determined by the following factors:

1. LSA generation delay. IGP implementations normally throttle LSA generation to prevent excessive flooding in case of oscillating (constantly flapping) links. Original OSPF specification required every LSA generation to be delayed for a fixed interval that defaulted to one second. To optimize this behavior, Cisco's OSPF and ISIS implementations use exponential backoff algorithm to dynamically calculate the delay for generating the SAME LSA (same LSA ID, LSA type and originating Router ID) by the router. You may find more information about truncated exponential backoff in [TUNING-OSPF], but in short the process works as following.

Three parameters control the throttling process: initial interval, hold, and max_wait time specified using the command timers throttle lsa initial hold max_wait. Suppose the network was stable for a relatively long time, and then a router link goes down. As a result, the router needs to generate new router LSA, listing the new connection status. The router delays LSA generation the initial amount if milliseconds and sets the next interval to hold milliseconds. This ensures that two consecutive events (e.g. link going down and then back up) will be separated by at least the hold interval. After this, if an additional event occurs after the initial wait window expired, the event would be held for processing until the hold milliseconds window expire. Thus, all events occurring after the initial delay will be accumulated and processed after the hold time expires. This means the next router LSA will be generated no earlier than hold milliseconds. At the same time, the next hold-time would be doubled, i.e. set to 2*hold. Effectively, every time an event occurs during the current wait window, the processing is delayed until the current hold-time expires and the next hold-time interval is doubled. The hold-time grows exponentially as 2^t*hold until it reaches the max_wait value. After this, every event received during current hold-time window would result in the next interval being equal to the constant max_wait. This ensures that exponential growth is limited or in other words the process is truncated. If there are no events for the duration of 2*max_wait milliseconds, the hold-time window is reset back to the initial value, assuming that the flapping link has returned back to the normal condition.

Initial LSA generation delay has significant impact on network convergence time, so it is important to tune it appropriately. The initial delay should be kept to a minimum, such as 5-10 milliseconds - setting it to zero is still not recommended, as multiple link failure may occur synchronously (e.g. SRLG failure) and it could be beneficial to reflect them all in a single LSA/LSP. The hold interval should be tuned so that the next LSA is only sent after the network has converged in response to the first event. This means the LSA hold time should be based on the convergence time per the formula above, or more accurately it should be at least above LSA_Initial_Delay + LSA_Propagation_Delay + SPF_Initial_Delay. You may then set the maximum hold time to at least twice the hold interval to enhance flooding protection against at least two concurrent oscillating processes (having more parallel oscillations in not very probable). Notice that a single link failure normally results in at least two LSAs being generated, by every attached router.

2. LSA reception delay. This delay is a sum of the ingress queueing delay and LSA arrival delay. When a router receives LSA, it may be subject to ingress queueing, though this effect is not significant unless massive BGP re-convergence is occurring at the same time. Even under heavy BGP TCP ACK storm, Cisco IOS input queue discipline known as Selective Packet Discard (see [SPD]) provides enough room for IGP traffic and handles it as highest priority. The received packets are then rate-limited based on the LSA arrival interval. OSPF rate-limits only reception of the SAME LSAs (see the definition above): there is a fixed delay between reception of the same LSA originated by a peer. This delay should not exceed the hold-time used for LSA generation - otherwise the receiving router may drop the second LSA generated by peer, say upon link recovery. Notice that every router on the LSA flooding path adds cumulative delay to this component, but the good news is that the initial LSA/LSP will not be rate-limited - the arrival delay applies only to the consecutive copy of the same LSA. As such, you may mainly ignore this component for the purpose of the fast reaction to a change, thanks to fast ingress queueing and expedited reception. Keep in mind that if you are tuning the arrival delay you need to adjust the OSPF retransmission timer to be slightly above the first timer. Otherwise, the side that just sent an LSA and has not received an acknowledgemnt may end up re-sending it again just to be dropped by the receiving side. The command to control retransmission interval for the same LSA is timers pacing retransmission
3. Processing Delay is the amount of time it takes the router to put the LSA on the outgoing flood lists. This delay could be signification if SPF process starts before flooding the LSA. SPF runtime is not the only contributor to the processing delay, but it's the one you have control over. If you configured SPF throttling to be fast enough (see next session) - the exact time varies but mainly the initial delays below than 40ms - it may happen so that SPF run occurs before the triggering LSA is flooded to neighbors. This will result in slower flooding process. For faster convergence, it is required that LSAs are always flooded prior to SPF run. ISIS process in Cisco IOS supports the command fast-flood, which ensures the LSPs are flooded ahead of running SPF, irrespective of the initial SPF delay. On contrary, OSPF does not support this feature and your only option (at the moment) is properly tuning SPF runtime delays (see below).

Lastly, egress queueing may result in significant delay on over-utilized links. In short, router's egress queue depth could be approximated as Q_Depth=Utilization/(1-Utilization), meaning that links with 50% or above constant utilization always result in some queueing delay (in average). Proper QoS configuration, such as reserving enough bandwidth to the control plane packets should neutralize the effect of this component, coupled with the fact that routing update packets normally have higher priority for handling by router processes.

4. Packet Propagation Delay. This variable depends is a sum of two major contributors: serialization delay at every hop and cumulative signal propagation delay across the topology. The serialization delay is almost negligible on the modern "fast" links (e.g. 12usec for 1500 bytes packet over a 1Gbps link), though it could be more significant on slow WAN links such as series of T1s. Therefore, signal propagation delay is the main contributor due to physical limitations. This value mainly depends on the distance the signal has to travel to cross the whole OSPF/ISIS area. The propagation delay could be roughly approximated as 0.82 ms per 100 miles and have significant impact only for inter-continental deployments or satellite links. For example, it would take at least 41ms to travel a 5000 miles wide topology. However, since most OSPF/ISIS area sizes do not exceed a single continent, this value could not seriously impact total convergence time.

#### Part III: SPF Calculations

The SPF algorithm complexity could be bounded as O(L+N*log(N)) where N is number of the nodes and L is the number of the links in a topology under consideration. This estimation hold true provided that implementation is optimal (see [DIJKSTRA-SPF]). Worst case complexity for dense topologies could be as high as O(N^2), but this is rarely seen in real-world topologies. SPF runtime used to be a major limiting factor in the routers of 80s (link-state routing was invented in ARPANET) and 90s (initial OSPF/ISIS deployments) that used slow CPUs where SPF computations may have taken seconds to complete. However, progress in modern hardware (the Moore's Law) allowed significantly reducing the impact of this factor on the network convergence, though it is still one of the major contributors to the convergence time. The use of Incremental SPF (iSPF) allows to further minimize the amount of calculations needed when partial changes occur in the network (see [TUNING-OSPF]). For example, OSPF Type-1 LSA flooding for a leaf connection does not cause complete SPF re-calculation anymore like it would have been when using classic SPF. An important benefit is that the farther away the router is from the failed link, the less time it needs to recompute the SPF. This compensates for the longer propagation delay to deliver the LSA from a distant corner of the network. Notice that OSPF also supports PRC (partial route computation), which takes only a few milliseconds upon reception of Type 3,4,5 LSAs that are treated as distance-vector updates. The PRC process is not delayed and you cannot tune exponential backoff time for PRC, like you can do for IS-IS.

You may find out typical SPF runtimes for your network (to estimate the total convergence time) by using the command show ip ospf statistics

```show ip ospf statistics
OSPF Router with ID (10.4.1.1) (Process ID 1)
Area 10: SPF algorithm executed 18 times
Summary OSPF SPF statistic
SPF calculation time
Delta T	Intra	D-Intra	Summ	D-Summ	Ext	D-Ext	Total	Reason
1w3d	8	0	0	0	0	0	8	R, X
1w3d	12	0	0	0	4	0	16	R, X
1w3d	16	0	0	0	4	0	20	R, X
1w3d	8	0	0	0	0	0	8	R,
1w3d	20	0	0	0	0	0	20	R, X
1w2d	24	0	0	0	8	0	32	R, X
1w2d	8	4	0	0	0	0	12	R,
6d16h	4	0	0	0	0	4	8	R, X
6d16h	4	0	0	0	0	0	4	R,
6d16h	12	0	0	0	8	0	20	R, X
RIB manipulation time during SPF (in msec):
Delta T	RIB Update	RIB Delete
1w3d	4	0
1w3d	8	0
1w3d	10	0
1w3d	5	0
1w3d	8	0
1w2d	10	0
1w2d	3	0
6d16h	2	0
6d16h	1	0
6d16h	9	0```

The above output is divided in two sections: SPF calculation times and RIB manipulation time. For now, we are interested in the values under the "Total" column, which represent the total time it took OSPF process to run SPF. You may see how these values vary, depending on the "Reason" field. You may want to find the maximum value and use it as an upper limit for SPF computation in your network. In our case, it's 32ms. The other section of the output will be discussed later.

The next "problem" is known as SPF throttling. Recent Cisco IOS OSPF implementation is designed to use exponential backoff algorithm when scheduling SPF runs. The goal, as usual, is to avoid excessive calculations in the times of high network instability but keep SPF reaction fast for stable networks. Exponential process is identical to the one used for LSA throttling, with the same timer semantics.

So how would one pick up optimal SPF throttling values? As mentioned before, the initial delay should be kept as short as possible to allow for instant reaction to a change but long enough not to trigger the SPF before the LSA is flooded out. It's hard to determine the delay to flood the LSA, but at least the initial timer should stay above the per-interface LSA flood pacing timer, so that it does not delay two consecutive LSAs flooded through the topology (as you remember, a typical transit link failure results in generation of at least two LSAs). Setting the interface flooding pacing timer to 5ms and initial SPF delay to 10ms should be a good starting point. After the initial run, SPF algorithm should be further held down for at least the amount of time it takes the network to converge after the initial event. This means that the SPF hold-time should be strictly higher than the value "SPF_Initial_Delay + SPF_Runtime + RIB_FIB_Update_Time". There exists alternate, more pragmatic approach to this timer tuning as well. Let's say we want to make sure SPF computations do not take more than 50% of the router's CPU time. For this to happen, the hold time should be at least the same as a typical SPF run time. This value could be found based on the router statistics and tuned individually on every router. Based on our example, we may set the hold interval to 32ms+20% (error margin, set higher to add more safety), which is about 38ms, and the maximum interval could be set to twice the hold time, which translates into 33% CPU usage under the worst condition of non-stop LSA storms. Notice that SPF hold and maximum timers could be tuned per-router, to account for the different CPU powers, if this applies to your scenario. Total network convergence time should be estimated based on the "slowest" router in the area.

#### Part IV: RIB/FIB Update

After completing SPF computation, OSPF performs sequential RIB update to reflect the changed topology. The RIB updates are further propagated to the FIB table - based on the platform architecture this could be either centralized or distributed process. The RIB/FIB update process may contribute the most to the convergence time in the topologies with large amount of prefixes, e.g. thousands or tens of thousands. In such networks, updating RIB and distributed FIB databases on line-cards may take considerable amount of time, such as at the order of 10's if not 100's of milliseconds (varies depending on the platform). There are two major ways to minimize the update delay: advertise less prefixes and sequence FIB updates so that important paths are updated before any other.

Is there a way to estimate the RIB/FIB manipulation times? As we have seen before, the show ip ospf statistics command provides information on RIB update time, though this output is not provided on every platform, nor there is clear interpretation of the values in Cisco's documentation, e.g. it's unclear whether there is a checkpoint mechanism to inform OSPF of the FIB entry updates. Special measurements should be taken to estimate these values, as done in [BLACKBOX-OSPF], and more importantly these values will heavily depend on the platform used. Still the OSPF RIB manipulation statistics could be useful to estimate the lower bound of network convergence time (though we are mostly interested in the accurate upper boundary).

#### Sample Fast Convergence Profile

Putting the above information together, let's try to find an optimum convergence profile based on the fact that we have "show ip ospf statistics" output from the "weakest" router in the area.

```show ip ospf statistics
OSPF Router with ID (10.4.1.1) (Process ID 1)
Area 10: SPF algorithm executed 18 times
Summary OSPF SPF statistic
SPF calculation time
Delta T	Intra	D-Intra	Summ	D-Summ	Ext	D-Ext	Total	Reason
1w3d	8	0	0	0	0	0	8	R, X
1w3d	12	0	0	0	4	0	16	R, X
1w3d	16	0	0	0	4	0	20	R, X
1w3d	8	0	0	0	0	0	8	R,
1w3d	20	0	0	0	0	0	20	R, X
1w2d	24	0	0	0	8	0	32	R, X
1w2d	8	4	0	0	0	0	12	R,
6d16h	4	0	0	0	0	4	8	R, X
6d16h	4	0	0	0	0	0	4	R,
6d16h	12	0	0	0	8	0	20	R, X
RIB manipulation time during SPF (in msec):
Delta T	RIB Update	RIB Delete
1w3d	4	0
1w3d	8	0
1w3d	10	0
1w3d	5	0
1w3d	8	0
1w2d	10	0
1w2d	3	0
6d16h	2	0
6d16h	1	0
6d16h	9	0```

Failure Detection Delay: about 5-10ms worst case to detect/report loss of network pulses.
Maximum SPF runtime: 32ms, doubling for safety makes it 64ms
Maximum RIB update: 10ms, doubling for safety makes it 20ms
OSPF interface flood pacing timer: 5ms (does not apply to the initial LSA flooded)

LSA Generation Initial Delay: 10ms (enough to detect multiple link failures resulting from SRLG failure)
SPF Initial Delay: 10ms (enough to hold SPF to allow two consecutive LSAs to be flooded)
Network geographical size: 100 miles (signal propagation is negligible)
Network physical media: 1 Gbps links (serialization delay is negligible)

Estimated network convergence time in response to initial event: 32*2 + 10*2 + 10 + 10 = 40+64 = 100ms. This estimation does not precisely account for FIB update time, but we assume it would be approximately the same as RIB update. We need to make sure out maximum backoff timers exceed this convergence timer to ensure processing is delay above the convergence interval in the worst case scenario.

LSA Generation Hold Time: 100ms (approximately the convergence time)
LSA Generation Maximum Time: 1s (way above the 100ms)
OSPF Arrival Time: 50ms (way below the LSA Generation hold time)
SPF Hold Time: 100ms
SPF Maximum Hold Time: 1s ( Maximum SPF runtime is 32ms, meaning we skip 30 SPF runtimes in the worst condition. This results in SPF consuming no more than 3% of CPU time under worst-case scenario).

Now estimate the worst-case convergence time: LSA_Maximum_Delay (1s) + SPF_Maximum_Delay (1s) + RIB_Update (

```router ospf 10
!
!
prefix-suppression
!
! Wait at least 50ms between accepting the same LSA
!
timers lsa arrival 50
!
! Throttle LSA generation
!
timers throttle lsa all 10 100 1000
!
! Throttle SPF runs
!
timers throttle spf 10 100 1000
!
! Pace interface-level flooding
!
timers pacing flood 5
!
! Make retransmission timer > than arrival
!
timers pacing retransmission 60
!
! Enable incremental SPF
!
ispf```

#### Conclusions

It has been well known that link-state IGPs could be tuned for sub-second convergence under almost any practical scenario, yet maintain network stability by the virtue of adaptive backoff timers. In this post we tried to provide a practical approach to calculating the optimum throttling timer values based on your recorded network performance. It is worth noting that three most important timers to tune network for sub-second convergence are the failure detection delay, initial LSA generation delay and initial SPF delay. All other timers, such as hold and maximum time serve the purpose of stabilizing network, and affect convergence in "worst-case" unstable network scenarios. Cisco's recommended values for the initial/hold/maximum timers are 10/100/5000 ms (see [ROUTED-CAMPUS], but those may look a bit conservative as they result in the worst-case convergence time above 10 seconds. Additionally, it is important to notice that in large topologies, significant amount of time is spent updating the RIB/FIB updates after reconvergence. Therefore, in addition to tuning the throttling timers you may want to implement other measures such as prefix-suppression, better summarization (e.g. totally stub areas) and minimization of external routing information. If your platform supports the feature, you may also implement priority-driven RIB prefix installation process.

We omitted other fast-convergence elements such as resilient network design, e.g. redundancy resulting in equal-cost multipathing and faster OSPF adjacency restoration or NSF feature which is very helpful to avoid re-convergence during planned downtimes. We also skipped discussing some other features related to OSPF stability such as flooding reduction and LSA group pacing, that could yield performance benefits in networks with large LSDs. It is not possible to cover all relevant technologies in a single blog post, but you may refer to the further reading documents for more information. And finally, if you are planning to tune your IGP for fast convergence, make sure you understand all consequences. Modern routing platforms are capable of handling almost any "stormy" network condition without losing overall network stability, but pushing network to its limits could always be dangerous. Make sure you monitor your OSPF statistics for potentially high or unusual conditions after you performed tuning, or set maximum timers to more conservative values (e.g. 3-5 seconds) to provide additional safety.

The following is the minimum list of the publications suggested to read on the topic of fast IGP convergence.

[ORDERED-FIB] "Loop-free convergence using oFIB"
[BLACKBOX-OSPF] "Experience in Black-box OSPF Measurement”
[SUBSEC-CONV] “Achieving Sub-second IGP Convergence in Large IP Networks”
[OSPF-FASTHELLO] "OSPF Fast Hello Enhancement"