Mar
30

OSPF and MTU Mismatch

Dear Brian,

What is the difference between using the “system mtu routing 1500” and the “ip ospf mtu-ignore” commands when running OSPF between a router and a switch?

Thanks,

Paul

Hi Paul,

Within the scope of the CCIE Lab Exam, it may be acceptable to issue either of these commands to solve a specific lab task. However, it is key to note that there is a difference between ignoring the MTU for the purpose of OSPF adjacency and matching the MTU within a real production network.

By design, OSPF will automatically detect a MTU mismatch between two devices when they exchange the Database Description (DBD) packets during the formation of adjacency. This is per the standard OSPF specification defined in RFC 2328, “OSPF Version 2”. Specifically the RFC states the following:

10.6.  Receiving Database Description Packets
        This section explains the detailed processing of a received
        Database Description Packet.
[snip]
        If the Interface MTU field in the Database Description packet
        indicates an IP datagram size that is larger than the router can
        accept on the receiving interface without fragmentation, the
        Database Description packet is rejected.
[/snip]

Basically this means that if a router tries to negotiate an adjacency on an interface in which the remote neighbor has a larger MTU, the adjacency will be denied. The idea behind this check is two-fold. The first is to alleviate a problem in the data plane, in which a sending host transmits packets to a receiver that are too large to accept. Typically, Path MTU Discovery (PMTUD) should be implemented on the sender to prevent this case, however this process relies on ICMP messages that could possibly be filtered out in the transit path due to a security policy. The second, and most important issue, is to alleviate a problem in the control plane in which OSPF packets are exchanged.

Specifically this problem stems from the issue that the OSPF Hello, Database Description (DBD), Link-State Request (LSR), and Link-State Acknowledgement (LSAck) packets are generally small, but the Link-State Update (LSU) packets are generally not.

When establishing a new OSPF adjacency, the DBD packet is used to tell new neighbors what LSAs are in the database, but not to give the details about them. Specifically the DBD contains the LSA Header information, but not the actual LSA payload. The idea behind this is to optimize flooding in the case that the receiving router already received the LSA from another neighbor, in which case flooding does not need to occur during adjacency establishment.

For example, suppose that you and I, routers A and B, both have neighbors C and D, and the database is synchronized. If you and I form a new adjacency, my DBD exchange to you will say that I have LSAs A, B, C, and D in my database. Since you are already adjacent with C and D, and I am adjacent with them, you already have all of my LSAs, possibly with the exception of the new link that connects us. This means that even though I describe LSAs A and B to you with my DBD packet, you don’t send an LSR to me for them, which means I don’t send you an LSU about them. This is the normal optimization of how the database is exchanged so that excessive flooding doesn’t occur.

Suppose next that you, router A, know about LSAs A1 through An in your database, and I, router B, know about LSAs B1 through Bn. When we establish an adjacency your DBD to me will describe LSAs A1-An, while mine will describe LSAs B1-Bn. Since I don’t have LSAs A1-An, I will send you an LSR about them, and likewise since you don’t have B1-Bn, you will send an LSR about those to me. When you reply back to me with the LSUs about A1-An, it is likely that the LSU packet itself will contain more than one LSA in the payload, or that if the LSA is large, that it will span multiple IP fragments. The idea behind this is that since you need to send me more than one LSA, it’s more efficient to send them in as few LSUs as possible, instead of sending one LSA per LSU. The problem that can occur in this procedure however is when the router that is flooding has a larger MTU than the router that is receiving.

For example, suppose that the flooding router has a Gigabit Ethernet interface that supports Jumbo frames, which exceed the normal Ethernet MTU of 1500 bytes; however, the receiving router has not enabled Jumbo frame support, which implies that frames over 1500 bytes (excluding layer 2 overhead) will be dropped. If the flooding router sends multiple LSAs in an LSU forcing the packet size to exceed 1500 bytes, or if a single LSA sent by the flooding router is large enough to exceed 1500 bytes, such as a Router LSA (LSA Type 1) with many links, the results can be non-deterministic.
To demonstrate this, take the following topology.

 

R1 and R2 connect with GigabitEthernet, while R2 and R3 connect with FastEthernet. R1 has a default MTU of 1500 bytes configured on its link to R2, while R2 has Jumbo frame support configured up to 2000 bytes. R2 and R3’s link uses the default MTU of 1500 bytes. Per the RFC’s defined behavior, R1 should reject a OSPF adjacency with R2. This default behavior can be seen as follows:

R1:
interface GigabitEthernet1/0
 ip address 12.0.0.1 255.255.255.0
!
router ospf 1
 network 0.0.0.0 255.255.255.255 area 0

R2:
interface GigabitEthernet1/0
 mtu 2000
 ip address 12.0.0.2 255.255.255.0
!
router ospf 1
 network 0.0.0.0 255.255.255.255 area 0

R1#debug ip packet detail
IP packet debugging is on (detailed)
R1#debug ip ospf adj
OSPF adjacency events debugging is on

01:07:18: OSPF: Rcv DBD from 2.2.2.2 on GigabitEthernet1/0 seq 0x172A opt 0x52 flag 0x7 len 32  mtu 2000 state EXSTART
01:07:18: OSPF: Nbr 2.2.2.2 has larger interface MTU
01:07:18: OSPF: Retransmitting DBD to 2.2.2.2 on GigabitEthernet1/0
01:07:18: OSPF: Up DBD Retransmit cnt to 5 for 2.2.2.2 on GigabitEthernet1/0
01:07:18: OSPF: Send DBD to 2.2.2.2 on GigabitEthernet1/0 seq 0x1813 opt 0x52 flag 0x7 len 32

In this case we can see that R1 rejects R2′s DBD packet, since the MTU is larger. Although the obvious solution to this problem is to simply match the MTU of the links to avoid this problem in the first place, IOS also offers the “ip ospf mtu-ignore” command at the interface level to skip over this check in the OSPF adjacency state machine. Once applied, as seen below, R1 and R2 form an adjacency.

R1#conf t
Enter configuration commands, one per line.  End with CNTL/Z.
R1(config)#interface Gig1/0
R1(config-if)#ip ospf mtu-ignore
R1(config-if)#end
R1#
%OSPF-5-ADJCHG: Process 1, Nbr 2.2.2.2 on GigabitEthernet1/0 from LOADING to FULL, Loading Done
R1#show ip ospf neighbor 

Neighbor ID     Pri   State           Dead Time   Address         Interface
2.2.2.2           1   FULL/DR         00:00:36    12.0.0.2        GigabitEthernet1/0

At this point, both R1 and R2 learn the routes to each other’s Loopback0 interfaces, as seen below.

R1#show ip route ospf
     2.0.0.0/32 is subnetted, 1 subnets
O       2.2.2.2 [110/2] via 12.0.0.2, 00:00:05, GigabitEthernet1/0

R2#show ip route ospf
     1.0.0.0/32 is subnetted, 1 subnets
O       1.1.1.1 [110/2] via 12.0.0.1, 00:00:46, GigabitEthernet1/0

As expected however, since there is an MTU mismatch, R1 is unable to receive packets from R2 that exceed an MTU of 1500 bytes.

R2#ping 1.1.1.1

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 1.1.1.1, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 12/16/20 ms

R2#ping
Protocol [ip]:
Target IP address: 1.1.1.1
Repeat count [5]:
Datagram size [100]: 2000
Timeout in seconds [2]:
Extended commands [n]: y
Source address or interface:
Type of service [0]:
Set DF bit in IP header? [no]: yes
Validate reply data? [no]:
Data pattern [0xABCD]:
Loose, Strict, Record, Timestamp, Verbose[none]:
Sweep range of sizes [n]:
Type escape sequence to abort.
Sending 5, 2000-byte ICMP Echos to 1.1.1.1, timeout is 2 seconds:
.....
Success rate is 0 percent (0/5)

Theoretically this MTU mismatch should not matter, since end hosts that send traffic should ideally implement Path MTU Discovery. However, let’s now see a case where R2 is unable to flood LSAs to R1 for which the IP packet size exceeds 1500 bytes.

R3, who connects to R2, has been configured with a large number of Loopback interfaces in order to generate a large Router LSA (LSA Type 1). R3′s configuration is as follows, where Loopbacks 3.3.3.2 – 3.3.3.253 have been omitted:

R3:
interface FastEthernet0/0
 ip address 23.0.0.3 255.255.255.0
 shutdown
!
interface Loopback3330
 ip address 3.3.3.0 255.255.255.255
!
[snip]
!
interface Loopback333254
 ip address 3.3.3.254 255.255.255.255
!
router ospf 1
 network 0.0.0.0 255.255.255.255 area 0

The number of resulting local links can be seen in R3′s database as follows:

R3#show ip ospf database

            OSPF Router with ID (23.0.0.3) (Process ID 1)

                Router Link States (Area 0)

Link ID         ADV Router      Age         Seq#       Checksum Link count
23.0.0.3        23.0.0.3        299         0x80000007 0x0050D2 254

Now let’s activate the link between R2 and R3, which will cause R3 to flood a large Router LSA to R2, which in turn causes R2 to flood this to R1.

R3#config t
Enter configuration commands, one per line.  End with CNTL/Z.
R3(config)#int Fa0/0
R3(config-if)#no shutdown
R3(config-if)#end
R3#

R2#debug ip packet detail
IP packet debugging is on (detailed)
R2#debug ip ospf packet
OSPF packet debugging is on

R2#config t
Enter configuration commands, one per line.  End with CNTL/Z.
R2(config)#interface Fa2/0
R2(config-if)#no shutdown
R2(config-if)#end
R2#
%SYS-5-CONFIG_I: Configured from console by console
IP: s=23.0.0.3 (FastEthernet2/0), d=224.0.0.5, len 76, rcvd 0, proto=89
OSPF: rcv. v:2 t:1 l:44 rid:23.0.0.3
      aid:0.0.0.0 chk:D59B aut:0 auk: from FastEthernet2/0
IP: s=23.0.0.2 (local), d=23.0.0.3 (FastEthernet2/0), len 80, sending, proto=89
[snip]

R2 and R3 form adjacency, and R3′s LSA is flooded to R2. Since the LSA takes more than one 1500 byte packet, it is fragmented into multiple packets, with the largest being the shared MTU of 1500 between them.

IP: s=23.0.0.3 (FastEthernet2/0), d=23.0.0.2, len 1500, rcvd 0
    IP Fragment, Ident = 497, fragment offset = 0, proto=89
IP: recv fragment from 23.0.0.3 offset 0 bytes
IP: s=23.0.0.3 (FastEthernet2/0), d=23.0.0.2, len 1500, rcvd 0
    IP Fragment, Ident = 497, fragment offset = 1480
IP: recv fragment from 23.0.0.3 offset 1480 bytes
IP: s=23.0.0.3 (FastEthernet2/0), d=23.0.0.2, len 172, rcvd 0
    IP Fragment, Ident = 497, fragment offset = 2960
IP: recv fragment from 23.0.0.3 offset 2960 bytes
OSPF: rcv. v:2 t:4 l:3112 rid:23.0.0.3
      aid:0.0.0.0 chk:297C aut:0 auk: from FastEthernet2/0
%OSPF-5-ADJCHG: Process 1, Nbr 23.0.0.3 on FastEthernet2/0 from LOADING to FULL, Loading Done

Once the adjacency is full, R2 installs R3′s routes, and begins to flood to R1:

R2#show ip route ospf
     1.0.0.0/32 is subnetted, 1 subnets
O       1.1.1.1 [110/2] via 12.0.0.1, 00:00:10, GigabitEthernet1/0
     3.0.0.0/32 is subnetted, 254 subnets
O       3.3.3.1 [110/2] via 23.0.0.3, 00:00:10, FastEthernet2/0
[snip]
O       3.3.3.254 [110/2] via 23.0.0.3, 00:00:10, FastEthernet2/0

R2#
IP: s=12.0.0.2 (local), d=224.0.0.5 (GigabitEthernet1/0), len 3132, sending broad/multicast, proto=89
IP: s=12.0.0.2 (local), d=224.0.0.5 (GigabitEthernet1/0), len 1996, sending fragment
    IP Fragment, Ident = 854, fragment offset = 0, proto=89
IP: s=12.0.0.2 (local), d=224.0.0.5 (GigabitEthernet1/0), len 1156, sending last fragment
    IP Fragment, Ident = 854, fragment offset = 1976

Note that since the LSA exceeds the MTU of 2000 bytes, it is fragmented into multiple packets. Since R1 cannot accept packets that exceed its MTU of 1500 bytes, the LSUs are never received. This means that R1 cannot synchronize the database with R2, as seen as follows.

R1#show ip ospf database

            OSPF Router with ID (1.1.1.1) (Process ID 1)

                Router Link States (Area 0)

Link ID         ADV Router      Age         Seq#       Checksum Link count
1.1.1.1         1.1.1.1         62          0x80000005 0x6592   2
2.2.2.2         2.2.2.2         35          0x8000000D 0x613E   3

                Net Link States (Area 0)

Link ID         ADV Router      Age         Seq#       Checksum
12.0.0.1        1.1.1.1         62          0x80000001 0x61BB
23.0.0.3        23.0.0.3        36          0x80000001 0x974C  

R2#show ip ospf database

            OSPF Router with ID (2.2.2.2) (Process ID 1)

                Router Link States (Area 0)

Link ID         ADV Router      Age         Seq#       Checksum Link count
1.1.1.1         1.1.1.1         67          0x80000005 0x6592   2
2.2.2.2         2.2.2.2         38          0x8000000D 0x613E   3
23.0.0.3        23.0.0.3        39          0x80000005 0x2AAD   255

                Net Link States (Area 0)

Link ID         ADV Router      Age         Seq#       Checksum
12.0.0.1        1.1.1.1         67          0x80000001 0x61BB
23.0.0.3        23.0.0.3        39          0x80000001 0x974C  

R3#show ip ospf database

            OSPF Router with ID (23.0.0.3) (Process ID 1)

                Router Link States (Area 0)

Link ID         ADV Router      Age         Seq#       Checksum Link count
1.1.1.1         1.1.1.1         69          0x80000005 0x006592 2
2.2.2.2         2.2.2.2         40          0x8000000D 0x00613E 3
23.0.0.3        23.0.0.3        39          0x80000005 0x002AAD 255

                Net Link States (Area 0)

Link ID         ADV Router      Age         Seq#       Checksum
12.0.0.1        1.1.1.1         69          0x80000001 0x0061BB
23.0.0.3        23.0.0.3        39          0x80000001 0x00974C

This also implies that R1 cannot install routes towards R3:

R1#show ip route ospf
     2.0.0.0/32 is subnetted, 1 subnets
O       2.2.2.2 [110/2] via 12.0.0.2, 00:00:02, GigabitEthernet1/0
     23.0.0.0/24 is subnetted, 1 subnets
O       23.0.0.0 [110/2] via 12.0.0.2, 00:00:02, GigabitEthernet1/0

Eventually the adjacency state between R1 and R2 is lost, due to the lack of LSAcks sent in response to R2′s LSUs. This can be seen in R1′s “debug ip ospf packet” as follows, and the “show ip ospf neighbor” on both devices:

R1#
OSPF: rcv. v:2 t:1 l:44 rid:2.2.2.2
      aid:0.0.0.0 chk:DC98 aut:0 auk: from GigabitEthernet1/0
OSPF: Cannot see ourself in hello from 2.2.2.2 on GigabitEthernet1/0, state INIT

R1#show ip ospf neighbor 

Neighbor ID     Pri   State           Dead Time   Address         Interface
2.2.2.2           1   LOADING/DR      00:00:34    12.0.0.2        GigabitEthernet1/0

R2#show ip ospf neighbor 

Neighbor ID     Pri   State           Dead Time   Address         Interface
23.0.0.3          1   FULL/DR         00:00:35    23.0.0.3        FastEthernet2/0
1.1.1.1           1   FULL/BDR        00:00:39    12.0.0.1        GigabitEthernet1/0

The key with this example is that although the “ip ospf mtu-ignore” command allows the initial adjacency to form between R1 and R2, we can see that synchronization fails between them when an LSA replication event causes packet sizes generated by R2 to exceed R1′s MTU.

Based on this we can see that the “ip ospf mtu-ignore” command is not a fix to the underlying problem. Instead it is simply an exception to the OSPF adjacency state machine. The real fix to this problem is to ensure that the MTU values match between neighbors, which prevents both routing exchange in the control plane, and packet drops due to unsupported sizes in the data plane.

About Brian McGahan, CCIE #8593, CCDE #2013::13:

Brian McGahan was one of the youngest engineers in the world to obtain the CCIE, having achieved his first CCIE in Routing & Switching at the age of 20 in 2002. Brian has been teaching and developing CCIE training courses for over 10 years, and has assisted thousands of engineers in obtaining their CCIE certification. When not teaching or developing new products Brian consults with large ISPs and enterprise customers in the midwest region of the United States.

Find all posts by Brian McGahan, CCIE #8593, CCDE #2013::13 | Visit Website


You can leave a response, or trackback from your own site.

25 Responses to “OSPF and MTU Mismatch”

 
  1. Patrick says:

    Very nice writ up.

  2. Edson Soares says:

    Very helpful. tks.

  3. Titus says:

    awesome..thats i always check with INE each time i need some depth on a piece of technology. thx.

  4. Darren says:

    Great article.

    I must admit, I’ve probably used OSPf mtu-ignore too many times in the past, but this is a real eye opener

  5. Bob Mars says:

    Where is the routing loop though? The other post leading to this talked about a routing loop.

    dont’ get me wrong, it’s a great post, but I’m still confused.

    • @Bob Any time routers in the network do not agree on the topology, both traffic loops and traffic black holes can occur. Typically these temporary conditions are deemed a “transient” loop. Essentially it is a failure in convergence that results in packet loss. One case that this can occur is due to the MTU issue aforementioned. This is not the same as a routing loop, such as due to redistribution.

  6. Moses Sokabi says:

    I agree with Darren…same here!

  7. chrismarget says:

    “Theoretically this MTU mismatch should not matter, since end hosts that send traffic should ideally implement Path MTU Discovery.”

    I’m struggling with this assertion.

    Suppose a router is trying to deliver a large IP packet to a host (or to another router) with a too-small MTU configured. The router will format the large frame, and put it onto the wire. The receiving host/router will log an error.

    How will the sending (large MTU) router know that it’s formatting un-receivable frames, so that it can generate an ICMP “too big” message in order to effect PMTUD?

    I’m under the impression that these rules apply to MTU sizing…
    1) All L3 systems (hosts/routers) sharing an IP subnet must agree on the MTU.
    2) All L2 gear must support an MTU at least as large as the L3 system MTU.

    Do I need to adjust my thinking on these rules?

    • If all links in the transit path use an MTU of 1500 bytes, but one segment or just one router supports giants, let’s say 4000 bytes, PMTUD should automatically negotiate the MTU down to 1500. Anything larger than this would assume that all devices in the transit path support larger than 1500, which includes the end host.

      • chrismarget says:

        Hi Brian,

        I’m down with the “one segment supports” an oddball MTU scenario.

        But I remain confused about the scenario where only one /interface/ in a broadcast domain supports a large MTU. It seems to me that this one device has no way to know about his neighbor’s limitation.

        If he doesn’t know about his neighbor’s small MTU, then he’ll happily forward an un-receivable frame onto the wire, rather than kicking back a “too big” message to the originator.

        Mismatched MTU within a broadcast domain seems utterly unsupportable. What am I missing here?

        • Right, the issue is that the routers don’t know that there is an MTU mismatch, hence the packet drops. This is what the OSPF problem demonstrates. Theoretically end hosts *could* prevent this by agreeing on the lowest MTU bidirectionally, but they don’t. PMTUD doesn’t normally account for this because mismatching the MTU between devices on the same segment isn’t a design issue, it’s just a misconfiguration.

          • chrismarget says:

            Okay, we’re on the same page then.

            I guess I interpreted your PMTUD comment to mean that PMTUD is able to catch/fix/workaround mismatched router MTU on a transit link.

            Thanks for clarifying!

  8. Ahmed says:

    I’m a bit confused here,

    Do you mean PMTUD is working when trying to send a traffic ( Data Plane ) only.

    why PMTUD didn’t work when R2 trying sending the LSU to R1 ( is sending LSU to OSPF neighbor not considered a Data Plane ).

    or did you bulit the scenario with disabling PMTUD on one side between R1 & R2?

    Thanks,

    • @Ahmed PMTUD does not fix the problem outlined in this post. The problem is in the control plane (OSPF) not the data plane (end host traffic). PMTUD works in the data plane not the control plane.

      • Pan says:

        So I am hopeful that I have found a souitlon in your post here, but I don’t get exactly what i am supposed to do. My setup is simply a cable modem in bridge mode wired into my Apple Extreme and wirelessly connected to the Xbox, no static IP. I get the MTU error. Any help would be greatly appreciated. By the way, I too am a Nole fan (Class of ’97, Marching Chief).

  9. Ricardo Simba says:

    Thanks Brian,

    Now I understand that the simple “ip ospf mtu-ignore” isn’t enough and why.

  10. Mark says:

    Still very helpful almost a year on.

    Going to have to rethink some of our mtu-ignores on our routers.

    Thank you for this post.

  11. Patil says:

    Thanks a lot Brian… its very helpful.

  12. raja says:

    Hi Brian,

    We having the same issue here. We have two 4500 routers which is R1 and R2. The R1 and R2 has two interconnect which are primary and secondary link for redundant. For the primary interconnect, we use bandwidth shaper in between R1 and R2. For secondary link its just a direct connection between R1 and R2.

    R1 interface configured with 1500 MTU size and R2 with 1520 MTU.
    OSPF is running in between these two routers.
    R1 interface configured “ip ospf mtu-ignore” but R2 nothing.

    The situation is that, the OSPF adjacency no issue on primary link but when shifted to secondary link, the issue started wherby it built adjancy with FULL and after 1 minutes if shows DOWN. After 30 minutes, its back to FULL and DOWN.C

    Can i know what was the reason behind this?

  13. Pradeep says:

    HI Brian,

    Good document i have one question ,

    When R2 sent a packet with 2000 bytes , R1 rejects it as he cannot accept it , R1 should send a ICMP message saying fragmentation needed , why does R1 not send a ICMP message ?

    Thanks
    -BG

  14. Noel says:

    Great, well put.

  15. Asgar says:

    Brain Guru,

    my topology include with one 65k and 2 Nexus 7k, configuring OSPF between them am getting “too many retransmissions” massage in 65k and neighbor 7k2 standing in EXCHANGE STATE
    - Already configured ip ospf mtu-ignore” on both the interfaces still the same problem any help will be great

 

Leave a Reply

Categories

CCIE Bloggers