Jan
14

I recently received an email from a student with a question about an example I did in our multicast bootcamp. After an hour into testing and drafting my email response, I realized this commonly misunderstood multicast design would make a great blog writeup! The original question is as follows:

Dear Brian

I am a customer of INE and bought the multicast bootcamp. Maybe I missed some important note, but I am confused related to the issue mentioned below. I am following the test bed you have shown in the presentation while describing the theory of sparse mode (Day 1 – Part 6) in which you have explained the RP Register, Join and SPT-Join.

Suppose the two trees are established, and traffic is flowing from the source to RP, and then RP to receiver. Also suppose that SPT-Join is disabled (e.g. threshold is infinity), and traffic always follows the shared trees.

Suppose that the multicast traffic flow is initiated from the source to the RP as follows:

R2 –> R4 –> R3 (RP)

Then traffic flows from the RP to receiver:

R3 –> R4 –>R5

When multicast traffic is coming from the RP on R4, will RPF check fail? I assume so, since multicast traffic is entering the interface in which RPF will be failed. Is there any other rule to follow if traffic is coming from RP?

Best Regards

Karim

Hi Karim,

First let’s do a quick recap on the different trees that are formed in PIM sparse mode. When a client wants to receive traffic for a particular group, it sends a (*,G) IGMP Report (a join) onto its LAN segment. The multicast router on this LAN segment turns the IGMP join message into a PIM Join message, and sends it upstream towards the RP. The result of this is a (*,G) shared tree rooted at the RP, which points down towards the receiver. This is considered the RPT, or simply the “shared tree”.

Next, the source of traffic begins sending multicast packets to this particular group onto its LAN segment. The multicast router on this LAN, the PIM DR, receives the feed and now knows about the source and group pair (S,G). Subsequently, the PIM DR sends a PIM Register message to the RP, informing it about the new (S,G) pair. Assuming that the rest of the PIM design is correct, e.g. no RPF failure or missing routes, the RP acknowledges the DR by replying with a PIM Register Stop message. Now the RP knows about the source via the (S,G) pair that was registered, and it knows about the receiver via the (*,G) join that came from downstream. At this point, the RP now needs to join the (S,G) tree to actually receive the traffic from the source. This is done via an (S,G) PIM Join up the reverse path towards the source. The result is a tree rooted at the source destined for the RP. This is considered the SPT, or simply the “shortest path tree”.

At this point traffic flows from the sender to the RP via the shortest path tree, and then from the RP down to the receiver via the shared tree. In many designs it is not an optimal path to have traffic flow through the RP, so the router attached to the receiver can optimize the traffic flow by joining a new shortest path tree, and then leaving the shared tree. This is accomplished by sending a SPT join, basically an (S,G) join, back towards the source, and then leaving the shared tree by sending a (*,G) prune up towards the RP. Once this process is complete, traffic can flow from the sender to the receiver without having to first go through the RP, assuming the RP isn’t already in the shortest path. Also note that this is the default behavior of PIM sparse mode.

For the purpose of this example we’ll ignore this last point about the SPT join, because as you requested “Suppose that SPT-Join is disabled (e.g. threshold is infinity), and traffic always follow the shared trees.” Now let’s take a look at the specific topology that will be used for this example, as seen below. R3 is the RP, SW2 is the sender, and SW1 is the receiver. Assume that IGP is enabled everywhere, PIM sparse mode is enabled on all transit links, all devices agree that R3’s address 150.1.3.3 is the RP address, and R5 has the ip pim spt-threshold infinity command configured so it won’t initiate an SPT join.

Now following the rules of how trees should be built, let’s think about how the traffic flow should look when a feed starts at SW2 and is received by SW1. Assuming SW1 joins the group first, a (*,G) IGMP join message will be sent up to R5. R5 will see that R3 is the RP for this group, and send a (*,G) PIM join message towards R4 which is destined for R3. R4 in turn forwards the (*,G) join to R3. At this point R3, R4, and R5 have the (*,G) state pointing down towards SW1. Next, traffic from the sender, SW2, begins to flow.

R2 receives the multicast feed from SW2, and sends an (S,G) PIM Register message to R3. R3 replies back with PIM Register Stop. At this point both R2 and R3 have the (S,G) entry installed, but no one else in the network does. R3 now realizes that it knows both the sender and the receiver, so it sends a new (S,G) PIM Join to R4 that is destined for R5. Once this is successful, R2, R4, and R3 have the (S,G) state installed in the multicast routing table. The actual packet flow should then look as follows:

• SW2 sends the feed
• R2 uses the (S,G) tree to forward the packets to R4
• R4 uses the (S,G) tree to forward the packets to R3
• R3 uses the (*,G) tree to forward the packets to R4
• R4 uses the (*,G) tree to forward the packets to R5
• R5 uses the (*,G) tree to forward the packets to SW1

If this were to be the case, there would technically be a traffic loop in the topology. This is because R4 sends the feed to R3 using the (S,G) tree, but R3 sends it right back to R4 using the (*,G) tree. Even if the packets don’t loop infinitely, shouldn’t we at least see an RPF failure on R4 for the packets that are received back from R3? The answer is… yes and no :) When traffic follows this tree, RPF failure will occur on R4 and the packets will be dropped. However, PIM SM fixes this problem by automatically optimizing the merging of the SPT and the RPT on the device where both of the trees converge. In reality the traffic flow will look like this:

• SW2 sends the feed
• R2 uses the (S,G) tree to forward the packets to R4
• R4 uses the (S,G) tree to forward the packets to R5
• R5 uses the (*,G) tree to forward the packets to SW1

While this result seems to break the logic of forming the SPT and the RPT, it’s actually the normal behavior per the RFC. I actually tried finding the specific portion in the RFC that mentions this, but the section that defines Join/Prune state management is basically written in programming logic. If you’re feeling adventurous, you can find the details in RFC 4601 – Protocol Independent Multicast – Sparse Mode (PIM-SM). Specifically this logic should be part of section 4.5. PIM Join/Prune Messages, but good luck finding it :) Instead, let’s use the IOS CLI to demonstrate the final result of this design.

Refer to the diagram above for topology information. Our first step is to configure the receiver to join a group:

SW1#config t
Enter configuration commands, one per line.  End with CNTL/Z.
SW1(config)#interface Vlan57
SW1(config-if)#ip igmp join-group 224.1.1.1
SW1(config-if)#end
SW1#

At this point the (*,G) join for the RPT should be propagated up to the RP, R3. This can be seen in the below debug output.

R3#debug ip pim
PIM debugging is on

PIM(0): Received v2 Join/Prune on FastEthernet0/0.34 from 150.1.34.4, to us
PIM(0): Join-list: (*, 224.1.1.1), RPT-bit set, WC-bit set, S-bit set
PIM(0): Check RP 150.1.3.3 into the (*, 224.1.1.1) entry
PIM(0): Add FastEthernet0/0.34/150.1.34.4 to (*, 224.1.1.1), Forward state, by PIM *G Join

R3, R4, and R5 should now have the (*,G) entry in the multicast routing table. The incoming interfaces point up towards the RP, while the outgoing interfaces point down towards SW1, the receiver.

R3#show ip mroute 224.1.1.1 | begin ^\(
(*, 224.1.1.1), 00:01:14/00:03:13, RP 150.1.3.3, flags: S
  Incoming interface: Null, RPF nbr 0.0.0.0
  Outgoing interface list:
    FastEthernet0/0.34, Forward/Sparse, 00:01:14/00:03:13

R4#show ip mroute 224.1.1.1 | begin ^\(
(*, 224.1.1.1), 00:01:23/00:03:06, RP 150.1.3.3, flags: S
  Incoming interface: FastEthernet0/0.34, RPF nbr 150.1.34.3
  Outgoing interface list:
    FastEthernet0/0.45, Forward/Sparse, 00:01:23/00:03:06

R5#show ip mroute 224.1.1.1 | begin ^\(
(*, 224.1.1.1), 00:01:25/00:02:02, RP 150.1.3.3, flags: SC
  Incoming interface: FastEthernet0/0.45, RPF nbr 150.1.45.4
  Outgoing interface list:
    FastEthernet0/0.57, Forward/Sparse, 00:01:25/00:02:02

Now the source, SW2, begins generating traffic:

SW2#ping 224.1.1.1 repeat 100

Type escape sequence to abort.
Sending 100, 100-byte ICMP Echos to 224.1.1.1, timeout is 2 seconds:
..
Reply to request 2 from 150.1.57.7, 8 ms
Reply to request 3 from 150.1.57.7, 8 ms

R3 receives the register message from R2 about the (S,G) pair, decapsulates the register, forwards the traffic down the (*,G) tree, and then joins the SPT back to 150.1.28.8.

PIM(0): Received v2 Register on FastEthernet0/0.34 from 150.1.24.2
     for 150.1.28.8, group 224.1.1.1
PIM(0): Insert (150.1.28.8,224.1.1.1) join in nbr 150.1.34.4's queue
PIM(0): Forward decapsulated data packet for 224.1.1.1 on FastEthernet0/0.34
PIM(0): Building Join/Prune packet for nbr 150.1.34.4
PIM(0): Adding v2 (150.1.28.8/32, 224.1.1.1), S-bit Join
PIM(0): Send v2 join/prune to 150.1.34.4 (FastEthernet0/0.34)

R3 is now joined to the SPT, and is the root of the RPT. The result is that R4 receives redundant packets, as seen from the below debug.

R4#debug ip mpacket
IP multicast packets debugging is on

IP(0): s=150.1.28.8 (FastEthernet0/0.24) d=224.1.1.1 (FastEthernet0/0.45) id=353, ttl=253, prot=1, len=100(100), mforward
IP(0): s=150.1.28.8 (FastEthernet0/0.34) d=224.1.1.1 id=353, ttl=253, prot=1, len=114(100), not RPF interface

R4 now tells R3 to leave the SPT, because the traffic flow is redundant:

R4#
PIM(0): Insert (150.1.28.8,224.1.1.1) sgr prune in nbr 150.1.34.3's queue
PIM(0): Building Join/Prune packet for nbr 150.1.34.3
PIM(0):  Adding v2 (150.1.28.8/32, 224.1.1.1), RPT-bit, S-bit Prune
PIM(0): Send v2 join/prune to 150.1.34.3 (FastEthernet0/0.34)

R3 gets this message from R4:

R3#
PIM(0): Received v2 Join/Prune on FastEthernet0/0.34 from 150.1.34.4, to us
PIM(0): Prune-list: (150.1.28.8/32, 224.1.1.1) RPT-bit set

R3 has both the (*,G) and the (S,G), but the (S,G) is now pruned:

R3#show ip mroute 224.1.1.1 | begin ^\(
(*, 224.1.1.1), 00:05:59/stopped, RP 150.1.3.3, flags: S
  Incoming interface: Null, RPF nbr 0.0.0.0
  Outgoing interface list:
    FastEthernet0/0.34, Forward/Sparse, 00:05:59/00:03:26

(150.1.28.8, 224.1.1.1), 00:00:03/00:02:58, flags: PTX
  Incoming interface: FastEthernet0/0.34, RPF nbr 150.1.34.4
  Outgoing interface list: Null

R4 is on the SPT for the (S,G), but does not forward the traffic towards the RP

R4#show ip mroute 224.1.1.1 | begin ^\(
(*, 224.1.1.1), 00:06:27/stopped, RP 150.1.3.3, flags: S
  Incoming interface: FastEthernet0/0.34, RPF nbr 150.1.34.3
  Outgoing interface list:
    FastEthernet0/0.45, Forward/Sparse, 00:06:27/00:02:56

(150.1.28.8, 224.1.1.1), 00:00:31/00:02:52, flags: T
  Incoming interface: FastEthernet0/0.24, RPF nbr 150.1.24.2
  Outgoing interface list:
    FastEthernet0/0.45, Forward/Sparse, 00:00:31/00:02:28

R5 is only on the (*,G) RPT, since the SPT threshold was set to inifinty:

R5#show ip mroute 224.1.1.1 | begin ^\(
(*, 224.1.1.1), 00:06:36/00:02:55, RP 150.1.3.3, flags: SC
  Incoming interface: FastEthernet0/0.45, RPF nbr 150.1.45.4
  Outgoing interface list:
    FastEthernet0/0.57, Forward/Sparse, 00:06:36/00:02:55

Note that R5 is actually forwarding the packets:

R5#show ip mroute count
IP Multicast Statistics
2 routes using 1396 bytes of memory
2 groups, 0.00 average sources per group
Forwarding Counts: Pkt Count/Pkts(neg(-) = Drops) per second/Avg Pkt Size/Kilobits per second
Other counts: Total/RPF failed/Other drops(OIF-null, rate-limit etc)

Group: 224.1.1.1, Source count: 0, Packets forwarded: 70, Packets received: 70
  RP-tree: Forwarding: 70/1/100/0, Other: 70/0/0

So there we have it! Although R3 is initially part of the tree for this (S,G) pair, R4 quickly realizes that this path is redundant, and prunes R3 from the tree. This SPT Join and Prune process between R3 and R4 will continue as long as the source is sending traffic, because the DR will periodically refresh the Register message to the RP. Every the register is re-sent, R3 will join the SPT, then R4 will prune R3 from the SPT.

What’s interesting about this is example is that it’s commonly understood that with shared trees, all traffic must pass through the RP. In this case we demonstrated this is not always true; the exception is that when links from the source to the RP in the (S,G) tree are also the same links on the (*,G) tree from the RP down to the receiver, these links will automatically be pruned.

Be sure to check out our multicast bootcamp in which I show lots of these types of examples live on the command line.

Thanks for reading!

About Brian McGahan, CCIE #8593, CCDE #2013::13:

Brian McGahan was one of the youngest engineers in the world to obtain the CCIE, having achieved his first CCIE in Routing & Switching at the age of 20 in 2002. Brian has been teaching and developing CCIE training courses for over 10 years, and has assisted thousands of engineers in obtaining their CCIE certification. When not teaching or developing new products Brian consults with large ISPs and enterprise customers in the midwest region of the United States.

Find all posts by Brian McGahan, CCIE #8593, CCDE #2013::13 | Visit Website


You can leave a response, or trackback from your own site.

17 Responses to “Understanding Advanced PIM Shared Tree Designs”

 
  1. Karim Asif Sattar says:

    Dear Brian
    Thanks for the example. Its really help.

    Best Regards
    Karim Asif

  2. Yasir Ashfaque says:

    Excellent Article Brian, i really enjoyed :)

  3. jdr says:

    Good one. Thank you.

  4. Josh says:

    Hi Brian,

    Great article! I just had this exact issue come up while I was teaching a class. One of the routers was on the SPT and RPT, and the traffic was not flowing through the RP.

    One thing though, in the second paragraph after the diagram, you said “R3 now realizes that it knows both the sender and the receiver, so it sends a new (S,G) PIM Join to R4 that is destined for R5″. Isn’t this message destined for R3, the PIM DR on the Sender’s LAN? R3 is trying to join the SPT rooted at R3 correct?

    Josh

  5. Josh says:

    Sorry, those last two sentences should have said “Isn’t this message detined for R2, the PIM DR on the Sender’s LAN? R3 is trying to join the SPT rooted at R2 correct?

  6. vishwa says:

    Hi Josh ,

    R4 is on the SPT for the (S,G), but does not forward the traffic towards the RP

    R4 now tells R3 to leave the SPT, because the traffic flow is redundant

    i have some question related to the above to comments by you.
    what will happen to the receiver who are also connected to RP ?
    If R4 is telling R3 to leave the SPT.

  7. sreekanth says:

    Good One , Made me understand the basics.

  8. Angela says:

    Thanks for teaching the basic.

    I have a question here:

    Will R4 still receiving some of the multicast packet from R3 at the very beginning of the prune ? Which will cause R4 treat those traffic as RPF failure?

    If the design is a triangle among the R2 — R3–R4–R2. How R4 avoid seeing RPF check message ? Is it the same logic as the topology you teached in this artical ?

    Thanks

    • Yes, R4 will get the packets from R3 at first, which causes it to generate the prune. The same would occur in your triangle type topology. Your best bet is to actually try it out on the CLI and look at the debug outputs to fully understand what happens.

  9. driss says:

    thank you ,it’s a good article

  10. Aslam Usmani says:

    Excellent explanation…… as always ;)

  11. Muhammad Adeel says:

    Thanks, it’s really excellent topic

  12. sandeep verma says:

    Amazing scenario to cover and great explanation. I am TAC engineer and one of our customer faced the same design issue recently.

  13. JK says:

    In response to Vishwa’s question if there was a source on RP, you are better off with PIM BiDir.

  14. Srikkanth says:

    How does RPF failure happens on R4, as R4 receives the traffic from R3, which is the RP for that group?

    Correct me if I’m wrong but according to PIM-SM, RPF check is performed on the IP address of RP rather than the IP address of the source of the packet.

    Hence RPF failure shouldn’t occur here on R$, as it is receiving the traffic from R3.

  15. Richekooh says:

    I rather enjoyed this post. You explained things very well and your efforts are much appreciated!

  16. N1x ( Rati Jokhadze ) says:

    great post! thanks

 

Leave a Reply

Categories

CCIE Bloggers