Carrier-of-Carriers Inter-provider L3-VPN on Junos vMX

Introduction

In this post I’ll show you an implementation of a Carrier-of-Carriers Inter-provider Layer-3 VPN on Junos vMX. I’ve studied this stuff as the last topic explained on Juniper Networks JNCIS-IS MPLS Study Guide, which I suggest you to read if you want to understand a lot of interesting features of Juniper platform.

This is the topology of the lab (click on this link for the full-size image):

post6_fig2_carrier-of-carriers

 

 

As you can see in the image above, I’ve used a convention to number the Loopback and P-t-P interfaces of the routers: it seems complex, but you’ll get used after a while ūüôā

Some examples:

  • Loopback of CE-SP-PE7: router Rx is within group 1 (the number within the golden square) and x is equal to 7, so its loopback is 192.168.1.7/32
  • Interface of CE-A-11 toward CE1-SP-PE7: it is an inter-group P-t-P link, between R11 and R7, so it is within 172.16.*.* network. Given that one of the two router’s numbers is greater than 10, we’ll sum 11 and 7, so we have 18 as the third octet. The IP address is 172.16.18.11/24.¬†The interface is blue, so it is lt-0/0/10.117 with peer unit lt-0/0/10.711: unit is 117 because we are on router with ID 11 and the interface points toward router with ID 7 (so the router with ID 7 will have unit 711 paired with unit 117 for the same logic)
  • Interface of SP-P1 toward SP-PE2: it is an intra-group P-t-P link, between R1 and R2, so it is within 10.*.*.* network. Group is 0 so the second octet is 0. The third octet is given by XY, where X is the lowest-numbered router, R1 and Y is the higher-numbered router, R2. So the third octet is 12. The last octet is given by the router number of SP-P1, so the ip address is 10.0.12.1/24.¬†The interface is green, so it is ge-0/0/0.12 with vlan-id 12: ge-0/0/0 because we are on router with ID 1 facing on the link toward router with ID 2, so we use the ge-0/0/0; unit is 12 and vlan-id is the same, and it is the concatenation of the two IDs, the lowest first, since both are less than 10. The corresponding interface on router with ID 2 is ge-0/0/1.12 with vlan-id 12.

As you can see, we’ll use /24 networks even if they are P-t-Ps, just to have the possibility to implement our numbering scheme.

On each link I’ve put one or more letters M, L or¬†R¬†to quickly show which protocols among MPLS, LDP and RSVP are enabled and running on the links.

Route-Distinguishers and Route-Targets used for the L3-VPNs are shown in the graphic.

The whole topology has been built on a single Juniper vMX virtual router, running within EVE-NG. Blue links are built on lt-0/0/10 logical tunnel interface, while green links are built on ge-0/0/0 and ge-0/0/1 physical interfaces, which are connected one with each other. A pair of ge interfaces are sufficient to build how many P-t-P links we want, it is sufficient to use a different vlan-id for each link. The reason why I’ve inserted some ge links is that EVE-NG allows me to capture packets on that links, while it wouldn’t be possible on the lt-0/0/10 interface. The fact that I’ve used only a pair of ge interfaces instead of more is that sniffing on ge-0/0/0 interface allows me to see the traffic traversing the 4 segments in a row, with different vlan tags.

This is the lab on EVE-NG (em0 is the management interface connected to my LAN):

post6_fig1

The switch cloud on the right is a simple way to connect ge-0/0/0 with ge-0/0/1 using EVE-NG linux bridging facility, without the need to configure a virtual switch.

Objective: Service provider 1 and 2 want to offer a L3-VPN called vpn-a between CE-A-11 and CE-A-12. The two service providers have different AS Numbers and are interconnected by Service Provider 0 which will offer them an Inter-Provider vpn called inter-vpn.

Problem:¬†vpn-a L3-VPN must be established with an MP-EBGP session between the two ASBRs of SP1 and SP2. No other router¬†in SP1, SP2 and SP0 will know anything about vpn-a. We must build a label-switched-path from CE1-SP-PE7 and CE2-SP-PE10. Usually IPv4 (family inet) i/eBGP sessions do not attach a label to the routes sent to the neighbors. This would imply that CE1-SP-PE7 would receive a route to reach CE2-SP-PE10 loopback without a label attached to it: this would cause the sending of a packet labeled with a vpn-a vpn label to CE1-SP-PE5, which as I’ve previously said doesn’t know anything about the vpn-a L3-VPN and would discard the packet. The same for the other routers in the path toward destination.

Carrier-of-Carriers Service Provider L3-VPN

Service Provider 0 is configured with OSPF Area 0 as IGP. I’ve chosen to use RSVP to build the following LSPs, instead of enabling LDP:

  • Bidirectional LSPs between SP-PE2 and SP-PE3 routers: these LSPs are needed to resolve the next-hop of inter-vpn L3-VPN routes in inet.3.
  • ¬†Unidirectional LSP from¬†SP-RR4 Route-Reflector toward its RR-clients: these LSPs are needed because routes received by an RR that can not resolve the corresponding next-hop in inet.3¬†are hidden and thus not “reflected” to other clients.

Each of SP-PE2 and SP-PE3 routers have an MP-iBGP session with SP-RR4, which has only the inet-vpn family enabled, since we only need to carry labeled L3-VPN routes. SP-P1 is a P-router, and it doesn’t need to know anything about VPN routes, so it requires only to run¬†OSPF, MPLS and RSVP on its interfaces, to allow the building of the required RSVP LSPs that I’ve mentioned above.

The only routes that we need to have within the ¬†inter-vpn.inet.0¬†table on SP-PE2 and SP-PE3 are the loopback addresses of CE1-SP-PE7 and CE2-SP-PE10, which will need to reach each other to build the outermost MP-eBGP session for¬†vpn-a. P-t-P links in the¬†inter-vpn routing-instance are not sent as labeled VPN routes between over the MP-iBGP session between SP-PE2 and SP-PE3, because they are considered multi-access links and are only advertised if there is a route with a next-hop on that link or if we’re using¬†vrf-table-label within the routing-instance (which is not the case for inter-vpn).

Follows the output of some show commands executed for the routers within SP0:

-- SP-PE2 --
admin> show route logical-system SP-PE2

inet.0: 9 destinations, 9 routes (9 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

10.0.12.0/24 *[Direct/0] 01:14:57
 > via ge-0/0/1.12
10.0.12.2/32 *[Local/0] 01:14:59
 Local via ge-0/0/1.12
10.0.13.0/24 *[OSPF/10] 01:14:43, metric 2
 > to 10.0.12.1 via ge-0/0/1.12
10.0.14.0/24 *[OSPF/10] 01:14:43, metric 2
 > to 10.0.12.1 via ge-0/0/1.12
192.168.0.1/32 *[OSPF/10] 01:14:43, metric 1
 > to 10.0.12.1 via ge-0/0/1.12
192.168.0.2/32 *[Direct/0] 01:15:43
 > via lo0.2
192.168.0.3/32 *[OSPF/10] 01:14:33, metric 2
 > to 10.0.12.1 via ge-0/0/1.12
192.168.0.4/32 *[OSPF/10] 01:14:38, metric 2
 > to 10.0.12.1 via ge-0/0/1.12
224.0.0.5/32 *[OSPF/10] 01:15:46, metric 1
 MultiRecv

inet.3: 1 destinations, 1 routes (1 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

192.168.0.3/32 *[RSVP/7/1] 01:14:16, metric 2
 > to 10.0.12.1 via ge-0/0/1.12, label-switched-path from-SP-PE2-to-SP-PE3

inter-vpn.inet.0: 4 destinations, 4 routes (4 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

172.16.25.0/24 *[Direct/0] 01:14:58
 > via ge-0/0/0.25
172.16.25.2/32 *[Local/0] 01:14:59
 Local via ge-0/0/0.25
192.168.1.7/32 *[BGP/170] 01:14:35, localpref 100
 AS path: 65100 I, validation-state: unverified
 > to 172.16.25.5 via ge-0/0/0.25, Push 299824
192.168.2.10/32 *[BGP/170] 01:14:16, localpref 100, from 192.168.0.4
 AS path: 65200 I, validation-state: unverified
 > to 10.0.12.1 via ge-0/0/1.12, label-switched-path from-SP-PE2-to-SP-PE3
[...]

admin> show mpls lsp logical-system SP-PE2 ingress detail
Ingress LSP: 1 sessions

192.168.0.3
 From: 192.168.0.2, State: Up, ActiveRoute: 0, LSPname: from-SP-PE2-to-SP-PE3
 ActivePath: (primary)
 LSPtype: Static Configured, Penultimate hop popping
 LoadBalance: Random
 Encoding type: Packet, Switching type: Packet, GPID: IPv4
 *Primary State: Up
 Priorities: 7 0
 SmartOptimizeTimer: 180
 Computed ERO (S [L] denotes strict [loose] hops): (CSPF metric: 2)
 10.0.12.1 S 10.0.13.3 S
 Received RRO (ProtectionFlag 1=Available 2=InUse 4=B/W 8=Node 10=SoftPreempt 20=Node-ID):
 10.0.12.1 10.0.13.3
Total 1 displayed, Up 1, Down 0

-- SP-RR4 --
admin> show route logical-system SP-RR4 table inet.3

inet.3: 2 destinations, 2 routes (2 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

192.168.0.2/32 *[RSVP/7/1] 01:18:23, metric 2
 > to 10.0.14.1 via lt-0/0/10.41, label-switched-path from-SP-RR4-to-SP-PE2
192.168.0.3/32 *[RSVP/7/1] 01:18:22, metric 2
 > to 10.0.14.1 via lt-0/0/10.41, label-switched-path from-SP-RR4-to-SP-PE3

admin> show route logical-system SP-RR4 table bgp.l3vpn.0

bgp.l3vpn.0: 2 destinations, 2 routes (2 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

65000:2:192.168.1.7/32
 *[BGP/170] 01:18:47, localpref 100, from 192.168.0.2
 AS path: 65100 I, validation-state: unverified
 > to 10.0.14.1 via lt-0/0/10.41, label-switched-path from-SP-RR4-to-SP-PE2
65000:3:192.168.2.10/32
 *[BGP/170] 01:18:43, localpref 100, from 192.168.0.3
 AS path: 65200 I, validation-state: unverified
 > to 10.0.14.1 via lt-0/0/10.41, label-switched-path from-SP-RR4-to-SP-PE3

-- SP-P1 (only inet.0 and mpls.0 routes) --
admin> show route logical-system SP-P1 terse | match routes

inet.0: 11 destinations, 11 routes (11 active, 0 holddown, 0 hidden)

mpls.0: 12 destinations, 12 routes (12 active, 0 holddown, 0 hidden)

Service Provider 1

Now let’s have a look to SP1 (which is configured as SP2): again, we have OSPF running on the three nodes and LSPs between CE1-SP-PE7 and CE1-SP-PE5, this time built by¬†LDP instead of RSVP. As we’ve previously said, we need to have a Label-Switched-Path that spans across different Autonomous-Systems, so we must find a way for the 192.168.2.10/32 route of CE2-SP-PE10 loopback to flow through SP2, the¬†inter-vpn on SP0 and then SP1 toward CE1-SP-PE7 with a label attached to it: this is accomplished through two BGP sessions, an MP-eBGP session between SP0 and SP1 and an MP-iBGP session within SP1, as shown on the topology, both with¬†labeled-unicast feature enabled within¬†inet address family. This feature tells the router to attach a label to the IPv4 route that sends to the BGP neighbor. I’ll show you the route to 192.168.2.10/32 on different routers, with the label(s) attached to it:

-- SP-PE3 inter-vpn.inet.0: route with the label attached by CE2-SP-PE8 due to labeled-unicast --
admin> show route logical-system SP-PE3 table inter-vpn.inet.0 192.168.2.10/32

inter-vpn.inet.0: 4 destinations, 4 routes (4 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

192.168.2.10/32 *[BGP/170] 01:32:39, localpref 100
 AS path: 65200 I, validation-state: unverified
 > to 172.16.38.8 via lt-0/0/10.38, Push 299824

-- SP-PE2 inter-vpn.inet.0: route with the L3-VPN label attached to the route by SP-PE3 and the RSVP LSP label on top of it to reach SP-PE3 (some lines are omitted for brevity) --
admin> show route logical-system SP-PE2 table inter-vpn.inet.0 192.168.2.10/32 detail

inter-vpn.inet.0: 4 destinations, 4 routes (4 active, 0 holddown, 0 hidden)
192.168.2.10/32 (1 entry, 1 announced)
 *BGP Preference: 170/-101
 Source: 192.168.0.4
 Next hop: 10.0.12.1 via ge-0/0/1.12, selected
 Label-switched-path from-SP-PE2-to-SP-PE3
 Label operation: Push 299776, Push 299824(top)
 Protocol next hop: 192.168.0.3
 VPN Label: 299776

-- CE1-SP-PE5 inet.0: route with the label attached by SP-PE2 due to labeled-unicast --

admin> show route logical-system CE1-SP-PE5 table inet.0 192.168.2.10/32

inet.0: 10 destinations, 11 routes (10 active, 1 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

192.168.2.10/32 *[BGP/170] 01:32:44, localpref 100
 AS path: 65000 65200 I, validation-state: unverified
 > to 172.16.25.2 via ge-0/0/1.25, Push 299808

-- CE1-SP-PE7 inet.0: route with the label attached by CE1-SP-PE5 due to labeled-unicast and the top label of the LSP toward CE1-SP-PE5 --

admin> show route logical-system CE1-SP-PE7 table inet.0 192.168.2.10/32

inet.0: 8 destinations, 8 routes (8 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

192.168.2.10/32 *[BGP/170] 01:32:54, localpref 100, from 192.168.1.5
 AS path: 65000 65200 I, validation-state: unverified
 > to 10.1.67.6 via ge-0/0/1.67, Push 299840, Push 299792(top)

I’ve omitted the route as it is seen on CE2-SP-PE8: in this case we receive a labeled IPv4 BGP route (preference 170) for 192.168.2.10/32 from CE2-SP-PE10, but it is also received in OSPF (preference 10), so the installed route has no label: this would cause a problem¬†when CE1-SP-PE7 sends a packet with a¬†vpn-a MPLS label, because when it reaches CE2-SP-PE8 it would be forwarded to CE2-SP-P9 toward CE2-SP-P10 with only the¬†vpn-a label, which would be unknown to CE2-SP-P9. We must force the iBGP labeled route to be installed in CE2-SP-PE8 inet.0 table (even if the label is an explicit null, as we will see, but having a BGP route instead of an OSPF one forces the use of the LSP toward CE2-SP-P10 to deliver the packet, thus adding an additional label that will be popped by CE2-SP-P9), so I’ve raised OSPF preference to 200:

-- CE2-SP-PE8 --

admin> show route logical-system CE2-SP-PE8 table inet.0 192.168.2.10/32

inet.0: 10 destinations, 11 routes (10 active, 1 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

192.168.2.10/32 *[BGP/170] 01:47:34, localpref 100, from 192.168.2.10
 AS path: I, validation-state: unverified
 > to 10.2.89.9 via lt-0/0/10.89, Push 299776
 [OSPF/200] 01:47:37, metric 2
 > to 10.2.89.9 via lt-0/0/10.89

Label 299776 is not the label attached to the route by CE2-SP-PE10, but it is the label used to reach the next-hop 192.168.2.10, resolved via inet.3 table:

admin> show route logical-system CE2-SP-PE8 table inet.3

inet.3: 2 destinations, 2 routes (2 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

192.168.2.9/32 *[LDP/9] 10:12:23, metric 1
 > to 10.2.89.9 via lt-0/0/10.89
192.168.2.10/32 *[LDP/9] 10:12:23, metric 2
 > to 10.2.89.9 via lt-0/0/10.89, Push 299776

CE2-SP-PE10 in fact sends a labeled route to CE2-SP-PE8 but the label is an explicit null (reserved label 3), which means “you do not have to use a label to reach that route, just reach me”:

admin> show route logical-system CE2-SP-PE8 receive-protocol bgp 192.168.2.10 extensive

inet.0: 10 destinations, 11 routes (10 active, 1 holddown, 0 hidden)
* 192.168.2.10/32 (2 entries, 2 announced)
 Accepted
 Route Label: 3
 Nexthop: 192.168.2.10
 Localpref: 100
 AS path: I

In fact, a packet received by CE2-SP-PE8 directed to CE2-SP-PE10 vpn-a needs only the vpn-a label and on top of it a label to reach CE2-SP-PE10 via an LDP LSP. The outer label will then be removed by CE2-SP-P9 due to Penultimate-Hop-Popping and a packet with only the VPN label will be delivered to CE2-SP-P10.

VPN-A customer’s L3VPN between SP1 and SP2

As a final step, once that we have a bidirectional LSP between CE1-SP-PE7 and CE2-SP-PE10, we can build the¬†vpn-a L3VPN with an MP-eBGP session between those two routers. In order to make the¬†vpn-a work, as we’ve explained with¬†inter-vpn, we need to have the eBGP next-hop reachable in¬†inet.3 routing table. This is accomplished with the addition of¬†resolve-vpn keyword added to labeled-unicast, which forces labeled IPv4 routes to be installed also in¬†inet.3.¬†Follows the configuration of BGP (internal and external BGP) on CE1-SP-PE7 and the configuration of the routing-instance for¬†vpn-a, where I’ve used¬†vrf-table-label (which forces a lookup in the vpn-a.inet.0 routing table instead of directly sending VPN packets destined to CE-A-11 on the P-t-P interface)¬†and static routing to reach the loopback of the connected CEs of Customer A (the CEs in turn have a default route toward their Service Provider):

-- CE1-SP-PE7 --
admin> show configuration logical-systems CE1-SP-PE7 protocols bgp
group SP1-Internal {
 type internal;
 local-address 192.168.1.7;
 family inet {
  labeled-unicast {
   resolve-vpn;
  }
 }
 export export-loopback;
 neighbor 192.168.1.5;
}
group SP-1-2-external {
 type external;
 multihop;
 local-address 192.168.1.7;
 family inet-vpn {
  unicast;
 }
 peer-as 65200;
 neighbor 192.168.2.10;
}

admin> show configuration logical-systems CE1-SP-PE7 routing-instances vpn-a
instance-type vrf;
interface lt-0/0/10.711;
route-distinguisher 65012:100;
vrf-target target:65012:0;
vrf-table-label;
routing-options {
 static {
  route 192.168.4.11/32 next-hop 172.16.18.11;
 }
}

Follows the output of some show commands on the same router, that shows the use of 3 labels, label 16 is a VPN label (it is so low due to the vrf-table-label statement), 299840 is the label attached to 192.168.2.10/32 route sent by CE1-SP-PE5 to CE1-SP-PE7 and 299792 is the LDP label associated to the LSP toward CE1-SP-PE5:

-- CE1-SP-PE7 --
admin> show route logical-system CE1-SP-PE7 table vpn-a.inet.0

vpn-a.inet.0: 5 destinations, 5 routes (5 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

172.16.18.0/24 *[Direct/0] 00:01:32
 > via lt-0/0/10.711
172.16.18.7/32 *[Local/0] 02:18:07
 Local via lt-0/0/10.711
172.16.22.0/24 *[BGP/170] 00:01:32, localpref 100, from 192.168.2.10
 AS path: 65200 I, validation-state: unverified
 > to 10.1.67.6 via ge-0/0/1.67, Push 16, Push 299840, Push 299792(top)
192.168.4.11/32 *[Static/5] 00:01:32
 > to 172.16.18.11 via lt-0/0/10.711
192.168.4.12/32 *[BGP/170] 00:01:32, localpref 100, from 192.168.2.10
 AS path: 65200 I, validation-state: unverified
 > to 10.1.67.6 via ge-0/0/1.67, Push 16, Push 299840, Push 299792(top)

As you can see, the reception of label 16 forces a second lookup within vpn-a.inet.0 routing table, instead of directly sending the packet toward CE-A-11 on the lt-0/0/10.711 interface of CE1-SP-PE7:

admin> show route logical-system CE1-SP-PE7 table mpls.0 label 16

mpls.0: 8 destinations, 8 routes (8 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

16 *[VPN/0] 21:14:18
 to table vpn-a.inet.0, Pop

Verifying the connection between CE-A-11 and CE-A-12

Now it is time to verify the connectivity between the two Customer-A’s routers with a traceroute. In order to have information for every hop within the network, each MPLS-enabled router must have¬†icmp-tunneling enabled within¬†protocol mpls stanza, otherwise a packet with an expired Time-To-Live value within IP header would produce a reply toward the source of the packet, 192.168.4.11 for example, which is completely unknown to all the routers except SP1 and SP2 ASBRs, i.e. CE1-SP-PE7 and CE2-SP-10. Enabling¬†icmp-tunneling forces the router where the packet is expired to build an¬†ICMP response that is sent toward the destination instead of the source with the original MPLS tags. When it reaches, in our example, CE2-SP-PE10 within¬†vpn-a.inet.0¬†table the router sees that the destination is 192.168.4.11 and it is sent back toward the source of the¬†traceroute’s UDP packet. I’ll add some info about each label on every hop.

admin> traceroute logical-system CE-A-11 192.168.4.12 source 192.168.4.11
traceroute to 192.168.4.12 (192.168.4.12) from 192.168.4.11, 30 hops max, 40 byte packets

 1 172.16.18.7 (172.16.18.7) 1.237 ms 0.668 ms 0.545 ms

 2 10.1.67.6 (10.1.67.6) 3.833 ms 4.254 ms 3.706 ms
 MPLS Label=299792 CoS=0 TTL=1 S=0 => LDP label to reach CE1-SP-PE5
 MPLS Label=299840 CoS=0 TTL=1 S=0 => iBGP label-unicast label for 192.168.4.12 received from CE-SP-PE5
 MPLS Label=16 CoS=0 TTL=1 S=1 => vpn-a label received from 192.168.4.12
 
3 10.1.56.5 (10.1.56.5) 3.704 ms 4.209 ms 3.687 ms => PHP removes the outermost label
 MPLS Label=299840 CoS=0 TTL=1 S=0 
 MPLS Label=16 CoS=0 TTL=2 S=1
 
4 172.16.25.2 (172.16.25.2) 3.770 ms 4.368 ms 3.605 ms
 MPLS Label=299792 CoS=0 TTL=1 S=0 => eBGP label-unicast label for 192.168.4.12 received from SP-PE2. It replaced label 299840
 MPLS Label=16 CoS=0 TTL=3 S=1
 
5 10.0.12.1 (10.0.12.1) 5.300 ms 5.557 ms 3.600 ms
 MPLS Label=299824 CoS=0 TTL=1 S=0 => RSVP label for from-SP-PE2-to-SP-PE3 LSP toward 192.168.0.3
 MPLS Label=299776 CoS=0 TTL=1 S=0 => inter-vpn label for 192.168.4.12 received from 192.168.0.3 through Route-Reflector 192.168.0.4. This replaces label 299792
 MPLS Label=16 CoS=0 TTL=4 S=1
 
6 10.0.13.3 (10.0.13.3) 4.809 ms 3.749 ms 4.161 ms => PHP removes the outermost label
 MPLS Label=299776 CoS=0 TTL=1 S=0
 MPLS Label=16 CoS=0 TTL=5 S=1
 
7 172.16.38.8 (172.16.38.8) 3.610 ms 4.518 ms 3.661 ms
 MPLS Label=299824 CoS=0 TTL=1 S=0 => eBGP label-unicast label for 192.168.4.12 received from CE2-SP-PE8. It replaced label 299776
 MPLS Label=16 CoS=0 TTL=6 S=1
 
8 10.2.89.9 (10.2.89.9) 3.633 ms 4.246 ms 3.619 ms => third label not added due to the explicit-null label received for 192.168.4.12 from 192.168.2.10 through iBGP labeled-unicast route advertisement
 MPLS Label=299776 CoS=0 TTL=1 S=0 => LDP label to reach CE2-SP-PE10. It replaced label 299824
 MPLS Label=16 CoS=0 TTL=7 S=1
 
9 172.16.22.10 (172.16.22.10) 4.002 ms 4.404 ms 3.705 ms => PHP removed the outermost label. CE2-SP-PE10 received the packet only with label 16 that has been removed for a second lookup within vpn-a.inet.0 due to vrf-table-label

10 192.168.4.12 (192.168.4.12) 4.000 ms 3.831 ms 4.493 ms

Then, I’ve started a packet capture on¬†ge-0/0/0¬†interface within EVE-NG and run the following command on vMX:

admin> ping logical-system CE-A-12 192.168.4.11 source 192.168.4.12 count 1

As you can see from the image below, I’ve captured the same¬†echo request and echo reply¬†packets 4 times in a row:

post6_fig3_captured_packets

Looking at the¬†echo request¬†packets, you can see all the labels we’ve seen from the¬†traceroute output before (look at the vlan-id, it tells you the green link the packet is traversing):

post6_fig4_capture_detail

These are the replies captured on the same interface (I’ll show you also the replies because they show the triple-label stack when the packet is leaving the L3VPN routing instance in CE1-SP-PE7 and SP-PE2):

post6_fig5_capture_detail_2

vMX Configuration

You can find the full configuration of the lab that you can load within your vMX router. Adjust the IP address of em0 interface, which is the one that can be connected to the Net cloud connected to the real LAN if you want to manage vMX via SSH. You can upload the configuration file via SSH (copy and paste via console can give you buffer problems) and then load it with the following commands:

[edit]
admin# delete
This will delete the entire configuration
Delete everything under this level? [yes,no] (no) yes

admin# load merge carrier-of-carriers_vMX_topology.cfg
load complete

admin# commit

commit complete

User is admin with password admin1, root has password root123.

carrier-of-carriers_vMX_topology.cfg

Conclusion

I hope you’ve read the whole stuff, it has been a long post but I hope you’ve find the topic very interesting as I did. I suggest you to read the JNCIS-SP Study Guide available from Juniper Networks to understand a lot of interesting stuff about MPLS on Junos, which can be successfully tested on a vMX platform.

As I usually say, post comments or questions or even tell me if I’ve made some mistakes, I’ve just gone through this stuff for few days and there is sure room to¬†improve my skills ūüôā

Advertisements
Posted in Uncategorized | Tagged , , , , , , , | 3 Comments

Enabling compression on base-images in Unetlab/EVE-NG Alpha

Introduction

In this small tutorial we’ll see how to enable QEMU image compression on¬†compress base images in Unetlab/EVE-NG Alpha. For some detail about where files are stored, have a look to my previous post¬†Modifying base-images with snapshots on Unetlab/EVE-NG Alpha.

Update 2017-01-25: after asking some info about QEMU compression on QEMU users’ mailing list, Alberto Garcia clarified some aspects of how compression works that were not clear for me:

I think there’s some misunderstanding here about compressed images in
QEMU. I’ll try to clarify:

* You create a compressed image with ‘qemu-img convert -c’. That is a
copy of the original image with all the clusters compressed.

* The compression is read-only: QEMU will read the compressed clusters,
  but everything that it writes will be uncompressed (also if you
  rewrite compressed clusters).

* Therefore, there’s no such thing as an image with compression
enabled. In QEMU you don’t compress an image, you compress
individual clusters of data. An image can have a mix of compressed
and uncompressed clusters.

Compressing base images

Suppose we have an already working base image or an¬†hda.qcow2 virtual hard disk prepared to be used. I’ll use my TinyCore linux I’ve used in the previous post. Let’s create a new folder and clone¬†/opt/unetlab/addons/qemu/linux-tiny-core-7.2/hda2.qcow2 base image enabling compression. This time we must take a full clone, we can not use snapshots:

root@eve-ng:/# mkdir /opt/unetlab/addons/qemu/linux-tiny-core-7.2-compressed

root@eve-ng:/# cd /opt/unetlab/addons/qemu/linux-tiny-core-7.2-compressed

root@eve-ng:/opt/unetlab/addons/qemu/linux-tiny-core-7.2-compressed# /opt/qemu/bin/qemu-img convert -c -f qcow2 -O qcow2 /opt/unetlab/addons/qemu/linux-tiny-core-7.2/hda.qcow2 hda.qcow2

root@eve-ng:/opt/unetlab/addons/qemu/linux-tiny-core-7.2-compressed# cd ..

root@eve-ng:/opt/unetlab/addons/qemu# ls -l linux-tiny-core-7.2/hda.qcow2
-rw-r--r-- 1 root root 76414976 Nov 11 10:21 linux-tiny-core-7.2/hda.qcow2

root@eve-ng:/opt/unetlab/addons/qemu# ls -l linux-tiny-core-7.2-compressed/hda.qcow2
-rw-r--r-- 1 root root 72351744 Jan 6 14:49 linux-tiny-core-7.2-compressed/hda.qcow2

As you can see, with TinyCore Linux I won’t save so much space, but with images such as Raware Alteon you can easily gin 1 GByte or more by enabling compression.

Unfortunately, I’ve not found a way to explicitly show that an image has compression enabled, if you know it, let me know! => As I wrote in the update at the beginning of the post, the concept of enabling compression with QEMU is faulty: you compress blocks of data of the qcow2 image but then there is no “compression enabled”, new data is written uncompressed.

Testing the compressed base image

I’ve added a new node to the lab we’ve used in a previous post, node 4, based on the new compressed image. Everything works as expected, EVE makes a snapshot of the compressed image:

root@eve-ng:/opt/unetlab/tmp/0/da2b48f4-d910-4e7d-9645-f952457cbf6d/4# /opt/qemu/bin/qemu-img info --backing-chain hda.qcow2
image: hda.qcow2
file format: qcow2
virtual size: 614M (643825664 bytes)
disk size: 1.3M
cluster_size: 65536
backing file: /opt/unetlab/addons/qemu/linux-tiny-core-7.2-compressed/hda.qcow2
Format specific information:
 compat: 1.1
 lazy refcounts: false
 refcount bits: 16
 corrupt: false

image: /opt/unetlab/addons/qemu/linux-tiny-core-7.2-compressed/hda.qcow2
file format: qcow2
virtual size: 614M (643825664 bytes)
disk size: 68M
cluster_size: 65536
Format specific information:
 compat: 1.1
 lazy refcounts: false
 refcount bits: 16
 corrupt: false

Conclusions

This small tutorial shows you how you can save some space by enabling compression on compressing¬†base images with QEMU tools. I’ve not done extensive tests on this, but everything seems to work. Any suggestions are welcome, just write a comment below!

Posted in Uncategorized | Tagged , , , | 1 Comment

Modifying base-images with snapshots on Unetlab/EVE-NG Alpha

Introduction

Unetlab/EVE-NG (Alpha) is a great tool you can use for learning about networking with different platforms (Dynamips routers, IOL, QEMU images). I won’t get into details about how to prepare the environment, you can find a lot of useful information on their site¬†http://www.unetlab.com, but I’ll focus on how labs’ devices’ images are managed by EVE (let’s use EVE instead of Unetlab/EVE-NG for the rest of the post).

Browsing the temporary files for a lab with a QEMU node

Let’s create a new lab with a single QEMU node, which in my case is a TinyCore Linux image:

post5_fig1_add_tinycore

The base image I’ve chosen is linux-tiny-core-7.2, which you can find on EVE VM under¬†/opt/unetlab/addons/qemu/linux-tiny-core-7.2:

root@eve-ng:~# ls -l /opt/unetlab/addons/qemu/linux-tiny-core-7.2/
total 74624
-rw-r--r-- 1 root root 76414976 Nov 11 10:21 hda.qcow2

hda.qcow2 is disk1 of the linux box.

What happens when you instantiate this image within your lab and run/modify its contents? As you can imagine, EVE does not modify the source image, otherwise it would be impossible to manage more instantiation for the same base image. Instead, it creates a QEMU snapshot within the lab’s temporary files. Let’s see it.

First, we must get some info about the lab: press Lab Details on the left menu bar and get the lab ID, which in my case is da2b48f4-d910-4e7d-9645-f952457cbf6d.

Go on the temporary lab folder on EVE (the “0” after /tmp is my pod number, zero as I’m working as admin):

cd /opt/unetlab/tmp/0/da2b48f4-d910-4e7d-9645-f952457cbf6d/

In this folder you can find a subfolder for each device of the lab. In this example we have only device 1, so enter within 1 subfolder (you can get the number of the node, if multiple nodes are running, by right-clicking on the node and looking at the number between the brackets after the name of the node) and look at its contents:

root@eve-ng:/opt/unetlab/tmp/0/da2b48f4-d910-4e7d-9645-f952457cbf6d# cd 1

root@eve-ng:/opt/unetlab/tmp/0/da2b48f4-d910-4e7d-9645-f952457cbf6d/1# ls -l
total 1284
-rw-r--r-- 1 root unl 1376256 Jan 6 11:43 hda.qcow2
-rw-rw-r-- 1 root unl 0 Jan 6 11:42 wrapper.txt

you can see¬†hda.qcow2 again. Is it a copy? No, it would be a waste of space, it is a snapshot of the base image, let’s see it:

root@eve-ng:/opt/unetlab/tmp/0/da2b48f4-d910-4e7d-9645-f952457cbf6d/1# /opt/qemu/bin/qemu-img info --backing-chain hda.qcow2
image: hda.qcow2
file format: qcow2
virtual size: 614M (643825664 bytes)
disk size: 1.3M
cluster_size: 65536
backing file: /opt/unetlab/addons/qemu/linux-tiny-core-7.2/hda.qcow2
[...]

image: /opt/unetlab/addons/qemu/linux-tiny-core-7.2/hda.qcow2
file format: qcow2
virtual size: 614M (643825664 bytes)
disk size: 73M
cluster_size: 65536
[...]

So, every modification you do to the disk of the instantiated image, it is confined to the snapshot and the base image won’t be modified, as it is quite obvious.

Modifying the base image

Some times ago, I’ve felt¬†the need to modify my base Tiny Core installation in order to add¬†tcpdump¬†package to it, which is useful for troubleshooting. Instead of installing it on every instantiation of TinyCore within my labs, I wanted to modify the base image, but¬†you must not modify a base image after you’ve used it at least in one lab, unless you want to corrupt all the labs that instantiated it. The quick and dirty solution could be creating a new subdirectory within¬†/opt/unetlab/addons/qemu/linux-tiny-core-7.2-v2¬†and copying the original¬†hda.qcow2 file into it, then modify the new base image before using it into any lab. With TinyCore it is just a matter of few Mbytes, so it is perfectly right to make a simple copy. But what if the base image is 2-3 GBytes, such as Radware Alteon or Juniper vMX images? It would be a waste of space, and on my laptop I don’t want to waste it, so let’s try a different approach.

Making a snapshot of the base image

We can use qemu-img¬†to create a new¬†hda.qcow2 which is a snapshot of the original base image. Go¬†into¬†/opt/unetlab/addons/qemu/linux-tiny-core-7.2-v2 folder¬†we’ve used in the previous section and delete the copy of the base image we’ve put within that folder. Let’s create a snapshot of the original base image with qemu-img create¬†and the¬†-b¬†flag that specifies a¬†backing file¬†for the new image, thus creating a snapshot:

root@eve-ng:/opt/unetlab/addons/qemu/linux-tiny-core-7.2-v2# /opt/qemu/bin/qemu-img create -f qcow2 -b /opt/unetlab/addons/qemu/linux-tiny-core-7.2/hda.qcow2 hda.qcow2
Formatting 'hda.qcow2', fmt=qcow2 size=643825664 backing_file='/opt/unetlab/addons/qemu/linux-tiny-core-7.2/hda.qcow2' encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16

root@eve-ng:/opt/unetlab/addons/qemu/linux-tiny-core-7.2-v2# /opt/qemu/bin/qemu-img info hda.qcow2
image: hda.qcow2
file format: qcow2
virtual size: 614M (643825664 bytes)
disk size: 196K
cluster_size: 65536
backing file: /opt/unetlab/addons/qemu/linux-tiny-core-7.2/hda.qcow2
[...]

Preparing the new base image – method 1: running QEMU from CLI

Now, you have a new base image and you want to prepare it before using in future labs. Let’s run it from command line. I won’t go into QEMU details, you can find useful docs on the web or simply run¬†ps aux from EVE VM with some nodes running to see how it runs QEMU nodes. TinyCore can be managed via VNC, so let’s pass¬†-vnc :100¬†in order to make QEMU listen for VNC connections on port 6000 on every interface (6000 is the result of 5900 base VNC port + 100) for the node we’re running:

root@eve-ng:/opt/unetlab/addons/qemu/linux-tiny-core-7.2-v2# /opt/qemu/bin/qemu-system-x86_64 -m 2048 -hda hda.qcow2 -serial telnet:0.0.0.0:44444,server,nowait -monitor tcp:127.0.0.1:42379,server,nowait -nographic -enable-kvm -vnc 0.0.0.0:100

In the command above you also see how to enable serial connection through telnet, in case your image has console access enabled. If you need to reach the image via IP connection, for example to transfer some content onto it by using SCP, you can manage it in this way: suppose EVE VM has¬†pnet0¬†with¬†eth0 connected in¬†bridge-mode to your own network, configure eth0 on TinyCore (or your own image) with an IP address compatible with your own network through the Console or VNC connection or let it get an IP via your LAN DHCP and then shutdown the image (if you’re using TinyCore, remember to make the changes persistent, otherwise you’ll loose them after reboot). Now, prepare a virtual interface to put within¬†pnet0 bridge and to which we will connect TinyCore:

root@eve-ng:/opt/unetlab/addons/qemu/linux-tiny-core-7.2-v2# ifconfig tmp_iface up

root@eve-ng:/opt/unetlab/addons/qemu/linux-tiny-core-7.2-v2# brctl addif pnet0 tmp_iface

root@eve-ng:/opt/unetlab/addons/qemu/linux-tiny-core-7.2-v2# brctl show pnet0
bridge name   bridge id           STP enabled   interfaces
pnet0         8000.000c29baeb65   no            eth0
                                                tmp_iface

Start qemu again mapping TinyCore eth0 onto¬†tmp_iface we’ve just created:

root@eve-ng:/opt/unetlab/addons/qemu/linux-tiny-core-7.2-v2# /opt/qemu/bin/qemu-system-x86_64 -m 2048 -hda hda.qcow2 -serial telnet:0.0.0.0:44444,server,nowait -monitor tcp:127.0.0.1:42379,server,nowait -nographic -enable-kvm -vnc 0.0.0.0:100 -device virtio-net-pci,netdev=net0,mac=50:01:00:04:00:00 -netdev tap,id=net0,ifname=tmp_iface,script=no

Now you can reach your QEMU node via SSH/SCP and you can do whatever you want in order to prepare the new base image.

After some modifications, let’s see the status of the -v2 base image:

root@eve-ng:/opt/unetlab/addons/qemu/linux-tiny-core-7.2-v2# /opt/qemu/bin/qemu-img info hda.qcow2 | grep "disk size"
disk size: 1.4M

Before my modifications to the image, disk size was 196K, now it is grown to 1.4M. The original base image has not been modified.

Preparing the new base image – method 2: running a new node from GUI

If you don’t want to go through the steps from CLI I’ve explained above, you can instantiate a new node based on -v2 image and run it from the GUI:

post5_fig2_add_tinycore-v2.png

Create a network object mapped on pnet0 and connect it to eth0:

post5_fig3_add_network_pnet0post5_fig4_connect_v2-to-pnet0

Start the node based on the new base image -v2 and modify it as you want. As we’ve already said before, these modifications won’t go onto the original -v2 image we’ve prepared within¬†/opt/unetlab/addons/qemu/linux-tiny-core-7.2-v2. Since I’ve added this new node on the same lab with the node based on the original base image, it will be node 2 of the same lab, so let’s move into its temporary folder on EVE VM and have a look at the snapshot image:

root@eve-ng:/opt/unetlab/tmp/0/da2b48f4-d910-4e7d-9645-f952457cbf6d/2# cd /

root@eve-ng:/# cd /opt/unetlab/tmp/0/da2b48f4-d910-4e7d-9645-f952457cbf6d/2

root@eve-ng:/opt/unetlab/tmp/0/da2b48f4-d910-4e7d-9645-f952457cbf6d/2# ls -l
total 1288
-rw-r--r-- 1 root unl 1376256 Jan 6 13:52 hda.qcow2
-rw-rw-r-- 1 root unl 118 Jan 6 13:52 wrapper.txt

root@eve-ng:/opt/unetlab/tmp/0/da2b48f4-d910-4e7d-9645-f952457cbf6d/2# /opt/qemu/bin/qemu-img info hda.qcow2
image: hda.qcow2
file format: qcow2
virtual size: 614M (643825664 bytes)
disk size: 1.3M
cluster_size: 65536
backing file: /opt/unetlab/addons/qemu/linux-tiny-core-7.2-v2/hda.qcow2
[...]

What can we do now to put the changes of this snapshot back to the -v2 base image? Let’s use QEMU tools to accomplish this task, but first shutdown the node within the GUI (don’t delete the node now!):

root@eve-ng:/opt/unetlab/tmp/0/da2b48f4-d910-4e7d-9645-f952457cbf6d/2# /opt/qemu/bin/qemu-img commit hda.qcow2
Image committed.

root@eve-ng:/opt/unetlab/tmp/0/da2b48f4-d910-4e7d-9645-f952457cbf6d/2# /opt/qemu/bin/qemu-img info hda.qcow2
image: hda.qcow2
file format: qcow2
virtual size: 614M (643825664 bytes)
disk size: 260K
cluster_size: 65536
backing file: /opt/unetlab/addons/qemu/linux-tiny-core-7.2-v2/hda.qcow2
[...]

The¬†commit command tells QEMU to commit changes we’ve made to¬†the instantiated node’s¬†/opt/unetlab/tmp/0/da2b48f4-d910-4e7d-9645-f952457cbf6d/2/hda.qcow2 ¬†disk to the base image¬†/opt/unetlab/addons/qemu/linux-tiny-core-7.2-v2/hda.qcow2.

The new node we’ve added into the EVE’s GUI is still valid, the operation above did not corrupt it’s snapshot, but now we can delete it and if you add a new node based on -v2 base image, you will find the modifications you’ve merged into the base image with¬†commit command.

A Snapshot Chain: what happened behind the scenes

As you’ve probably already understood, when you instantiate a new node within the GUI based on -v2 base image, you’re creating a snapshot of a snapshot. This is our snapshots tree:

/opt/unetlab/addons/qemu/linux-tiny-core-7.2-v2/hda.qcow2  
|
 \==> /opt/unetlab/addons/qemu/linux-tiny-core-7.2-v2/hda.qcow2
|     |
|     \==> /opt/unetlab/tmp/0/da2b48f4-d910-4e7d-9645-f952457cbf6d/2/hda.qcow2
|
 \==> /opt/unetlab/tmp/0/da2b48f4-d910-4e7d-9645-f952457cbf6d/1/hda.qcow2

Let’s see the chain of snapshots for node 2 disk:

root@eve-ng:/opt/unetlab/tmp/0/da2b48f4-d910-4e7d-9645-f952457cbf6d/2# /opt/qemu/bin/qemu-img info --backing-chain hda.qcow2
image: hda.qcow2
file format: qcow2
virtual size: 614M (643825664 bytes)
disk size: 1.3M
cluster_size: 65536
backing file: /opt/unetlab/addons/qemu/linux-tiny-core-7.2-v2/hda.qcow2
[...]

image: /opt/unetlab/addons/qemu/linux-tiny-core-7.2-v2/hda.qcow2
file format: qcow2
virtual size: 614M (643825664 bytes)
disk size: 1.4M
cluster_size: 65536
backing file: /opt/unetlab/addons/qemu/linux-tiny-core-7.2/hda.qcow2
[...]

image: /opt/unetlab/addons/qemu/linux-tiny-core-7.2/hda.qcow2
file format: qcow2
virtual size: 614M (643825664 bytes)
disk size: 73M
cluster_size: 65536
[...]

What you must remember is that modifications to one of the images in the snapshot tree invalidates al the snapshots in the tree under the modified snapshot. For example, committing changes made on /opt/unetlab/addons/qemu/linux-tiny-core-7.2-v2/hda.qcow2 base image forces a modification of original base image, which invalidates node 1 disk /opt/unetlab/tmp/0/da2b48f4-d910-4e7d-9645-f952457cbf6d/1/hda.qcow2 but does not damage the node 2 disk /opt/unetlab/tmp/0/da2b48f4-d910-4e7d-9645-f952457cbf6d/2/hda.qcow2. If you have doubts before making such important changes, take a snapshot of your EVE VM with the tools offered by your Virtualization Environment (such as VMWare Fusion or VirtualBox) to have a quick rollback solution in case of damages to your labs.

Conclusions

In this small tutorial you’ve seen how you can use QEMU tools to manage snapshots of base images in Unetlab/EVE-NG, in order to setup new base images based on older base images, without wasting space with full clones. As EVE-NG is being developed (now it is in Alpha stage) a lot of this stuff could integrated within the GUI, but I hope it can be useful to someone experimenting with now ūüôā

In the next tutorial I’ll show you how to enable compression on base images, let’s move on with¬†Enabling compression on base-images in Unetlab/EVE-NG Alpha!

Posted in Networking, Uncategorized | Tagged , , , , | 6 Comments

Loop-Free Alternate on JunOS

Introduction: The Topology

A lot of time has passed since my last post on this blog and I’m¬†back with something about an interesting feature, LFA, i.e. Loop-Free Alternate, this time not implemented on Cisco routers but on Juniper ones. I’ve used the following topology, implemented with logical systems, i.e. virtual routers inside a JunOS vMX instance.

post4_fig1

We have 4 main core routers, R3-R4-R5-R6, forming a square, in OSPF area 0.0.0.0, along with other two routers, R7-R9. OSPF link costs are depicted on the image.

Routers R1 and R8 are connected to R3 and R6, respectively, in a routing-instance of type vrf with route-target 1000:1, thus forming a L3-VPN.

I’ve not put IPs on the topology diagram but IP assignment is simple and follows these rules:

  • Loopback Addresses:¬†they take the form of 100.0.0.x with x the number of the router. So R5 has 100.0.0.5 as loopback address
  • P-t-P link Subnet: even if a /30 would be sufficient, we use a /24 network, which allows us to use the scheme 10.0.xy.0/24, where x is the number of the lowest-numbered router connected to the P-t-P link and y is the number of the highest-numbered router connected to the P-t-P link. So the link between R3 and R5 has the network 10.0.35.0/24. OSPF configuration forces P-t-P link type.
  • P-t-P link interface address:¬†on a P-t-P link 10.0.xy.0/24 router Rx has address 10.0.xy.x and router Ry has address 10.0.xy.y

Since these routers are logical-systems, they are connected one with each other with logical units under the logical-tunnel interface lt-0/0/10 with the following rule:

  • Rx is connected to Ry with lt-0/0/10.xy paired with lt-0/0/10.yx, which is the other end of the tunnel that connects Ry with Rx

So, for example, R3 is connected with R1 with the following interface and IP address:

interfaces {
  lt-0/0/10 {
    unit 31 {
      encapsulation ethernet;
      peer-unit 13;
      family inet {
        address 10.0.13.3/24;
      }
      family mpls;
    }
  }
}

MPLS, RSVP, LDP are enabled on all the interface of routers within OSPF area 0.0.0.0.

Finally, core routers have load-balance per-packet (which is per-flow, despite the name) applied on the forwarding table, so equal-cost paths are load-balanced when taking forwarding decisions.

The Feature: Loop-Free Alternate

In this post we get the focus on router R3 and think about what we can do to improve convergence time in the square topology of core routers when R3¬†WAN’s link fails and it needs to get to the routers on the other side. Standard OSPF convergence allows for quite fast convergence time in case of a link failure, but there is still an interruption in traffic flow across the WAN for R3, if R3-R5 fails. We will see why R3 can not immediately start forwarding to R4 and how we can force it to immediately forward thorough that neighbor the traffic destined to R5 or R7, without causing loops, and LFA will be the feature of our interest.

Simple OSPF deployment

Let’s have a look at the routing table of R3, focusing on loopback addresses with a simple OSPF deployment:

admin> show route logical-system R3 table inet.0 100.0.0.0/24

inet.0: 15 destinations, 15 routes (15 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

100.0.0.3/32 *[Direct/0] 01:55:14
             > via lo0.3
100.0.0.4/32 *[OSPF/10] 01:54:16, metric 1
             > to 10.0.34.4 via lt-0/0/10.34
100.0.0.5/32 *[OSPF/10] 01:24:38, metric 10
             > to 10.0.35.5 via lt-0/0/10.35
100.0.0.6/32 *[OSPF/10] 01:24:38, metric 11
             > to 10.0.34.4 via lt-0/0/10.34
               to 10.0.35.5 via lt-0/0/10.35
100.0.0.7/32 *[OSPF/10] 01:24:38, metric 11
             > to 10.0.35.5 via lt-0/0/10.35
100.0.0.9/32 *[OSPF/10] 01:24:38, metric 12
               to 10.0.34.4 via lt-0/0/10.34
             > to 10.0.35.5 via lt-0/0/10.35

As you can see, R3 can reach R6 (and, obviously, R9) through two equal-cost paths, through its neighbors R4 and R5, so if R3-R4 goes down, R3 can immediately forward traffic to R6 (and R9) that was previously forwarded to R4, to the neighbor R5 (because it immediately knows that R3-R5 is down, being connected to that link). If the link that gets broken is the WAN link, i.e. R3-R5, R3 can not immediately forward traffic destined to R5 (and R7) to neighbor R4, because it knows that until SFP runs and the network converges, traffic could loop. In fact, R4 has two equal paths to R5, through R3 and R6, so if it receives traffic destined to R5 (and R7) from R3, it could send it back to R3 until it knows that R3-R5 is down and it has only one path, through R6, to R5. So, if WAN link between R3 and R5 goes down, we must wait for the network to converge before traffic flowing from R3 and R5 starts flowing again toward the destination.

OSPF Link Protection

We can enable link-protection on ospf links in order to make ospf find alternate backup paths that it can use as soon as an ospf link fails. This is the explanation for the feature taken from juniper.net site:

You can configure link protection for any interface for which OSPF is enabled. When you enable link protection, Junos OS creates an alternate path to the primary next hop for all destination routes that traverse a protected interface. Use link protection when you assume that only a single link might become unavailable but that the neighboring node would still be available through another interface.

We can query ospf spf data to see what R3 knows about destination R6:

admin> show ospf backup spf detail logical-system R3 100.0.0.6
Topology default results:

Area 0.0.0.0 results:

100.0.0.6
 Self to Destination Metric: 11
 Parent Node: 100.0.0.5
 Parent Node: 10.0.46.6
 Primary next-hop: lt-0/0/10.34 via 10.0.34.4
 Primary next-hop: lt-0/0/10.35 via 10.0.35.5
 Backup Neighbor: 100.0.0.5
  Neighbor to Destination Metric: 1, Neighbor to Self Metric: 10
  Self to Neighbor Metric: 10, Backup preference: 0x0
  Not evaluated, Reason: Primary next-hop multipath
 Backup Neighbor: 100.0.0.4
  Neighbor to Destination Metric: 10, Neighbor to Self Metric: 1
  Self to Neighbor Metric: 1, Backup preference: 0x0
  Not evaluated, Reason: Primary next-hop multipath

The output says that both paths are already primary paths and so link-protection does not add anything, as I’ve already said, to network performance in case of link failure.

Let’s look at destination R7:

admin> show ospf backup spf detail logical-system R3 100.0.0.7
Topology default results:

Area 0.0.0.0 results:

100.0.0.7
 Self to Destination Metric: 11
 Parent Node: 100.0.0.5
 Primary next-hop: lt-0/0/10.35 via 10.0.35.5
 Backup Neighbor: 100.0.0.5
  Neighbor to Destination Metric: 1, Neighbor to Self Metric: 10
  Self to Neighbor Metric: 10, Backup preference: 0x0
  Not eligible, Reason: Primary next-hop link fate sharing
 Backup Neighbor: 100.0.0.4
  Neighbor to Destination Metric: 12, Neighbor to Self Metric: 1
  Self to Neighbor Metric: 1, Backup preference: 0x0
  Track Item: 100.0.0.3
  Track Item: 100.0.0.5
  Not eligible, Reason: Path loops

In this case, the first backup neighbor is the same neighbor through which we reach 100.0.0.7, i.e. 100.0.0.5 (Primary next-hop link fate sharing, which means that if primary next-hop fails, also that backup neighbor will not be usable). The second backup neighbor, R4, could be an alternate path, but as we discussed before, it would cause a traffic loop (Path loops).

So, with link-protection¬†OSPF¬†can immediately start to work¬†about finding feasible successors, using an EIGRP¬†terminology, i.e. a successor where we can immediately route traffic flowing on a link that fails and that was the best path toward a destination. Again, this does not add almost anything to our network due to our topology, so let’s move to the next step to decrease traffic disruption in case of link R3-R5 failure.

RSVP-TE tunnel from R3 to R5

How can we help R3 to forward traffic destined to R5 and R7 through R4->R6->R5 path without causing a loop? We must use something that forces R4 not to do a lookup on traffic destination, otherwise it could route it back to R3 until knows about the link failure. We can deploy an RSVP-TE tunnel from R3 to R5 that goes through R4: in this way, R3 encapsulates traffic within an MPLS label which is swapped by R4 (popped due to PHP, i.e. Penultimate-Hop-Popping) without looking at the contents of the forwarded packets, which are de-encapsulated at R6 and forwarded to R5, without enabling loops.

Let’s define the RSVP-TE tunnel on R3:

protocols {
  mpls {
    icmp-tunneling;
    label-switched-path to-R6 {
      backup;
      to 100.0.0.6;
      primary loose-R4;
    }
    path loose-R4 {
      100.0.0.4 loose;
    }
    interface all;
  }
}

This is the defined path:

admin> show mpls lsp logical-system R3 detail
Ingress LSP: 1 sessions

100.0.0.6
 From: 100.0.0.3, State: Up, ActiveRoute: 0, LSPname: to-R6
 ActivePath: loose-R4 (primary)
 LSPtype: Static Configured, Penultimate hop popping
 LoadBalance: Random
 Encoding type: Packet, Switching type: Packet, GPID: IPv4
 *Primary loose-R4 State: Up
 Priorities: 7 0
 SmartOptimizeTimer: 180
 Computed ERO (S [L] denotes strict [loose] hops): (CSPF metric: 11)
 10.0.34.4 S 10.0.46.6 S
 Received RRO (ProtectionFlag 1=Available 2=InUse 4=B/W 8=Node 10=SoftPreempt 20=Node-ID):
 10.0.34.4 10.0.46.6
Total 1 displayed, Up 1, Down 0

As the ERO (Explicit Route Object) says, the path goes on R3-R4 and R4-R6 links. As you can see in path definition, we added the keyword¬†backup to let ospf use that RSVP tunnel for¬†alternate path computations. Let’s look again at spf backup coverage for destination 100.0.0.7:

admin> show ospf backup spf detail logical-system R3 100.0.0.7
Topology default results:

Area 0.0.0.0 results:

100.0.0.7
 Self to Destination Metric: 11
 Parent Node: 100.0.0.5
 Primary next-hop: lt-0/0/10.35 via 10.0.35.5
 Backup next-hop: to-R6
 Backup Neighbor: 100.0.0.6 (LSP endpoint)
  Neighbor to Destination Metric: 2, Neighbor to Self Metric: 11
  Self to Neighbor Metric: 11, Backup preference: 0x0
  Track Item: 100.0.0.5
  Eligible, Reason: Contributes backup next-hop
 Backup Neighbor: 100.0.0.5
  Neighbor to Destination Metric: 1, Neighbor to Self Metric: 10
  Self to Neighbor Metric: 10, Backup preference: 0x0
  Not evaluated, Reason: Interface is already covered
 Backup Neighbor: 100.0.0.4
  Neighbor to Destination Metric: 12, Neighbor to Self Metric: 1
  Self to Neighbor Metric: 1, Backup preference: 0x0
  Track Item: 100.0.0.3
  Track Item: 100.0.0.5
  Not evaluated, Reason: Interface is already covered

Now, as you can see from the following output, ospf perfectly knows that if primary next-hop on lt-0/0/10.35 fails, due to link R3-R5 failure, it can immediately encapsulate traffic within RSVP tunnel and forward it on R3-R4 link. R3 route table now has alternate next-hops listed also for 100.0.0.5 and 100.0.0.7:

admin> show route logical-system R3 table inet.0 100.0.0.0/24

inet.0: 15 destinations, 15 routes (15 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

100.0.0.3/32 *[Direct/0] 04:00:44
             > via lo0.3
100.0.0.4/32 *[OSPF/10] 03:59:46, metric 1
             > to 10.0.34.4 via lt-0/0/10.34
100.0.0.5/32 *[OSPF/10] 00:08:29, metric 10
             > to 10.0.35.5 via lt-0/0/10.35
               to 10.0.34.4 via lt-0/0/10.34, label-switched-path to-R6
100.0.0.6/32 *[OSPF/10] 00:47:34, metric 11
               to 10.0.34.4 via lt-0/0/10.34
             > to 10.0.35.5 via lt-0/0/10.35
100.0.0.7/32 *[OSPF/10] 00:08:29, metric 11
             > to 10.0.35.5 via lt-0/0/10.35
               to 10.0.34.4 via lt-0/0/10.34, label-switched-path to-R6
100.0.0.9/32 *[OSPF/10] 00:47:34, metric 12
             > to 10.0.34.4 via lt-0/0/10.34
               to 10.0.35.5 via lt-0/0/10.35

We focused on R3, but the same approach should be taken also on R4-R5-R6, defining link-protection and an RSVP-TE tunnel to the not-directly connected router on the opposite vertex of the square on each core router.

Now you could be asking yourself “why did he implement also a L3-VPN in the topology”? We will see it in the next section.

RSVP-TE tunnel and load-balancing vpn-a traffic

Let’s disable for a moment RSVP-TE tunnel from R3 to R5 and have a look at vpn-a.inet.0 routing table on R3 for 100.0.0.8 destination, i.e. CE router R8, connected to R6 (some lines are omitted for brevity):

admin> show route logical-system R3 table vpn-a.inet.0 100.0.0.8/32 detail

vpn-a.inet.0: 5 destinations, 5 routes (5 active, 0 holddown, 0 hidden)
100.0.0.8/32 (1 entry, 1 announced)
 *BGP Preference: 170/-101
 Route Distinguisher: 1000:1000
 Next hop type: Indirect
 Next hop type: Router, Next hop index: 1048592
 Next hop: 10.0.34.4 via lt-0/0/10.34 weight 0x1
 Label operation: Push 299808, Push 299872(top)
 Session Id: 0x100002
 Next hop: 10.0.35.5 via lt-0/0/10.35 weight 0x1, selected
 Label operation: Push 299808, Push 299776(top)
 Session Id: 0x100003
 Protocol next hop: 100.0.0.6
 Label operation: Push 299808
 AS path: 65080 I
 Communities: target:1000:1
 VPN Label: 299808
 Localpref: 100
 Router ID: 100.0.0.6
 Primary Routing Table bgp.l3vpn.0

If you remember from the beginning of the post, R6 can be reached through two equal-cost paths, and the same is for vpn-a prefixes it announces with MultiProtocol-BGP (MP-BGP). In this case, R3 can forward traffic from R1 to R8 on R3-R4 link or R3-R5 link, by pushing two different MPLS labels (299776 or 299872) on top of the label stack, above VPN label 299808 which is used to reach the vrf vpn-a on R6.

Let’s examine how MP-BGP works: R3 receives an announcement from R6 that says “you can reach 100.0.0.8 prefix through me”, so protocol next-hop for that prefix is 100.0.0.6. L3-VPN protocol next-hops are resolved by looking at inet.3 table, which lists LDP prefixes¬†learned through LDP neighbors:

admin> show route logical-system R3 table inet.3 100.0.0.6/32

inet.3: 5 destinations, 5 routes (5 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

100.0.0.6/32 *[LDP/9] 00:28:58, metric 11
             > to 10.0.34.4 via lt-0/0/10.34, Push 299872
               to 10.0.35.5 via lt-0/0/10.35, Push 299776

As you can see, if the R3¬†resolves 100.0.0.6 in inet.3, it finds two next hops, with the two MPLS labels we’ve already seen in the MBGP route to 100.0.0.8.

What happens if we enable the RSVP-TE tunnel again?

admin> show route logical-system R3 table inet.3 100.0.0.6/32

inet.3: 5 destinations, 6 routes (5 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

100.0.0.6/32 *[RSVP/7/1] 00:00:01, metric 11
             > to 10.0.34.4 via lt-0/0/10.34, label-switched-path to-R6
             [LDP/9] 00:32:55, metric 11
             > to 10.0.34.4 via lt-0/0/10.34, Push 299872
               to 10.0.35.5 via lt-0/0/10.35, Push 299776

admin> show route logical-system R3 table vpn-a.inet.0 100.0.0.8/32 detail

vpn-a.inet.0: 5 destinations, 5 routes (5 active, 0 holddown, 0 hidden)
100.0.0.8/32 (1 entry, 1 announced)
 *BGP Preference: 170/-101
 Route Distinguisher: 1000:1000
 Next hop type: Indirect
 Next hop type: Router, Next hop index: 980
 Next hop: 10.0.34.4 via lt-0/0/10.34, selected
 Label-switched-path to-R6
 Label operation: Push 299808, Push 299968(top)
 Session Id: 0x100002
 Protocol next hop: 100.0.0.6
 Label operation: Push 299808
 AS path: 65080 I
 Communities: target:1000:1
 VPN Label: 299808
 Localpref: 100
 Router ID: 100.0.0.6
 Primary Routing Table bgp.l3vpn.0

As you can see, 100.0.0.6/32 prefix in inet.3 can be reached also through the RSVP-TE tunnel, and due to RSVP lower (i.e. better) preference, it is preferred to LDP prefix. This causes vpn-a traffic from R1 to R8 to flow only on R3-R4-R6 path. Top label 299968 is the label RSVP negotiated for to-R6 path, as we can see on transit node (for that path) R4:

admin> show mpls lsp logical-system R4 transit
Transit LSP: 1 sessions
To        From      State Rt Style Labelin Labelout LSPname
100.0.0.6 100.0.0.3 Up    0   1 FF  299968        3 to-R6
Total 1 displayed, Up 1, Down 0

(Labelout is 3, i.e. explicit null, due to penultimate-hop-popping)

We can simply re-enable load-balancing by raising RSVP preference from 7 to 10 (with preference 10 statement within label-switched-path to-R6 definition), which is above LDP preference of 9:

admin> show route logical-system R3 table inet.3 100.0.0.6/32

inet.3: 5 destinations, 6 routes (5 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

100.0.0.6/32 *[LDP/9] 00:47:34, metric 11
              > to 10.0.34.4 via lt-0/0/10.34, Push 299872
                to 10.0.35.5 via lt-0/0/10.35, Push 299776
              [RSVP/10/1] 00:00:23, metric 11
              > to 10.0.34.4 via lt-0/0/10.34, label-switched-path to-R6

admin> show route logical-system R3 table vpn-a.inet.0 100.0.0.8/32

vpn-a.inet.0: 5 destinations, 5 routes (5 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

100.0.0.8/32 *[BGP/170] 00:00:03, localpref 100, from 100.0.0.6
             AS path: 65080 I, validation-state: unverified
               to 10.0.34.4 via lt-0/0/10.34, Push 299808, Push 299872(top)
             > to 10.0.35.5 via lt-0/0/10.35, Push 299808, Push 299776(top)

Statistics about OSPF backup coverage

OSPF backup coverage can be queried on R3 in order to see how well our network is being protected:

admin> show ospf backup coverage logical-system R3
Topology default coverage:

Node Coverage:

 Area    Covered  Total  Percent
           Nodes  Nodes  Covered
0.0.0.0        4      5   80.00%

Route Coverage:

Path Type Covered  Total Percent
           Routes Routes Covered
    Intra       9     12  75.00%
    Inter       0      0 100.00%
     Ext1       0      0 100.00%
     Ext2       0      0 100.00%
      All       9     12  75.00%

admin> show ospf route no-backup-coverage logical-system R3
Topology default Route Table:

Prefix       Path  Route   NH   Metric  NextHop       Nexthop
             Type  Type    Type         Interface     Address/LSP
100.0.0.4    Intra Router  IP        1  lt-0/0/10.34  10.0.34.4
10.0.34.0/24 Intra Network IP        1  lt-0/0/10.34
100.0.0.3/32 Intra Network IP        0  lo0.3
100.0.0.4/32 Intra Network IP        1  lt-0/0/10.34  10.0.34.4

admin> show ospf backup spf no-coverage logical-system R3
Topology default results:

Area 0.0.0.0 results:

100.0.0.4
 Self to Destination Metric: 1
 Parent Node: 100.0.0.3
 Primary next-hop: lt-0/0/10.34 via 10.0.34.4
 Backup Neighbor: 100.0.0.6 (LSP endpoint)
  Neighbor to Destination Metric: 10, Neighbor to Self Metric: 11
  Self to Neighbor Metric: 11, Backup preference: 0x0
  Not eligible, Reason: Missing primary next-hop
 Backup Neighbor: 100.0.0.4
  Neighbor to Destination Metric: 0, Neighbor to Self Metric: 1
  Self to Neighbor Metric: 1, Backup preference: 0x0
  Not eligible, Reason: Primary next-hop link fate sharing
 Backup Neighbor: 100.0.0.5
  Neighbor to Destination Metric: 11, Neighbor to Self Metric: 10
  Self to Neighbor Metric: 10, Backup preference: 0x0
  Not eligible, Reason: Path loops

As you can see, connection between R3 and R4 does not have an alternate backup next-hop, so it’s considered uncovered. We could add an RSVP-TE tunnel R3-R5-R6 to protect also that link¬†and get coverage for 100.0.0.4/32 prefix, but in our topology we consider WAN link more prone to link failures (R3-R4 link could be a robust LAG or we don’t want traffic from R3 to R4 to flow over the two WAN links if R3-R4 fails).

OT about icmp-tunneling

Let’s do a small Off-Topic journey about icmp-tunneling feature of MPLS in JunOS, in case you’ve noticed it in one of the code snippets above.

We enabled icmp-tunneling within mpls stanza of our logical-systems in order to let you traceroute traffic over the mpls network. Without icmp-tunneling the following output could not be seen:

admin> traceroute logical-system R1 source 100.0.0.1 100.0.0.8
traceroute to 100.0.0.8 (100.0.0.8) from 100.0.0.1, 30 hops max, 40 byte packets
 1 10.0.13.3 (10.0.13.3) 6.606 ms 1.935 ms 0.944 ms
 2 10.0.35.5 (10.0.35.5) 5.271 ms 10.0.34.4 (10.0.34.4) 1.535 ms 1.423 ms
    MPLS Label=299872 CoS=0 TTL=1 S=0
    MPLS Label=299808 CoS=0 TTL=1 S=1
 3 10.0.56.6 (10.0.56.6) 1.361 ms 2.400 ms 1.334 ms
    MPLS Label=299808 CoS=0 TTL=1 S=1
 4 100.0.0.8 (100.0.0.8) 1.813 ms 1.463 ms 1.544 ms

This is a ping from R1 to R8, two CE routers within vpn-a, whose addresses are not know by R4 and R5 backbone routers that are traversed by the ping packets.

Traceroute works by sending out packets with an increasing TTL, starting from 1, and each intermediate router where the TTL expires sends an ICMP message to the source to inform it that the TTL has expired. In this way, the source knows the intermediate routers toward a destination. The problem with an MPLS core is that a packet¬†from R1 to R8 that expires on R4, for example, can not cause R4 sending a TTL expired warning to R1, because R4 doesn’t know R1. MPLS icmp-tunneling forces the following behavior: if TTL expires on R4, R4 generates an ICMP TTL-expired packet which is not routed back to the unknown source; instead, R4 copies MPLS label stack from the expired packet onto the new ICMP packet and forwards it to the original destination (not the source, as it normally happens) of the expired packet. When it reaches the destination, it is sent back¬†to the source of the expired packet (which is known by R8) that can have the necessary information about what happened along with MPLS label information. This can be very useful when you’re learning on your JunOS labs! ūüėČ

Conclusions

I hope you’ve enjoyed this small lab about this interesting feature, LFA. I’ve started working on JunOS vMX few months ago, so I can not consider me an expert, but simply someone curious about this interesting and powerful technology. So if you find errors or have questions, feel free to leave a comment below! ūüôā

Posted in Networking | Tagged , , , , , , , | 1 Comment

MPLS for a BGP-free core

Here we are with my second post on my new blog. This time I’ll talk about one of the benefits of implementing MPLS within the backbone, i.e. to allow you to have a¬†BGP-free core.¬†We’ll start with the following scenario:

post2_fig1_starting_config

We implement OSPF as IGP and every router (the Provider Edge and the inner Provider routers) has full knowledge of the point-to-point links and the loopbacks in blue in the scenario above. Let’s have a look, for example, at the routing table of PE 1 and at its OSPF configuration (the rest of configuration is trivial, just assign IP addresses to the interfaces):

PE1#show ip route
Codes: C - connected, S - static, R - RIP, M - mobile, B - BGP
 D - EIGRP, EX - EIGRP external, O - OSPF, IA - OSPF inter area
 N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2
 E1 - OSPF external type 1, E2 - OSPF external type 2
 i - IS-IS, su - IS-IS summary, L1 - IS-IS level-1, L2 - IS-IS level-2
 ia - IS-IS inter area, * - candidate default, U - per-user static route
 o - ODR, P - periodic downloaded static route

Gateway of last resort is not set

 1.0.0.0/32 is subnetted, 1 subnets
O 1.1.1.1 [110/11] via 172.16.0.2, 00:13:58, FastEthernet0/0
 2.0.0.0/32 is subnetted, 1 subnets
O 2.2.2.2 [110/21] via 172.16.0.2, 00:12:31, FastEthernet0/0
 172.16.0.0/30 is subnetted, 3 subnets
O 172.16.0.8 [110/30] via 172.16.0.2, 00:12:31, FastEthernet0/0
O 172.16.0.4 [110/20] via 172.16.0.2, 00:13:48, FastEthernet0/0
C 172.16.0.0 is directly connected, FastEthernet0/0
 22.0.0.0/32 is subnetted, 1 subnets
O 22.22.22.22 [110/31] via 172.16.0.2, 00:10:27, FastEthernet0/0
 10.0.0.0/24 is subnetted, 1 subnets
C 10.1.0.0 is directly connected, FastEthernet0/1
 11.0.0.0/32 is subnetted, 1 subnets
C 11.11.11.11 is directly connected, Loopback0

PE1#sh run | sec router
router ospf 1
 router-id 11.11.11.11
 log-adjacency-changes
 network 11.11.11.11 0.0.0.0 area 0
 network 172.16.0.1 0.0.0.0 area 0

Requirement: enable communication between the two LAN segments in red, distributing the corresponding networks with iBGP (internal BGP).

1. iBGP connection between the PE routers: good starting point, but it doesn’t work

Let’s configure an iBGP connection between PE 1 and PE 2:

post2_fig1_iBGP_between_PE

PE1#sh run | sec router bgp
router bgp 65000
 no synchronization
 bgp log-neighbor-changes
 network 10.1.0.0 mask 255.255.255.0
 neighbor 22.22.22.22 remote-as 65000
 neighbor 22.22.22.22 update-source Loopback0
 no auto-summary

PE2#sh run | sec router bgp
router bgp 65000
 no synchronization
 bgp log-neighbor-changes
 network 10.2.0.0 mask 255.255.255.0
 neighbor 11.11.11.11 remote-as 65000
 neighbor 11.11.11.11 update-source Loopback0
 no auto-summary

This is sufficient to let PE 1 know about LAN connected to PE 2 (and vice versa):

PE1#sh ip route 10.2.0.0 255.255.255.0
Routing entry for 10.2.0.0/24
 Known via "bgp 65000", distance 200, metric 0, type internal
 Last update from 22.22.22.22 00:04:10 ago
 Routing Descriptor Blocks:
 * 22.22.22.22, from 22.22.22.22, 00:04:10 ago
 Route metric is 0, traffic share count is 1
 AS Hops 0

So try to ping 10.2.0.1 from 10.1.0.1:

PE1#ping 10.2.0.1 source 10.1.0.1

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 10.2.0.1, timeout is 2 seconds:
Packet sent with a source address of 10.1.0.1
.....
Success rate is 0 percent (0/5)

It doesn’t work… why? Let’s investigate: PE 1 routing table says that 10.2.0.0/24 is reachable through 22.22.22.22, so a recursive lookup for 22.22.22.22 in the routing table succeeds and tells PE 1 that it must forward the packet to 172.16.0.2. This recursive lookup is automatically done and the result is programmed within the CEF:

PE1#sh ip cef exact-route 10.1.0.1 10.2.0.1
10.1.0.1 -> 10.2.0.1 : FastEthernet0/0 (next hop 172.16.0.2)

So, we can assume that the ping is forwarded out of FastEthernet 0/0 of PE 1 and it should be received by P 1. So, we use the¬†follow-the-path troubleshooting technique and look at what happens on P 1 when the¬†echo request is received on FastEthernet 0/0. We’ll use¬†debug ip packet and disable¬†ip route-cache¬†on the FastEthernet 0/0 interface of P 1 because otherwise we can’t see anything but locally generated/processed IP packets. This is the modified configuration and what we see as soon as we send out an¬†ICMP echo request:

P1#sh run | sec access-list
access-list 100 permit ip host 10.1.0.1 host 10.2.0.1
P1#sh run int fa 0/0
!
interface FastEthernet0/0
 ip address 172.16.0.2 255.255.255.252
 no ip route-cache cef
 no ip route-cache
end

P1#debug ip packet 100 detail
IP packet debugging is on (detailed) for access list 100
P1#
*Mar 1 00:20:30.083: IP: s=10.1.0.1 (FastEthernet0/0), d=10.2.0.1, len 100, unroutable
*Mar 1 00:20:30.083: ICMP type=8, code=0

It says unroutable, so give a look at P 1 routing table:

P1#sh ip route
Codes: C - connected, S - static, R - RIP, M - mobile, B - BGP
 D - EIGRP, EX - EIGRP external, O - OSPF, IA - OSPF inter area
 N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2
 E1 - OSPF external type 1, E2 - OSPF external type 2
 i - IS-IS, su - IS-IS summary, L1 - IS-IS level-1, L2 - IS-IS level-2
 ia - IS-IS inter area, * - candidate default, U - per-user static route
 o - ODR, P - periodic downloaded static route

Gateway of last resort is not set

 1.0.0.0/32 is subnetted, 1 subnets
C 1.1.1.1 is directly connected, Loopback0
 2.0.0.0/32 is subnetted, 1 subnets
O 2.2.2.2 [110/11] via 172.16.0.6, 00:22:47, FastEthernet0/1
 172.16.0.0/30 is subnetted, 3 subnets
O 172.16.0.8 [110/20] via 172.16.0.6, 00:22:47, FastEthernet0/1
C 172.16.0.4 is directly connected, FastEthernet0/1
C 172.16.0.0 is directly connected, FastEthernet0/0
 22.0.0.0/32 is subnetted, 1 subnets
O 22.22.22.22 [110/21] via 172.16.0.6, 00:22:48, FastEthernet0/1
 11.0.0.0/32 is subnetted, 1 subnets
O 11.11.11.11 [110/11] via 172.16.0.1, 00:22:58, FastEthernet0/0

The packet is discarded, because P 1 doesn’t have a route toward 10.2.0.1!

2. Implement a full mesh of iBGP connections

We can solve the issue of previous section by implementing iBGP also on the two P routers. The problem is that it is not sufficient to implement iBGP between PE 1 and P1 and between PE 2 and P 2: one of the rules of BGP is that a route received from an iBGP peer is not propagated to another iBGP peer, so we must implement a full-mesh of iBGP relationships. This means we must add 5 iBGP peerings (two between PE 1 and the two P routers, one between P 1 and P 2 and two between PE 2 and the two P routers). This would solve our problem, but we want to avoid this solution, that is too expensive in terms of configuration if the P routers are more than 2 (we could also implement BGP Route Reflectors or other mechanisms, but these are out of the scope of this post). 

3. Enable MPLS in the backbone

We can easily solve our problem by enabling MPLS (Multi Protocol Label Switching) on the routers. We can define MPLS as a 2.5-Layer protocol: when a packet must be sent out an MPLS-enabled interface, a label is applied between the Ethernet and the IP layer and it is sent out to the usual next-hop. The router that receives the packet can forward it without looking at the destination IP address but simply looking at the label, swapping it with a proper label and forwarding it outside another interface. Each router tells to its neighbors which labels to use to reach the networks it knows (due to the IGP) through itself, using LDP (Label Distribution Protocol).

We enable mpls at the global level and under the point-to-point interfaces on all the routers, I’ll show P 1 configuration as an example:

! MPLS needs ip CEF to be enabled in order to work
ip cef 
!
mpls label protocol ldp
mpls ldp router-id Loopback0
!
interface FastEthernet 0/0
 mpls ip
interface FastEthernet 0/1
 mpls ip

Now let’s ping again:

PE1#ping 10.2.0.1 source 10.1.0.1

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 10.2.0.1, timeout is 2 seconds:
Packet sent with a source address of 10.1.0.1
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 24/41/56 ms

It works! P 1 router ip packet debug doesn’t show anything, because the IP packet is not even processed by P 1: it is received on Fa 0/0, the MPLS label is swapped, and it is forwarded on Fa 0/1.

Now we’ll have a look at how the label for 22.22.22.22, i.e. the next hop on PE 1 to 10.2.0.0/24, is received by PE 1 and how the label is swapped when it goes through the network.

Label propagation for 22.22.22.22/32 from Pe 2 to Pe 1

PE 1

Here we see the Penultimate Hop Popping (PHP) feature of MPLS. If a router is the final destination for a prefix, it tells its neighbors not to use an MPLS label when sending an IP packet for that prefix to it, because it is useless:

PE2#show mpls ldp bindings 22.22.22.22 32
 tib entry: 22.22.22.22/32, rev 12
 local binding: tag: imp-null
 remote binding: tsr: 2.2.2.2:0, tag: 18

So, PE 2 tells to P 2 (the only LDP neighbor it has) not to use an MPLS label, and receives the instruction to use label 18 to reach 22.22.22.22/32 through P 2 (2.2.2.2).

P 2

Let’s have a look at the ldp binding on P 2:

P2#show mpls ldp bindings 22.22.22.22 32
 tib entry: 22.22.22.22/32, rev 12
 local binding: tag: 18
 remote binding: tsr: 1.1.1.1:0, tag: 18
 remote binding: tsr: 22.22.22.22:0, tag: imp-null

P 2 tells to its neighbors that its label for 22.22.22.22/32 is 18. We can see that LDP neighbor 22.22.22.22 (PE 2) says to use no label to reach 22.22.22.22/32 through it, while neighbor 1.1.1.1 says to use label 18 (the same as P 2, but it is not relevant).

P 1

Now it’s time to look at P 1:

P1#show mpls ldp bindings 22.22.22.22 32
 tib entry: 22.22.22.22/32, rev 12
 local binding: tag: 18
 remote binding: tsr: 11.11.11.11:0, tag: 20
 remote binding: tsr: 2.2.2.2:0, tag: 18

The local label for 22.22.22.22/32 is 18, the same received from P 2. Then there is the label received by PE 1, which is 20.

PE 1

Finally, we arrived on PE 1:

PE1#show mpls ldp bindings 22.22.22.22 32
 tib entry: 22.22.22.22/32, rev 12
 local binding: tag: 20
 remote binding: tsr: 1.1.1.1:0, tag: 18

It receives the instruction to use label 18 from neighbor 1.1.1.1 (P 1) and it’s local label for 22.22.22.22/32 is 20.

Label-switched forwarding of packet from 10.1.0.1 to 10.2.0.2

We now go in the opposite direction, following the travel of an ICMP Echo Request from 10.1.0.1 to 10.2.0.

PE 1

As we’ve seen at the beginning, a packet to 10.2.0.1 has 22.22.22.22 as next-hop on P 1 and a recursive lookup tells PE 1 to forward it to 172.16.0.2, i.e. P 1, but this time the packet is sent out with an MPLS label:

PE1#show mpls forwarding-table 10.2.0.1
Local Outgoing   Prefix        Bytes tag  Outgoing   Next Hop
tag   tag or VC  or Tunnel Id  switched   interface
20    18         10.2.0.0/24   0          Fa0/0      172.16.0.2

PE1#show ip cef 10.2.0.1
10.2.0.0/24, version 18, epoch 0, cached adjacency 172.16.0.2
0 packets, 0 bytes
 tag information from 22.22.22.22/32, shared
 local tag: 20
 fast tag rewrite with Fa0/0, 172.16.0.2, tags imposed: {18}
 via 22.22.22.22, 0 dependencies, recursive
 next hop 172.16.0.2, FastEthernet0/0 via 22.22.22.22/32
 valid cached adjacency
 tag rewrite with Fa0/0, 172.16.0.2, tags imposed: {18}

PE 1 does not have a label for 10.2.0.0/24, since it is not received through the IGP (OSPF) nor it is a locally known network, so show mols forwarding-table 10.2.0.1 shows the label it is applied to reach 22.22.22.22, which is the next-hop for 10.2.0.0/24.

P 1

P1#show mpls forwarding-table
Local Outgoing   Prefix         Bytes tag Outgoing  Next Hop
tag   tag or VC  or Tunnel Id   switched  interface
16    Pop tag    2.2.2.2/32     0         Fa0/1     172.16.0.6
17    Pop tag    172.16.0.8/30  0         Fa0/1     172.16.0.6
18    18         22.22.22.22/32 12944     Fa0/1     172.16.0.6
19    Pop tag    11.11.11.11/32 7489      Fa0/0     172.16.0.1

P 1 swaps tag 18 with… tag 18 and sends the packet out onto FastEthernet 0/1 toward P 2.

This is the packet forwarded from P 1 to P 2, with the MPLS encapsulation header with label 18:

MPLS_packet

P 2

P2#show mpls forwarding-table
Local Outgoing   Prefix         Bytes tag Outgoing  Next Hop
tag   tag or VC  or Tunnel Id   switched  interface
16    Pop tag    1.1.1.1/32     0         Fa0/1     172.16.0.5
17    Pop tag    172.16.0.0/30  0         Fa0/1     172.16.0.5
18    Pop tag    22.22.22.22/32 12911     Fa0/0     172.16.0.10
19    19         11.11.11.11/32 8494      Fa0/1     172.16.0.5

When P 2 receives the packet with label 18, it removes the label (due to the PHP feature) and sends it out on FastEthernet 0/0.

PE 2

PE2#show ip cef 10.2.0.1
10.2.0.1/32, version 4, epoch 0, receive

PE 2 receives an IP packet with 10.2.0.1 as destination and it can receive, i.e. process, it.

Conclusion

In this post, we’ve seen how MPLS, which is easily and quickly configured on the backbone, saves us from implementing an iBGP full-mesh. It works between Layer 2 and Layer 3, avoids internal routers to examine the IP packets destined to the PE routers and let them forward IP packets destined to¬†IP addresses they don’t even know.

I hope you enjoyed this post, feel free to post comments or contact me.

Posted in Networking | Tagged , , , , | Leave a comment

Different approaches to IPSec Communication between Cisco Routers

This is my first post on my new blog, and I’ll start by examining some different approaches to enable IPSec communication between Cisco Routers, starting from the oldest one to the most recent ones.

Scenario Schermata 2014-11-05 alle 21.38.37

R1 and R2 are connected with their FastEthernet 0/0 interfaces. We simulate two LANs on each router with the interface FastEthernet 0/1 (which is connected to nothing, but has the no¬†keepalive statement activated in order to keep it up as if it was connected). Routers have also a Loopback 0 defined and it’s used as router-id for OSPF routing protocol, which in turn is used between the two routers to have full IP reachability of the above aforementioned interfaces.

Requirement: enable secure communication over the link between the two routers between LAN 10.1.0.1/24 and 10.2.0.1/24.

Note about ESP encryption: I’ve used null encryption in order to allow me to do packet traces with Wireshark and show you the encapsulated headers.

1. Using CRYPTO MAP on Fa0/0 interface for policy-based IPSec Encryption

The first method consists in applying a crypto-map on Fa0/0 instructing the router to encrypt the traffic that matches the PROXIES ACL that sets the remote and local proxy to 10.1.0.0/24 and the remote proxy to 10.2.0.0/24 on R1 and vice versa on R2. The two encryption peers are 172.16.0.1 and 172.16.0.2.

R1 configuration 

crypto isakmp policy 1
 encr aes
 hash sha256
 authentication pre-share
 group 2
crypto isakmp key cisco address 172.16.0.2
!
crypto ipsec transform-set TSET esp-null esp-sha-hmac
!
crypto map CMAP 1 ipsec-isakmp
 set peer 172.16.0.2
 set transform-set TSET
 match address PROXIES
!
interface Loopback0
 ip address 1.1.1.1 255.255.255.255
!
interface FastEthernet0/0
 ip address 172.16.0.1 255.255.255.252
 crypto map CMAP
!
interface FastEthernet0/1
 ip address 10.1.0.1 255.255.255.0
 no keepalive
!
router ospf 1
 network 1.1.1.1 0.0.0.0 area 0
 network 10.1.0.1 0.0.0.0 area 0
 network 172.16.0.1 0.0.0.0 area 0
!
ip access-list extended PROXIES
 permit ip 10.1.0.0 0.0.0.255 10.2.0.0 0.0.0.255

R2 configuration

crypto isakmp policy 1
 encr aes
 hash sha
 authentication pre-share
 group 2
crypto isakmp key cisco address 172.16.0.1
!
crypto ipsec transform-set TSET esp-null esp-sha-hmac
!
crypto map CMAP 1 ipsec-isakmp
 set peer 172.16.0.1
 set transform-set TSET
 match address PROXIES
!
interface Loopback0
 ip address 2.2.2.2 255.255.255.255
!
interface FastEthernet0/0
 ip address 172.16.0.2 255.255.255.252
 crypto map CMAP
!
interface FastEthernet0/1
 ip address 10.2.0.1 255.255.255.0
 no keepalive
!
router ospf 1
 network 2.2.2.2 0.0.0.0 area 0
 network 10.2.0.1 0.0.0.0 area 0
 network 172.16.0.2 0.0.0.0 area 0
!
ip access-list extended PROXIES
 permit ip 10.2.0.0 0.0.0.255 10.1.0.0 0.0.0.255

Now, a ping from 10.1.0.1 to 10.2.0.1 works and is being encrypted by IPSec, using tunnel mode, which adds a new IP header with the Fa0/0 addresses between the ESP packet that contains the original IP packet with the Fa 0/1 addresses:

Schermata 2014-11-05 alle 21.40.12

2. Using a GRE IP Tunnel for route-based IPSec Encryption

This second approach¬†involves the creation of a GRE IP Tunnel between the two routers, using Fa 0/0 IPs as tunnel source and destinations. In this case we change the PROXIES ACLs in order to match only GRE traffic between 172.16.0.1 and 172.16.0.2. This is a route-based approach since we determine with routing which traffic is going to be encrypted, by sending the traffic to be encrypted over Tunnel 0 with a static route for the two Fa0/1 LANs that wins over the same route known through OSPF. I’ll post only the relevant configuration:

R1 configuration 

! New Tunnel interface
!
interface Tunnel0
 ip address 172.31.0.1 255.255.255.252
 tunnel source FastEthernet0/0
 tunnel destination 172.16.0.2
end
! 
! Modified PROXIES ACL
!
ip access-list extended PROXIES
 permit gre host 172.16.0.1 host 172.16.0.2
!
! Route traffic for 10.2.0.0/24 over Tunnel 0
!
ip route 10.2.0.0 255.255.255.0 Tunnel0

R2 configuration

! New Tunnel interface
!
interface Tunnel0
 ip address 172.31.0.2 255.255.255.252
 tunnel source FastEthernet0/0
 tunnel destination 172.16.0.1
end
! 
! Modified PROXIES ACL
!
ip access-list extended PROXIES
 permit gre host 172.16.0.2 host 172.16.0.1
!
! Route traffic for 10.1.0.0/24 over Tunnel 0
!
ip route 10.1.0.0 255.255.255.0 Tunnel0

The transform set TSET is using default IPSec mode, i.e. Tunnel mode:

crypto ipsec transform-set TSET esp-null esp-sha-hmac

This produces a redundant encapsulation, since GRE and IPSec tunnel endpoints are the same:

Schermata 2014-11-05 alle 21.57.31 We can avoid this double encapsulation by using IPSec transport mode (if you change it, the quickest way to reset security associations is to shutdown and activate the Tu 0 interfaces):

crypto ipsec transform-set TSET esp-null esp-sha-hmac
 mode transport

This removes the redundant header: Schermata 2014-11-05 alle 23.03.12

3. Using GRE Tunnel with tunnel protection for route-based IPSec Encryption

In this section we will obtain the same encapsulation that¬†we’ve got in previous section but without using a crypto-map applied to the FastEthernet 0/0 interface. This time we use the newer approach which applies IPSec¬†tunnel protection¬†to the GRE tunnel. This time we’ll define an IPSEC profile to be applied to the tunnel, which in turn uses an ISAKMP profile with the ISAKMP secret specified within a keyring.

Assuming interfaces and routing is configured as in the previous examples, I’ll post only the Fa0/0 and Tu0 interfaces configuration and all the crypto-related configuration statements, so you can assume that everything about cryptography is posted in the following code snippet:

R1 configuration

interface FastEthernet0/0
 ip address 172.16.0.1 255.255.255.252
 duplex auto
 speed auto
end
!
interface Tunnel0
 ip address 172.31.0.1 255.255.255.252
 tunnel source FastEthernet0/0
 tunnel destination 172.16.0.2
 tunnel protection ipsec profile IPSEC_PROFILE
end
! 
crypto keyring KEYRING
 pre-shared-key address 172.16.0.2 key cisco
crypto isakmp policy 1
 encr aes
 authentication pre-share
 group 2
crypto isakmp profile ISAKMP_PROFILE
 keyring KEYRING
 match identity address 172.16.0.2 255.255.255.255
crypto ipsec transform-set TSET esp-null esp-sha-hmac
crypto ipsec profile IPSEC_PROFILE
 set transform-set TSET
 set isakmp-profile ISAKMP_PROFILE

R2 configuration

interface FastEthernet0/0
 ip address 172.16.0.2 255.255.255.252
 duplex auto
 speed auto
end
!
interface Tunnel0
 ip address 172.31.0.2 255.255.255.252
 tunnel source FastEthernet0/0
 tunnel destination 172.16.0.1
 tunnel protection ipsec profile IPSEC_PROFILE
end
!
crypto keyring KEYRING
  pre-shared-key address 172.16.0.1 key cisco
crypto isakmp policy 1
 encr aes
 authentication pre-share
 group 2
crypto isakmp profile ISAKMP_PROFILE
   keyring KEYRING
   match identity address 172.16.0.1 255.255.255.255
crypto ipsec transform-set TSET esp-null esp-sha-hmac
crypto ipsec profile IPSEC_PROFILE
 set transform-set TSET
 set isakmp-profile ISAKMP_PROFILE

As we’ve already seen in the previous section, this produces a redundant IP encapsulation because both GRE and ESP create two new IP headers with source and destination IPs the Fa 0/0 interfaces’ IPs:

Schermata 2014-11-06 alle 20.30.32

To remove this redundant encapsulation and save some bytes, just use IPSEC in transport mode:

crypto ipsec transform-set TSET esp-null esp-sha-hmac
 mode transport

This removes the redundant header:

Schermata 2014-11-06 alle 20.33.14

4. Using IPSEC IPv4 Tunnel with tunnel protection for route-based IPSec Encryption

In the last approach, we use IPSec Encryption without using GRE tunneling, such as in section 1, but we use a route-based approach (as opposed to the policy-based¬†encryption we examined at the beginning of this post). This can be accomplished by changing tunneling mode to¬†ipsec ipv4¬†(I’ll show only Tunnel 0 configuration since everything else is the same as in section 3):

R1 configuration

interface Tunnel0
 ip address 172.31.0.1 255.255.255.252
 tunnel source FastEthernet0/0
 tunnel destination 172.16.0.2
 tunnel mode ipsec ipv4
 tunnel protection ipsec profile IPSEC_PROFILE
end

R2 configuration

interface Tunnel0
 ip address 172.31.0.2 255.255.255.252
 tunnel source FastEthernet0/0
 tunnel destination 172.16.0.1
 tunnel mode ipsec ipv4
 tunnel protection ipsec profile IPSEC_PROFILE
end

This time, specifying tunnel mode transport¬†does not force IPSEC to use transport mode, since tunnel mode is needed to encapsulate the inner IP packet which carries the ping from R1 to R2 and vice versa. I’ve left the¬†tunnel mode transport statement in the transform set, but the IPSEC security association is negotiated with¬†tunnel mode:

R1#show crypto ipsec sa
interface: Tunnel0
 Crypto map tag: Tunnel0-head-0, local addr 172.16.0.1
 protected vrf: (none)
 local ident (addr/mask/prot/port): (0.0.0.0/0.0.0.0/0/0)
 remote ident (addr/mask/prot/port): (0.0.0.0/0.0.0.0/0/0)
 current_peer 172.16.0.2 port 500
 PERMIT, flags={origin_is_acl,}
 [...]
 local crypto endpt.: 172.16.0.1, remote crypto endpt.: 172.16.0.2
 [...]
 inbound esp sas:
 spi: 0xF99762A5(4187447973)
 transform: esp-null esp-sha-hmac ,
 in use settings ={Tunnel, }
 [...]
 outbound esp sas:
 spi: 0x89CCFDC2(2311912898)
 transform: esp-null esp-sha-hmac ,
 in use settings ={Tunnel, }
 [...]

In the output above, you can also see that local and remote identities (the proxies specified in section 1) are 0.0.0.0/0, since we encrypt everything that we route over the Tunnel 0 interface. On the contrary, when you’re using policy-based IPSec encryption, the local and remote identities are the networks specified within the PROXIES ACL:

R1#show crypto ipsec sa

interface: FastEthernet0/0
    Crypto map tag: CMAP, local addr 172.16.0.1

   protected vrf: (none)
   local  ident (addr/mask/prot/port): (10.1.0.0/255.255.255.0/0/0)
   remote ident (addr/mask/prot/port): (10.2.0.0/255.255.255.0/0/0)

This produces the following encapsulation:

Schermata 2014-11-06 alle 21.13.45

Conclusion

In my first post we’ve gone through different approaches to solve a simple task, encrypt all the communications between two LANs, starting from the policy-based solution with the crypto-map applied to the Fa0/1 interfaces, and going through three route-based solutions: the first route-based solution still uses a crypto-map and negotiates the tunnel endpoints as local and remote identities, since the tunnel simply encapsulates all the traffic that is sent over it, but it is the crypto map that does the encryption and it must know which traffic it must encrypt. The last two solutions negotiate 0.0.0.0/0 as local and remote identities since it is the tunnel interface itself that does the encryption, to all the traffic that is sent over it.

This post has been an exercise for me to experiment with IPSec tunnels and I hope someone will find the post useful. Feel free to post comments or to contact me, I don’t know if I’ll have time to reply, but you can try if you want ūüôā

Posted in Networking | Tagged , , , , , , , | Leave a comment