Implementing SNAT/DNAT on Fortinet Firewalls has never been straightforward as on other platforms like Checkpoint, in my opinion, at least before the introduction of Central NAT. Let’s see how to implement a subnet-to-subnet 1-to-1 translation with deterministic mappings, using VirtualIPs (DNAT) and Fixed-Port-Range IP-Pools (SNAT).
Introduction
I’ve recently had to migrate some firewall security and NAT rules from Checkpoint to Fortinet firewalls and faced some challenges due to the different behavior of the two technologies. Checkpoint applies security policies first (at least the version I’m using) and then it checks the NAT policies and applies both source and destination NAT. Fortinet instead has a different order of operations, more like Linux with Iptables: the packet arrives from the incoming interface, there is a pre-routing step where Destination NAT (DNAT from now on) happens, follows a routing decision (since now the final destination is known) that determines the outgoing interface, security policies (IPv4 policies) are examined and finally Source NAT (SNAT from now on) takes place and the packet can be sent out on the outgoing interface (more detail can be found here, I’ve omitted a lot of other intermediate steps).
During the rules’ migration, I’ve had the necessity to implement SNAT and DNAT rules to map a /22 network onto another /22 network, with deterministic mappings. Let’s take the following as an example:
On the right there is a partner’s network, 10.0.100.0/22 (i.e. with IPs ranging from 10.0.100.0 to 10.0.103.255) for which our firewall has a known route. We have some rules that required communication from our network on the left toward the partner network but via an alias network 10.0.200.0/22, which is routed in our network to our border firewall. What is important is that the mapping, in both directions, must be deterministic, i.e. 10.0.101.55 must be SNATted to 10.0.201.55 and vice versa, or in a more generic way, 10.0.10x.y must be mapped to 10.0.20x.y (and vice versa). To make things more complex, SNAT from the partner network to us must happen only for some IPs within 10.0.100.0/22.
NOTE: examples and images shown are taken on FortiManager, the Interface/VM used to manage FortiGate firewalls. You can apply the same concepts directly on FortiGate firewalls (maybe there is a slightly different terminology).
Central NAT
As I’ve mentioned before, implementing DNAT and SNAT on Fortinet FortiGate firewalls has never been simple as on other platforms, but they’ve made a big step forward with Central NAT, that adds two policy tables to the standard IPv4 table with security rules that we usually had on FortiManager. There is still something that reminds the old way of configuring nat (Central DNAT rules create VirtualIP objects that where used before to implement DNAT, but now you do not have to apply them as destinations in standard security rules), but it is more easy to use and to read (and less prone to errors).
Security Rules
As I’ve explained before, DNAT happens before security policies are evaluated, so when you write the security policies you must take into account that DNAT already translated destination addresses. On the opposite, SNAT has still to happen, so original source addresses must be used. In order to implement bidirectional communications we’ve shown in the example figure above, we need two rules:
IPv4 Security Policies
As you can see, rule #1 has 10.0.100.0/22 as destination, because DNAT from 10.0.200.0/22 to 10.0.100.0/22 has already been applied.
DNAT
DNAT is quite easy to implement, it requires a single rule in Central DNAT table, which in turn creates a single VIP Object:
DNAT 10.0.200.0/22 to 10.0.100.0/22
As you can see you set the range of IP addresses of the /22 network that we “know” on our side and then you specify only the first address of the real Mapped (or Internal) network of the partner. The GUI computes the final IP of that network in order to have a 1-to-1 static and deterministic mapping.
SNAT and the Fixed Port Range IP-Pools
I faced some challenges to implement the opposite, since in order to migrate the rules from the old firewall we had to be able to implement a deterministic SNAT of 10.0.10x.y to 10.0.20x.y IP addresses, but only for specific sources. Source NAT is implemented in Central SNAT table, where you write a policy, like the security ones, specifying source/destination addresses and ports of the traffic to match (with post-DNAT destinations) and then you select to SNAT the traffic onto the IP of the outgoing interface or onto an IP pool. The IP Pool can be of different kinds: overload, One-to-One, Fixed Port Range and Port Block Allocation. One-to-One was the one that seemed right for me, since we want to implement a 1-to-1 mapping between two subnets… but that object allows you to specify a single range of IP addresses, so the FortiGate has no way of knowing which source IP (that is something you specify in the Central SNAT policy, a separate object) will be mapped on which IP specified in the IP-Pool object. So in my first test I’ve implemented a Central SNAT policy that said “traffic from test-outside and with source IP in 10.0.100.0/22 has to be SNATted onto an IP-Pool One-to-One object that specifies 10.0.200.0/22 as range”. Then I made some tests, with a flow from 10.0.100.10 SNATted onto 10.0.200.1, a second flow from 10.0.100.x SNATted onto 10.0.200.2 and so on, so I could not achieve a deterministic 1-to-1 mapping as I needed.
I’ve searched a bit through the documentation and I’ve read how Fixed Port Range IP-Pool object works: Fixed Port Range allows you to map for example a /22 network onto a /24 network, so with a 4:1 ratio of IP addresses, in a deterministic way, reserving 1/4 of the ports for each IP. You can find more details on the Official Fortinet Documentation. What you can see is that if you have a 1:1 ratio of IP addresses between the real network and the mapped one, you can achieve a deterministic 1-to-1 mapping of 10.0.10x.y port w to 10.0.20x.y port w with Fixed Port Range SNAT.
The following formulas are found in section The Equations on official docs:
Let’s try with our particular case, with a partner’s source IP of 10.0.102.55:
factor = (10.0.103.255–10.0.100.0 + 10.0.203.255–10.0.200.0 + 1) / (10.0.203.255–10.0.200.0 + 1)
= (1024 + 1024 + 1) / (1024 + 1)
= 2049 / 1025
= 1,999024
= 1 (there is an integer conversion
that cuts the decimal part)
new IP = 10.0.200.0 + (10.0.102.55–10.0.100.0) / 1
= 10.0.200.0 + (256 + 256 + 55)
= 10.0.202.55
Here you are: 10.0.102.55 is translated to 10.0.202.55! Bingo!
Math with subnets and IP addresses may not seem so intuitive at first, but it should be quite clear in this example I’ve shown.
As you can see in the following image, with Fixed Port Range you specify the exact mapping, by telling FortiManager what is the External IP Range (the IP addresses as known on our side) and the Internal (or mapped) IP addresses (the real IP addresses on the partner’s network):
Fixed Port range IP-Pool
Now we can use this IP-pool object in a Central SNAT rule where we determine for which IP addresses of the partner network we want the deterministic NAT to happen:
Central SNAT rule detailsCentral SNAT rule
Obviously you can implement the above SNAT with 3 rules each one with its own Overload/One-to-One IP-pool that translates a single IP onto the corresponding one, but in my example you have a deterministic 1-to-1 mapping of the original /22 network onto the target /22 network and then you determine with SNAT policies when it has to take place.
With the above rule, traffic from 10.0.100.5 arrives into our network with source 10.0.200.5, as expected, while traffic from 10.0.101.100 for example arrives with its original IP address since it is not matched by the Central SNAT policy.
In a simpler setup, you can make the SNAT happen for the whole 10.0.100.0/22 by setting it as source address in the Central SNAT rule:
SNAT for the whole partner’s network
Conclusions
I wanted to share this bit of knowledge because I had some headaches before finding the solution to my problem for this deterministic SNAT between subnets and maybe it can be useful for someone that would have never thought (like me) to use a Fixed Port RangeIP-pool object to implement it, until I’ve read the documentation and understood how it worked in our corner case with 1:1 ratio of IP addresses.
Installing Pi-hole on a Synology NAS with Docker is quite trivial, disabling caching is not, so let’s see how to do it. You will also learn how to build your own docker image that overrides default cache settings. Key info is generic so it is valuable for other Docker installations too, if you’re not running Docker on a Synology box.
Introduction
In the previous post Recursive DNS Resolver with AD-Blocking Features I’ve explained how to implement on a Raspberry Pi device a DNS resolver that blocks ADs and malicious sites (Pi-hole) and resolves names recursively (Unbound) without relying on official DNS servers like Google ones. As I’ve said in that post I have deployed two Pi-holes and two Unbound servers in my home network, to have a bit of redundancy when I’m doing maintenance and to have a bit of fun 🙂 The first Pi-hole+Unbound stack was deployed on an RPi3, so I had to choose another home-device that is active 24x7x365 for the second stack: my Synology DS218+ NAS with Docker was the perfect solution. This time we will focus on Pi-hole installation, leaving Unbound for another post.
This article is about Synology Docker but the info you can find can be perfectly applied to any device on which you’re running Docker. If you’re starting to use docker on a device without a GUI as on Synology, give a look at the portainer/portainer container which can provide a Web GUI to manage Docker’s Images, Containers, Volumes etc.
I’ll assume you’ve already installed and configured Docker on Synology via Package Manager. Screenshots are in Italian, but I think it will be easy to understand the equivalent in your language by looking at the positioning of the elements on the interface.
Installing Pi-hole
First of all open Docker app, go to Registry and search for pihole. Select the pihole/pihole image, press Download and select the latest tag.
Pi-hole Image Download
Then, before starting a new container with this image, prepare the following folder structure (you can create it via File Station app):
Note:/volume1 is a folder you can see via SSH CLI and its the folder containing the shared folders on your NAS. I don’t remember if docker shared folder is created by Docker app installation procedure, otherwise you can create it via Synology GUI and then you can proceed creating pihole subfolders.
As you can see on the Docker Hub page of pihole/pihole, the first two folders need to be mounted within the container to grant data persistance to your configuration when you need to re-create the container to update the image. The third one is a workaround I will explain later to set CACHE_SIZE to ZERO (read the previous article to understand why we want it to be zero).
First Pi-hole run
Now you can go to the Images section of the Docker app on Synology, select pihole/pihole:latest and press Launch. The app will ask you about the initial configuration of the container. Set the name as you want (default is pihole-pihole1) and press Advanced in order to configure the advanced options. Enable automatic restart, if you want, and then move to the Volumes tab: here we will mount the first two folders, ignore the third one, by pressing Add Folder, selecting the folders you’ve created before and mounting them on the correct paths with Read/Write permissions (I usually call the folder on synology as the path on which it will be mounted with dashes instead of slashes):
Volumes’ Bind Mounts
Then move on the Port settings tab and expose the ports you want to be reachable to outside. In particular, I want to expose:
DNS Service on 53/UDP and 53/TCP ports
HTTP Service (GUI) on 8080/TCP port
I’ll expose the DNS ports on the original ones (home devices can’t point to other non standard ports for DNS resolution) while I move the HTTP port to 8080/TCP (80/TCP is used by Synology):
Note: Synology GUI show you the services that are declared in the Dockerfile manifest as being exposed by the container and by default it exposes them on automatically allocated ports. You can remove such mappings if you don’t want to expose some container’ services or you can expose them on statically known ports as above.
Go to the Environment tab (the last one) and set a variable called WEBPASSWORD to the admin password you will use to access HTTP GUI (if you don’t set it it will be randomly generated and you will be able to see it in the container logs by double clicking on the container and then on the Log tab.
Configuration is finished, apply the setup and let Synology Docker launch the newly created container.
Setting CACHE_SIZE to zero
As we’ve seen in the previous article, we need to disable caching in order to have Unbound pre-fetching and caching work as expected. I’ve tried to set CACHE_SIZE=0 via Environment variable or modifying /etc/pihole/setupVars.conf that I have on my synology shared folder (as I’ve did on RPi3) and that is mounted within the container but it didn’t work. After a bit of study I’ve understood that the initialization script /root/ph_install.sh that is automatically executed by the container configures dnsmasq (the process that is responsible for Pi-hole dns resolution and caching) by replacing the values contained within /etc/.pihole/advanced/01-pihole.conf with the values contained in setupVars.conf and then it copies that file to /etc/dnsmasq.d/01-pihole.conf . The strange thing is that it does not take CACHE_SIZE from setupVars.conf but instead it has a CACHE_SIZE=10000 harcoded within the script.
This is the content of /etc/.pihole/advanced/01-pihole.conf
# Pi-hole: A black hole for Internet advertisements
# (c) 2017 Pi-hole, LLC (https://pi-hole.net)
# Network-wide ad blocking via your own hardware.
#
# Dnsmasq config for Pi-hole's FTLDNS
#
# This file is copyright under the latest version of the EUPL.
# Please see LICENSE file for your rights under this license.
###############################################################################
# FILE AUTOMATICALLY POPULATED BY PI-HOLE INSTALL/UPDATE PROCEDURE. #
# ANY CHANGES MADE TO THIS FILE AFTER INSTALL WILL BE LOST ON THE NEXT UPDATE #
# #
# IF YOU WISH TO CHANGE THE UPSTREAM SERVERS, CHANGE THEM IN: #
# /etc/pihole/setupVars.conf #
# #
# ANY OTHER CHANGES SHOULD BE MADE IN A SEPARATE CONFIG FILE #
# WITHIN /etc/dnsmasq.d/yourname.conf #
###############################################################################
You can see this file by double-clicking on the running container and then on the Terminal tab and creating a new bash session as you can see in this screenshot where you can write the highlighted cat command:
Executing “cat /etc/.pihole/advanced/01-pihole.conf” within the container
Alternatively you can run the same command via cli by connecting via SSH to the Synology NAS, after becoming root:
You can save this file on your NAS, creating /volume1/docker/pihole/etc_.pihole_advanced/01-pihole.conf and modifying cache-size=@CACHE_SIZE@ to cache-size=0
Then, stop the pihole-pihole1 container, select it and click on modify and add a new Read-Only file mapping (not a folder mapping this time) in order to mount /volume1/docker/pihole/etc_.pihole_advanced/01-pihole.conf on /etc/.pihole/advanced/01-pihole.conf within the container:
Mounting 01-pihole.conf with cache-size modified
You can now restart the container and connect to the Pi-hole GUI running on http://your-nas-ip:8080/admin GUI, enter the admin password and theck on the Settings page that the DNS cache is set to zero:
Check DNS cache size, it must be ZERO if everything works as expected
Now, you can configure Pi-hole as you’ve seen in the previous article and you can point the Unbound server running on the Raspberry as on the other Pi-hole setup as an Upstream DNS. In this way we’ve doubled the Pi-hole servers, a first step toward redundancy (we still have a single Unbound server, so if RPi goes down, DNS resolution won’t work, but let’s focus on Pi-hole now and double Unbound in the next article).
You can now try the procedure you will do to update a container on Synology and verify that your settings are persisted and cache size is still zero:
Shutdown the container
Click on Action -> Erase (not Delete). In Italian we have Cancella and Elimina, the first one deletes the container but keeps it’s configuration (volumes’ and ports’ mappings etc) within Docker GUI, in order to let you Launch the container again, thus re-creating it from the latest image you’ve downloaded. Elimina instead deletes the container with all of its settings. You must choose the first action, because we do not want to reconfigure it again. I don’t know the exact words used in English but I think they could be Erase (Cancella) and Delete (Elimina), so try to understand the correct one (the one to avoid should be the lowest one in the menu).
Go back to the Registry and search for pihole again and re-download the latest version.
When the download is finished, launch the container again, it will be restarted with the newly downloaded image and with all of your settings in place
Erase the container in order to re-create it with an updated image
Note: if you launch the Pi-hole GUI you will be noticed about new versions by looking at the footer of the page. If version numbers are blinking in red, it means that there is an official update available. BTW, it could be necessary to wait some days in order to have an updated container image (you can go to the pihole/pihole Docker Hub page and check if the latest tag has been recently updated). Automatic container update can be done via containrrr/watchtower container running on Synology but this is beyond the scope of this article.
What I don’t like about overwriting 01-pihole.conf
As you’ve seen above, we’ve overwritten the content of /etc/.pihole/advanced/01-pihole.conf file with our modified one in order to force cache-size to zero. If a new version of pihole/pihole with some changes in that file comes out, we won’t use it because we mount our version of the file over the container updated file. So, I wanted to modify the CACHE_SIZE=10000 setting in ph_install.sh setup script of the container, in order to seti ti to CACHE_SIZE=0.
Building my own pihole-nocache image
The first thing you could think of is “let’s copy ph_install.sh script, modify the variable and mount that file over ph_install.sh script of the container base image”, but this would be exactly the same as overwriting 01-pihole.conf file.
So I’ve thought that I could simply do a replace of that instruction within ph_install.sh without replacing the whole file with my own copy.
DISCLAIMER: the following instructions require SSH access to the synology with the ability to become root, so you must be the NAS administrator. Understand the commands before trying them on a production environment. There should be no risk but I’m not responsible for bricking your Synology Docker environment.
First of all I’ve created the following folder:
mkdir /volume1/docker/_IMAGES/pihole-nocache
Then I’ve created a Dockerfile file within that folder that instructs docker build to create a new image based on the official pihole/pihole:latest with my modification done via sed command:
# cat /volume1/docker/_IMAGES/pihole-nocache/Dockerfile
FROM pihole/pihole:latest
RUN sed -i -e "s:CACHE_SIZE=[0-9]\+:CACHE_SIZE=0:g" /root/ph_install.sh
So, we replace whatever value CACHE_SIZE has been set to in the ph_install.sh script with ZERO.
Then you can build your image with the following command:
# docker build --pull /volume1/docker/_IMAGES/pihole-nocache/ -t pihole-nocache:latest
Sending build context to Docker daemon 2.048kB
Step 1/2 : FROM pihole/pihole:latestlatest: Pulling from pihole/pihole
Digest: sha256:3a39992f3e0879a4705d87d0b059513af0749e6ea2579744653fe54ceae360a0
Status: Image is up to date for pihole/pihole:latest
---> eb777ee00e0c
Step 2/2 : RUN sed -i -e "s:CACHE_SIZE=[0-9]\+:CACHE_SIZE=0:g" /root/ph_install.sh
---> Running in 4305759e9f70
Removing intermediate container 4305759e9f70
---> ffb77613b225
Successfully built ffb77613b225
Successfully tagged pihole-nocache:latest
After doing this, you will find pihole-nocache image in your images section of the Docker app on Synology and you will be able to create a new container based on it by following the steps you’ve done before for pihole/pihole image. If you’ve deleted pihole-pihole1 container you can re-create a new one with the same volumes and port mappings, except for the /etc/.pihole/advanced/01-pihole.conf file which is no more necessary. Let’s start the new container and check that cache is set to zero without mounting the 01-pihole.conf file as expected.
Updating the container requires you to re-launch the docker build command (the — pull option forces the build process to search for an updated version of the base image) instead of performing the download described in step 3. in the update procedure described before.
Uploading the image on your private container
As I’ve said before, I have a Watchtower container running on Synology NAS that regularly updates my containers by pulling new images and re-creating them automatically. In order to allow it to update my pihole-nocache image, I’ve created a private registry and each night I rebuild the pihole-nocache image with the command explained above and upload it to my registry.
First of all I’ve created a registry container with the following command:
# Create registry folder with required subfolders
mkdir /volume1/docker/registry
mkdir /volume1/docker/registry/certs
mkdir /volume1/docker/registry/var_lib_registry
# Create symbolic links to my Synology NAS cert/key renewed
# automatically via Letsencrypt (when certificate is renewed,
# restart the registry container in order to refresh the certificate
ln -s /usr/syno/etc/certificate/system/default/fullchain.pem /volume1/docker/registry/certs/domain.crt
ln -s /usr/syno/etc/certificate/system/default/privkey.pem /volume1/docker/registry/certs/domain.key
# Start the registry on port 55000/TCP
docker run -d \
-p 55000:5000 \
--restart=always \
--name registry \
-v /volume1/docker/registry/certs/domain.key:/certs/domain.key \
-v /volume1/docker/registry/certs/domain.crt:/certs/domain.crt \
-e REGISTRY_HTTP_TLS_CERTIFICATE=/certs/domain.crt \
-e REGISTRY_HTTP_TLS_KEY=/certs/domain.key \
-v /volume1/docker/registry/var_lib_registry:/var/lib/registry \
registry:2
Note: in my home network I have a local resolution for the public name my-nas-fqdn.synology.me of my NAS, in order to resolve it on the private IP address of my Synology box. In this way the Synology certificate and key I’m mounting in the registry container will work fine because they match the FQDN name I’m using to point the container. More detail about deploying a private regitry in plain HTTP can be found on the official documentation at this link.
You can then test if the registry is listening on port 55000/TCP via the following command (CTRL^C to terminate execution) that should show you the public certificate of your NAS:
You can add this registry in the Images section of the Docker app and make it active with my-nas-fqdn name, for example, By cliccing on Add and then Add from URL and entering https://my-nas-fqdn.synology.me:55000 as URL.
This allows you to search images in your private registry. Just remember to re-activate the default one when you need to search for official images.
Now, we must build the image tagging it in order to let docker know it is an image that will be found on our private registry and not on Docker hub by changing the -t option in the following way:
Now if you do a search on your private registry from Docker app on Synology you will find pihole-nocache image, ad you will be able to download its latest tag.
Enabling automatic re-build and upload of pihole-nocache on my private registry
Finally, in order to let Watchtower find an updated version of pihole-nocache on my private registry, I’ve scheduled on my NAS the following script to run every day and that will re-create and re-upload with the latest tag all the images described by a Dockerfile within each subfolder in /volume1/docker_IMAGES folder:
#!/bin/bash
LS="/bin/ls"
RM="/bin/rm"
TEE="/bin/tee"
DATE="/bin/date"
DOCKER="/usr/local/bin/docker"
IMAGESDIR="/volume1/docker/_IMAGES/"
LOG_DIR="/volume1/homes/gianni/scripts/logs"
LOG_FILE="$LOG_DIR/docker_images_rebuild.log"
$RM -f $LOG_FILE
$DATE | $TEE -a $LOG_FILE
for image in $(ls $IMAGESDIR); do
$DOCKER build --pull $IMAGESDIR/$image/ -t my-nas-fqdn.synology.me:55000/$image:latest | $TEE -a $LOG_FILE
$DOCKER push my-nas-fqdn.synology.me:55000/$image | $TEE -a $LOG_FILE
done
When you create a new image with the latest tag, the previous one’s tag becomes <none>. It can’t be removed until the container is updated because is still in use.
# docker image ls | grep "none\|TAG"
REPOSITORY TAG IMAGE ID CREATED SIZE
my-nas-fqdn.synology.me:55000/pihole-nocache <none> fb155ea738a9 15 hours ago 333MB
When it is not used anymore you can remove it via the following command:
docker rmi IMAGE_ID
replacing IMAGE_ID with the ID shown by docker image ls.
I’ve scheduled a weekly job that removes dangling images in order to save space (it also shows you how to send notifications to a user within Synology web GUI via synodsmnotify tool):
#!/bin/bash
DOCKER="/usr/local/bin/docker"
TEE="/bin/tee"
RM="/bin/rm"
ECHO="/bin/echo"
DATE="/bin/date"
NOTIFY="/usr/syno/bin/synodsmnotify"
LOG_DIR="/volume1/homes/gianni/scripts/logs"
LOG_FILE="$LOG_DIR/docker_images_cleanup.log"
$RM -f $LOG_FILE
$DATE | $TEE -a $LOG_FILE
IDS=$($DOCKER images | grep "<none>" | awk "{print \$3}")
if [ "x$IDS" == "x" ] ; then
$ECHO "No docker images with tag <none> found :)" | $TEE -a $LOG_FILE
$NOTIFY gianni "Docker images OK" "No docker images with tag <none> found :)"
exit
fi
$DOCKER images --all | $TEE -a $LOG_FILE
$DOCKER rmi -f $IDS | $TEE -a $LOG_FILE
$NOTIFY gianni "Docker images cleaned-up" "Cleaned up docker images with tag <none>"
Today I’ve updated the Image and the daily run of Watchtower upgraded it and deleted the dangling old image (I use soulassassin85/docker-telegram-notifier container to send notifications to a private Telegram channel)
Telegram notification sent by Watchtower
Conclusions
I hope you’ve enjoyed reading this post, maybe it can save you a bit of time if you want to solve problems like disabling the cache on a dockerized Pi-hole or if you are in the process of learning Docker and you want to experiment a bit like me. If you have comments feel free to write me below, I’ve started learning docker two weeks ago, so there can be something that could be done better or in a more efficient way 🙂
Wouldn’t it be amazing to have an home DNS server which filters Advertisements, Malicious Sites or other bad sites’ categories and recursively resolves the names without using official DNS servers like the one provided by Google, CloudFlare or OpenDNS, for example?
Provider DNSs
The majority of people uses internet at home without messing up with the configuration of the modem/router provided by the Telco and they use the provider’s DNS servers which are returned automatically by provider when the modem brings up the PPPoE connection when it is powered up.
Free public (and faster) DNS servers
The first step to have a better surfing experience at home is to connect to the router and change the DNS servers returned on the local LAN by the DHCP server replacing them with some public DNS servers like CloudFlare 1.1.1.1 which is very fast or one of its variants like 1.1.1.2 (no malware domains) or 1.1.1.3 (no malware and no adult content).
The advantage of using these DNS servers is that they are really fast compared to the ones of the providers. The disadvantage of using DNS servers like Google ones, is that you give google a lot of information about the domains you browse, because each time you visit www.mysecretsite.org you’re asking google “what’s the IP address of http://www.mysecretsite.org?”, thus providing it the whole list of domains you visit.
Recursive DNS Resolver@Home
The next step is to configure a Recursive DNS resolver in your home network.
A recursive DNS resolver does not asks public well-known DNS server to resolve Fully Qualified Domain Names (FQDNs), but instead it queries the root servers (which it must known in advance) asking them to which DNS server ask the top level domain of the FQDN. Then it goes on asking one of the DNS servers it has got with the next part of the domain and so on, so it acts recursively.
Here is how we can resolve the FQDN www.medium.com recursively:
# Ask root server M the list of DNS servers that can resolve .COM
# Here I'm using @DNS-FQDN which implies a resolution of DNS-FQDN
# into an IP address
# Ask the root server M which Name Server (NS) we should use for
# com TLD
dig com NS @M.ROOT-SERVERS.NET
[...]
com. 172800 IN NSf.gtld-servers.net.
[...]
# Ask for the NS to use for medium.com
dig medium.com NS @f.gtld-servers.net
[...]
medium.com. 172800 IN NSkip.ns.cloudflare.com.
[...]
# Finally, ask for the A (IPv4 Address) record of www.medium.com
dig www.medium.comA @kip.ns.cloudflare.com
[...]
www.medium.com. 300 IN A 162.159.152.4
[...]
The problem of recursive DNS resolution
If you simply enable a recursive DNS server that recursively resolves every FQDN, you would have a high performance hit in DNS resolution, because recursion requires a lot of time (you could pass from 20–30ms to 2–300ms for each query, or even more).
Caching and prefetching + serving expired 0-TTL resolutions is the way
If you enable an home recursive DNS, you definitely want to enable caching and prefetching:
Caching: the resolver recursively resolves an FQDN and then it stores it into it’s local cache for the amount of time specified by the TTL (5 minutes for the www.medium.com resolution above)
Prefetching: the resolver can proactively prefetch FQDNs in cache when you ask for them and the remaining life is under a certain percentage of the returned TTL
Serving expired records while resolving them: if devices in your LAN ask for www.medium.com every 10 minutes, there won’t be a resolution in the resolver cache since it has a TTL of 5 minutes. What the resolver can do is return an expired resolution with a TTL of zero (so that the Operating System of the device does not cache it) while starting an immediate recursive resolution for that domain. Recursive resolvers like Unbound support serving expired records and you can specify a maximum amount of time during which it returns an expired record if it is not able to resolve it recursively (usually you set it to 1-hour to avoid returning resolutions for removed domains for too long)
Avoiding the Advertisements.. and thousands of other bad sites
As I’ve already said above, there are some public DNS resolvers like Cloudflare for Families that can remove part of undesired DNS resolutions, for example the ones related to adult content if you want to have a safer browser experience for your kids.
What should you do if you want a much greater flexibility and you want to implement also a recursive resolution on your own at home? In this case Pi-hole is the solution: it acts like a blackhole for bad domains, thus the name -hole. I think the Pi in the name it is due to the fact that a Raspberry-Pi is a perfect candidate to run it into your home-network, but I’m just guessing 🙂
Implementing the Solution on a Raspberry-Pi (or equivalent)
Pi-hole Configuration
You can buy a Raspberry-Pi for very few bucks, you don’t need a new RPi4, an RPi3 like the one I have (and maybe also one of the previous versions) is perfectly fine.
I won’t focus on every aspect of the configuration, I’ll just focus on the Ad-lists, the DNS setup and caching.
Once installed, you can set an admin password via the following command:
pihole -a -p
Then you can connect to the admin interface on http://RPI_ADDRESS/admin and choose Login on the left menu to login with the just set password.
First of all, you can change the DNS to which PiHole will forward queries by going into Settings -> DNS
Pi-hole DNS Forwarders Setup
As you can see, you can easily choose a primary and secondary IPv4 server among Google, OpenDNS etc, or you can set your own Upstream DNS servers on the right. Keep this in mind, since we will use the custom setup in order to forward queries to the Unbound Recursive Resolver.
Then there is the cache: if you just want to add the Pi-hole layer of filtering to the standard resolution to Google and OpenDNS, you should leave cache as it is. If you are going to use Unbound, with it’s own cache, DNSSEC and prefetching mechanism (which is triggered by client DNS queries) you must disable DNS Sec and Cache on PiHole.
In the same page where you’ve just setup Upstream DNS servers, you must disable DNSSEC if it is enabled (since it can be done by Unbound). I’ve also enabled the forwarding of reverse lookup of private ip addresses because it will be done by Unbound, otherwise if you use public DNS as upstream servers leave the option that blocks the reverse lookup for private ip addresses enabled, since it can leak information about your network to the outside.
Private IP Reverse Lookup and DNSSEC settings
Caching can not be disabled via GUI, it is necessary to disable it via CLI. So, edit /etc/pihole/setupVars.conf and set the following variable:
CACHE_SIZE=0
then restart pihole with:
systemctl restart pihole-FTL
Reconnect to the GUI and go to Settings -> System and have a look on the right side, you should see the DNS cache entries with all zeros (some caching still happens, but almost nothing and it is related to local resolutions)
By default, pi-hole blocks about 60k domains with the configured AD-List (at the time of writing, it may change in the future), but you can increase the amount of blocked domains by adding other lists. You can find more information on this useful blog post:
Otherwise you can pick your own lists from FilterLists by filtering for lists which are compatible with Pi-hole.
Note: you can no longer add the lists you find in the blog post above via editing the adlists.list file via cli, now you must go via GUI in the Group Management menu on the left. I wanted to add the lists used in the blog post above and so I’ve created some grups with different categories in Group Manageent -> Groups:
Groups Management
Groups are useful because with a single click you can enable or disable an entire group of adlists, so I suggest to use them.
Then go to Groups -> Adlists and add some of the lists that you’ve found:
Adlists Management
As it is written in gray within the Address input box, you can paste more than one URL at once, they say space-separated urls but also urls with a carriage return like the one you find in the blog I’ve linked work. After you’ve imported the lists, you can change the group assignement by removing the default group and assigning the category you want. Then go to Tools -> Update Gravity (yeah, it is a blackhole, you are going to increase its gravity force 🙂 ) and after a while (it can take several seconds) the list of blocked domains you can find in the upper right corner of your dashboard will increase from the original 60k domains to some hundreds of thousands like in my case:
Unbound configuration
Finally, we must configure the recursive resolver, to complete our DNS AD-Free & Recursive stack.
I’ve installed it on the same RPi3 of Pi-hole, so I will need to change the port on which it listens for DNS queries, since the standard 53/UDP port is used by Pi-hole (and we want it to be on the standard port, since our devices can’t be configured to use non standard ports for DNS queries). So, we will use 50053/UDP.
Then you can customize the configuration by editing /etc/unbound/unbound.conf
I’ll post here the result of uncommented lines in my configuration and add some comments inline to explain it better:
# include additional configuration files (query data minimization for privacy and DNSSEC for security purposes)
# qname-minimisation.conf adds to server section
# qname-minimisation: yes
# root-auto-trust-anchor-file.conf adds
# auto-trust-anchor-file: "/var/lib/unbound/root.key"
include: "/etc/unbound/unbound.conf.d/*.conf"
server:
verbosity: 0
statistics-cumulative: yes
# RPi3 has a quad-core CPU, so let's enable 4 threads
num-threads: 4
# Listen on port 50053 on every interface
interface: 0.0.0.0@50053
# Can enable faster resolutions in multithreaded configuration
so-reuseport: yes
access-control: 0.0.0.0/0 refuse
access-control: 127.0.0.0/8 allow
access-control: ::0/0 refuse
access-control: ::1 allow
access-control: ::ffff:127.0.0.1 allow
# Allow queries from local lan (useful for testing, but can be
# omitted if queries will come only from Pi-hole running on the
# same host
access-control: 192.168.0.0/24 allow
root-hints: "root.hints"
private-address: 192.168.0.0/24
private-domain: "your_local_domain"
# Enable prefetching of cache elements that are queried when the
# remaining life is less than 10% of the original TTL
prefetch: yes
domain-insecure: "your_local_domain"
# Enable serving expired cache entries for at most 1-hour if
# it is impossible to refresh them
serve-expired: yes
serve-expired-ttl: 3600
# Enable reverse resolution of local DNS names
local-zone: "0.168.192.in-addr.arpa." nodefault
unblock-lan-zones: no
insecure-lan-zones: no
python:
remote-control:
# This can be used for querying Unbound about its internal statistics
control-enable: yes
control-interface: 0.0.0.0
control-port: 8953
# In my case, I have forward and reverse resolution for my home domain
# setup on my router 192.168.0.254, so I want to forward local resolutions'
# queries to it
forward-zone:
name: "your_local_domain"
forward-addr: 192.168.0.254
forward-zone:
name: "0.168.192.in-addr.arpa"
forward-addr: 192.168.0.254
Restart unbound with the new configuration:
systemctl restart unbound
you can test that unbound is working via the following command on the device where it is running:
Wow, now it is in cache, and we have an immediate reply.
Pointing Pi-hole to Unbound
Now that we have a working Recursive DNS Resolver, go back into Pi-hole GUI, Settings -> DNS and configure Unbound as a resolver. In this case, it is running on the same system (127.0.0.1 means the local system) but on a non standard port so let’s go with the following:
127.0.0.1#50053
Now, repeat the DNS resolution test with the commands above (remove the –p 50053 option because Pi-hole is running on the standard 53/UDP port) and test that everything works fine.
If everything is fine, go back to the Provider’s router DHCP configuration and replace the Google/Cloudflare/OpenDNS or other DNSs with the IP address of your Raspberry. Then you can reboot the modem to force your devices to reconnect to the local LAN and get the updated DNS from DHCP.
Monitoring Pi-hole
Pi-hole has it’s beautiful GUI that lets you monitor its performance, so go on the Dashboard and on the top bar you will see the amount of ADs your Pi-hole have blocked today. We always talk about ADs but as you have seen it can block also malware and other bad stuff listed in the adlists you’ve configured.
You can also query the following link to have a JSON output of internal stats. I’ve used it to send data to my InfluxDB2 server and to monitor Pi-hole performance via Grafana:
You can get some info about how Unbound is performing by using unbound-control. Just launch unbound-control to see all the info you can ask for. We’re interested now in statistics so let’s run it with unbound-control stats_noreset (by default if you invoke it with stats the statistics are reset, unless you specify statistics-cumulative: yes as I’ve done in my unbound.conf ).
You’ll get something like this (I’ll omit the threadX rows that gives statistics per thread, in my case from thread0 to thread3 since I’ve specified to run with 4 threads):
I send also this info, after parsing it with Python, to my InfluxDB2 server.
Visualizing the performance of the stack
I won’t go into details of my monitoring panes in Grafana, it could be an interesting topic for a future post, but I can show you that my Pi-Holes and Unbound servers (yes, I have two Pi-holes and each of them points to the two Unbound servers, with a stack running on an RPi3 and the other running in Docker Containers on my Synology NAS, stuff for another post too 🙂 ).
As you can see, we’ve almost no caching on Pi-holes:
Almost no caching on Pi-holes
Unbound gives us a lot of info:
Cache hits are over 96% of the number of queries received by both Unbound servers
We have a lot of prefetches, which is good because they increases the Cache hits
We also have a good amount of expired entries, which is the number of queries served via expired records (that are counted as a Cache hit)
What can’t be answered with the cache (even with expired entries) requires a recursive query/reply which is counted by the last entry below
We have no IP-ratelimited queries, and this is normal for a home network with few devices
A lot of Cache Hits and prefetch on Unbound, really good!
Note: you may ask yourself why the number of forwarded queries on Pi-holes is lower than the number of queries received by Unbound servers, but this is simply due to the fact that Pi-hole numbers are reset every day and soometimes I restart services (updates, device reboot etc)
Since Grafana is like a drug, when you have the data, you start plotting everything, here you can see some other stuff I plot about my redundant DNS stack 🙂
Note: utilization of Pi-holes and Unbound servers is not balanced since 1) the operating system tends to use the first DNS, which in my case is the Pi-hole running on Raspberry and 2) I think (but I’m not sure) that also Pi-hole tends to forward queries to the first DNS, or maybe it has some mechanism that sees the one that answers faster and tends to use it the most
DNS Resolvers Dashboard #1DNS Resolvers Dashboard #2
Conclusions
This is my first new article here on Medium, I hope someone will find it inspiring and maybe useful, you can reach me on the social pages linked in my profile if you have questions or you can comment here 🙂
A lot of people that have a server or NAS at home require to publish some services on the Internet, through the so-called port forwarding technique on their home routers. Exposing services makes them detectable by port scanners, which can understand what machines or servers are running in your home network, which version of software they run and what vulnerabilities may affect them. This analysis can be the first step for a malicious user that wants to penetrate into your intranet.
Wouldn’t it be amazing to have no services exposed to simple scanners that continuously scan the public network’s IP addresses?
The linux knockd daemon solution
Some years ago my home router was a simple low-power alix-1c mini-computer and I’ve accomplished the task of having no services exposed to port scanners by using the port knock server knockd: as the linked MAN page explains, this service listens to every packet received on a network interface and can execute specific commands upon the reception of a single packet or, more usefully, a sequence of packets. As the examples on the MAN page show, you can tell knockd to insert an iptables rule when it sees a specific sequence of packets in a specified amount of time, then wait some seconds and execute another command when the timer is expired.
Let’s suppose you have a server with IP address 192.168.0.100 listening for ssh connections on port 22/TCP , which you want to expose on port 1022/TCP on the public network, but only when the router sees a sequence of packets like 2222/TCP, 3333/UDP, 4444/TCP received within 15 seconds time frame, and only for 10 seconds.
To accomplish what I’ve said above, I would configure a permanent port forwarding rule that translates packets with destination port 1022/TCP received on the internet-facing interface, such as ppp0, of the linux router into packets with destination port 22/TCP and destination IP address 192.168.0.100. This is accomplished in the PREROUTING chain of the iptables’ NAT table. Then I would put a jump into a special KNOCKD_FWD_RULES (I’ve named it FWD_RULES because I could have also a special chain for KNOCKD_INPUT_RULES for rules related to traffic destined to the linux router) chain just after the rule that allows ESTABLISHED,RELATED traffic (which has already been authorized) in the FORWARDING chain of the FILTER table and configure knockd to add/remove rules in that special chain to allow traffic destined to 192.168.0.100 port 22/TCP.
Linux implementation example
Iptables configuration (it must be adapted to your configuration, here I’m considering a fresh firewall with a DROP policy on the chains, i.e. drop everything not explicitly allowed):
iptables -t nat -A PREROUTING -i ppp0 -p tcp -m tcp --dport 1022 -j DNAT --to-destination 192.168.0.100:22
iptables -N KNOCKD_FWD_RULES
# Filter table is implicit
iptables -A FORWARDING -m state --state RELATED,ESTABLISHED -j ACCEPT
iptables -A FORWARDING -j KNOCKD_FWD_RULES
Now, when you are outside your home network and need to connect to your home server, you can do the following:
Send the three magic packets to the public IP address of your home router, by using an FQDN such as myhomenetwork.no-ip.org (you can register 3 IPs for free on NO-IP, for example)
Establish an SSH connection within 10 seconds from the sending of the magic sequence of packets
So, let’s try to connect from a client that exits on the Internet with IP x.x.x.x
ssh -p 1022 myhomenetwork.no-ip.org
[timeout - firewall closed]
# Send the magic sequence of packets from client x.x.x.x
nmap -Pn -sS -p 2222 myhomenetwork.no-ip.orgnmap -Pn -sU -p 3333 myhomenetwork.no-ip.orgnmap -Pn -sS -p 4444 myhomenetwork.no-ip.org
[firewall opens the door, new connections allowed from x.x.x.x]
sleep 2sssh -p 1022 myhomenetwork.no-ip.org
[connection established]
[firewall closes the door, no NEW connections are allowed from x.x.x.x]
[established connection from x.x.x.x keeps going on]
What I find beautiful about this approach is that:
Your firewall is opened only for your client public ip address and only for 10 seconds, so the service is exposed only for the IP address that generated the magic sequence of packets.
The service stops being exposed after 10 seconds, and if you established a connection within that small time frame, it will not be stopped after the special rule is removed from KNOCKD_FWD_RULES, because your traffic matches the established connections’ accept rule at the beginning of the FORWARD chain. This implies that if your client is NATted on a public IP address with tens of other clients, your service will be opened to the other clients behind the same public address only for 10 seconds, then it will “disappear”.
As you can see on the knockd MAN page I’ve linked above, you can implement even more complex behaviors, by using for example a file with a list of sequences that can be used to trigger a knockd event and each time a sequence is used it is invalidated, in order to avoid reply attacks if someone is sniffing the network where your client is connected and understands the magic sequence you use.
Knocking on a MikroTik door
3 years ago I’ve moved from a Linux router to a MikroTik HAP^2 router in my home network and I wanted to implement a sort of knockd by using the tools provided by RouterOS. We don’t have a knockd implementation for RouterOS but we can implement the example of the previous section using dynamic ADDRESS-LISTS, which are a useful way to group IP addresses that can be used as matching condition in firewall rules and that allows you to add an IP address with an expiration time.
We can do what follows:
Check if a packet with destination port 2222/TCP with the SYN flag on is received on ppp0 interface and add the source IP address to KNOCK_FIRST address list with an expiration time of 15 seconds.
Check if a packet with destination port 3333/UDP is received on ppp0 interface by an IP address within KNOCK_FIRST address list and add the source IP address to KNOCK_SECOND address list with an expiration time of 10 seconds (I decrease the timeout since less packets are needed to complete the magic sequence).
Check if a packet with destination port 4444/TCP with the SYN flag on is received on ppp0 interface by an IP address within KNOCK_SECOND address list and add the source IP address to TRUSTED_SOURCES address list with an expiration time of 10 seconds, which is the time frame in which the client must start the connection to the temporary exposed service.
Implement a forwarding rule that allows new sessions from TRUSTED_SOURCES to the intranet service on port 22/TCP of IP 192.168.0.100 (we still allow the established/related sessions to go on, without checking the source).
MikroTik implementation example
This is the MikroTik implementation:
# NAT TABLE
/ip firewall nat# PREROUTING CHAINaddaction=dst-nat chain=dstnat in-interface=ppp0 protocol=tcp dst-port=1022 \
to-addresses=192.168.0.100 to-ports=22 \
comment="Port forward 1022/TCP to 192.168.0.100:22"
# FILTER TABLE/ip firewall filter# FORWARD CHAIN
# action fast-track allows the connection to be processed via fast-path,
# but an identical rule with allow action is required if the connection
# follows the slow-path.
# More details on https://wiki.mikrotik.com/wiki/Manual:IP/Fasttrack
addaction=fasttrack-connection chain=forward comment="FastTrack Established/Related" \
connection-state=established,related
addaction=accept chain=forward comment="Allow Established/Related" \
connection-state=established,related
#Allow new connections from Trusted sources
addaction=accept chain=forward protocol=tcp connection-state=new \
src-address-list=TRUSTED_SOURCES dst-port=22 dst-address=192.168.0.100 \
comment="Allow SSH from Trusted to 192.168.0.100 on port 22/TCP"
#Drop everything not explicitly allowed
addaction=drop chain=forward
# INPUT CHAIN
# Port knocking rules
addaction=add-src-to-address-list chain=input connection-state=new protocol=tcp \
dst-port=2222 address-list=KNOCK_FIRST address-list-timeout=15s
addaction=add-src-to-address-list chain=input connection-state=new protocol=udp \
dst-port=3333 src-address-list=KNOCK_FIRST address-list=KNOCK_SECOND address-list-timeout=10s
addaction=add-src-to-address-list chain=input connection-state=new protocol=tcp \
dst-port=4444 src-address-list=KNOCK_SECOND address-list=TRUSTED_SOURCES address-list-timeout=10s
addaction=drop chain=input
Here you can see the configuration in a more readable syntax-highlighted format (I’m using Sublime-Text with MikroTik syntax highlighing):
You can monitor the insertion of the public IP address in the address-lists with the following command:
[admin@MikroTik] > /ip firewall address-list print
Flags: X - disabled, D - dynamic
# LIST ADDRESS CREATION-TIME TIMEOUT
0 D KNOCK_FIRST x.x.x.x jun/15/2020 15:17:30 12s
1 D KNOCK_SECOND x.x.x.x jun/15/2020 15:17:30 8s
2 D TRUSTED_SOURCES x.x.x.x jun/15/2020 15:17:30 9s
Knocking from a mobile client
In the previous sections I’ve shown you how you can knock from a computer by using nmap, but what if you need to connect to your home server via smartphone or tablet? You can easily find port-knocking applications such as Port Knock on iOS, which allows you to specify the sequence of TCP/UDP packets to send, a port to probe after sending the magic sequence and that even allows you to establish connections via protocols such as SSH (I prefer to use dedicated clients such as Prompt on iOS). Maybe the 10 seconds timeout we’ve used is too short if you plan to connect via mobile using different apps for port knocking and for the SSH connection, I think you can safely increase it to 20-30 seconds.
Follows an example of Port Knock iOS app configuration that knocks to myhomenetwork.no-ip.org on ports 2222/TCP, 3333/UDP, 4444/TCP and after that probes port 1022/TCP to check if it is open, providing you immediate feedback about the reception of the magic sequence by your home router (it also launches an SSH connection to port 1022 with user myuser after knock, but this can be omitted if you want to use some other client for the SSH connection)
Conclusions
I hope that this article will be useful for anyone who wants to make his home network a little bit less exposed to the threats of the Internet. Again, you can try the configurations I’ve shown on EVE-NG with the MikroTik cloud router before implementing them into your network or before convincing yourself to buy one of their amazing routers.
Some network devices and PCs can listen for incoming special packets on their ethernet interfaces even when shutdown, and this is used to allow them to be powered up with a special magic packet, which is used by Wake-On-Lan (from now WOL).
WOL is usually done by generating a packet with destination IP address the broadcast address of the network (in a common 192.168.0.0/24 network, it is directed to 192.168.0.255 or 255.255.255.255), which produces an ethernet frame with FF:FF:FF:FF:FF:FF destination mac address. This broadcast frame is processed by all the hosts on the lan segment. What does it make this packet magic? The fact that it must contain the Mac Address of the device to be woken up, repeated 16 times. When the powered off device’s ethernet card detects this special frame, it powers up the device.
Usually the magic packet is an UDP packet with destination port 0, 7 or 9, but this is not mandatory. BTW, I will use UDP port 9 in the examples.
Suppose that you have a NAS in your home network that you would like to power on only when needed, to get some documents you have stored on it, and that you don’t have other devices active on the home network to which you can connect in order to use WOL, wouldn’t it be useful to be able to use WOL from the Internet? How can we produce a broadcast frame on the internal LAN from the public network?
I’ve build this simple setup in Eve-NG network simulator, with a virtual MikroTik router that simulates our home router, with eth1 as WAN interface (I’m using private IP addressing in 192.168.60.0/24 but consider it a public address exposed on the Internet) and eth2-3-4 grouped in a bridge called lan_bridge with IP address 192.168.1.1/24 and a DHCP server enabled with 192.168.1.10-192.168.1.50 pool of addresses available for clients on the internal LAN.
We could generate a magic packet directed to the public IP address of our home router, but then how can we force it to change it to a broadcast packet? The first simplest solution that came into my mind was to use destination NAT to change the magic packet destined to 192.168.60.141 to 192.168.1.255 but on MikroTik or Linux-based routers this doesn’t work (I think that directed broadcast forwarding is not supported) and the packet is discarder.
So, how can we generate the magic packet on the 192.168.1.0/24 lan to power up our devices? We can implement the following trick:
Allocate an unused IP address in 192.168.1.0/24, such as 192.168.1.100
Define a static ARP resolution on MikroTik router, setting FF:FF:FF:FF:FF:FF as 192.168.1.100 mac address
Implement a pre-routing destination-nat rule on MikroTik router in order to change incoming traffic directed to its Internet-facing interface and to UDP port X (let’s choose 9999) by changing the destination address to 192.168.1.100 and the destination port to 9
Et voilà, now when you’ll send a magic packet to 192.168.60.141 with destination port 9999/UDP, MikroTik pre-routing NAT processing will change the destination address to 192.168.1.100. Then MikroTik will route the packet toward lan_bridge, which is on that subnet, and when it will prepare the ethernet frame that will contain the forwarded packet it will put FF:FF:FF:FF:FF:FF as destination Mac Address, thus producing a broadcast packet on the internal LAN, even if the destination IP address is a unicast IP.
Security Warning: a packet sent with 9999/UDP port on the public address of your router will generate a broadcast packet on your internal network, so it is highly reccomended to rate-limit the number of packets that are forwarded.
As suggested before, we will implement a forwarding rule that allows the traffic directed to 192.168.1.100 but with a rate-limiting check that will allow 1 packet every 10 seconds from a specific public source address, with a burst of 3 (this effectively allows 4 packets to be forwarded, this could be due to how MikroTik dst-limit works, I did not dig into this very much since the practical effect is the same for the purposes of this HowTo).
Here you can see the whole configuration in a more readable format with the syntax highlighting in Sublime Text:
Testing MikroTik setup
In order to test what I’ve implemented, I’ve downloaded on my MacBook Pro the wakeonlan software package (from MacPorts) and I’ve generated 10 magic packets in row with the following command (sudo asks for the password only the first time, so the command is repeated 10 times very quickly):
% for i in $(seq 1 10) ; do sudo wakeonlan -i 192.168.60.141 -p 9999 aa:bb:cc:dd:ee:ff ; done
Sending magic packet to 192.168.60.141:9999 with aa:bb:cc:dd:ee:ff
Sending magic packet to 192.168.60.141:9999 with aa:bb:cc:dd:ee:ff
Sending magic packet to 192.168.60.141:9999 with aa:bb:cc:dd:ee:ff
Sending magic packet to 192.168.60.141:9999 with aa:bb:cc:dd:ee:ff
Sending magic packet to 192.168.60.141:9999 with aa:bb:cc:dd:ee:ff
Sending magic packet to 192.168.60.141:9999 with aa:bb:cc:dd:ee:ff
Sending magic packet to 192.168.60.141:9999 with aa:bb:cc:dd:ee:ff
Sending magic packet to 192.168.60.141:9999 with aa:bb:cc:dd:ee:ff
Sending magic packet to 192.168.60.141:9999 with aa:bb:cc:dd:ee:ff
Sending magic packet to 192.168.60.141:9999 with aa:bb:cc:dd:ee:ff
I’ve chosen aa:bb:cc:dd:ee:ff as Mac Address of the device to be woken up. Let’s have a look at the MikroTik logs to see what happens:
As you can see, the Pre-Routing NAT (src-nat chain) rule is triggered 10 times due to the 10 packets above, but the forwarding rule that allows the traffic to pass is triggered only 4 times, the other 6 times we have a drop.
I’ve also started a packet capture on PC interface Fa0/0 (its a virtual router, this is why the interface has such a name). The Fa0/0 interface is connected to the MikroTik lan_bridge through eth2 MikroTik interface, in fact it gets an IP address via DHCP:
PC#sh dhcp lease
Temp IP addr: 192.168.1.49 for peer on Interface: FastEthernet0/0
Temp sub net mask: 255.255.255.0
DHCP Lease server: 192.168.1.1, state: 5 Bound
DHCP transaction id: 18AE
Lease: 600 secs, Renewal: 300 secs, Rebind: 525 secs
Temp default-gateway addr: 192.168.1.1
Next timer fires after: 00:04:53
Retry count: 0 Client-ID: cisco-c202.0c94.0000-Fa0/0
Client-ID hex dump: 636973636F2D633230322E306339342E
303030302D4661302F30
Hostname: PC
Fa0/0 interface receives the magic packet due to it’s FF:FF:FF:FF:FF:FF destination mac address, but then it ignores the packet because it does not contain traffic destined to its IP address. The only purpose of the PC is to make the lan segment active in the lab and show the magic packet reception on the internal lan. In the following image you can see the 4 magic packets containing aa:bb:cc:dd:ee:ff mac address 16 times:
You can generate the magic packet also via phone with apps like WOL on IOS:
You can configure your home router to register its public IP address on a service like No-IP and then configure your WOL app with the FQDN you registered (such as myhomenetwork.no-ip.org).
Conclusions
Credits for this idea go to my boss G.D. which pointed me in the right direction when I was thinking about how to trigger the broadcast packet on my internal LAN from the public network. I hope this will be the first of some quick MikroTik How-Tos that will show you the flexibility of these incredible low-cost but powerful routers. I’ve spent about 70 euros for a MikroTik Hap^2 router that is able to manage my 1Gbps internet connection, with 4-500 Mbps wireless data rate peaks on 5Ghz, with tens of firewall rules, two pppoe connections (one in a specialized vrf, maybe this will be the topic of the next article) and scripts running in the background to manage dynamic Access Control Lists (I check the IP address of some well-known FQDNs and I add them to a trusted sources list with an expiration time, in order to allow only some public IP addresses to access services in my home network). If you like to experiment, have a look at MikroTik site and if you want to experiment without spending an euro, just download the virtual image of the MikroTik Cloud Router and launch it in EVE-NG Network Simulator to have some fun! 😉
Freeradius is one of the tens of software tools and devices that a Network Engineer like me must manage at work, and since I’m a human and I didn’t spend my last 10 years using Freeradius everyday, sometimes I have a problem or I simply don’t how to implement something. Yesterday I’ve wrote on the Mailing List about a problem in using the statement Exec-Program-Wait to call a script before accepting or rejecting specific users. I’ve notice that after having moved from v2.x to 3.13 the statement was accepted but the script was not invoked. So, after having googled a bit (let’s try to find some meaningful documentation or helps about Freeradius, it is quite difficult, the modules are well documented, but they help you if you already know a lot about Freeradius, otherwise they’re not always so clear) without finding any help online, I’ve written an email on the Freeradius Users mailing list:
Hi,
I was using the following syntax on Freeradius 2.x to determine if a user
could connect to a particular IP address, even if the authentication
succeeds, based on some parameters passed to a script:
XXX747 Auth-Type = System, Realm == imp
Service-Type := Login-User,
cisco-avpair = "shell:priv-lvl=2",
Exec-Program-Wait =
"/opt/script/radius/bin/check_operator_access.sh %{NAS-IP-Address}
%{User-Name} %{Realm}"
It worked on my old 2.x installation, now I'm on the last version available on Red Hat Enterprise 7, which is 3.0.13-10.el7_6. The syntax gives no error, but the script is not invoked (it contains an invocation to logger system command to put an entry in /var/log/messages and I can't see it), even if the above entry in the users (authorize) file is mached. What could be the problem? If this is the wrong way to implement this check can you give me an hint on how should I do it on 3.x Freeradius installation?
Wed Jun 19 17:01:52 2019 : Debug: (12) files: users: Matched entry XXX747
at line 497
Wed Jun 19 17:01:52 2019 : Debug:
/opt/script/radius/bin/check_operator_access.sh %{NAS-IP-Address}
%{User-Name} %{Realm}
Wed Jun 19 17:01:52 2019 : Debug: Parsed xlat tree:
Wed Jun 19 17:01:52 2019 : Debug: literal -->
/opt/script/radius/bin/check_operator_access.sh
Wed Jun 19 17:01:52 2019 : Debug: attribute --> NAS-IP-Address
Wed Jun 19 17:01:52 2019 : Debug: literal -->
Wed Jun 19 17:01:52 2019 : Debug: attribute --> User-Name
Wed Jun 19 17:01:52 2019 : Debug: literal -->
Wed Jun 19 17:01:52 2019 : Debug: attribute --> Realm
Wed Jun 19 17:01:52 2019 : Debug: (12) files: EXPAND
/opt/script/radius/bin/check_operator_access.sh %{NAS-IP-Address}
%{User-Name} %{Realm}
Wed Jun 19 17:01:52 2019 : Debug: (12) files: -->
/opt/script/radius/bin/check_operator_access.sh 172.16.120.218 XXX747 at imp
imp
Wed Jun 19 17:01:52 2019 : Debug: (12) modsingle[authorize]: returned
from files (rlm_files)
Wed Jun 19 17:01:52 2019 : Debug: (12) [files] = ok
Thank you in advance for any help.
Best regards,
Gianni Costanzi
After my email Alan Dekok (Network Radius CEO) replied with this hint:
> I was using the following syntax on Freeradius 2.x to determine if a user
> could connect to a particular IP address, even if the authentication
> succeeds, based on some parameters passed to a script:
>> XXX747 Auth-Type = System, Realm == imp
> Service-Type := Login-User,
> cisco-avpair = "shell:priv-lvl=2",
> Exec-Program-Wait =
> "/opt/script/radius/bin/check_operator_access.sh %{NAS-IP-Address}
> %{User-Name} %{Realm}"
Exec-Program-Wait goes in the first line. It's a check attribute, and isn't a reply attribute.
Alan DeKok.
I immediately tried what he suggested but still it didn’t work. You can read the remaining messages here:
He gave me some suggestions, then I’ve explained how I was processing user requests for some users, i.e. I was sending the access request to an authentication proxy called imp and then I modified the post-auth behavior of free radius in order to force the analysis of users (authorize) file even after a successful authentication on realm Imp, to avoid accepting all the users that have a valid account on imp. So, a simple files.authorize directive in the post-auth section with a check on realm equal to imp forced the analysis of this user even after a successful authentication on imp:
(I’ve tried also to move Exec-Program-Wait to the first line, as a check item, as Dekok suggested me, but then I’ve moved in the previous position where it always worked on v2.x)
The discussion continued, with Alan suggesting me to try other ways of processing the authentication, and I’ve kept replying explaining that what he suggested me did not implement the authentication flow I wanted. Very soon he started replying me with his usual style.. he seems that he has a gun pointed to his head, that he must answer to you poor user that don’t have in mind the whole documentation and maybe the source code of Freeradius, as he, the God on earth. Then he told me I was childish because I didn’t follow his advices, and at a certain point I summed up everything with a simple question:
To avoid you being so acid (I don’t really understand why, I was quite polite I think), you’ve told me how to call Exec-Program-Wait, with := and on the first line as a check item. I’ve told you that it is not invoked even when I do that. Can you explain me why? Where should I check if there is an error? Is there some different requirement compared to previous versions of freeradius server? I think you can answer me even without further details on this point.
We’re in an open source community where everyone should help the others if the others are polite and correct. I think I’ve been both, so I don’t really understand why you answered in such a bad way.
This is part of his reply:
> you’ve told me how to call Exec-Program-Wait, with := and
> on the first line as a check item. I’ve told you that it is not invoked
> even when I do that. Can you explain me why?
<shrug> You're probably using an old version, or something else is happening.> Where should I check if there is an error?
It's Open Source. You have to source. Track it down, and supply a patch to fix the bug.
[...[ I can't explain your failure to understand. I've explained myself repeatedly. My "bad way" of answering is honest frustration at *you*, who is making it as difficult as possible for me to help you. This is Open Source. You're not paying for support. Don't complain about the answers you get for free. If it's a bug, you have access to the source. Track it down and fix it. If you're not willing to do that, then the "community" aspect you talked about is bullshit. For you, there is no community. Only others helping you for free, while you refuse to do anything yourself. I've seen this attitude a lot over the past 20+ years in the open source community. The people complaining the loudest about others are the ones who (a) refuse to follow instructions, and (b) refuse to contribute. I'll make it simple: follow instructions, read the docs, and you will be able to fix the problem. Keep complaining about how mean we are for helping you, and you will be banned from the list.
As you can see from the red part of his reply, you can not be part of the community if you don’t read the whole source code, or if you tell the God on Earth Alan that what he suggested doesn’t fit your needs. Read again what I asked more than once, i.e. “give me some hints about where I should have a look in order to solve Exec-Program-Wait not working”.
Then I wrote two emails in a row, quite upset about Alan usual poor behavior:
Some answers:
- I will go through the docs and the help about exec module and other
things you’ve suggested, I’m not one that don’t follows your advices, you’ve said that without any reason
- I’ve asked about hints about why exec-program-wait does not work an d the only answer is probably is a bug related to an old version, I just hoped you gave me some hints about further points to check. I don’t know if you’ve ever worked in a Gdpr/pci-Dss/etc compliant company, ve are using the latest version available on our red hat enterprise servers, we can not instal the latest development release
- I’m not childish, maybe you are, looking at your answers- you keep asking me while I repeat myself, I did it simply because the workaround you’ve suggested me simply to not implement the authentication flow I’ve explained
- have you ever have a look at the other open source communities? About how experts help and answer to problems?
Have a nice day/night
Gianni
Alan, you don't have a gun pointed to your head, if you don't have time to explain in detail or you don't have the will to explain something, just don't answer. If you suggest me to implement things that do not really implement the behavior I've described you, expect me to ask you again. How can you reply "maybe you're not running the latest version, if it is a bug check the source code".. I'm not running Freeradius 1, I'm not on a years-old Freeradius version.
You almost always suggest to have a look at the documentation within the configuration files, and it is something I almost always do before asking you something, because I know how you answer 90% of the time, but that documentation is clear for someone like you that already knows how it works. This is why there is an ML, to help people like me understand how to implement something or solve some problems that we can't solve by looking at the "not-so-clear" documentation.
You quite simplistically said that I don't follow your advice etc, which is definitely not true. I will dig in the modules' docs and re-read your answers tomorrow. I just hoped you could give me an hint about how to check why Exec-Program-Wait was not working, because it was the simplest way of implementing what I needed and it worked like a charm on previous versions. I'm not saying that I don't want to change it at all because I want to stick to the old configuration, I'm just
trying to understand if there is something I can do to understand why it does not work on my configuration, before spending hours in understanding how exec or other modules work and if they can help me fulfill my requirements.
Best regards,
Gianni Costanzi
While I was writing these two emails, this was Alan reply I’ve received:
If you're paying RedHat for support, then ask them for help.
> - I’m not childish, maybe you are, looking at your answers
And you're gone.
Alan DeKok.
Then, he banned me from the Mailing List.. after having banned me, he wrote another reply, with a false statement about what I’ve said and without the possibility for me to post a reply:
For everyone else reading, this is the key problem.
Alan: Use the "exec" module, it will do what you want
Gianni: I don't want to spend hours reading documentation on exec or *other* modules. I'm not sure *if* they will do what I want
It really can't be stated any better than that. After being given a solution, his response is "no, I don't believe you".
That's not an acceptable answer.
Alan DeKok.
At this point I’ve emailed two other people that seem to manage the Users ML with Alan to try to be unbanned, but I’m still waiting for a reply:
Hi, I'm writing to you to ask to be unbanned from Freeradius ML, you can give a look at the conversation I had with Alan Deck and I don't think I deserve a ban. He told me I was childish without any reason (simply because I repeated more than once my requirements because I thought he didn't understand them and because I've asked him why he is so acid when he answers) and he banned me when I've replied that I'm not childish, maybe he is.
Furthermore, is there anyone else that have a bit of patience to reply to questions of people that maybe don't understand the documentation of the project, given the fact that it is not so clear for someone that do not use free radius everyday?
I wrote you because I've seen that you run the ML along with Alan, if the documentation is up-to-date.
Best regards,
Gianni Costanzi
There was another user’s reply on the thread, this is what Thor Spruyt wrote:
This has already been going on (and getting worse) for years...
People seem to be less and less interested in understanding something . They also think reading documentation (not to mention source code) is not needed at all, just ask on a forum/maillist whatever you want.
I think it's ok for people to not wanting to understand the details, but expecting others to just deliver whatever they want "on a plate" is indeed a problem.
I replied privately to him, because I couldn’t reply on the ML anymore:
Hi Thor, if you read the whole thread, I did not say that I don't believe in what Alan says or I don't want to dig into the documentation, but the facts are the following:
1) Freeradius documentation is poor, compared to a lot of other opensource projects
2) Documentation within freeradius modules is fine for people that use Freeradius everyday and probably already know how it works
3) I was quite polite, I kept repeating myself because Alan proposed solutions were ignoring some of my requirements
4) alan told me i was using Exec-Program-Wait in the wrong position, I've followed his advice and then it kept not working. I've asked him to give me some hints where I could look at to find why it was not working and it told me that maybe I was on an old version
I can go on for hours, if you read the whole thread I think you can see that there is no reason to ban me from the ML. Alan 90% of times is acid when it answers people, and I don't really understand why. I replied on other MLs and sites like stack overflow to a lot of people, even if they were not understanding what I was saying I've always tried to explain it better. It always seems that it must answer even if it doesn't want, the main problem is that he is the only expert that replies on that ML, this is the point.
Have a nice day, excuse me if I replied to your private address but Alan banned me.
Best regards,
Gianni Costanzi
I’ve got really upset with this facts, since I’m one that reads tons of documentation especially if it is well done and written for users, not for people that already know how the piece of software works, such as Freeradius Docs. I’ve also spent hours of my life to try replying to all kinds of people, experts and non-expert, and I’ve also worked on open-source projects like Gentoo Documentation Project.
Please, don’t judge me or Alan from the snippets above, you can read the whole thread on this page, looking for Exec-Program-Wait not working in the subject:
Today I went to work, announced my boss that I’ve been banned from the ML (he knows how Alan behaves, so it was not really surprised) and then started working on a solution, which should have been quite simple for God-Dekok: I explained him I’ve forced a passage through users file even after a successful authentication on the external realm imp, and this is done, as I explained, by putting a files.authorize in the post-auth section of default site. Do you know why exec module did not invoke Exec-Program-Wait when examining the users file after an Access-Accept from realm imp? Quite simple, because it was needed to call exec after files.authorize (or maybe put the following if block that calls files.authorize before the exec statement in Post-Auth section:
Now that I’ve understood what was the problem, I think that God-Dekok could have suggested me that the issue could have been related to the order in which exec module and the users file were processed, but maybe this is not true. It would have been amazing for me to share how I’ve solved the issue and discuss with Alan why Exec-Program-Wait works when put as a Reply item and not as a Check item, but I can’t, since he banned irrevocably my email.
That solved my problem, maybe it is not the exact or more elegant way of implementing what I needed, and if there was a real community with more than a single expert that replies on the ML it would be amazing for me to start discussing about how I could implement the expected authentication flow, but with Alan is simply impossible, because he gives too quick answers, which usually say “there is this module, read the docs within it” and when you go and read the docs you realize that they do not really clarify the exact usage of each statement.
I think that the main problem of this project, apart from the poor documentation, is that Alan is the only expert that replies to your questions on the mailing list, and that he accomplishes this task not with the passion that I and thousands of other guys have when we want to help someone, because it always seems that he answers without having time or the required patience to explain.
I hope that someone will read this long post, I surely made some mistakes interacting with Alan, but I think that his poor behavior have no excuses.
Here you can find the pages of thread, in case they will be removed (I don’t know if it is possible… just in case…)
I hope that if you use Freeradius, you’ll never need to interact with Alan Dekok.
I’ve started reading Chapter 2 of Juniper MX Series book a few days ago, where it talks about Bridging, VLAN Mapping and IRB interfaces. It talks about two ways of configuring bridging, the simpler Enterprise-style and the more complex but more flexible Service Provider-style. In this small lab I’ve just wanted to try out some configurations I’ve learnt on the book.
This is the setup of my lab in EVE-NG:
In this lab we have a Cisco router RT with interface e0/0 configured with two sub-interfaces .100 and .200 with dot1Q encapsulation with tag 1100 and 200, respectively. I’ve put e0/0.100 in VRF100 in order to be able to ping between the two sub-interfaces through the 3 vMX switches (otherwise the ping would be something local to the router).
Tagging and bridging configurations are quite strange in this lab (not very smart, you could say) but the purpose is to experiment with configurations 🙂
Packets’ flow
The flow of a packet from RT to vMX3 is the following:
The packet exits RT e0/0.100 tagged with 1100 and arrives into ge-0/0/0 on vMX1
vMX1 ge-0/0/0 is configured as a Trunk that allows vlan tags 100 and 200 and translates incoming tag 1100 to 100 and vice versa when it packets leave the interface. Due to tag 100, the packet goes into Bridge-Domain BD100 and then it exits via interface ge-0/0/1.100 and arrives on interface ge-0/0/1.100 of vMX2
vMX2 ge-0/0/1.100 receives the packet which goes again in BD100 and then it exits untagged via interface ge-0/0/2.100 which is configured as an Access interface. Packets are received by ge-0/0/2.100 on vMX3.
vMX3 ge-0/0/2.100 is again an Access interface and receives the untagged packets that are placed in BD100.
BD100 is always configured as a Bridge-Domain with single tag 100, so packets received in this Bridge-Domain have this single tag applied to them.
vMX3 has two Integrated Routing and Bridging (IRB) interfaces, one in BD100 and one in BD200, whose purpose is to route packets between the two vlans.
The flow of a packet from vMX3 to RT is the following:
The packet starts in vMX3 BD200 where irb.200 interface is placed and exits via interface ge-0/0/3.200 tagged with vlan tag 200. It is received on vMX2 on ge-0/0/3.200.
vMX2 ge-0/0/3.200 receives the packet and places it in BD200. This time BD200 has two tags: 2200 outside and 1200 inside. This means that the incoming packet’s tag 200 is swapped with tag 1200 and then an outer tag 2200 is pushed onto the packet. The packet is then delivered to interface ge-0/0/1.200 which is configured to send packets with outer tag 20 and inner tag 200: this means that tag 1200 is swappedwith 200 and 2200 is swapped with 20. The packet is then received by ge-0/0/1.200 on vMX1.
vMX1 ge-0/0/1.200 receives the packet with outer tag 20 and inner tag 200 and places it within BD200, which is configured with single tag 200. This means that the outer tag 20 must be popped. The packet is then sent out via interface ge-0/0/0 tagged with 200 and it is received by RT on its e0/0.200 sub-interface.
Enterprise-style configuration
ENT-style configuration is the simplest way of configuring access and trunk interfaces: you define an interface with family bridge and interface-mode trunk or access and specify the allowed vlan(s) and the MX automatically places the traffic in the corresponding bridge: if an interface is an access interface with tag 100, its untagged traffic is placed in a bridge with tag 100, if an interface is a trunk with allowed vlans 100 and 200, traffic tagged with 100 is placed in a bridge with tag 100 and traffic tagged with 200 is placed in a bridge with tag 200. This configuration style is the fasted and easier but it allows for less flexibility in interface configuration than Service Provider-style.
Let’s have a look at some interfaces within the lab:
vMX1 ge-0/0/0 is configured with a single unit 0 (the only unit allowed with Enterprise Style configuration) and it is a simple trunk with allowed vlans 100 and 200, but it also has a vlan-rewriting configuration that is used to translate tag 1100 to 100 when received from RT:
# show interfaces
ge-0/0/0 {
unit 0 {
family bridge {
interface-mode trunk;
vlan-id-list [ 100 200 ];
vlan-rewrite {
translate 1100 100;
}
}
}
}
This interface is automatically placed within BD100:
# show bridge domain BD100 extensive
Routing instance: default-switch
Bridge domain: BD100 State: Active
Bridge VLAN ID: 100
Interfaces:
ge-0/0/0.0
ge-0/0/1.100
Total MAC count: 3
vMX3 ge-0/0/2 is configured with unit 0 configured in Access mode on vlan 100:
# show interfaces
ge-0/0/2 {
unit 0 {
family bridge {
interface-mode access;
vlan-id 100;
}
}
}
Service Provider-style Configuration
SP-style configuration allows for more flexibility and automatic vlan-rewriting, but it requires a bit more configuration. Interfaces supports more than the single unit 0 and interface units must be manually placed within the corresponding bridges. This manual placement of interfaces within BDs is what enables automatic vlan-rewriting, as we will see in some of the following examples.
Let’s have a look at some interfaces within the lab:
vMX1 ge-0/0/1 has two units, 100 and 200:
unit 100 is configured with single tag 100 and it is manually placed within BD100:
# show interfaces
ge-0/0/1 {
flexible-vlan-tagging;
encapsulation extended-vlan-bridge;
unit 100 {
vlan-id 100;
}
}
# show bridge-domains
BD100 {
vlan-id 100;
interface ge-0/0/1.100;
}
Interface unit and BD are both configured with a single tag 100, so there are no vlan-rewriting operations going on:
unit 200 is configured with two tags, outer 20 and inner 200, and is manually placed within BD200:
# show interfaces
ge-0/0/1 {
flexible-vlan-tagging;
encapsulation extended-vlan-bridge;
unit 200 {
vlan-tags outer 20 inner 200;
}
}
# show bridge-domains
BD200 {
vlan-id 200;
interface ge-0/0/1.200;
}
This time packets leaving BD200 through ge-0/0/1.200 or arriving from ge-0/0/1.200 into BD200 must go through a vlan-rewriting step, which happens automatically on the MX, you only need to specify tags for BD and interface unit and link those two elements by putting the interface within the bridge-domain, and then the MX implements the proper vlan-rewriting operations:
As you can see from the output above, interface traffic has an outer tag 20 and inner tag 200, so incoming traffic must go through a popoperation, which removes tag 20, while outgoing traffic must go through a push 20 operation, which adds tag 20.
vMX2 ge-0/0/1 has two units, 100 and 200 configured as on vMX1:
unit 100 is configured with single tag 100 and it is manually placed within BD100
unit 200 is configured with two tags, outer 20 and inner 200, and is manually placed within BD200. This time we have a double tagged bridge-domain:
As you can see, MX must do a double-swap of inner and outer labels to translate incoming and outgoing traffic.
Vlan-tagging and Encapsulation type
When configuring bridging interfaces you can enable the use of single or dual tags with different keywords:
vlan-tagging: it enables the use of single tag
stacked-vlan-tagging: it enables the use of double tags
flexible-vlan-tagging: it enables you to mix single and double tags’ units within the same physical interface and gives you the opportunity to specify native vlans for both inner and outer tags (give a look to the Juniper MX Series book for further details)
Another flexibility you have is to specify the kind of encapsulation to be used:
encapsulation extended-vlan-bridge: it applies the vlan-bridge encapsulation to all the units, so it can be used when you have single- or double-tagged units
encapsulation flexible-ethernet-services: it allows you to specify different per-unit encapsulation. In this way you can have some units which do bridging with vlan-bridgeencapsulation, some layer-3 units and some units that do VPLS bridging with vlan-vpls encapsulation.
As you can see in the full configurations attached to the post, I’ve mixed some of the encapsulation and tagging modes in the lab, just for testing purposes.
Cisco & vMX Configuration
You can find the full configuration of the lab’s routers that you can load within your vMX and Cisco routers. Adjust the IP address of em0 interface on each vMXs, which is the one that can be connected to the Netcloud connected to the real LAN if you want to manage vMX via SSH. You can upload the configuration file via SCP and then load it with the following commands:
[edit]
admin# delete
This will delete the entire configuration
Delete everything under this level? [yes,no] (no) yes
admin# load merge filename.conf
load complete
admin# commit
commit complete
vMXs’ user is admin with password admin1, root has password root123.
Note: I’ve added some IRB interfaces also on vMX1 and vMX2 that allowed me to debug communication problems while I was experimenting with the configuration.
Conclusions
I perfectly understand that I did not explain too much about the undergoing details of MX bridging, but I’ve just wanted to show some of the flexibility of this wonderful machine and maybe increase your curiosity about what you can do with it. I suggest you to read the chapter about Bridging on the Juniper MX Series book which goes into deep details and shows some other more complex configurations.
In this post I’ll show you an implementation of a Carrier-of-Carriers Inter-provider Layer-3 VPN on Junos vMX. I’ve studied this stuff as the last topic explained on Juniper Networks JNCIS-IS MPLS Study Guide, which I suggest you to read if you want to understand a lot of interesting features of Juniper platform.
As you can see in the image above, I’ve used a convention to number the Loopback and P-t-P interfaces of the routers: it seems complex, but you’ll get used after a while 🙂
Some examples:
Loopback of CE-SP-PE7: router Rx is within group 1 (the number within the golden square) and x is equal to 7, so its loopback is 192.168.1.7/32
Interface of CE-A-11 toward CE1-SP-PE7: it is an inter-group P-t-P link, between R11 and R7, so it is within 172.16.*.* network. Given that one of the two router’s numbers is greater than 10, we’ll sum 11 and 7, so we have 18 as the third octet. The IP address is 172.16.18.11/24. The interface is blue, so it is lt-0/0/10.117 with peer unit lt-0/0/10.711: unit is 117 because we are on router with ID 11 and the interface points toward router with ID 7 (so the router with ID 7 will have unit 711 paired with unit 117 for the same logic)
Interface of SP-P1 toward SP-PE2: it is an intra-group P-t-P link, between R1 and R2, so it is within 10.*.*.* network. Group is 0 so the second octet is 0. The third octet is given by XY, where X is the lowest-numbered router, R1 and Y is the higher-numbered router, R2. So the third octet is 12. The last octet is given by the router number of SP-P1, so the ip address is 10.0.12.1/24. The interface is green, so it is ge-0/0/0.12 with vlan-id 12: ge-0/0/0 because we are on router with ID 1 facing on the link toward router with ID 2, so we use the ge-0/0/0; unit is 12 and vlan-id is the same, and it is the concatenation of the two IDs, the lowest first, since both are less than 10. The corresponding interface on router with ID 2 is ge-0/0/1.12 with vlan-id 12.
As you can see, we’ll use /24 networks even if they are P-t-Ps, just to have the possibility to implement our numbering scheme.
On each link I’ve put one or more letters M, L or R to quickly show which protocols among MPLS, LDP and RSVP are enabled and running on the links.
Route-Distinguishers and Route-Targets used for the L3-VPNs are shown in the graphic.
The whole topology has been built on a single Juniper vMX virtual router, running within EVE-NG. Blue links are built on lt-0/0/10 logical tunnel interface, while green links are built on ge-0/0/0 and ge-0/0/1 physical interfaces, which are connected one with each other. A pair of ge interfaces are sufficient to build how many P-t-P links we want, it is sufficient to use a different vlan-id for each link. The reason why I’ve inserted some ge links is that EVE-NG allows me to capture packets on that links, while it wouldn’t be possible on the lt-0/0/10 interface. The fact that I’ve used only a pair of ge interfaces instead of more is that sniffing on ge-0/0/0 interface allows me to see the traffic traversing the 4 segments in a row, with different vlan tags.
This is the lab on EVE-NG (em0 is the management interface connected to my LAN):
The switch cloud on the right is a simple way to connect ge-0/0/0 with ge-0/0/1 using EVE-NG linux bridging facility, without the need to configure a virtual switch.
Objective: Service provider 1 and 2 want to offer a L3-VPN called vpn-a between CE-A-11 and CE-A-12. The two service providers have different AS Numbers and are interconnected by Service Provider 0 which will offer them an Inter-Provider vpn called inter-vpn.
Problem: vpn-a L3-VPN must be established with an MP-EBGP session between the two ASBRs of SP1 and SP2. No other router in SP1, SP2 and SP0 will know anything about vpn-a. We must build a label-switched-path from CE1-SP-PE7 and CE2-SP-PE10. Usually IPv4 (family inet) i/eBGP sessions do not attach a label to the routes sent to the neighbors. This would imply that CE1-SP-PE7 would receive a route to reach CE2-SP-PE10 loopback without a label attached to it: this would cause the sending of a packet labeled with a vpn-a vpn label to CE1-SP-PE5, which as I’ve previously said doesn’t know anything about the vpn-a L3-VPN and would discard the packet. The same for the other routers in the path toward destination.
Carrier-of-Carriers Service Provider L3-VPN
Service Provider 0 is configured with OSPF Area 0 as IGP. I’ve chosen to use RSVP to build the following LSPs, instead of enabling LDP:
Bidirectional LSPs between SP-PE2 and SP-PE3 routers: these LSPs are needed to resolve the next-hop of inter-vpn L3-VPN routes in inet.3.
Unidirectional LSP from SP-RR4 Route-Reflector toward its RR-clients: these LSPs are needed because routes received by an RR that can not resolve the corresponding next-hop in inet.3 are hidden and thus not “reflected” to other clients.
Each of SP-PE2 and SP-PE3 routers have an MP-iBGP session with SP-RR4, which has only the inet-vpn family enabled, since we only need to carry labeled L3-VPN routes. SP-P1 is a P-router, and it doesn’t need to know anything about VPN routes, so it requires only to run OSPF, MPLS and RSVP on its interfaces, to allow the building of the required RSVP LSPs that I’ve mentioned above.
The only routes that we need to have within the inter-vpn.inet.0 table on SP-PE2 and SP-PE3 are the loopback addresses of CE1-SP-PE7 and CE2-SP-PE10, which will need to reach each other to build the outermost MP-eBGP session for vpn-a. P-t-P links in the inter-vpn routing-instance are not sent as labeled VPN routes between over the MP-iBGP session between SP-PE2 and SP-PE3, because they are considered multi-access links and are only advertised if there is a route with a next-hop on that link or if we’re using vrf-table-label within the routing-instance (which is not the case for inter-vpn).
Follows the output of some show commands executed for the routers within SP0:
-- SP-PE2 --
admin> show route logical-system SP-PE2
inet.0: 9 destinations, 9 routes (9 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both
10.0.12.0/24 *[Direct/0] 01:14:57
> via ge-0/0/1.12
10.0.12.2/32 *[Local/0] 01:14:59
Local via ge-0/0/1.12
10.0.13.0/24 *[OSPF/10] 01:14:43, metric 2
> to 10.0.12.1 via ge-0/0/1.12
10.0.14.0/24 *[OSPF/10] 01:14:43, metric 2
> to 10.0.12.1 via ge-0/0/1.12
192.168.0.1/32 *[OSPF/10] 01:14:43, metric 1
> to 10.0.12.1 via ge-0/0/1.12
192.168.0.2/32 *[Direct/0] 01:15:43
> via lo0.2
192.168.0.3/32 *[OSPF/10] 01:14:33, metric 2
> to 10.0.12.1 via ge-0/0/1.12
192.168.0.4/32 *[OSPF/10] 01:14:38, metric 2
> to 10.0.12.1 via ge-0/0/1.12
224.0.0.5/32 *[OSPF/10] 01:15:46, metric 1
MultiRecv
inet.3: 1 destinations, 1 routes (1 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both
192.168.0.3/32 *[RSVP/7/1] 01:14:16, metric 2
> to 10.0.12.1 via ge-0/0/1.12, label-switched-path from-SP-PE2-to-SP-PE3
inter-vpn.inet.0: 4 destinations, 4 routes (4 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both
172.16.25.0/24 *[Direct/0] 01:14:58
> via ge-0/0/0.25
172.16.25.2/32 *[Local/0] 01:14:59
Local via ge-0/0/0.25
192.168.1.7/32 *[BGP/170] 01:14:35, localpref 100
AS path: 65100 I, validation-state: unverified
> to 172.16.25.5 via ge-0/0/0.25, Push 299824
192.168.2.10/32 *[BGP/170] 01:14:16, localpref 100, from 192.168.0.4
AS path: 65200 I, validation-state: unverified
> to 10.0.12.1 via ge-0/0/1.12, label-switched-path from-SP-PE2-to-SP-PE3
[...]
admin> show mpls lsp logical-system SP-PE2 ingress detail
Ingress LSP: 1 sessions
192.168.0.3
From: 192.168.0.2, State: Up, ActiveRoute: 0, LSPname: from-SP-PE2-to-SP-PE3
ActivePath: (primary)
LSPtype: Static Configured, Penultimate hop popping
LoadBalance: Random
Encoding type: Packet, Switching type: Packet, GPID: IPv4
*Primary State: Up
Priorities: 7 0
SmartOptimizeTimer: 180
Computed ERO (S [L] denotes strict [loose] hops): (CSPF metric: 2)
10.0.12.1 S 10.0.13.3 S
Received RRO (ProtectionFlag 1=Available 2=InUse 4=B/W 8=Node 10=SoftPreempt 20=Node-ID):
10.0.12.1 10.0.13.3
Total 1 displayed, Up 1, Down 0
-- SP-RR4 --
admin> show route logical-system SP-RR4 table inet.3
inet.3: 2 destinations, 2 routes (2 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both
192.168.0.2/32 *[RSVP/7/1] 01:18:23, metric 2
> to 10.0.14.1 via lt-0/0/10.41, label-switched-path from-SP-RR4-to-SP-PE2
192.168.0.3/32 *[RSVP/7/1] 01:18:22, metric 2
> to 10.0.14.1 via lt-0/0/10.41, label-switched-path from-SP-RR4-to-SP-PE3
admin> show route logical-system SP-RR4 table bgp.l3vpn.0
bgp.l3vpn.0: 2 destinations, 2 routes (2 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both
65000:2:192.168.1.7/32
*[BGP/170] 01:18:47, localpref 100, from 192.168.0.2
AS path: 65100 I, validation-state: unverified
> to 10.0.14.1 via lt-0/0/10.41, label-switched-path from-SP-RR4-to-SP-PE2
65000:3:192.168.2.10/32
*[BGP/170] 01:18:43, localpref 100, from 192.168.0.3
AS path: 65200 I, validation-state: unverified
> to 10.0.14.1 via lt-0/0/10.41, label-switched-path from-SP-RR4-to-SP-PE3
-- SP-P1 (only inet.0 and mpls.0 routes) --
admin> show route logical-system SP-P1 terse | match routes
inet.0: 11 destinations, 11 routes (11 active, 0 holddown, 0 hidden)
mpls.0: 12 destinations, 12 routes (12 active, 0 holddown, 0 hidden)
Service Provider 1
Now let’s have a look to SP1 (which is configured as SP2): again, we have OSPF running on the three nodes and LSPs between CE1-SP-PE7 and CE1-SP-PE5, this time built by LDP instead of RSVP. As we’ve previously said, we need to have a Label-Switched-Path that spans across different Autonomous-Systems, so we must find a way for the 192.168.2.10/32 route of CE2-SP-PE10 loopback to flow through SP2, the inter-vpn on SP0 and then SP1 toward CE1-SP-PE7 with a label attached to it: this is accomplished through two BGP sessions, an MP-eBGP session between SP0 and SP1 and an MP-iBGP session within SP1, as shown on the topology, both with labeled-unicast feature enabled within inet address family. This feature tells the router to attach a label to the IPv4 route that sends to the BGP neighbor. I’ll show you the route to 192.168.2.10/32 on different routers, with the label(s) attached to it:
-- SP-PE3 inter-vpn.inet.0: route with the label attached by CE2-SP-PE8 due to labeled-unicast --
admin> show route logical-system SP-PE3 table inter-vpn.inet.0 192.168.2.10/32
inter-vpn.inet.0: 4 destinations, 4 routes (4 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both
192.168.2.10/32 *[BGP/170] 01:32:39, localpref 100
AS path: 65200 I, validation-state: unverified
> to 172.16.38.8 via lt-0/0/10.38, Push 299824
-- SP-PE2 inter-vpn.inet.0: route with the L3-VPN label attached to the route by SP-PE3 and the RSVP LSP label on top of it to reach SP-PE3 (some lines are omitted for brevity) --
admin> show route logical-system SP-PE2 table inter-vpn.inet.0 192.168.2.10/32 detail
inter-vpn.inet.0: 4 destinations, 4 routes (4 active, 0 holddown, 0 hidden)
192.168.2.10/32 (1 entry, 1 announced)
*BGP Preference: 170/-101
Source: 192.168.0.4
Next hop: 10.0.12.1 via ge-0/0/1.12, selected
Label-switched-path from-SP-PE2-to-SP-PE3
Label operation: Push 299776, Push 299824(top)
Protocol next hop: 192.168.0.3
VPN Label: 299776
-- CE1-SP-PE5 inet.0: route with the label attached by SP-PE2 due to labeled-unicast --
admin> show route logical-system CE1-SP-PE5 table inet.0 192.168.2.10/32
inet.0: 10 destinations, 11 routes (10 active, 1 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both
192.168.2.10/32 *[BGP/170] 01:32:44, localpref 100
AS path: 65000 65200 I, validation-state: unverified
> to 172.16.25.2 via ge-0/0/1.25, Push 299808
-- CE1-SP-PE7 inet.0: route with the label attached by CE1-SP-PE5 due to labeled-unicast and the top label of the LSP toward CE1-SP-PE5 --
admin> show route logical-system CE1-SP-PE7 table inet.0 192.168.2.10/32
inet.0: 8 destinations, 8 routes (8 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both
192.168.2.10/32 *[BGP/170] 01:32:54, localpref 100, from 192.168.1.5
AS path: 65000 65200 I, validation-state: unverified
> to 10.1.67.6 via ge-0/0/1.67, Push 299840, Push 299792(top)
I’ve omitted the route as it is seen on CE2-SP-PE8: in this case we receive a labeled IPv4 BGP route (preference 170) for 192.168.2.10/32 from CE2-SP-PE10, but it is also received in OSPF (preference 10), so the installed route has no label: this would cause a problem when CE1-SP-PE7 sends a packet with a vpn-a MPLS label, because when it reaches CE2-SP-PE8 it would be forwarded to CE2-SP-P9 toward CE2-SP-P10 with only the vpn-a label, which would be unknown to CE2-SP-P9. We must force the iBGP labeled route to be installed in CE2-SP-PE8 inet.0 table (even if the label is an explicit null, as we will see, but having a BGP route instead of an OSPF one forces the use of the LSP toward CE2-SP-P10 to deliver the packet, thus adding an additional label that will be popped by CE2-SP-P9), so I’ve raised OSPF preference to 200:
-- CE2-SP-PE8 --
admin> show route logical-system CE2-SP-PE8 table inet.0 192.168.2.10/32
inet.0: 10 destinations, 11 routes (10 active, 1 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both
192.168.2.10/32 *[BGP/170] 01:47:34, localpref 100, from 192.168.2.10
AS path: I, validation-state: unverified
> to 10.2.89.9 via lt-0/0/10.89, Push 299776
[OSPF/200] 01:47:37, metric 2
> to 10.2.89.9 via lt-0/0/10.89
Label 299776 is not the label attached to the route by CE2-SP-PE10, but it is the label used to reach the next-hop 192.168.2.10, resolved via inet.3 table:
admin> show route logical-system CE2-SP-PE8 table inet.3
inet.3: 2 destinations, 2 routes (2 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both
192.168.2.9/32 *[LDP/9] 10:12:23, metric 1
> to 10.2.89.9 via lt-0/0/10.89
192.168.2.10/32 *[LDP/9] 10:12:23, metric 2
> to 10.2.89.9 via lt-0/0/10.89, Push 299776
CE2-SP-PE10 in fact sends a labeled route to CE2-SP-PE8 but the label is an explicit null (reserved label 3), which means “you do not have to use a label to reach that route, just reach me”:
In fact, a packet received by CE2-SP-PE8 directed to CE2-SP-PE10 vpn-a needs only the vpn-a label and on top of it a label to reach CE2-SP-PE10 via an LDP LSP. The outer label will then be removed by CE2-SP-P9 due to Penultimate-Hop-Popping and a packet with only the VPN label will be delivered to CE2-SP-P10.
VPN-A customer’s L3VPN between SP1 and SP2
As a final step, once that we have a bidirectional LSP between CE1-SP-PE7 and CE2-SP-PE10, we can build the vpn-a L3VPN with an MP-eBGP session between those two routers. In order to make the vpn-a work, as we’ve explained with inter-vpn, we need to have the eBGP next-hop reachable in inet.3 routing table. This is accomplished with the addition of resolve-vpn keyword added to labeled-unicast, which forces labeled IPv4 routes to be installed also in inet.3. Follows the configuration of BGP (internal and external BGP) on CE1-SP-PE7 and the configuration of the routing-instance for vpn-a, where I’ve used vrf-table-label (which forces a lookup in the vpn-a.inet.0 routing table instead of directly sending VPN packets destined to CE-A-11 on the P-t-P interface) and static routing to reach the loopback of the connected CEs of Customer A (the CEs in turn have a default route toward their Service Provider):
-- CE1-SP-PE7 --
admin> show configuration logical-systems CE1-SP-PE7 protocols bgp
group SP1-Internal {
type internal;
local-address 192.168.1.7;
family inet { labeled-unicast { resolve-vpn; } }
export export-loopback;
neighbor 192.168.1.5;
}
group SP-1-2-external {
type external;
multihop;
local-address 192.168.1.7;
family inet-vpn {
unicast;
}
peer-as 65200;
neighbor 192.168.2.10;
}
admin> show configuration logical-systems CE1-SP-PE7 routing-instances vpn-a
instance-type vrf;
interface lt-0/0/10.711;
route-distinguisher 65012:100;
vrf-target target:65012:0;
vrf-table-label;
routing-options {
static {
route 192.168.4.11/32 next-hop 172.16.18.11;
}
}
Follows the output of some show commands on the same router, that shows the use of 3 labels, label 16 is a VPN label (it is so low due to the vrf-table-label statement), 299840 is the label attached to 192.168.2.10/32 route sent by CE1-SP-PE5 to CE1-SP-PE7 and 299792 is the LDP label associated to the LSP toward CE1-SP-PE5:
-- CE1-SP-PE7 --
admin> show route logical-system CE1-SP-PE7 table vpn-a.inet.0
vpn-a.inet.0: 5 destinations, 5 routes (5 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both
172.16.18.0/24 *[Direct/0] 00:01:32
> via lt-0/0/10.711
172.16.18.7/32 *[Local/0] 02:18:07
Local via lt-0/0/10.711
172.16.22.0/24 *[BGP/170] 00:01:32, localpref 100, from 192.168.2.10
AS path: 65200 I, validation-state: unverified
> to 10.1.67.6 via ge-0/0/1.67, Push 16, Push 299840, Push 299792(top)
192.168.4.11/32 *[Static/5] 00:01:32
> to 172.16.18.11 via lt-0/0/10.711
192.168.4.12/32 *[BGP/170] 00:01:32, localpref 100, from 192.168.2.10
AS path: 65200 I, validation-state: unverified
> to 10.1.67.6 via ge-0/0/1.67, Push 16, Push 299840, Push 299792(top)
As you can see, the reception of label 16 forces a second lookup within vpn-a.inet.0 routing table, instead of directly sending the packet toward CE-A-11 on the lt-0/0/10.711 interface of CE1-SP-PE7:
admin> show route logical-system CE1-SP-PE7 table mpls.0 label 16
mpls.0: 8 destinations, 8 routes (8 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both
16 *[VPN/0] 21:14:18
to table vpn-a.inet.0, Pop
Verifying the connection between CE-A-11 and CE-A-12
Now it is time to verify the connectivity between the two Customer-A’s routers with a traceroute. In order to have information for every hop within the network, each MPLS-enabled router must have icmp-tunneling enabled within protocol mpls stanza, otherwise a packet with an expired Time-To-Live value within IP header would produce a reply toward the source of the packet, 192.168.4.11 for example, which is completely unknown to all the routers except SP1 and SP2 ASBRs, i.e. CE1-SP-PE7 and CE2-SP-10. Enabling icmp-tunneling forces the router where the packet is expired to build an ICMP response that is sent toward the destination instead of the source with the original MPLS tags. When it reaches, in our example, CE2-SP-PE10 within vpn-a.inet.0 table the router sees that the destination is 192.168.4.11 and it is sent back toward the source of the traceroute’s UDP packet. I’ll add some info about each label on every hop.
admin> traceroute logical-system CE-A-11 192.168.4.12 source 192.168.4.11
traceroute to 192.168.4.12 (192.168.4.12) from 192.168.4.11, 30 hops max, 40 byte packets
1 172.16.18.7 (172.16.18.7) 1.237 ms 0.668 ms 0.545 ms
2 10.1.67.6 (10.1.67.6) 3.833 ms 4.254 ms 3.706 ms
MPLS Label=299792 CoS=0 TTL=1 S=0 => LDP label to reach CE1-SP-PE5
MPLS Label=299840 CoS=0 TTL=1 S=0 => iBGP label-unicast label for 192.168.4.12 received from CE-SP-PE5
MPLS Label=16 CoS=0 TTL=1 S=1 => vpn-a label received from 192.168.4.12
3 10.1.56.5 (10.1.56.5) 3.704 ms 4.209 ms 3.687 ms => PHP removes the outermost label
MPLS Label=299840 CoS=0 TTL=1 S=0
MPLS Label=16 CoS=0 TTL=2 S=1
4 172.16.25.2 (172.16.25.2) 3.770 ms 4.368 ms 3.605 ms
MPLS Label=299792 CoS=0 TTL=1 S=0 => eBGP label-unicast label for 192.168.4.12 received from SP-PE2. It replaced label 299840
MPLS Label=16 CoS=0 TTL=3 S=1
5 10.0.12.1 (10.0.12.1) 5.300 ms 5.557 ms 3.600 ms
MPLS Label=299824 CoS=0 TTL=1 S=0 => RSVP label for from-SP-PE2-to-SP-PE3 LSP toward 192.168.0.3
MPLS Label=299776 CoS=0 TTL=1 S=0 => inter-vpn label for 192.168.4.12 received from 192.168.0.3 through Route-Reflector 192.168.0.4. This replaces label 299792
MPLS Label=16 CoS=0 TTL=4 S=1
6 10.0.13.3 (10.0.13.3) 4.809 ms 3.749 ms 4.161 ms => PHP removes the outermost label
MPLS Label=299776 CoS=0 TTL=1 S=0
MPLS Label=16 CoS=0 TTL=5 S=1
7 172.16.38.8 (172.16.38.8) 3.610 ms 4.518 ms 3.661 ms
MPLS Label=299824 CoS=0 TTL=1 S=0 => eBGP label-unicast label for 192.168.4.12 received from CE2-SP-PE8. It replaced label 299776
MPLS Label=16 CoS=0 TTL=6 S=1
8 10.2.89.9 (10.2.89.9) 3.633 ms 4.246 ms 3.619 ms => third label not added due to the explicit-null label received for 192.168.4.12 from 192.168.2.10 through iBGP labeled-unicast route advertisement
MPLS Label=299776 CoS=0 TTL=1 S=0 => LDP label to reach CE2-SP-PE10. It replaced label 299824
MPLS Label=16 CoS=0 TTL=7 S=1
9 172.16.22.10 (172.16.22.10) 4.002 ms 4.404 ms 3.705 ms => PHP removed the outermost label. CE2-SP-PE10 received the packet only with label 16 that has been removed for a second lookup within vpn-a.inet.0 due to vrf-table-label
10 192.168.4.12 (192.168.4.12) 4.000 ms 3.831 ms 4.493 ms
Then, I’ve started a packet capture on ge-0/0/0 interface within EVE-NG and run the following command on vMX:
As you can see from the image below, I’ve captured the same echo request and echo reply packets 4 times in a row:
Looking at the echo request packets, you can see all the labels we’ve seen from the traceroute output before (look at the vlan-id, it tells you the green link the packet is traversing):
These are the replies captured on the same interface (I’ll show you also the replies because they show the triple-label stack when the packet is leaving the L3VPN routing instance in CE1-SP-PE7 and SP-PE2):
vMX Configuration
You can find the full configuration of the lab that you can load within your vMX router. Adjust the IP address of em0 interface, which is the one that can be connected to the Net cloud connected to the real LAN if you want to manage vMX via SSH. You can upload the configuration file via SSH (copy and paste via console can give you buffer problems) and then load it with the following commands:
[edit]
admin# delete
This will delete the entire configuration
Delete everything under this level? [yes,no] (no) yes
admin# load merge carrier-of-carriers_vMX_topology.cfg
load complete
admin# commit
commit complete
User is admin with password admin1, root has password root123.
I hope you’ve read the whole stuff, it has been a long post but I hope you’ve find the topic very interesting as I did. I suggest you to read the JNCIS-SP Study Guide available from Juniper Networks to understand a lot of interesting stuff about MPLS on Junos, which can be successfully tested on a vMX platform.
As I usually say, post comments or questions or even tell me if I’ve made some mistakes, I’ve just gone through this stuff for few days and there is sure room to improve my skills 🙂
In this small tutorial we’ll see how to enable QEMU image compression on compress base images in Unetlab/EVE-NG Alpha. For some detail about where files are stored, have a look to my previous post Modifying base-images with snapshots on Unetlab/EVE-NG Alpha.
Update 2017-01-25: after asking some info about QEMU compression on QEMU users’ mailing list, Alberto Garcia clarified some aspects of how compression works that were not clear for me:
I think there’s some misunderstanding here about compressed images in
QEMU. I’ll try to clarify:
* You create a compressed image with ‘qemu-img convert -c’. That is a
copy of the original image with all the clusters compressed.
* The compression is read-only: QEMU will read the compressed clusters, but everything that it writes will be uncompressed (also if you rewrite compressed clusters).
* Therefore, there’s no such thing as an image with compression
enabled. In QEMU you don’t compress an image, you compress
individual clusters of data. An image can have a mix of compressed
and uncompressed clusters.
Compressing base images
Suppose we have an already working base image or an hda.qcow2 virtual hard disk prepared to be used. I’ll use my TinyCore linux I’ve used in the previous post. Let’s create a new folder and clone /opt/unetlab/addons/qemu/linux-tiny-core-7.2/hda2.qcow2 base image enabling compression. This time we must take a full clone, we can not use snapshots:
root@eve-ng:/# mkdir /opt/unetlab/addons/qemu/linux-tiny-core-7.2-compressed
root@eve-ng:/# cd /opt/unetlab/addons/qemu/linux-tiny-core-7.2-compressed
root@eve-ng:/opt/unetlab/addons/qemu/linux-tiny-core-7.2-compressed# /opt/qemu/bin/qemu-img convert -c -f qcow2 -O qcow2 /opt/unetlab/addons/qemu/linux-tiny-core-7.2/hda.qcow2 hda.qcow2
root@eve-ng:/opt/unetlab/addons/qemu/linux-tiny-core-7.2-compressed# cd ..
root@eve-ng:/opt/unetlab/addons/qemu# ls -l linux-tiny-core-7.2/hda.qcow2
-rw-r--r-- 1 root root 76414976 Nov 11 10:21 linux-tiny-core-7.2/hda.qcow2
root@eve-ng:/opt/unetlab/addons/qemu# ls -l linux-tiny-core-7.2-compressed/hda.qcow2
-rw-r--r-- 1 root root 72351744 Jan 6 14:49 linux-tiny-core-7.2-compressed/hda.qcow2
As you can see, with TinyCore Linux I won’t save so much space, but with images such as Raware Alteon you can easily gin 1 GByte or more by enabling compression.
Unfortunately, I’ve not found a way to explicitly show that an image has compression enabled, if you know it, let me know! => As I wrote in the update at the beginning of the post, the concept of enabling compression with QEMU is faulty: you compress blocks of data of the qcow2 image but then there is no “compression enabled”, new data is written uncompressed.
Testing the compressed base image
I’ve added a new node to the lab we’ve used in a previous post, node 4, based on the new compressed image. Everything works as expected, EVE makes a snapshot of the compressed image:
root@eve-ng:/opt/unetlab/tmp/0/da2b48f4-d910-4e7d-9645-f952457cbf6d/4# /opt/qemu/bin/qemu-img info --backing-chain hda.qcow2
image: hda.qcow2
file format: qcow2
virtual size: 614M (643825664 bytes)
disk size: 1.3M
cluster_size: 65536
backing file: /opt/unetlab/addons/qemu/linux-tiny-core-7.2-compressed/hda.qcow2
Format specific information:
compat: 1.1
lazy refcounts: false
refcount bits: 16
corrupt: false
image: /opt/unetlab/addons/qemu/linux-tiny-core-7.2-compressed/hda.qcow2
file format: qcow2
virtual size: 614M (643825664 bytes)
disk size: 68M
cluster_size: 65536
Format specific information:
compat: 1.1
lazy refcounts: false
refcount bits: 16
corrupt: false
Conclusions
This small tutorial shows you how you can save some space by enabling compression on compressing base images with QEMU tools. I’ve not done extensive tests on this, but everything seems to work. Any suggestions are welcome, just write a comment below!
Unetlab/EVE-NG (Alpha) is a great tool you can use for learning about networking with different platforms (Dynamips routers, IOL, QEMU images). I won’t get into details about how to prepare the environment, you can find a lot of useful information on their site http://www.unetlab.com, but I’ll focus on how labs’ devices’ images are managed by EVE (let’s use EVE instead of Unetlab/EVE-NG for the rest of the post).
Browsing the temporary files for a lab with a QEMU node
Let’s create a new lab with a single QEMU node, which in my case is a TinyCore Linux image:
The base image I’ve chosen is linux-tiny-core-7.2, which you can find on EVE VM under /opt/unetlab/addons/qemu/linux-tiny-core-7.2:
root@eve-ng:~# ls -l /opt/unetlab/addons/qemu/linux-tiny-core-7.2/
total 74624
-rw-r--r-- 1 root root 76414976 Nov 11 10:21 hda.qcow2
hda.qcow2is disk1 of the linux box.
What happens when you instantiate this image within your lab and run/modify its contents? As you can imagine, EVE does not modify the source image, otherwise it would be impossible to manage more instantiation for the same base image. Instead, it creates a QEMU snapshot within the lab’s temporary files. Let’s see it.
First, we must get some info about the lab: press Lab Details on the left menu bar and get the lab ID, which in my case is da2b48f4-d910-4e7d-9645-f952457cbf6d.
Go on the temporary lab folder on EVE (the “0” after /tmp is my pod number, zero as I’m working as admin):
cd /opt/unetlab/tmp/0/da2b48f4-d910-4e7d-9645-f952457cbf6d/
In this folder you can find a subfolder for each device of the lab. In this example we have only device 1, so enter within 1 subfolder (you can get the number of the node, if multiple nodes are running, by right-clicking on the node and looking at the number between the brackets after the name of the node) and look at its contents:
root@eve-ng:/opt/unetlab/tmp/0/da2b48f4-d910-4e7d-9645-f952457cbf6d# cd 1
root@eve-ng:/opt/unetlab/tmp/0/da2b48f4-d910-4e7d-9645-f952457cbf6d/1# ls -l
total 1284
-rw-r--r-- 1 root unl 1376256 Jan 6 11:43 hda.qcow2
-rw-rw-r-- 1 root unl 0 Jan 6 11:42 wrapper.txt
you can see hda.qcow2 again. Is it a copy? No, it would be a waste of space, it is a snapshot of the base image, let’s see it:
So, every modification you do to the disk of the instantiated image, it is confined to the snapshot and the base image won’t be modified, as it is quite obvious.
Modifying the base image
Some times ago, I’ve felt the need to modify my base Tiny Core installation in order to add tcpdump package to it, which is useful for troubleshooting. Instead of installing it on every instantiation of TinyCore within my labs, I wanted to modify the base image, but you must not modify a base image after you’ve used it at least in one lab, unless you want to corrupt all the labs that instantiated it. The quick and dirty solution could be creating a new subdirectory within /opt/unetlab/addons/qemu/linux-tiny-core-7.2-v2 and copying the original hda.qcow2 file into it, then modify the new base image before using it into any lab. With TinyCore it is just a matter of few Mbytes, so it is perfectly right to make a simple copy. But what if the base image is 2-3 GBytes, such as Radware Alteon or Juniper vMX images? It would be a waste of space, and on my laptop I don’t want to waste it, so let’s try a different approach.
Making a snapshot of the base image
We can use qemu-img to create a new hda.qcow2 which is a snapshot of the original base image. Go into /opt/unetlab/addons/qemu/linux-tiny-core-7.2-v2 folder we’ve used in the previous section and delete the copy of the base image we’ve put within that folder. Let’s create a snapshot of the original base image with qemu-img create and the -b flag that specifies a backing file for the new image, thus creating a snapshot:
Preparing the new base image – method 1: running QEMU from CLI
Now, you have a new base image and you want to prepare it before using in future labs. Let’s run it from command line. I won’t go into QEMU details, you can find useful docs on the web or simply run ps aux from EVE VM with some nodes running to see how it runs QEMU nodes. TinyCore can be managed via VNC, so let’s pass -vnc :100 in order to make QEMU listen for VNC connections on port 6000 on every interface (6000 is the result of 5900 base VNC port + 100) for the node we’re running:
In the command above you also see how to enable serial connection through telnet, in case your image has console access enabled. If you need to reach the image via IP connection, for example to transfer some content onto it by using SCP, you can manage it in this way: suppose EVE VM has pnet0 with eth0 connected in bridge-mode to your own network, configure eth0 on TinyCore (or your own image) with an IP address compatible with your own network through the Console or VNC connection or let it get an IP via your LAN DHCP and then shutdown the image (if you’re using TinyCore, remember to make the changes persistent, otherwise you’ll loose them after reboot). Now, prepare a virtual interface to put within pnet0 bridge and to which we will connect TinyCore:
root@eve-ng:/opt/unetlab/addons/qemu/linux-tiny-core-7.2-v2# ifconfig tmp_iface up
root@eve-ng:/opt/unetlab/addons/qemu/linux-tiny-core-7.2-v2# brctl addif pnet0 tmp_iface
root@eve-ng:/opt/unetlab/addons/qemu/linux-tiny-core-7.2-v2# brctl show pnet0
bridgename bridge id STP enabled interfacespnet0 8000.000c29baeb65 no eth0tmp_iface
Start qemu again mapping TinyCore eth0 onto tmp_iface we’ve just created:
Now you can reach your QEMU node via SSH/SCP and you can do whatever you want in order to prepare the new base image.
After some modifications, let’s see the status of the -v2 base image:
root@eve-ng:/opt/unetlab/addons/qemu/linux-tiny-core-7.2-v2# /opt/qemu/bin/qemu-img info hda.qcow2 | grep "disk size"
disk size: 1.4M
Before my modifications to the image, disk size was 196K, now it is grown to 1.4M. The original base image has not been modified.
Preparing the new base image – method 2: running a new node from GUI
If you don’t want to go through the steps from CLI I’ve explained above, you can instantiate a new node based on -v2 image and run it from the GUI:
Create a network object mapped on pnet0 and connect it to eth0:
Start the node based on the new base image -v2 and modify it as you want. As we’ve already said before, these modifications won’t go onto the original -v2 image we’ve prepared within /opt/unetlab/addons/qemu/linux-tiny-core-7.2-v2. Since I’ve added this new node on the same lab with the node based on the original base image, it will be node 2 of the same lab, so let’s move into its temporary folder on EVE VM and have a look at the snapshot image:
root@eve-ng:/opt/unetlab/tmp/0/da2b48f4-d910-4e7d-9645-f952457cbf6d/2# cd /
root@eve-ng:/# cd /opt/unetlab/tmp/0/da2b48f4-d910-4e7d-9645-f952457cbf6d/2
root@eve-ng:/opt/unetlab/tmp/0/da2b48f4-d910-4e7d-9645-f952457cbf6d/2# ls -l
total 1288
-rw-r--r-- 1 root unl 1376256 Jan 6 13:52 hda.qcow2
-rw-rw-r-- 1 root unl 118 Jan 6 13:52 wrapper.txt
root@eve-ng:/opt/unetlab/tmp/0/da2b48f4-d910-4e7d-9645-f952457cbf6d/2# /opt/qemu/bin/qemu-img info hda.qcow2
image: hda.qcow2
file format: qcow2
virtual size: 614M (643825664 bytes)
disk size: 1.3M
cluster_size: 65536
backing file: /opt/unetlab/addons/qemu/linux-tiny-core-7.2-v2/hda.qcow2
[...]
What can we do now to put the changes of this snapshot back to the -v2 base image? Let’s use QEMU tools to accomplish this task, but first shutdown the node within the GUI (don’t delete the node now!):
The commit command tells QEMU to commit changes we’ve made to the instantiated node’s /opt/unetlab/tmp/0/da2b48f4-d910-4e7d-9645-f952457cbf6d/2/hda.qcow2 disk to the base image /opt/unetlab/addons/qemu/linux-tiny-core-7.2-v2/hda.qcow2.
The new node we’ve added into the EVE’s GUI is still valid, the operation above did not corrupt it’s snapshot, but now we can delete it and if you add a new node based on -v2 base image, you will find the modifications you’ve merged into the base image with commit command.
A Snapshot Chain: what happened behind the scenes
As you’ve probably already understood, when you instantiate a new node within the GUI based on -v2 base image, you’re creating a snapshot of a snapshot. This is our snapshots tree:
What you must remember is that modifications to one of the images in the snapshot tree invalidates al the snapshots in the tree under the modified snapshot. For example, committing changes made on /opt/unetlab/addons/qemu/linux-tiny-core-7.2-v2/hda.qcow2 base image forces a modification of original base image, which invalidates node 1 disk /opt/unetlab/tmp/0/da2b48f4-d910-4e7d-9645-f952457cbf6d/1/hda.qcow2 but does not damage the node 2 disk /opt/unetlab/tmp/0/da2b48f4-d910-4e7d-9645-f952457cbf6d/2/hda.qcow2. If you have doubts before making such important changes, take a snapshot of your EVE VM with the tools offered by your Virtualization Environment (such as VMWare Fusion or VirtualBox) to have a quick rollback solution in case of damages to your labs.
Conclusions
In this small tutorial you’ve seen how you can use QEMU tools to manage snapshots of base images in Unetlab/EVE-NG, in order to setup new base images based on older base images, without wasting space with full clones. As EVE-NG is being developed (now it is in Alpha stage) a lot of this stuff could integrated within the GUI, but I hope it can be useful to someone experimenting with now 🙂