Subnet-to-Subnet SNAT/DNAT on Fortinet Firewalls with Central NAT

Article originally published by me on Medium on 26 April 2021: 

https://medium.com/nerd-for-tech/subnet-to-subnet-snat-dnat-on-fortinet-firewalls-with-central-nat-604a6102b21f

Implementing SNAT/DNAT on Fortinet Firewalls has never been straightforward as on other platforms like Checkpoint, in my opinion, at least before the introduction of Central NAT. Let’s see how to implement a subnet-to-subnet 1-to-1 translation with deterministic mappings, using VirtualIPs (DNAT) and Fixed-Port-Range IP-Pools (SNAT).

Introduction

I’ve recently had to migrate some firewall security and NAT rules from Checkpoint to Fortinet firewalls and faced some challenges due to the different behavior of the two technologies. Checkpoint applies security policies first (at least the version I’m using) and then it checks the NAT policies and applies both source and destination NAT. Fortinet instead has a different order of operations, more like Linux with Iptables: the packet arrives from the incoming interface, there is a pre-routing step where Destination NAT (DNAT from now on) happens, follows a routing decision (since now the final destination is known) that determines the outgoing interface, security policies (IPv4 policies) are examined and finally Source NAT (SNAT from now on) takes place and the packet can be sent out on the outgoing interface (more detail can be found here, I’ve omitted a lot of other intermediate steps).

During the rules’ migration, I’ve had the necessity to implement SNAT and DNAT rules to map a /22 network onto another /22 network, with deterministic mappings. Let’s take the following as an example:

On the right there is a partner’s network, 10.0.100.0/22 (i.e. with IPs ranging from 10.0.100.0 to 10.0.103.255) for which our firewall has a known route. We have some rules that required communication from our network on the left toward the partner network but via an alias network 10.0.200.0/22, which is routed in our network to our border firewall. What is important is that the mapping, in both directions, must be deterministic, i.e. 10.0.101.55 must be SNATted to 10.0.201.55 and vice versa, or in a more generic way, 10.0.10x.y must be mapped to 10.0.20x.y (and vice versa). To make things more complex, SNAT from the partner network to us must happen only for some IPs within 10.0.100.0/22.

NOTE: examples and images shown are taken on FortiManager, the Interface/VM used to manage FortiGate firewalls. You can apply the same concepts directly on FortiGate firewalls (maybe there is a slightly different terminology).

Central NAT

As I’ve mentioned before, implementing DNAT and SNAT on Fortinet FortiGate firewalls has never been simple as on other platforms, but they’ve made a big step forward with Central NAT, that adds two policy tables to the standard IPv4 table with security rules that we usually had on FortiManager. There is still something that reminds the old way of configuring nat (Central DNAT rules create VirtualIP objects that where used before to implement DNAT, but now you do not have to apply them as destinations in standard security rules), but it is more easy to use and to read (and less prone to errors).

Security Rules

As I’ve explained before, DNAT happens before security policies are evaluated, so when you write the security policies you must take into account that DNAT already translated destination addresses. On the opposite, SNAT has still to happen, so original source addresses must be used. In order to implement bidirectional communications we’ve shown in the example figure above, we need two rules:

IPv4 Security Policies

As you can see, rule #1 has 10.0.100.0/22 as destination, because DNAT from 10.0.200.0/22 to 10.0.100.0/22 has already been applied.

DNAT

DNAT is quite easy to implement, it requires a single rule in Central DNAT table, which in turn creates a single VIP Object:

DNAT 10.0.200.0/22 to 10.0.100.0/22

As you can see you set the range of IP addresses of the /22 network that we “know” on our side and then you specify only the first address of the real Mapped (or Internal) network of the partner. The GUI computes the final IP of that network in order to have a 1-to-1 static and deterministic mapping.

SNAT and the Fixed Port Range IP-Pools

I faced some challenges to implement the opposite, since in order to migrate the rules from the old firewall we had to be able to implement a deterministic SNAT of 10.0.10x.y to 10.0.20x.y IP addresses, but only for specific sources. Source NAT is implemented in Central SNAT table, where you write a policy, like the security ones, specifying source/destination addresses and ports of the traffic to match (with post-DNAT destinations) and then you select to SNAT the traffic onto the IP of the outgoing interface or onto an IP pool. The IP Pool can be of different kinds: overload, One-to-One, Fixed Port Range and Port Block Allocation. One-to-One was the one that seemed right for me, since we want to implement a 1-to-1 mapping between two subnets… but that object allows you to specify a single range of IP addresses, so the FortiGate has no way of knowing which source IP (that is something you specify in the Central SNAT policy, a separate object) will be mapped on which IP specified in the IP-Pool object. So in my first test I’ve implemented a Central SNAT policy that said “traffic from test-outside and with source IP in 10.0.100.0/22 has to be SNATted onto an IP-Pool One-to-One object that specifies 10.0.200.0/22 as range”. Then I made some tests, with a flow from 10.0.100.10 SNATted onto 10.0.200.1, a second flow from 10.0.100.x SNATted onto 10.0.200.2 and so on, so I could not achieve a deterministic 1-to-1 mapping as I needed.

I’ve searched a bit through the documentation and I’ve read how Fixed Port Range IP-Pool object works: Fixed Port Range allows you to map for example a /22 network onto a /24 network, so with a 4:1 ratio of IP addresses, in a deterministic way, reserving 1/4 of the ports for each IP. You can find more details on the Official Fortinet Documentation. What you can see is that if you have a 1:1 ratio of IP addresses between the real network and the mapped one, you can achieve a deterministic 1-to-1 mapping of 10.0.10x.y port w to 10.0.20x.y port w with Fixed Port Range SNAT.

The following formulas are found in section The Equations on official docs:

Image taken from https://docs.fortinet.com/document/fortigate/5.4.0/cookbook/414467/fixed-port-range-ip-pools-algorithm

Let’s try with our particular case, with a partner’s source IP of 10.0.102.55:

factor = (10.0.103.255–10.0.100.0 + 10.0.203.255–10.0.200.0 + 1) / (10.0.203.255–10.0.200.0 + 1) 
       = (1024 + 1024 + 1) / (1024 + 1) 
       = 2049 / 1025 
       = 1,999024 
       = 1 (there is an integer conversion 
            that cuts the decimal part)
new IP = 10.0.200.0 + (10.0.102.55–10.0.100.0) / 1 
       = 10.0.200.0 + (256 + 256 + 55) 
       = 10.0.202.55

Here you are: 10.0.102.55 is translated to 10.0.202.55! Bingo!

Math with subnets and IP addresses may not seem so intuitive at first, but it should be quite clear in this example I’ve shown.

As you can see in the following image, with Fixed Port Range you specify the exact mapping, by telling FortiManager what is the External IP Range (the IP addresses as known on our side) and the Internal (or mapped) IP addresses (the real IP addresses on the partner’s network):

Fixed Port range IP-Pool

Now we can use this IP-pool object in a Central SNAT rule where we determine for which IP addresses of the partner network we want the deterministic NAT to happen:

Central SNAT rule details
Central SNAT rule

Obviously you can implement the above SNAT with 3 rules each one with its own Overload/One-to-One IP-pool that translates a single IP onto the corresponding one, but in my example you have a deterministic 1-to-1 mapping of the original /22 network onto the target /22 network and then you determine with SNAT policies when it has to take place.

With the above rule, traffic from 10.0.100.5 arrives into our network with source 10.0.200.5, as expected, while traffic from 10.0.101.100 for example arrives with its original IP address since it is not matched by the Central SNAT policy.

In a simpler setup, you can make the SNAT happen for the whole 10.0.100.0/22 by setting it as source address in the Central SNAT rule:

SNAT for the whole partner’s network

Conclusions

I wanted to share this bit of knowledge because I had some headaches before finding the solution to my problem for this deterministic SNAT between subnets and maybe it can be useful for someone that would have never thought (like me) to use a Fixed Port Range IP-pool object to implement it, until I’ve read the documentation and understood how it worked in our corner case with 1:1 ratio of IP addresses.

Posted in Firewall, Fortinet, Nat, Networking | Tagged , , , , | Leave a comment

Recursive DNS +AD-Blocker Part 2: installing Pi-hole without caching on Synology NAS with Docker

Article originally published by me on Medium:

https://medium.com/nerd-for-tech/recursive-dns-ad-blocker-part-2-installing-pi-hole-without-caching-on-synology-nas-with-docker-5363bc7258f4

Installing Pi-hole on a Synology NAS with Docker is quite trivial, disabling caching is not, so let’s see how to do it. You will also learn how to build your own docker image that overrides default cache settings. Key info is generic so it is valuable for other Docker installations too, if you’re not running Docker on a Synology box.

Introduction

In the previous post Recursive DNS Resolver with AD-Blocking Features I’ve explained how to implement on a Raspberry Pi device a DNS resolver that blocks ADs and malicious sites (Pi-hole) and resolves names recursively (Unbound) without relying on official DNS servers like Google ones. As I’ve said in that post I have deployed two Pi-holes and two Unbound servers in my home network, to have a bit of redundancy when I’m doing maintenance and to have a bit of fun 🙂 The first Pi-hole+Unbound stack was deployed on an RPi3, so I had to choose another home-device that is active 24x7x365 for the second stack: my Synology DS218+ NAS with Docker was the perfect solution. This time we will focus on Pi-hole installation, leaving Unbound for another post.

This article is about Synology Docker but the info you can find can be perfectly applied to any device on which you’re running Docker. If you’re starting to use docker on a device without a GUI as on Synology, give a look at the portainer/portainer container which can provide a Web GUI to manage Docker’s Images, Containers, Volumes etc.

I’ll assume you’ve already installed and configured Docker on Synology via Package Manager. Screenshots are in Italian, but I think it will be easy to understand the equivalent in your language by looking at the positioning of the elements on the interface.

Installing Pi-hole

First of all open Docker app, go to Registry and search for pihole. Select the pihole/pihole image, press Download and select the latest tag.

Pi-hole Image Download

Then, before starting a new container with this image, prepare the following folder structure (you can create it via File Station app):

/volume1/docker/pihole/etc-dnsmasq.d
/volume1/docker/pihole/etc-pihole
/volume1/docker/pihole/etc.pihole_advanced

Note: /volume1 is a folder you can see via SSH CLI and its the folder containing the shared folders on your NAS. I don’t remember if docker shared folder is created by Docker app installation procedure, otherwise you can create it via Synology GUI and then you can proceed creating pihole subfolders.

As you can see on the Docker Hub page of pihole/pihole, the first two folders need to be mounted within the container to grant data persistance to your configuration when you need to re-create the container to update the image. The third one is a workaround I will explain later to set CACHE_SIZE to ZERO (read the previous article to understand why we want it to be zero).

First Pi-hole run

Now you can go to the Images section of the Docker app on Synology, select pihole/pihole:latest and press Launch. The app will ask you about the initial configuration of the container. Set the name as you want (default is pihole-pihole1) and press Advanced in order to configure the advanced options. Enable automatic restart, if you want, and then move to the Volumes tab: here we will mount the first two folders, ignore the third one, by pressing Add Folder, selecting the folders you’ve created before and mounting them on the correct paths with Read/Write permissions (I usually call the folder on synology as the path on which it will be mounted with dashes instead of slashes):

Volumes’ Bind Mounts

Then move on the Port settings tab and expose the ports you want to be reachable to outside. In particular, I want to expose:

  • DNS Service on 53/UDP and 53/TCP ports
  • HTTP Service (GUI) on 8080/TCP port

I’ll expose the DNS ports on the original ones (home devices can’t point to other non standard ports for DNS resolution) while I move the HTTP port to 8080/TCP (80/TCP is used by Synology):

Note: Synology GUI show you the services that are declared in the Dockerfile manifest as being exposed by the container and by default it exposes them on automatically allocated ports. You can remove such mappings if you don’t want to expose some container’ services or you can expose them on statically known ports as above.

Go to the Environment tab (the last one) and set a variable called WEBPASSWORD to the admin password you will use to access HTTP GUI (if you don’t set it it will be randomly generated and you will be able to see it in the container logs by double clicking on the container and then on the Log tab.

Configuration is finished, apply the setup and let Synology Docker launch the newly created container.

Setting CACHE_SIZE to zero

As we’ve seen in the previous article, we need to disable caching in order to have Unbound pre-fetching and caching work as expected. I’ve tried to set CACHE_SIZE=0 via Environment variable or modifying /etc/pihole/setupVars.conf that I have on my synology shared folder (as I’ve did on RPi3) and that is mounted within the container but it didn’t work. After a bit of study I’ve understood that the initialization script /root/ph_install.sh that is automatically executed by the container configures dnsmasq (the process that is responsible for Pi-hole dns resolution and caching) by replacing the values contained within /etc/.pihole/advanced/01-pihole.conf with the values contained in setupVars.conf and then it copies that file to /etc/dnsmasq.d/01-pihole.conf . The strange thing is that it does not take CACHE_SIZE from setupVars.conf but instead it has a CACHE_SIZE=10000 harcoded within the script.

This is the content of /etc/.pihole/advanced/01-pihole.conf

# Pi-hole: A black hole for Internet advertisements
# (c) 2017 Pi-hole, LLC (https://pi-hole.net)
# Network-wide ad blocking via your own hardware.
#
# Dnsmasq config for Pi-hole's FTLDNS
#
# This file is copyright under the latest version of the EUPL.
# Please see LICENSE file for your rights under this license.
###############################################################################
#      FILE AUTOMATICALLY POPULATED BY PI-HOLE INSTALL/UPDATE PROCEDURE.      #
# ANY CHANGES MADE TO THIS FILE AFTER INSTALL WILL BE LOST ON THE NEXT UPDATE #
#                                                                             #
#        IF YOU WISH TO CHANGE THE UPSTREAM SERVERS, CHANGE THEM IN:          #
#                      /etc/pihole/setupVars.conf                             #
#                                                                             #
#        ANY OTHER CHANGES SHOULD BE MADE IN A SEPARATE CONFIG FILE           #
#                    WITHIN /etc/dnsmasq.d/yourname.conf                      #
###############################################################################
addn-hosts=/etc/pihole/local.list
addn-hosts=/etc/pihole/custom.list
domain-needed
localise-queries
bogus-priv
no-resolv
server=@DNS1@
server=@DNS2@
interface=@INT@
cache-size=@CACHE_SIZE@
log-queries
log-facility=/var/log/pihole.log
local-ttl=2
log-async

You can see this file by double-clicking on the running container and then on the Terminal tab and creating a new bash session as you can see in this screenshot where you can write the highlighted cat command:

Executing “cat /etc/.pihole/advanced/01-pihole.conf” within the container

Alternatively you can run the same command via cli by connecting via SSH to the Synology NAS, after becoming root:

docker exec -it pihole-pihole1 cat /etc/.pihole/advanced/01-pihole.conf

You can save this file on your NAS, creating /volume1/docker/pihole/etc_.pihole_advanced/01-pihole.conf and modifying cache-size=@CACHE_SIZE@ to cache-size=0

Then, stop the pihole-pihole1 container, select it and click on modify and add a new Read-Only file mapping (not a folder mapping this time) in order to mount /volume1/docker/pihole/etc_.pihole_advanced/01-pihole.conf on /etc/.pihole/advanced/01-pihole.conf within the container:

Mounting 01-pihole.conf with cache-size modified

You can now restart the container and connect to the Pi-hole GUI running on http://your-nas-ip:8080/admin GUI, enter the admin password and theck on the Settings page that the DNS cache is set to zero:

Check DNS cache size, it must be ZERO if everything works as expected

Now, you can configure Pi-hole as you’ve seen in the previous article and you can point the Unbound server running on the Raspberry as on the other Pi-hole setup as an Upstream DNS. In this way we’ve doubled the Pi-hole servers, a first step toward redundancy (we still have a single Unbound server, so if RPi goes down, DNS resolution won’t work, but let’s focus on Pi-hole now and double Unbound in the next article).

You can now try the procedure you will do to update a container on Synology and verify that your settings are persisted and cache size is still zero:

  1. Shutdown the container
  2. Click on Action -> Erase (not Delete). In Italian we have Cancella and Elimina, the first one deletes the container but keeps it’s configuration (volumes’ and ports’ mappings etc) within Docker GUI, in order to let you Launch the container again, thus re-creating it from the latest image you’ve downloaded. Elimina instead deletes the container with all of its settings. You must choose the first action, because we do not want to reconfigure it again. I don’t know the exact words used in English but I think they could be Erase (Cancella) and Delete (Elimina), so try to understand the correct one (the one to avoid should be the lowest one in the menu).
  3. Go back to the Registry and search for pihole again and re-download the latest version.
  4. When the download is finished, launch the container again, it will be restarted with the newly downloaded image and with all of your settings in place
Erase the container in order to re-create it with an updated image

Note: if you launch the Pi-hole GUI you will be noticed about new versions by looking at the footer of the page. If version numbers are blinking in red, it means that there is an official update available. BTW, it could be necessary to wait some days in order to have an updated container image (you can go to the pihole/pihole Docker Hub page and check if the latest tag has been recently updated). Automatic container update can be done via containrrr/watchtower container running on Synology but this is beyond the scope of this article.

What I don’t like about overwriting 01-pihole.conf

As you’ve seen above, we’ve overwritten the content of /etc/.pihole/advanced/01-pihole.conf file with our modified one in order to force cache-size to zero. If a new version of pihole/pihole with some changes in that file comes out, we won’t use it because we mount our version of the file over the container updated file. So, I wanted to modify the CACHE_SIZE=10000 setting in ph_install.sh setup script of the container, in order to seti ti to CACHE_SIZE=0.

Building my own pihole-nocache image

The first thing you could think of is “let’s copy ph_install.sh script, modify the variable and mount that file over ph_install.sh script of the container base image”, but this would be exactly the same as overwriting 01-pihole.conf file.

So I’ve thought that I could simply do a replace of that instruction within ph_install.sh without replacing the whole file with my own copy.

DISCLAIMER: the following instructions require SSH access to the synology with the ability to become root, so you must be the NAS administrator. Understand the commands before trying them on a production environment. There should be no risk but I’m not responsible for bricking your Synology Docker environment.

First of all I’ve created the following folder:

mkdir /volume1/docker/_IMAGES/pihole-nocache

Then I’ve created a Dockerfile file within that folder that instructs docker build to create a new image based on the official pihole/pihole:latest with my modification done via sed command:

# cat /volume1/docker/_IMAGES/pihole-nocache/Dockerfile
FROM pihole/pihole:latest
RUN sed -i -e "s:CACHE_SIZE=[0-9]\+:CACHE_SIZE=0:g" /root/ph_install.sh

So, we replace whatever value CACHE_SIZE has been set to in the ph_install.sh script with ZERO.

Then you can build your image with the following command:

# docker build --pull /volume1/docker/_IMAGES/pihole-nocache/ -t pihole-nocache:latest
Sending build context to Docker daemon  2.048kB
Step 1/2 : FROM pihole/pihole:latest
latest: Pulling from pihole/pihole
Digest: sha256:3a39992f3e0879a4705d87d0b059513af0749e6ea2579744653fe54ceae360a0
Status: Image is up to date for pihole/pihole:latest
 ---> eb777ee00e0c
Step 2/2 : RUN sed -i -e "s:CACHE_SIZE=[0-9]\+:CACHE_SIZE=0:g" /root/ph_install.sh
 ---> Running in 4305759e9f70
Removing intermediate container 4305759e9f70
 ---> ffb77613b225
Successfully built ffb77613b225
Successfully tagged pihole-nocache:latest

After doing this, you will find pihole-nocache image in your images section of the Docker app on Synology and you will be able to create a new container based on it by following the steps you’ve done before for pihole/pihole image. If you’ve deleted pihole-pihole1 container you can re-create a new one with the same volumes and port mappings, except for the /etc/.pihole/advanced/01-pihole.conf file which is no more necessary. Let’s start the new container and check that cache is set to zero without mounting the 01-pihole.conf file as expected.

Updating the container requires you to re-launch the docker build command (the — pull option forces the build process to search for an updated version of the base image) instead of performing the download described in step 3. in the update procedure described before.

Uploading the image on your private container

As I’ve said before, I have a Watchtower container running on Synology NAS that regularly updates my containers by pulling new images and re-creating them automatically. In order to allow it to update my pihole-nocache image, I’ve created a private registry and each night I rebuild the pihole-nocache image with the command explained above and upload it to my registry.

First of all I’ve created a registry container with the following command:

# Create registry folder with required subfolders 
mkdir /volume1/docker/registry
mkdir /volume1/docker/registry/certs
mkdir /volume1/docker/registry/var_lib_registry

# Create symbolic links to my Synology NAS cert/key renewed 
# automatically via Letsencrypt (when certificate is renewed, 
# restart the registry container in order to refresh the certificate
ln -s /usr/syno/etc/certificate/system/default/fullchain.pem /volume1/docker/registry/certs/domain.crt
ln -s /usr/syno/etc/certificate/system/default/privkey.pem /volume1/docker/registry/certs/domain.key
    
# Start the registry on port 55000/TCP
   docker run -d \
     -p 55000:5000 \
     --restart=always \
     --name registry \
     -v /volume1/docker/registry/certs/domain.key:/certs/domain.key \
     -v /volume1/docker/registry/certs/domain.crt:/certs/domain.crt \
     -e REGISTRY_HTTP_TLS_CERTIFICATE=/certs/domain.crt \
     -e REGISTRY_HTTP_TLS_KEY=/certs/domain.key \
     -v /volume1/docker/registry/var_lib_registry:/var/lib/registry \
      registry:2

Note: in my home network I have a local resolution for the public name my-nas-fqdn.synology.me of my NAS, in order to resolve it on the private IP address of my Synology box. In this way the Synology certificate and key I’m mounting in the registry container will work fine because they match the FQDN name I’m using to point the container. More detail about deploying a private regitry in plain HTTP can be found on the official documentation at this link.

You can then test if the registry is listening on port 55000/TCP via the following command (CTRL^C to terminate execution) that should show you the public certificate of your NAS:

openssl s_client -connect my-nas-fqdn.synology.me:55000

You can add this registry in the Images section of the Docker app and make it active with my-nas-fqdn name, for example, By cliccing on Add and then Add from URL and entering https://my-nas-fqdn.synology.me:55000 as URL.

This allows you to search images in your private registry. Just remember to re-activate the default one when you need to search for official images.

Now, we must build the image tagging it in order to let docker know it is an image that will be found on our private registry and not on Docker hub by changing the -t option in the following way:

docker build --pull /volume1/docker/_IMAGES/pihole-nocache/ -t my-nas-fqdn.synology.me:55000/pihole-nocache:latest

If you go to the Images section of Docker Hub you will see that the registry of the newly created image is my-nas-fqdn instead of Docker Hub.

We can then upload the image to our registry (here my image was already up-to-date on the registry):

# docker push my-nas-fqdn.synology.me:55000/pihole-nocache:latest
The push refers to repository [my-nas-fqdn:55000/pihole-nocache]
b6d5ed12d029: Layer already exists
e996f48635d0: Layer already exists
010e0146144d: Layer already exists
ef63252a9627: Layer already exists
8eeede195441: Layer already exists
d3088e548a33: Layer already exists
8ac01c752962: Layer already exists
4e3984d0f56b: Layer already exists
3d64ad524bb4: Layer already exists
56e1f33806ae: Layer already exists
0990c4f0eff1: Layer already exists
b546331c0c35: Layer already exists
ea63252b317e: Layer already exists
9eb82f04c782: Layer already exists
latest: digest: sha256:b36cd790060f9f3fc20a5909b722a2f31a1c9c24ed235f03e533884dda3a244e size: 3244

Now if you do a search on your private registry from Docker app on Synology you will find pihole-nocache image, ad you will be able to download its latest tag.

Enabling automatic re-build and upload of pihole-nocache on my private registry

Finally, in order to let Watchtower find an updated version of pihole-nocache on my private registry, I’ve scheduled on my NAS the following script to run every day and that will re-create and re-upload with the latest tag all the images described by a Dockerfile within each subfolder in /volume1/docker_IMAGES folder:

#!/bin/bash

LS="/bin/ls"
RM="/bin/rm"
TEE="/bin/tee"
DATE="/bin/date"
DOCKER="/usr/local/bin/docker"
IMAGESDIR="/volume1/docker/_IMAGES/"
LOG_DIR="/volume1/homes/gianni/scripts/logs"
LOG_FILE="$LOG_DIR/docker_images_rebuild.log"

$RM -f $LOG_FILE

$DATE | $TEE -a $LOG_FILE

for image in $(ls $IMAGESDIR); do
	$DOCKER build --pull $IMAGESDIR/$image/ -t my-nas-fqdn.synology.me:55000/$image:latest | $TEE -a $LOG_FILE
	$DOCKER push my-nas-fqdn.synology.me:55000/$image | $TEE -a $LOG_FILE
done

When you create a new image with the latest tag, the previous one’s tag becomes <none>. It can’t be removed until the container is updated because is still in use.

# docker image ls | grep "none\|TAG"
REPOSITORY                                   TAG    IMAGE ID     CREATED      SIZE
my-nas-fqdn.synology.me:55000/pihole-nocache <none> fb155ea738a9 15 hours ago 333MB

When it is not used anymore you can remove it via the following command:

docker rmi IMAGE_ID

replacing IMAGE_ID with the ID shown by docker image ls.

I’ve scheduled a weekly job that removes dangling images in order to save space (it also shows you how to send notifications to a user within Synology web GUI via synodsmnotify tool):

#!/bin/bash

DOCKER="/usr/local/bin/docker"
TEE="/bin/tee"
RM="/bin/rm"
ECHO="/bin/echo"
DATE="/bin/date"
NOTIFY="/usr/syno/bin/synodsmnotify"
LOG_DIR="/volume1/homes/gianni/scripts/logs"
LOG_FILE="$LOG_DIR/docker_images_cleanup.log"

$RM -f $LOG_FILE

$DATE | $TEE -a $LOG_FILE

IDS=$($DOCKER images | grep "<none>" | awk "{print \$3}")

if [ "x$IDS" == "x" ] ; then
	$ECHO "No docker images with tag <none> found :)" | $TEE -a $LOG_FILE
	$NOTIFY gianni "Docker images OK" "No docker images with tag <none> found :)"
	exit
fi

$DOCKER images --all | $TEE -a $LOG_FILE
$DOCKER rmi -f $IDS | $TEE -a $LOG_FILE


$NOTIFY gianni "Docker images cleaned-up" "Cleaned up docker images with tag <none>"

Today I’ve updated the Image and the daily run of Watchtower upgraded it and deleted the dangling old image (I use soulassassin85/docker-telegram-notifier container to send notifications to a private Telegram channel)

Telegram notification sent by Watchtower

Conclusions

I hope you’ve enjoyed reading this post, maybe it can save you a bit of time if you want to solve problems like disabling the cache on a dockerized Pi-hole or if you are in the process of learning Docker and you want to experiment a bit like me. If you have comments feel free to write me below, I’ve started learning docker two weeks ago, so there can be something that could be done better or in a more efficient way 🙂

Posted in DNS, Docker, Networking | Tagged , , , , | 1 Comment

Recursive DNS Resolver with AD-Blocking Features

Article originally published by me on Medium:

https://medium.com/nerd-for-tech/recursive-dns-resolver-with-ad-blocking-features-dea766d4f703 

Wouldn’t it be amazing to have an home DNS server which filters Advertisements, Malicious Sites or other bad sites’ categories and recursively resolves the names without using official DNS servers like the one provided by Google, CloudFlare or OpenDNS, for example?

Provider DNSs

The majority of people uses internet at home without messing up with the configuration of the modem/router provided by the Telco and they use the provider’s DNS servers which are returned automatically by provider when the modem brings up the PPPoE connection when it is powered up.

Free public (and faster) DNS servers

The first step to have a better surfing experience at home is to connect to the router and change the DNS servers returned on the local LAN by the DHCP server replacing them with some public DNS servers like CloudFlare 1.1.1.1 which is very fast or one of its variants like 1.1.1.2 (no malware domains) or 1.1.1.3 (no malware and no adult content).

The advantage of using these DNS servers is that they are really fast compared to the ones of the providers. The disadvantage of using DNS servers like Google ones, is that you give google a lot of information about the domains you browse, because each time you visit www.mysecretsite.org you’re asking google “what’s the IP address of http://www.mysecretsite.org?”, thus providing it the whole list of domains you visit.

Recursive DNS Resolver@Home

The next step is to configure a Recursive DNS resolver in your home network.

A recursive DNS resolver does not asks public well-known DNS server to resolve Fully Qualified Domain Names (FQDNs), but instead it queries the root servers (which it must known in advance) asking them to which DNS server ask the top level domain of the FQDN. Then it goes on asking one of the DNS servers it has got with the next part of the domain and so on, so it acts recursively.

Let’s assume for example that we want to get the IP address of www.medium.com and do the recursive resolution “by hand”. I’ve got a list of root DNS servers from wget https://www.internic.net/domain/named.root

It contains for example the following server:

.                        3600000      NS    M.ROOT-SERVERS.NET.
M.ROOT-SERVERS.NET.      3600000      A     202.12.27.33
M.ROOT-SERVERS.NET.      3600000      AAAA  2001:dc3::35

Here is how we can resolve the FQDN www.medium.com recursively:

# Ask root server M the list of DNS servers that can resolve .COM
# Here I'm using @DNS-FQDN which implies a resolution of DNS-FQDN
# into an IP address
# Ask the root server M which Name Server (NS) we should use for
# com TLD
dig com NS @M.ROOT-SERVERS.NET
[...]
com.   172800 IN NS f.gtld-servers.net.
[...]
# Ask for the NS to use for medium.com 
dig medium.com NS @f.gtld-servers.net
[...]
medium.com.  172800 IN NS kip.ns.cloudflare.com.
[...]
# Finally, ask for the A (IPv4 Address) record of www.medium.com
dig www.medium.com A @kip.ns.cloudflare.com
[...]
www.medium.com.  300 IN A 162.159.152.4
[...]

The problem of recursive DNS resolution

If you simply enable a recursive DNS server that recursively resolves every FQDN, you would have a high performance hit in DNS resolution, because recursion requires a lot of time (you could pass from 20–30ms to 2–300ms for each query, or even more).

Caching and prefetching + serving expired 0-TTL resolutions is the way

If you enable an home recursive DNS, you definitely want to enable caching and prefetching:

  • Caching: the resolver recursively resolves an FQDN and then it stores it into it’s local cache for the amount of time specified by the TTL (5 minutes for the www.medium.com resolution above)
  • Prefetching: the resolver can proactively prefetch FQDNs in cache when you ask for them and the remaining life is under a certain percentage of the returned TTL
  • Serving expired records while resolving them: if devices in your LAN ask for www.medium.com every 10 minutes, there won’t be a resolution in the resolver cache since it has a TTL of 5 minutes. What the resolver can do is return an expired resolution with a TTL of zero (so that the Operating System of the device does not cache it) while starting an immediate recursive resolution for that domain. Recursive resolvers like Unbound support serving expired records and you can specify a maximum amount of time during which it returns an expired record if it is not able to resolve it recursively (usually you set it to 1-hour to avoid returning resolutions for removed domains for too long)

Avoiding the Advertisements.. and thousands of other bad sites

As I’ve already said above, there are some public DNS resolvers like Cloudflare for Families that can remove part of undesired DNS resolutions, for example the ones related to adult content if you want to have a safer browser experience for your kids.

What should you do if you want a much greater flexibility and you want to implement also a recursive resolution on your own at home? In this case Pi-hole is the solution: it acts like a blackhole for bad domains, thus the name -hole. I think the Pi in the name it is due to the fact that a Raspberry-Pi is a perfect candidate to run it into your home-network, but I’m just guessing 🙂

Implementing the Solution on a Raspberry-Pi (or equivalent)

Pi-hole Configuration

You can buy a Raspberry-Pi for very few bucks, you don’t need a new RPi4, an RPi3 like the one I have (and maybe also one of the previous versions) is perfectly fine.

You can install it following the guide here:

https://github.com/pi-hole/pi-hole/#one-step-automated-install

I won’t focus on every aspect of the configuration, I’ll just focus on the Ad-lists, the DNS setup and caching.

Once installed, you can set an admin password via the following command:

pihole -a -p

Then you can connect to the admin interface on http://RPI_ADDRESS/admin and choose Login on the left menu to login with the just set password.

First of all, you can change the DNS to which PiHole will forward queries by going into Settings -> DNS

Pi-hole DNS Forwarders Setup

As you can see, you can easily choose a primary and secondary IPv4 server among Google, OpenDNS etc, or you can set your own Upstream DNS servers on the right. Keep this in mind, since we will use the custom setup in order to forward queries to the Unbound Recursive Resolver.

Then there is the cache: if you just want to add the Pi-hole layer of filtering to the standard resolution to Google and OpenDNS, you should leave cache as it is. If you are going to use Unbound, with it’s own cache, DNSSEC and prefetching mechanism (which is triggered by client DNS queries) you must disable DNS Sec and Cache on PiHole.

In the same page where you’ve just setup Upstream DNS servers, you must disable DNSSEC if it is enabled (since it can be done by Unbound). I’ve also enabled the forwarding of reverse lookup of private ip addresses because it will be done by Unbound, otherwise if you use public DNS as upstream servers leave the option that blocks the reverse lookup for private ip addresses enabled, since it can leak information about your network to the outside.

Private IP Reverse Lookup and DNSSEC settings

Caching can not be disabled via GUI, it is necessary to disable it via CLI. So, edit /etc/pihole/setupVars.conf and set the following variable:

CACHE_SIZE=0

then restart pihole with:

systemctl restart pihole-FTL

Reconnect to the GUI and go to Settings -> System and have a look on the right side, you should see the DNS cache entries with all zeros (some caching still happens, but almost nothing and it is related to local resolutions)

By default, pi-hole blocks about 60k domains with the configured AD-List (at the time of writing, it may change in the future), but you can increase the amount of blocked domains by adding other lists. You can find more information on this useful blog post:

https://github.com/pi-hole/pi-hole/#one-step-automated-install

Otherwise you can pick your own lists from FilterLists by filtering for lists which are compatible with Pi-hole.

Note: you can no longer add the lists you find in the blog post above via editing the adlists.list file via cli, now you must go via GUI in the Group Management menu on the left. I wanted to add the lists used in the blog post above and so I’ve created some grups with different categories in Group Manageent -> Groups:

Groups Management

Groups are useful because with a single click you can enable or disable an entire group of adlists, so I suggest to use them.

Then go to Groups -> Adlists and add some of the lists that you’ve found:

Adlists Management

As it is written in gray within the Address input box, you can paste more than one URL at once, they say space-separated urls but also urls with a carriage return like the one you find in the blog I’ve linked work. After you’ve imported the lists, you can change the group assignement by removing the default group and assigning the category you want. Then go to Tools -> Update Gravity (yeah, it is a blackhole, you are going to increase its gravity force 🙂 ) and after a while (it can take several seconds) the list of blocked domains you can find in the upper right corner of your dashboard will increase from the original 60k domains to some hundreds of thousands like in my case:

Unbound configuration

Finally, we must configure the recursive resolver, to complete our DNS AD-Free & Recursive stack.

I’ve installed it on the same RPi3 of Pi-hole, so I will need to change the port on which it listens for DNS queries, since the standard 53/UDP port is used by Pi-hole (and we want it to be on the standard port, since our devices can’t be configured to use non standard ports for DNS queries). So, we will use 50053/UDP.

You can easily install unbound via apt:

apt install unbound

Then get the root.hints file:

wget https://www.internic.net/domain/named.root -O /etc/unbound/root.hints

Then you can customize the configuration by editing /etc/unbound/unbound.conf

I’ll post here the result of uncommented lines in my configuration and add some comments inline to explain it better:

# include additional configuration files (query data minimization for privacy and DNSSEC for security purposes)
# qname-minimisation.conf adds to server section
#   qname-minimisation: yes
# root-auto-trust-anchor-file.conf adds
#   auto-trust-anchor-file: "/var/lib/unbound/root.key"
include: "/etc/unbound/unbound.conf.d/*.conf"
server:
	verbosity: 0
	statistics-cumulative: yes
	# RPi3 has a quad-core CPU, so let's enable 4 threads
	num-threads: 4
	# Listen on port 50053 on every interface
	interface: 0.0.0.0@50053
	# Can enable faster resolutions in multithreaded configuration
	so-reuseport: yes
	access-control: 0.0.0.0/0 refuse
	access-control: 127.0.0.0/8 allow
	access-control: ::0/0 refuse
	access-control: ::1 allow
	access-control: ::ffff:127.0.0.1 allow
	# Allow queries from local lan (useful for testing, but can be 
	# omitted if queries will come only from Pi-hole running on the
	# same host
	access-control: 192.168.0.0/24 allow
	root-hints: "root.hints"
	private-address: 192.168.0.0/24
	private-domain: "your_local_domain"
	# Enable prefetching of cache elements that are queried when the
	# remaining life is less than 10% of the original TTL
	prefetch: yes

	domain-insecure: "your_local_domain"
	# Enable serving expired cache entries for at most 1-hour if 
	# it is impossible to refresh them
	serve-expired: yes
	serve-expired-ttl: 3600
	# Enable reverse resolution of local DNS names 
	local-zone: "0.168.192.in-addr.arpa." nodefault
	unblock-lan-zones: no
	insecure-lan-zones: no

python:
remote-control:
	# This can be used for querying Unbound about its internal statistics
	control-enable: yes
	control-interface: 0.0.0.0
	control-port: 8953
	# In my case, I have forward and reverse resolution for my home domain
	# setup on my router 192.168.0.254, so I want to forward local resolutions'
	# queries to it
forward-zone:
	name: "your_local_domain"
	forward-addr: 192.168.0.254
forward-zone:
	name: "0.168.192.in-addr.arpa"
	forward-addr: 192.168.0.254

Restart unbound with the new configuration:

systemctl restart unbound

you can test that unbound is working via the following command on the device where it is running:

$ dig www.microsoft.com A @127.0.0.1 -p 50053
; <<>> DiG 9.11.5-P4-5.1+deb10u3-Raspbian <<>> www.microsoft.com A @127.0.0.1 -p 50053
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 21721
;; flags: qr rd ra; QUERY: 1, ANSWER: 4, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;www.microsoft.com.  IN A
;; ANSWER SECTION:
www.microsoft.com. 3600 IN CNAME www.microsoft.com-c-3.edgekey.net.
www.microsoft.com-c-3.edgekey.net. 900 IN CNAME www.microsoft.com-c-3.edgekey.net.globalredir.akadns.net.
www.microsoft.com-c-3.edgekey.net.globalredir.akadns.net. 900 IN CNAME e13678.dscb.akamaiedge.net.
e13678.dscb.akamaiedge.net. 20 IN A 104.113.246.4
;; Query time: 903 msec
;; SERVER: 127.0.0.1#50053(127.0.0.1)
;; WHEN: Sat Apr 03 00:14:29 CEST 2021
;; MSG SIZE  rcvd: 213

As you can see, I’ve had a very bad performance for the first recursive query. Let’s repeat it again:

$ dig www.microsoft.com A @127.0.0.1 -p 50053
; <<>> DiG 9.11.5-P4–5.1+deb10u3-Raspbian <<>> www.microsoft.com A @127.0.0.1 -p 50053
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 14653
;; flags: qr rd ra; QUERY: 1, ANSWER: 4, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;www.microsoft.com. IN A
;; ANSWER SECTION:
www.microsoft.com. 3527 IN CNAME www.microsoft.com-c-3.edgekey.net.
www.microsoft.com-c-3.edgekey.net. 827 IN CNAME www.microsoft.com-c-3.edgekey.net.globalredir.akadns.net.
www.microsoft.com-c-3.edgekey.net.globalredir.akadns.net. 827 IN CNAME e13678.dscb.akamaiedge.net.
e13678.dscb.akamaiedge.net. 17 IN A 104.113.246.4
;; Query time: 0 msec
;; SERVER: 127.0.0.1#50053(127.0.0.1)
;; WHEN: Sat Apr 03 00:15:42 CEST 2021
;; MSG SIZE rcvd: 213

Wow, now it is in cache, and we have an immediate reply.

Pointing Pi-hole to Unbound

Now that we have a working Recursive DNS Resolver, go back into Pi-hole GUI, Settings -> DNS and configure Unbound as a resolver. In this case, it is running on the same system (127.0.0.1 means the local system) but on a non standard port so let’s go with the following:

127.0.0.1#50053

Now, repeat the DNS resolution test with the commands above (remove the –p 50053 option because Pi-hole is running on the standard 53/UDP port) and test that everything works fine.

If everything is fine, go back to the Provider’s router DHCP configuration and replace the Google/Cloudflare/OpenDNS or other DNSs with the IP address of your Raspberry. Then you can reboot the modem to force your devices to reconnect to the local LAN and get the updated DNS from DHCP.

Monitoring Pi-hole

Pi-hole has it’s beautiful GUI that lets you monitor its performance, so go on the Dashboard and on the top bar you will see the amount of ADs your Pi-hole have blocked today. We always talk about ADs but as you have seen it can block also malware and other bad stuff listed in the adlists you’ve configured.

You can also query the following link to have a JSON output of internal stats. I’ve used it to send data to my InfluxDB2 server and to monitor Pi-hole performance via Grafana:

http://RASPBERRY_IP/admin/api.php
{
  "domains_being_blocked":367112,
  "dns_queries_today":160892,
  "ads_blocked_today":3095,
  "ads_percentage_today":1.923651,
  "unique_domains":3030,
  "queries_forwarded":156650,
  "queries_cached":213,
  "clients_ever_seen":10,
  "unique_clients":8,
  "dns_queries_all_types":160892,
  "reply_NODATA":28264,
  "reply_NXDOMAIN":100,
  "reply_CNAME":4901,
  "reply_IP":89307,
  "privacy_level":0,
  "status":"enabled",
  "gravity_last_updated":
    {
      "file_exists":true,
      "absolute":1616893862,
      "relative":
        {
          "days":5,
          "hours":21,
          "minutes":16}
        }
    }
}

Monitoring Unbound

You can get some info about how Unbound is performing by using unbound-control. Just launch unbound-control to see all the info you can ask for. We’re interested now in statistics so let’s run it with unbound-control stats_noreset (by default if you invoke it with stats the statistics are reset, unless you specify statistics-cumulative: yes as I’ve done in my unbound.conf ).

You’ll get something like this (I’ll omit the threadX rows that gives statistics per thread, in my case from thread0 to thread3 since I’ve specified to run with 4 threads):

[...]
total.num.queries=458295
total.num.queries_ip_ratelimited=0
total.num.cachehits=442964
total.num.cachemiss=15331
total.num.prefetch=85808
total.num.zero_ttl=20472
total.num.recursivereplies=15331
total.requestlist.avg=0.405877
total.requestlist.max=42
total.requestlist.overwritten=0
total.requestlist.exceeded=0
total.requestlist.current.all=0
total.requestlist.current.user=0
total.recursion.time.avg=0.069605
total.recursion.time.median=0.0766845
total.tcpusage=0
time.now=1617402872.580515
time.up=261229.371628
time.elapsed=29.142016

I send also this info, after parsing it with Python, to my InfluxDB2 server.

Visualizing the performance of the stack

I won’t go into details of my monitoring panes in Grafana, it could be an interesting topic for a future post, but I can show you that my Pi-Holes and Unbound servers (yes, I have two Pi-holes and each of them points to the two Unbound servers, with a stack running on an RPi3 and the other running in Docker Containers on my Synology NAS, stuff for another post too 🙂 ).

As you can see, we’ve almost no caching on Pi-holes:

Almost no caching on Pi-holes

Unbound gives us a lot of info:

  • Cache hits are over 96% of the number of queries received by both Unbound servers
  • We have a lot of prefetches, which is good because they increases the Cache hits
  • We also have a good amount of expired entries, which is the number of queries served via expired records (that are counted as a Cache hit)
  • What can’t be answered with the cache (even with expired entries) requires a recursive query/reply which is counted by the last entry below
  • We have no IP-ratelimited queries, and this is normal for a home network with few devices
A lot of Cache Hits and prefetch on Unbound, really good!

Note: you may ask yourself why the number of forwarded queries on Pi-holes is lower than the number of queries received by Unbound servers, but this is simply due to the fact that Pi-hole numbers are reset every day and soometimes I restart services (updates, device reboot etc)

Since Grafana is like a drug, when you have the data, you start plotting everything, here you can see some other stuff I plot about my redundant DNS stack 🙂

Note: utilization of Pi-holes and Unbound servers is not balanced since 1) the operating system tends to use the first DNS, which in my case is the Pi-hole running on Raspberry and 2) I think (but I’m not sure) that also Pi-hole tends to forward queries to the first DNS, or maybe it has some mechanism that sees the one that answers faster and tends to use it the most

DNS Resolvers Dashboard #1
DNS Resolvers Dashboard #2

Conclusions

This is my first new article here on Medium, I hope someone will find it inspiring and maybe useful, you can reach me on the social pages linked in my profile if you have questions or you can comment here 🙂

Update: added a follow-up story about installing Pi-hole on a Synology NAS with Docker with tips about how to disable caching in a dockerized environment (which is not straightforward as on a standard linux installation). You can find it on Recursive DNS +AD-Blocker — Part 2: Installing Pi-hole without caching on Synology NAS with Docker

Posted in DNS, Networking | Tagged , , , , | 1 Comment

Knock Knock, can you open the Firewall? (Linux & MikroTik practical examples)

Expose no services

A lot of people that have a server or NAS at home require to publish some services on the Internet, through the so-called port forwarding technique on their home routers. Exposing services makes them detectable by port scanners, which can understand what machines or servers are running in your home network, which version of software they run and what vulnerabilities may affect them. This analysis can be the first step for a malicious user that wants to penetrate into your intranet.

Wouldn’t it be amazing to have no services exposed to simple scanners that continuously scan the public network’s IP addresses?

The linux knockd daemon solution

Some years ago my home router was a simple low-power alix-1c mini-computer and I’ve accomplished the task of having no services exposed to port scanners by using the port knock server knockd: as the linked MAN page explains, this service listens to every packet received on a network interface and can execute specific commands upon the reception of a single packet or, more usefully, a sequence of packets. As the examples on the MAN page show, you can tell knockd to insert an iptables rule when it sees a specific sequence of packets in a specified amount of time, then wait some seconds and execute another command when the timer is expired.

Let’s suppose you have a server with IP address 192.168.0.100 listening for ssh connections on port 22/TCP , which you want to expose on port 1022/TCP on the public network, but only when the router sees a sequence of packets like 2222/TCP, 3333/UDP, 4444/TCP received within 15 seconds time frame, and only for 10 seconds.

To accomplish what I’ve said above, I would configure a permanent port forwarding rule that translates packets with destination port 1022/TCP received on the internet-facing interface, such as ppp0, of the linux router into packets with destination port 22/TCP and destination IP address 192.168.0.100. This is accomplished in the PREROUTING chain of the iptables’ NAT table. Then I would put a jump into a special KNOCKD_FWD_RULES (I’ve named it FWD_RULES because I could have also a special chain for KNOCKD_INPUT_RULES for rules related to traffic destined to the linux router) chain just after the rule that allows ESTABLISHED,RELATED traffic (which has already been authorized) in the FORWARDING chain of the FILTER table and configure knockd to add/remove rules in that special chain to allow traffic destined to 192.168.0.100 port 22/TCP.

Linux implementation example

Iptables configuration (it must be adapted to your configuration, here I’m considering a fresh firewall with a DROP policy on the chains, i.e. drop everything not explicitly allowed):

iptables -t nat -A PREROUTING -i ppp0 -p tcp -m tcp --dport 1022 -j DNAT --to-destination 192.168.0.100:22

iptables -N KNOCKD_FWD_RULES

# Filter table is implicit
iptables -A FORWARDING -m state --state RELATED,ESTABLISHED -j ACCEPT
iptables -A FORWARDING -j KNOCKD_FWD_RULES

Knockd configuration:

[options]
    logfile = /var/log/knockd.log
[opencloseSSH]
    sequence      = 2222:tcp,3333:udp,4444:tcp
    seq_timeout   = 15
    tcpflags      = syn
    start_command = /usr/sbin/iptables -A KNOCKD_FWD_RULES -s %IP% -p tcp -m tcp --dport 22 -d 192.168.0.100 --syn -j ACCEPT
    cmd_timeout   = 10
    stop_command  = /usr/sbin/iptables -D KNOCKD_FWD_RULES -s %IP% -p tcp -m tcp --dport 22 -d 192.168.0.100 --syn -j ACCEPT

Now, when you are outside your home network and need to connect to your home server, you can do the following:

  1. Send the three magic packets to the public IP address of your home router, by using an FQDN such as myhomenetwork.no-ip.org (you can register 3 IPs for free on NO-IP, for example)
  2. Establish an SSH connection within 10 seconds from the sending of the magic sequence of packets

So, let’s try to connect from a client that exits on the Internet with IP x.x.x.x

ssh -p 1022 myhomenetwork.no-ip.org
[timeout - firewall closed]
# Send the magic sequence of packets from client x.x.x.x
nmap -Pn -sS -p 2222 myhomenetwork.no-ip.org
nmap -Pn -sU -p 3333 myhomenetwork.no-ip.org
nmap -Pn -sS -p 4444 myhomenetwork.no-ip.org
[firewall opens the door, new connections allowed from x.x.x.x]
sleep 2s
ssh -p 1022 myhomenetwork.no-ip.org
[connection established]
[firewall closes the door, no NEW connections are allowed from x.x.x.x]
[established connection from x.x.x.x keeps going on]

What I find beautiful about this approach is that:

  • Your firewall is opened only for your client public ip address and only for 10 seconds, so the service is exposed only for the IP address that generated the magic sequence of packets.
  • The service stops being exposed after 10 seconds, and if you established a connection within that small time frame, it will not be stopped after the special rule is removed from KNOCKD_FWD_RULES, because your traffic matches the established connections’ accept rule at the beginning of the FORWARD chain. This implies that if your client is NATted on a public IP address with tens of other clients, your service will be opened to the other clients behind the same public address only for 10 seconds, then it will “disappear”.

As you can see on the knockd MAN page I’ve linked above, you can implement even more complex behaviors, by using for example a file with a list of sequences that can be used to trigger a knockd event and each time a sequence is used it is invalidated, in order to avoid reply attacks if someone is sniffing the network where your client is connected and understands the magic sequence you use.

Knocking on a MikroTik door

3 years ago I’ve moved from a Linux router to a MikroTik HAP^2 router in my home network and I wanted to implement a sort of knockd by using the tools provided by RouterOS. We don’t have a knockd implementation for RouterOS but we can implement the example of the previous section using dynamic ADDRESS-LISTS, which are a useful way to group IP addresses that can be used as matching condition in firewall rules and that allows you to add an IP address with an expiration time.

We can do what follows:

  1. Check if a packet with destination port 2222/TCP with the SYN flag on is received on ppp0 interface and add the source IP address to KNOCK_FIRST address list with an expiration time of 15 seconds.
  2. Check if a packet with destination port 3333/UDP is received on ppp0 interface by an IP address within KNOCK_FIRST address list and add the source IP address to KNOCK_SECOND address list with an expiration time of 10 seconds (I decrease the timeout since less packets are needed to complete the magic sequence).
  3. Check if a packet with destination port 4444/TCP with the SYN flag on is received on ppp0 interface by an IP address within KNOCK_SECOND address list and add the source IP address to TRUSTED_SOURCES address list with an expiration time of 10 seconds, which is the time frame in which the client must start the connection to the temporary exposed service.
  4. Implement a forwarding rule that allows new sessions from TRUSTED_SOURCES to the intranet service on port 22/TCP of IP 192.168.0.100 (we still allow the established/related sessions to go on, without checking the source).

MikroTik implementation example

This is the MikroTik implementation:

# NAT TABLE
/ip firewall nat
# PREROUTING CHAIN
add action=dst-nat chain=dstnat in-interface=ppp0 protocol=tcp dst-port=1022 \
to-addresses=192.168.0.100 to-ports=22 \
comment="Port forward 1022/TCP to 192.168.0.100:22"

# FILTER TABLE
/ip firewall filter
# FORWARD CHAIN
# action fast-track allows the connection to be processed via fast-path, 
# but an identical rule with allow action is required if the connection
# follows the slow-path.
# More details on https://wiki.mikrotik.com/wiki/Manual:IP/Fasttrack
add action=fasttrack-connection chain=forward comment="FastTrack Established/Related" \
    connection-state=established,related
add action=accept chain=forward comment="Allow Established/Related" \
    connection-state=established,related
#Allow new connections from Trusted sources
add action=accept chain=forward protocol=tcp connection-state=new \
    src-address-list=TRUSTED_SOURCES dst-port=22 dst-address=192.168.0.100 \
    comment="Allow SSH from Trusted to 192.168.0.100 on port 22/TCP"
#Drop everything not explicitly allowed
add action=drop chain=forward

# INPUT CHAIN
# Port knocking rules
add action=add-src-to-address-list chain=input connection-state=new protocol=tcp \
    dst-port=2222 address-list=KNOCK_FIRST address-list-timeout=15s
add action=add-src-to-address-list chain=input connection-state=new protocol=udp \
    dst-port=3333 src-address-list=KNOCK_FIRST address-list=KNOCK_SECOND address-list-timeout=10s
add action=add-src-to-address-list chain=input connection-state=new protocol=tcp \
    dst-port=4444 src-address-list=KNOCK_SECOND address-list=TRUSTED_SOURCES address-list-timeout=10s
add action=drop chain=input

Here you can see the configuration in a more readable syntax-highlighted format (I’m using Sublime-Text with MikroTik syntax highlighing):

post10_fig4

You can monitor the insertion of the public IP address in the address-lists with the following command:

[admin@MikroTik] > /ip firewall address-list print
Flags: X - disabled, D - dynamic
 #   LIST                           ADDRESS                                             CREATION-TIME        TIMEOUT
0 D KNOCK_FIRST                    x.x.x.x                                             jun/15/2020 15:17:30 12s
1 D KNOCK_SECOND                   x.x.x.x                                             jun/15/2020 15:17:30 8s
2 D TRUSTED_SOURCES                x.x.x.x                                             jun/15/2020 15:17:30 9s

Knocking from a mobile client

In the previous sections I’ve shown you how you can knock from a computer by using nmap, but what if you need to connect to your home server via smartphone or tablet? You can easily find port-knocking applications such as Port Knock on iOS, which allows you to specify the sequence of TCP/UDP packets to send, a port to probe after sending the magic sequence and that even allows you to establish connections via protocols such as SSH (I prefer to use dedicated clients such as Prompt on iOS). Maybe the 10 seconds timeout we’ve used is too short if you plan to connect via mobile using different apps for port knocking and for the SSH connection, I think you can safely increase it to 20-30 seconds.

Follows an example of Port Knock iOS app configuration that knocks to myhomenetwork.no-ip.org on ports 2222/TCP, 3333/UDP, 4444/TCP and after that probes port 1022/TCP to check if it is open, providing you immediate feedback about the reception of the magic sequence by your home router (it also launches an SSH connection to port 1022 with user myuser after knock, but this can be omitted if you want to use some other client for the SSH connection)

post10_fig1post10_fig2post10_fig3

Conclusions

I hope that this article will be useful for anyone who wants to make his home network a little bit less exposed to the threats of the Internet. Again, you can try the configurations I’ve shown on EVE-NG with the MikroTik cloud router before implementing them into your network or before convincing yourself to buy one of their amazing routers.

Posted in Firewall, Iptables, Linux, MikroTik, Networking, Security | Leave a comment

Wake-On-Lan from Public Network (MikroTik practical example)

Wake-On-Lan purpose

Some network devices and PCs can listen for incoming special packets on their ethernet interfaces even when shutdown, and this is used to allow them to be powered up with a special magic packet, which is used by Wake-On-Lan (from now WOL).

WOL is usually done by generating a packet with destination IP address the broadcast address of the network (in a common 192.168.0.0/24 network, it is directed to 192.168.0.255 or 255.255.255.255), which produces an ethernet frame with FF:FF:FF:FF:FF:FF destination mac address. This broadcast frame is processed by all the hosts on the lan segment. What does it make this packet magic? The fact that it must contain the Mac Address of the device to be woken up, repeated 16 times. When the powered off device’s ethernet card detects this special frame, it powers up the device.

Usually the magic packet is an UDP packet with destination port 0, 7 or 9, but this is not mandatory. BTW, I will use UDP port 9 in the examples.

You can find more info on WOL on Wikipedia.

Using WOL from the Internet

Suppose that you have a NAS in your home network that you would like to power on only when needed, to get some documents you have stored on it, and that you don’t have other devices active on the home network to which you can connect in order to use WOL, wouldn’t it be useful to be able to use WOL from the Internet? How can we produce a broadcast frame on the internal LAN from the public network?

I’ve build this simple setup in Eve-NG network simulator, with a virtual MikroTik router that simulates our home router, with eth1 as WAN interface (I’m using private IP addressing in 192.168.60.0/24 but consider it a public address exposed on the Internet) and eth2-3-4 grouped in a bridge called lan_bridge with IP address 192.168.1.1/24 and a DHCP server enabled with 192.168.1.10-192.168.1.50 pool of addresses available for clients on the internal LAN.

post9_fig1

We could generate a magic packet directed to the public IP address of our home router, but then how can we force it to change it to a broadcast packet? The first simplest solution that came into my mind was to use destination NAT to change the magic packet destined to 192.168.60.141 to 192.168.1.255 but on MikroTik or Linux-based routers this doesn’t work (I think that directed broadcast forwarding is not supported) and the packet is discarder.

So, how can we generate the magic packet on the 192.168.1.0/24 lan to power up our devices? We can implement the following trick:

  1. Allocate an unused IP address in 192.168.1.0/24, such as 192.168.1.100
  2. Define a static ARP resolution on MikroTik router, setting FF:FF:FF:FF:FF:FF as 192.168.1.100 mac address
  3. Implement a pre-routing destination-nat rule on MikroTik router in order to change incoming traffic directed to its Internet-facing interface and to UDP port X (let’s choose 9999) by changing the destination address to 192.168.1.100 and the destination port to 9

Et voilà, now when you’ll send a magic packet to 192.168.60.141 with destination port 9999/UDP, MikroTik pre-routing NAT processing will change the destination address to 192.168.1.100. Then MikroTik will route the packet toward lan_bridge, which is on that subnet, and when it will prepare the ethernet frame that will contain the forwarded packet it will put FF:FF:FF:FF:FF:FF as destination Mac Address, thus producing a broadcast packet on the internal LAN, even if the destination IP address is a unicast IP.

Security Warning: a packet sent with 9999/UDP port on the public address of your router will generate a broadcast packet on your internal network, so it is highly reccomended to rate-limit the number of packets that are forwarded.

MikroTik configuration

Basic configuration

# Bridge eth2-3-4
/interface bridge
add name=lan_bridge
/interface bridge vlan
add bridge=lan_bridge vlan-ids=1
/interface bridge port
add bridge=lan_bridge interface=ether2
add bridge=lan_bridge interface=ether3
add bridge=lan_bridge interface=ether4

# Enable DHCP client on eth1 (WAN interface)
/ip dhcp-client
add disabled=no interface=ether1

# Setup internal network and enable DHCP Server
/ip address
add address=192.168.1.1/24 interface=lan_bridge network=192.168.1.0
/ip pool
add name=dhcp_pool ranges=192.168.1.10-192.168.1.50
/ip dhcp-server
add address-pool=dhcp_pool disabled=no interface=lan_bridge name=dhcp_server
/ip dhcp-server network
add address=192.168.1.0/24 gateway=192.168.1.1 netmask=24

ARP resolution “trick”

/ip arp
add address=192.168.1.100 interface=lan_bridge mac-address=FF:FF:FF:FF:FF:FF

Firewall NAT and Forwarding rules

As suggested before, we will implement a forwarding rule that allows the traffic directed to 192.168.1.100 but with a rate-limiting check that will allow 1 packet every 10 seconds from a specific public source address, with a burst of 3 (this effectively allows 4 packets to be forwarded, this could be due to how MikroTik dst-limit works, I did not dig into this very much since the practical effect is the same for the purposes of this HowTo).

/ip firewall nat
add action=dst-nat chain=dstnat dst-port=9999 in-interface=ether1 log=yes log-prefix=PRE-RT: protocol=udp \
    to-addresses=192.168.1.100 to-ports=9

/ip firewall filter
add action=accept chain=forward dst-address=192.168.1.100 dst-limit=1/10s,3,src-address in-interface=ether1 log=yes \
    log-prefix=FWD: out-interface=lan_bridge
add action=drop chain=forward log=yes log-prefix=DROP:

Here you can see the whole configuration in a more readable format with the syntax highlighting in Sublime Text:

post9_fig6

Testing MikroTik setup

In order to test what I’ve implemented, I’ve downloaded on my MacBook Pro the wakeonlan software package (from MacPorts) and I’ve generated 10 magic packets in row with the following command (sudo asks for the password only the first time, so the command is repeated 10 times very quickly):

% for i in $(seq 1 10) ; do sudo wakeonlan -i 192.168.60.141 -p 9999 aa:bb:cc:dd:ee:ff ; done
Sending magic packet to 192.168.60.141:9999 with aa:bb:cc:dd:ee:ff
Sending magic packet to 192.168.60.141:9999 with aa:bb:cc:dd:ee:ff
Sending magic packet to 192.168.60.141:9999 with aa:bb:cc:dd:ee:ff
Sending magic packet to 192.168.60.141:9999 with aa:bb:cc:dd:ee:ff
Sending magic packet to 192.168.60.141:9999 with aa:bb:cc:dd:ee:ff
Sending magic packet to 192.168.60.141:9999 with aa:bb:cc:dd:ee:ff
Sending magic packet to 192.168.60.141:9999 with aa:bb:cc:dd:ee:ff
Sending magic packet to 192.168.60.141:9999 with aa:bb:cc:dd:ee:ff
Sending magic packet to 192.168.60.141:9999 with aa:bb:cc:dd:ee:ff
Sending magic packet to 192.168.60.141:9999 with aa:bb:cc:dd:ee:ff

I’ve chosen aa:bb:cc:dd:ee:ff as Mac Address of the device to be woken up. Let’s have a look at the MikroTik logs to see what happens:

post9_fig2

(click here for full-sized image)

As you can see, the Pre-Routing NAT (src-nat chain) rule is triggered 10 times due to the 10 packets above, but the forwarding rule that allows the traffic to pass is triggered only 4 times, the other 6 times we have a drop.

I’ve also started a packet capture on PC interface Fa0/0 (its a virtual router, this is why the interface has such a name). The Fa0/0 interface is connected to the MikroTik lan_bridge through eth2 MikroTik interface, in fact it gets an IP address via DHCP:

PC#sh dhcp lease
Temp IP addr: 192.168.1.49 for peer on Interface: FastEthernet0/0
Temp sub net mask: 255.255.255.0
DHCP Lease server: 192.168.1.1, state: 5 Bound
DHCP transaction id: 18AE
Lease: 600 secs, Renewal: 300 secs, Rebind: 525 secs
Temp default-gateway addr: 192.168.1.1
Next timer fires after: 00:04:53
Retry count: 0 Client-ID: cisco-c202.0c94.0000-Fa0/0
Client-ID hex dump: 636973636F2D633230322E306339342E
303030302D4661302F30
Hostname: PC

Fa0/0 interface receives the magic packet due to it’s FF:FF:FF:FF:FF:FF destination mac address, but then it ignores the packet because it does not contain traffic destined to its IP address. The only purpose of the PC is to make the lan segment active in the lab and show the magic packet reception on the internal lan. In the following image you can see the 4 magic packets containing aa:bb:cc:dd:ee:ff mac address 16 times:

post9_fig3post9_fig4

You can generate the magic packet also via phone with apps like WOL on IOS:post9_fig5

You can configure your home router to register its public IP address on a service like No-IP and then configure your WOL app with the FQDN you registered (such as myhomenetwork.no-ip.org).

Conclusions

Credits for this idea go to my boss G.D. which pointed me in the right direction when I was thinking about how to trigger the broadcast packet on my internal LAN from the public network. I hope this will be the first of some quick MikroTik How-Tos that will show you the flexibility of these incredible low-cost but powerful routers. I’ve spent about 70 euros for a MikroTik Hap^2 router that is able to manage my 1Gbps internet connection, with 4-500 Mbps wireless data rate peaks on 5Ghz, with tens of firewall rules, two pppoe connections (one in a specialized vrf, maybe this will be the topic of the next article) and scripts running in the background to manage dynamic Access Control Lists (I check the IP address of some well-known FQDNs and I add them to a trusted sources list with an expiration time, in order to allow only some public IP addresses to access services in my home network). If you like to experiment, have a look at MikroTik site and if you want to experiment without spending an euro, just download the virtual image of the MikroTik Cloud Router and launch it in EVE-NG Network Simulator to have some fun! 😉

Posted in Firewall, MikroTik, Networking, Routing | Tagged , , , , | Leave a comment

How I’ve got banned from Freeradius Users Mailing List

Freeradius is one of the tens of software tools and devices that a Network Engineer like me must manage at work, and since I’m a human and I didn’t spend my last 10 years using Freeradius everyday, sometimes I have a problem or I simply don’t how to implement something. Yesterday I’ve wrote on the Mailing List about a problem in using the statement Exec-Program-Wait to call a script before accepting or rejecting specific users. I’ve notice that after having moved from v2.x to 3.13 the statement was accepted but the script was not invoked. So, after having googled a bit (let’s try to find some meaningful documentation or helps about Freeradius, it is quite difficult, the modules are well documented, but they help you if you already know a lot about Freeradius, otherwise they’re not always so clear) without finding any help online, I’ve written an email on the Freeradius Users mailing list:

Hi,
I was using the following syntax on Freeradius 2.x to determine if a user
could connect to a particular IP address, even if the authentication
succeeds, based on some parameters passed to a script:

XXX747 Auth-Type = System, Realm == imp
        Service-Type := Login-User,
        cisco-avpair = "shell:priv-lvl=2",
        Exec-Program-Wait =
"/opt/script/radius/bin/check_operator_access.sh %{NAS-IP-Address}
%{User-Name} %{Realm}"

It worked on my old 2.x installation, now I'm on the last version available on Red Hat Enterprise 7, which is 3.0.13-10.el7_6. 
The syntax gives no error, but the script is not invoked (it contains
an invocation to logger system command to put an entry in
/var/log/messages and I can't see it), even if the above entry in the
users (authorize) file is mached.
What could be the problem? If this is the wrong way to implement
this check can you give me an hint on how should I do it on 3.x
Freeradius installation?
Wed Jun 19 17:01:52 2019 : Debug: (12) files: users: Matched entry XXX747 at line 497 Wed Jun 19 17:01:52 2019 : Debug: /opt/script/radius/bin/check_operator_access.sh %{NAS-IP-Address} %{User-Name} %{Realm} Wed Jun 19 17:01:52 2019 : Debug: Parsed xlat tree: Wed Jun 19 17:01:52 2019 : Debug: literal --> /opt/script/radius/bin/check_operator_access.sh Wed Jun 19 17:01:52 2019 : Debug: attribute --> NAS-IP-Address Wed Jun 19 17:01:52 2019 : Debug: literal --> Wed Jun 19 17:01:52 2019 : Debug: attribute --> User-Name Wed Jun 19 17:01:52 2019 : Debug: literal --> Wed Jun 19 17:01:52 2019 : Debug: attribute --> Realm Wed Jun 19 17:01:52 2019 : Debug: (12) files: EXPAND /opt/script/radius/bin/check_operator_access.sh %{NAS-IP-Address} %{User-Name} %{Realm} Wed Jun 19 17:01:52 2019 : Debug: (12) files: --> /opt/script/radius/bin/check_operator_access.sh 172.16.120.218 XXX747 at imp imp Wed Jun 19 17:01:52 2019 : Debug: (12) modsingle[authorize]: returned from files (rlm_files) Wed Jun 19 17:01:52 2019 : Debug: (12) [files] = ok Thank you in advance for any help. Best regards, Gianni Costanzi

After my email Alan Dekok (Network Radius CEO) replied with this hint:

> I was using the following syntax on Freeradius 2.x to determine if a user
> could connect to a particular IP address, even if the authentication
> succeeds, based on some parameters passed to a script:
> 
> XXX747 Auth-Type = System, Realm == imp
>        Service-Type := Login-User,
>        cisco-avpair = "shell:priv-lvl=2",
>        Exec-Program-Wait =
> "/opt/script/radius/bin/check_operator_access.sh %{NAS-IP-Address}
> %{User-Name} %{Realm}"

  Exec-Program-Wait goes in the first line.  It's a check attribute, and isn't a reply attribute.

  Alan DeKok.

I immediately tried what he suggested but still it didn’t work. You can read the remaining messages here:

He gave me some suggestions, then I’ve explained how I was processing user requests for some users, i.e. I was sending the access request to an authentication proxy called imp and then I modified the post-auth behavior of free radius in order to force the analysis of users (authorize) file even after a successful authentication on realm Imp, to avoid accepting all the users that have a valid account on imp. So, a simple files.authorize directive in the post-auth section with a check on realm equal to imp forced the analysis of this user even after a successful authentication on imp:

XXX747 Auth-Type = System, Realm == imp 
    Service-Type := Login-User, 
    cisco-avpair = "shell:priv-lvl=2", 
    Exec-Program-Wait = "/opt/script/radius/bin/check_operator_access.sh %{NAS-IP-Address} %{User-Name} %{Realm}"

(I’ve tried also to move Exec-Program-Wait to the first line, as a check item, as Dekok suggested me, but then I’ve moved in the previous position where it always worked on v2.x)

The discussion continued, with Alan suggesting me to try other ways of processing the authentication, and I’ve kept replying explaining that what he suggested me did not implement the authentication flow I wanted. Very soon he started replying me with his usual style.. he seems that he has a gun pointed to his head, that he must answer to you poor user that don’t have in mind the whole documentation and maybe the source code of Freeradius, as he, the God on earth. Then he told me I was childish because I didn’t follow his advices, and at a certain point I summed up everything with a simple question:

To avoid you being so acid (I don’t really understand why, I was quite polite I think), you’ve told me how to call Exec-Program-Wait, with := and on the first line as a check item. I’ve told you that it is not invoked even when I do that. Can you explain me why? Where should I check if there is an error? Is there some different requirement compared to previous versions of freeradius server? I think you can answer me even without further details on this point. 

We’re in an open source community where everyone should help the others if the others are polite and correct. I think I’ve been both, so I don’t really understand why you answered in such a bad way.

This is part of his reply:

> you’ve told me how to call Exec-Program-Wait, with := and
> on the first line as a check item. I’ve told you that it is not invoked
> even when I do that. Can you explain me why?

  <shrug>  You're probably using an old version, or something else is happening.

> Where should I check if there is an error?

  It's Open Source.  You have to source.  Track it down, and supply a patch to fix the bug.

[...[
I can't explain your failure to understand. I've explained myself repeatedly. My "bad way" of answering is honest frustration at *you*, who is making it as difficult as possible for me to help you. This is Open Source. You're not paying for support. Don't complain about the answers you get for free. If it's a bug, you have access to the source. Track it down and fix it. If you're not willing to do that, then the "community" aspect you talked about is bullshit. For you, there is no community. Only others helping you for free, while you refuse to do anything yourself. I've seen this attitude a lot over the past 20+ years in the open source community. The people complaining the loudest about others are the ones who (a) refuse to follow instructions, and (b) refuse to contribute. I'll make it simple: follow instructions, read the docs, and you will be able to fix the problem. Keep complaining about how mean we are for helping you, and you will be banned from the list.

As you can see from the red part of his reply, you can not be part of the community if you don’t read the whole source code, or if you tell the God on Earth Alan that what he suggested doesn’t fit your needs. Read again what I asked more than once, i.e. “give me some hints about where I should have a look in order to solve Exec-Program-Wait not working”.

Then I wrote two emails in a row, quite upset about Alan usual poor behavior:

Some answers:
- I will go through the docs and the help about exec module and other
things you’ve suggested, I’m not one that don’t follows your advices, you’ve said that without any reason
- I’ve asked about hints about why exec-program-wait does not work an
d the only answer is probably is a bug related to an old version, I
just hoped you gave me some hints about further points to check. I
don’t know if you’ve ever worked in a Gdpr/pci-Dss/etc compliant
company, ve are using the latest version available on our red hat
enterprise servers, we can not instal the latest development release - I’m not childish, maybe you are, looking at your answers - you keep asking me while I repeat myself, I did it simply because
the workaround you’ve suggested me simply to not implement the
authentication flow I’ve explained
- have you ever have a look at the other open source communities?
About how experts help and answer to problems? Have a nice day/night Gianni
Alan, you don't have a gun pointed to your head, if you don't have 
time to explain in detail or you don't have the will to explain
something, just don't answer. If you suggest me to implement things
that do not really implement the behavior I've described you, expect
me to ask you again. How can you reply "maybe you're not running the
latest version, if it is a bug check the source code".. I'm not
running Freeradius 1, I'm not on a years-old Freeradius version. You almost always suggest to have a look at the documentation within
the configuration files, and it is something I almost always do
before asking you something, because I know how you answer 90% of
the time, but that documentation is clear for someone like you that
already knows how it works. This is why there is an ML, to help
people like me understand how to implement something or solve some
problems that we can't solve by looking at the "not-so-clear"
documentation. You quite simplistically said that I don't follow your advice etc,
which is definitely not true. I will dig in the modules' docs and
re-read your answers tomorrow. I just hoped you could give me an
hint about how to check why Exec-Program-Wait was not working,
because it was the simplest way of implementing what I needed and
it worked like a charm on previous versions. I'm not saying that I
don't want to change it at all because I want to stick to the old
configuration, I'm just trying to understand if there is something I can do to understand
why it does not work on my configuration, before spending hours in
understanding how exec or other modules work and if they can help
me fulfill my requirements. Best regards, Gianni Costanzi

While I was writing these two emails, this was Alan reply I’ve received:

If you're paying RedHat for support, then ask them for help.

> - I’m not childish, maybe you are, looking at your answers

  And you're gone.

  Alan DeKok.

Then, he banned me from the Mailing List.. after having banned me, he wrote another reply, with a false statement about what I’ve said and without the possibility for me to post a reply:

 For everyone else reading, this is the key problem.

Alan: Use the "exec" module, it will do what you want

Gianni: I don't want to spend hours reading documentation on exec or *other* modules.  I'm not sure *if* they will do what I want

  It really can't be stated any better than that.  After being given a solution, his response is "no, I don't believe you".

  That's not an acceptable answer.

  Alan DeKok.

At this point I’ve emailed two other people that seem to manage the Users ML with Alan to try to be unbanned, but I’m still waiting for a reply:

Hi,
I'm writing to you to ask to be unbanned from Freeradius ML, you
can give a look at the conversation I had with Alan Deck and I
don't think I deserve a ban. He told me I was childish without any
reason (simply because I repeated more than once my requirements
because I thought he didn't understand them and because I've asked
him why he is so acid when he answers) and he banned me when I've
replied that I'm not childish, maybe he is.

Furthermore, is there anyone else that have a bit of patience to
reply to questions of people that maybe don't understand the
documentation of the project, given the fact that it is not so
clear for someone that do not use free radius everyday?

I wrote you because I've seen that you run the ML along with Alan,
if the documentation is up-to-date.

Best regards,

    Gianni Costanzi
 

There was another user’s reply on the thread, this is what Thor Spruyt wrote:

This has already been going on (and getting worse) for years...
People seem to be less and less interested in understanding something
. They also think reading documentation (not to mention source code)
is not needed at all, just ask on a forum/maillist whatever you
want. I think it's ok for people to not wanting to understand the details,
but expecting others to just deliver whatever they want "on a plate"
is indeed a problem.

I replied privately to him, because I couldn’t reply on the ML anymore:

Hi Thor,
if you read the whole thread, I did not say that I don't believe in
what Alan says or I don't want to dig into the documentation, but
the facts are the following:

1) Freeradius documentation is poor, compared to a lot of other
opensource projects

2) Documentation within freeradius modules is fine for people that
use Freeradius everyday and probably already know how it works

3) I was quite polite, I kept repeating myself because Alan proposed
solutions were ignoring some of my requirements

4) alan told me i was using Exec-Program-Wait in the wrong position,
I've followed his advice and then it kept not working. I've asked
him to give me some hints where I could look at to find why it was
not working and it told me that maybe I was on an old version

I can go on for hours, if you read the whole thread I think you can
see that there is no reason to ban me from the ML. Alan 90% of times
is acid when it answers people, and I don't really understand why.
I replied on other MLs and sites like stack overflow to a lot of
people, even if they were not understanding what I was saying I've
always tried to explain it better. It always seems that it must
answer even if it doesn't want, the main problem is that he is the
only expert that replies on that ML, this is the point. 

Have a nice day, excuse me if I replied to your private address but
Alan banned me.

Best regards,

   Gianni Costanzi
 
I’ve got really upset with this facts, since I’m one that reads tons of documentation especially if it is well done and written for users, not for people that already know how the piece of software works, such as Freeradius Docs. I’ve also spent hours of my life to try replying to all kinds of people, experts and non-expert, and I’ve also worked on open-source projects like Gentoo Documentation Project. 
 
Please, don’t judge me or Alan from the snippets above, you can read the whole thread on this page, looking for Exec-Program-Wait not working in the subject:
 
 
 
Today I went to work, announced my boss that I’ve been banned from the ML (he knows how Alan behaves, so it was not really surprised) and then started working on a solution, which should have been quite simple for God-Dekok: I explained him I’ve forced a passage through users file even after a successful authentication on the external realm imp, and this is done, as I explained, by putting a files.authorize in the post-auth section of default site. Do you know why exec module did not invoke Exec-Program-Wait when examining the users file after an Access-Accept from realm imp? Quite simple, because it was needed to call exec after files.authorize (or maybe put the following if block that calls files.authorize before the exec statement in Post-Auth section:
 
post-auth {
if ( &request:Realm && (request:Realm == "imp" )) {
files.authorize
exec
}
}
Now that I’ve understood what was the problem, I think that God-Dekok could have suggested me that the issue could have been related to the order in which exec module and the users file were processed, but maybe this is not true. It would have been amazing for me to share how I’ve solved the issue and discuss with Alan why Exec-Program-Wait works when put as a Reply item and not as a Check item, but I can’t, since he banned irrevocably my email.
 
That solved my problem, maybe it is not the exact or more elegant way of implementing what I needed, and if there was a real community with more than a single expert that replies on the ML it would be amazing for me to start discussing about how I could implement the expected authentication flow, but with Alan is simply impossible, because he gives too quick answers, which usually say “there is this module, read the docs within it” and when you go and read the docs you realize that they do not really clarify the exact usage of each statement. 
 
I think that the main problem of this project, apart from the poor documentation, is that Alan is the only expert that replies to your questions on the mailing list, and that he accomplishes this task not with the passion that I and thousands of other guys have when we want to help someone, because it always seems that he answers without having time or the required patience to explain. 
 
I hope that someone will read this long post, I surely made some mistakes interacting with Alan, but I think that his poor behavior have no excuses.
 
Here you can find the pages of thread, in case they will be removed (I don’t know if it is possible… just in case…)
 
I hope that if you use Freeradius, you’ll never need to interact with Alan Dekok.
Posted in Freeradius, Networking | Tagged , , | 3 Comments

Enterprise- & Service Provider-Style Bridging on Juniper MX

Introduction

I’ve started reading Chapter 2 of Juniper MX Series book a few days ago, where it talks about Bridging, VLAN Mapping and IRB interfaces. It talks about two ways of configuring bridging, the simpler Enterprise-style and the more complex but more flexible Service Provider-style. In this small lab I’ve just wanted to try out some configurations I’ve learnt on the book.

This is the setup of my lab in EVE-NG:

post7_fig1_topology

In this lab we have a Cisco router RT with interface e0/0 configured with two sub-interfaces .100 and .200 with dot1Q encapsulation with tag 1100 and 200, respectively. I’ve put e0/0.100 in VRF100 in order to be able to ping between the two sub-interfaces through the 3 vMX switches (otherwise the ping would be something local to the router).

Tagging and bridging configurations are quite strange in this lab (not very smart, you could say) but the purpose is to experiment with configurations 🙂

Packets’ flow

The flow of a packet from RT to vMX3 is the following:

  • The packet exits RT e0/0.100 tagged with 1100 and arrives into ge-0/0/0 on vMX1
  • vMX1 ge-0/0/0 is configured as a Trunk that allows vlan tags 100 and 200 and translates incoming tag 1100 to 100 and vice versa when it packets leave the interface. Due to tag 100, the packet goes into Bridge-Domain BD100 and then it exits via interface ge-0/0/1.100 and arrives on interface ge-0/0/1.100 of vMX2
  • vMX2 ge-0/0/1.100 receives the packet which goes again in BD100 and then it exits untagged via interface ge-0/0/2.100 which is configured as an Access interface. Packets are received by ge-0/0/2.100 on vMX3.
  • vMX3 ge-0/0/2.100 is again an Access interface and receives the untagged packets that are placed in BD100.

BD100 is always configured as a Bridge-Domain with single tag 100, so packets received in this Bridge-Domain have this single tag applied to them.

vMX3 has two Integrated Routing and Bridging (IRB) interfaces, one in BD100 and one in BD200, whose purpose is to route packets between the two vlans.

The flow of a packet from vMX3 to RT is the following:

  • The packet starts in vMX3 BD200 where irb.200 interface is placed and exits via interface ge-0/0/3.200 tagged with vlan tag 200. It is received on vMX2 on ge-0/0/3.200.
  • vMX2 ge-0/0/3.200 receives the packet and places it in BD200. This time BD200 has two tags: 2200 outside and 1200 inside. This means that the incoming packet’s tag 200 is swapped with tag 1200 and then an outer tag 2200 is pushed onto the packet. The packet is then delivered to interface ge-0/0/1.200 which is configured to send packets with outer tag 20 and inner tag 200: this means that tag 1200 is swapped with 200 and 2200 is swapped with 20. The packet is then received by ge-0/0/1.200 on vMX1.
  • vMX1 ge-0/0/1.200 receives the packet with outer tag 20 and inner tag 200 and places it within BD200, which is configured with single tag 200. This means that the outer tag 20 must be popped. The packet is then sent out via interface ge-0/0/0 tagged with 200 and it is received by RT on its e0/0.200 sub-interface.

Enterprise-style configuration

ENT-style configuration is the simplest way of configuring access and trunk interfaces: you define an interface with family bridge and interface-mode trunk or access and specify the allowed vlan(s) and the MX automatically places the traffic in the corresponding bridge: if an interface is an access interface with tag 100, its untagged traffic is placed in a bridge with tag 100, if an interface is a trunk with allowed vlans 100 and 200, traffic tagged with 100 is placed in a bridge with tag 100 and traffic tagged with 200 is placed in a bridge with tag 200. This configuration style is the fasted and easier but it allows for less flexibility in interface configuration than Service Provider-style.

Let’s have a look at some interfaces within the lab:

  • vMX1 ge-0/0/0 is configured with a single unit 0 (the only unit allowed with Enterprise Style configuration) and it is a simple trunk with allowed vlans 100 and 200, but it also has a vlan-rewriting configuration that is used to translate tag 1100 to 100 when received from RT:
    # show interfaces
    ge-0/0/0 {
      unit 0 {
        family bridge {
          interface-mode trunk;
          vlan-id-list [ 100 200 ];
          vlan-rewrite {
            translate 1100 100;
          }
        }
      }
    }

    This interface is automatically placed within BD100:

    # show bridge domain BD100 extensive
    
    Routing instance: default-switch
    Bridge domain: BD100 State: Active
    Bridge VLAN ID: 100
    Interfaces:
     ge-0/0/0.0
     ge-0/0/1.100
    Total MAC count: 3
  • vMX3 ge-0/0/2 is configured with unit 0 configured in Access mode on vlan 100:
    # show interfaces
    ge-0/0/2 {
      unit 0 {
        family bridge {
          interface-mode access;
          vlan-id 100;
        }
      }
    }

Service Provider-style Configuration

SP-style configuration allows for more flexibility and automatic vlan-rewriting, but it requires a bit more configuration. Interfaces supports more than the single unit 0 and interface units must be manually placed within the corresponding bridges. This manual placement of interfaces within BDs is what enables automatic vlan-rewriting, as we will see in some of the following examples.

Let’s have a look at some interfaces within the lab:

  • vMX1 ge-0/0/1 has two units, 100 and 200:
    • unit 100 is configured with single tag 100 and it is manually placed within BD100:
      # show interfaces
      ge-0/0/1 {
        flexible-vlan-tagging;
        encapsulation extended-vlan-bridge;
        unit 100 {
          vlan-id 100;
        }
      }
      
      # show bridge-domains
      BD100 {
        vlan-id 100;
        interface ge-0/0/1.100;
      }

      Interface unit and BD are both configured with a single tag 100, so there are no vlan-rewriting operations going on:

      # show interfaces ge-0/0/1.100
       Logical interface ge-0/0/1.100 (Index 397) (SNMP ifIndex 577)
       Flags: Up SNMP-Traps 0x20004000 VLAN-Tag [ 0x8100.100 ]
       Encapsulation: Extended-VLAN-Bridge
    • unit 200 is configured with two tags, outer 20 and inner 200, and is manually placed within BD200:
      # show interfaces
      ge-0/0/1 {
        flexible-vlan-tagging;
        encapsulation extended-vlan-bridge;
          unit 200 {
          vlan-tags outer 20 inner 200;
        }
      }
      
      # show bridge-domains
      BD200 {
        vlan-id 200;
        interface ge-0/0/1.200;
      }

      This time packets leaving BD200 through ge-0/0/1.200 or arriving from ge-0/0/1.200 into BD200 must go through a vlan-rewriting step, which happens automatically on the MX, you only need to specify tags for BD and interface unit and link those two elements by putting the interface within the bridge-domain, and then the MX implements the proper vlan-rewriting operations:

      # show interfaces ge-0/0/1.200
       Logical interface ge-0/0/1.200 (Index 398) (SNMP ifIndex 579)
       Flags: Up SNMP-Traps 0x20004000
       VLAN-Tag [ 0x8100.20 0x8100.200 ] In(pop) Out(push 0x8100.20)
       Encapsulation: Extended-VLAN-Bridge

      As you can see from the output above, interface traffic has an outer tag 20 and inner tag 200, so incoming traffic must go through a pop operation, which removes tag 20, while outgoing traffic must go through a push 20 operation, which adds tag 20.

  • vMX2 ge-0/0/1 has two units, 100 and 200 configured as on vMX1:
    • unit 100 is configured with single tag 100 and it is manually placed within BD100
    • unit 200 is configured with two tags, outer 20 and inner 200, and is manually placed within BD200. This time we have a double tagged bridge-domain:
      # show bridge-domains
      BD200 {
        vlan-tags outer 2200 inner 1200;
        interface ge-0/0/1.200;
        interface ge-0/0/3.200;
        routing-interface irb.200;
      }

      Again, MX automatically knows what to do with the tags in order to exchange traffic between BD200 and ge-0/0/1.200:

      # show interfaces ge-0/0/1.200 
      Logical interface ge-0/0/1.200 (Index 397) (SNMP ifIndex 583)
      Flags: Up SNMP-Traps 0x20004000
       VLAN-Tag [ 0x8100.20 0x8100.200 ] In(swap-swap .2200 .1200) Out(swap-swap .20 .200)
       Encapsulation: Extended-VLAN-Bridge

      As you can see, MX must do a double-swap of inner and outer labels to translate incoming and outgoing traffic.

Vlan-tagging and Encapsulation type

When configuring bridging interfaces you can enable the use of single or dual tags with different keywords:

  • vlan-tagging: it enables the use of single tag
  • stacked-vlan-tagging: it enables the use of double tags
  • flexible-vlan-tagging: it enables you to mix single and double tags’ units within the same physical interface and gives you the opportunity to specify native vlans for both inner and outer tags (give a look to the Juniper MX Series book for further details)

Another flexibility you have is to specify the kind of encapsulation to be used:

  • encapsulation extended-vlan-bridge: it applies the vlan-bridge encapsulation to all the units, so it can be used when you have single- or double-tagged units
  • encapsulation flexible-ethernet-services: it allows you to specify different per-unit encapsulation. In this way you can have some units which do bridging with vlan-bridge encapsulation, some layer-3 units and some units that do VPLS bridging with vlan-vpls encapsulation.

As you can see in the full configurations attached to the post, I’ve mixed some of the encapsulation and tagging modes in the lab, just for testing purposes.

Cisco & vMX Configuration

You can find the full configuration of the lab’s routers that you can load within your vMX and Cisco routers. Adjust the IP address of em0 interface on each vMXs, which is the one that can be connected to the Netcloud connected to the real LAN if you want to manage vMX via SSH. You can upload the configuration file via SCP and then load it with the following commands:

[edit]
admin# delete
This will delete the entire configuration
Delete everything under this level? [yes,no] (no) yes

admin# load merge filename.conf
load complete

admin# commit

commit complete

vMXs’ user is admin with password admin1, root has password root123.

RT.conf    vMX1.conf    vMX2.conf    vMX3.conf

Note: I’ve added some IRB interfaces also on vMX1 and vMX2 that allowed me to debug communication problems while I was experimenting with the configuration.

Conclusions

I perfectly understand that I did not explain too much about the undergoing details of MX bridging, but I’ve just wanted to show some of the flexibility of this wonderful machine and maybe increase your curiosity about what you can do with it. I suggest you to read the chapter about Bridging on the Juniper MX Series book which goes into deep details and shows some other more complex configurations.

Posted in Uncategorized | Tagged , , , | Leave a comment

Carrier-of-Carriers Inter-provider L3-VPN on Junos vMX

Introduction

In this post I’ll show you an implementation of a Carrier-of-Carriers Inter-provider Layer-3 VPN on Junos vMX. I’ve studied this stuff as the last topic explained on Juniper Networks JNCIS-IS MPLS Study Guide, which I suggest you to read if you want to understand a lot of interesting features of Juniper platform.

This is the topology of the lab (click on this link for the full-size image):

post6_fig2_carrier-of-carriers

 

 

As you can see in the image above, I’ve used a convention to number the Loopback and P-t-P interfaces of the routers: it seems complex, but you’ll get used after a while 🙂

Some examples:

  • Loopback of CE-SP-PE7: router Rx is within group 1 (the number within the golden square) and x is equal to 7, so its loopback is 192.168.1.7/32
  • Interface of CE-A-11 toward CE1-SP-PE7: it is an inter-group P-t-P link, between R11 and R7, so it is within 172.16.*.* network. Given that one of the two router’s numbers is greater than 10, we’ll sum 11 and 7, so we have 18 as the third octet. The IP address is 172.16.18.11/24. The interface is blue, so it is lt-0/0/10.117 with peer unit lt-0/0/10.711: unit is 117 because we are on router with ID 11 and the interface points toward router with ID 7 (so the router with ID 7 will have unit 711 paired with unit 117 for the same logic)
  • Interface of SP-P1 toward SP-PE2: it is an intra-group P-t-P link, between R1 and R2, so it is within 10.*.*.* network. Group is 0 so the second octet is 0. The third octet is given by XY, where X is the lowest-numbered router, R1 and Y is the higher-numbered router, R2. So the third octet is 12. The last octet is given by the router number of SP-P1, so the ip address is 10.0.12.1/24. The interface is green, so it is ge-0/0/0.12 with vlan-id 12: ge-0/0/0 because we are on router with ID 1 facing on the link toward router with ID 2, so we use the ge-0/0/0; unit is 12 and vlan-id is the same, and it is the concatenation of the two IDs, the lowest first, since both are less than 10. The corresponding interface on router with ID 2 is ge-0/0/1.12 with vlan-id 12.

As you can see, we’ll use /24 networks even if they are P-t-Ps, just to have the possibility to implement our numbering scheme.

On each link I’ve put one or more letters M, L or R to quickly show which protocols among MPLS, LDP and RSVP are enabled and running on the links.

Route-Distinguishers and Route-Targets used for the L3-VPNs are shown in the graphic.

The whole topology has been built on a single Juniper vMX virtual router, running within EVE-NG. Blue links are built on lt-0/0/10 logical tunnel interface, while green links are built on ge-0/0/0 and ge-0/0/1 physical interfaces, which are connected one with each other. A pair of ge interfaces are sufficient to build how many P-t-P links we want, it is sufficient to use a different vlan-id for each link. The reason why I’ve inserted some ge links is that EVE-NG allows me to capture packets on that links, while it wouldn’t be possible on the lt-0/0/10 interface. The fact that I’ve used only a pair of ge interfaces instead of more is that sniffing on ge-0/0/0 interface allows me to see the traffic traversing the 4 segments in a row, with different vlan tags.

This is the lab on EVE-NG (em0 is the management interface connected to my LAN):

post6_fig1

The switch cloud on the right is a simple way to connect ge-0/0/0 with ge-0/0/1 using EVE-NG linux bridging facility, without the need to configure a virtual switch.

Objective: Service provider 1 and 2 want to offer a L3-VPN called vpn-a between CE-A-11 and CE-A-12. The two service providers have different AS Numbers and are interconnected by Service Provider 0 which will offer them an Inter-Provider vpn called inter-vpn.

Problem: vpn-a L3-VPN must be established with an MP-EBGP session between the two ASBRs of SP1 and SP2. No other router in SP1, SP2 and SP0 will know anything about vpn-a. We must build a label-switched-path from CE1-SP-PE7 and CE2-SP-PE10. Usually IPv4 (family inet) i/eBGP sessions do not attach a label to the routes sent to the neighbors. This would imply that CE1-SP-PE7 would receive a route to reach CE2-SP-PE10 loopback without a label attached to it: this would cause the sending of a packet labeled with a vpn-a vpn label to CE1-SP-PE5, which as I’ve previously said doesn’t know anything about the vpn-a L3-VPN and would discard the packet. The same for the other routers in the path toward destination.

Carrier-of-Carriers Service Provider L3-VPN

Service Provider 0 is configured with OSPF Area 0 as IGP. I’ve chosen to use RSVP to build the following LSPs, instead of enabling LDP:

  • Bidirectional LSPs between SP-PE2 and SP-PE3 routers: these LSPs are needed to resolve the next-hop of inter-vpn L3-VPN routes in inet.3.
  •  Unidirectional LSP from SP-RR4 Route-Reflector toward its RR-clients: these LSPs are needed because routes received by an RR that can not resolve the corresponding next-hop in inet.3 are hidden and thus not “reflected” to other clients.

Each of SP-PE2 and SP-PE3 routers have an MP-iBGP session with SP-RR4, which has only the inet-vpn family enabled, since we only need to carry labeled L3-VPN routes. SP-P1 is a P-router, and it doesn’t need to know anything about VPN routes, so it requires only to run OSPF, MPLS and RSVP on its interfaces, to allow the building of the required RSVP LSPs that I’ve mentioned above.

The only routes that we need to have within the  inter-vpn.inet.0 table on SP-PE2 and SP-PE3 are the loopback addresses of CE1-SP-PE7 and CE2-SP-PE10, which will need to reach each other to build the outermost MP-eBGP session for vpn-a. P-t-P links in the inter-vpn routing-instance are not sent as labeled VPN routes between over the MP-iBGP session between SP-PE2 and SP-PE3, because they are considered multi-access links and are only advertised if there is a route with a next-hop on that link or if we’re using vrf-table-label within the routing-instance (which is not the case for inter-vpn).

Follows the output of some show commands executed for the routers within SP0:

-- SP-PE2 --
admin> show route logical-system SP-PE2

inet.0: 9 destinations, 9 routes (9 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

10.0.12.0/24 *[Direct/0] 01:14:57
 > via ge-0/0/1.12
10.0.12.2/32 *[Local/0] 01:14:59
 Local via ge-0/0/1.12
10.0.13.0/24 *[OSPF/10] 01:14:43, metric 2
 > to 10.0.12.1 via ge-0/0/1.12
10.0.14.0/24 *[OSPF/10] 01:14:43, metric 2
 > to 10.0.12.1 via ge-0/0/1.12
192.168.0.1/32 *[OSPF/10] 01:14:43, metric 1
 > to 10.0.12.1 via ge-0/0/1.12
192.168.0.2/32 *[Direct/0] 01:15:43
 > via lo0.2
192.168.0.3/32 *[OSPF/10] 01:14:33, metric 2
 > to 10.0.12.1 via ge-0/0/1.12
192.168.0.4/32 *[OSPF/10] 01:14:38, metric 2
 > to 10.0.12.1 via ge-0/0/1.12
224.0.0.5/32 *[OSPF/10] 01:15:46, metric 1
 MultiRecv

inet.3: 1 destinations, 1 routes (1 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

192.168.0.3/32 *[RSVP/7/1] 01:14:16, metric 2
 > to 10.0.12.1 via ge-0/0/1.12, label-switched-path from-SP-PE2-to-SP-PE3

inter-vpn.inet.0: 4 destinations, 4 routes (4 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

172.16.25.0/24 *[Direct/0] 01:14:58
 > via ge-0/0/0.25
172.16.25.2/32 *[Local/0] 01:14:59
 Local via ge-0/0/0.25
192.168.1.7/32 *[BGP/170] 01:14:35, localpref 100
 AS path: 65100 I, validation-state: unverified
 > to 172.16.25.5 via ge-0/0/0.25, Push 299824
192.168.2.10/32 *[BGP/170] 01:14:16, localpref 100, from 192.168.0.4
 AS path: 65200 I, validation-state: unverified
 > to 10.0.12.1 via ge-0/0/1.12, label-switched-path from-SP-PE2-to-SP-PE3
[...]

admin> show mpls lsp logical-system SP-PE2 ingress detail
Ingress LSP: 1 sessions

192.168.0.3
 From: 192.168.0.2, State: Up, ActiveRoute: 0, LSPname: from-SP-PE2-to-SP-PE3
 ActivePath: (primary)
 LSPtype: Static Configured, Penultimate hop popping
 LoadBalance: Random
 Encoding type: Packet, Switching type: Packet, GPID: IPv4
 *Primary State: Up
 Priorities: 7 0
 SmartOptimizeTimer: 180
 Computed ERO (S [L] denotes strict [loose] hops): (CSPF metric: 2)
 10.0.12.1 S 10.0.13.3 S
 Received RRO (ProtectionFlag 1=Available 2=InUse 4=B/W 8=Node 10=SoftPreempt 20=Node-ID):
 10.0.12.1 10.0.13.3
Total 1 displayed, Up 1, Down 0

-- SP-RR4 --
admin> show route logical-system SP-RR4 table inet.3

inet.3: 2 destinations, 2 routes (2 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

192.168.0.2/32 *[RSVP/7/1] 01:18:23, metric 2
 > to 10.0.14.1 via lt-0/0/10.41, label-switched-path from-SP-RR4-to-SP-PE2
192.168.0.3/32 *[RSVP/7/1] 01:18:22, metric 2
 > to 10.0.14.1 via lt-0/0/10.41, label-switched-path from-SP-RR4-to-SP-PE3

admin> show route logical-system SP-RR4 table bgp.l3vpn.0

bgp.l3vpn.0: 2 destinations, 2 routes (2 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

65000:2:192.168.1.7/32
 *[BGP/170] 01:18:47, localpref 100, from 192.168.0.2
 AS path: 65100 I, validation-state: unverified
 > to 10.0.14.1 via lt-0/0/10.41, label-switched-path from-SP-RR4-to-SP-PE2
65000:3:192.168.2.10/32
 *[BGP/170] 01:18:43, localpref 100, from 192.168.0.3
 AS path: 65200 I, validation-state: unverified
 > to 10.0.14.1 via lt-0/0/10.41, label-switched-path from-SP-RR4-to-SP-PE3

-- SP-P1 (only inet.0 and mpls.0 routes) --
admin> show route logical-system SP-P1 terse | match routes

inet.0: 11 destinations, 11 routes (11 active, 0 holddown, 0 hidden)

mpls.0: 12 destinations, 12 routes (12 active, 0 holddown, 0 hidden)

Service Provider 1

Now let’s have a look to SP1 (which is configured as SP2): again, we have OSPF running on the three nodes and LSPs between CE1-SP-PE7 and CE1-SP-PE5, this time built by LDP instead of RSVP. As we’ve previously said, we need to have a Label-Switched-Path that spans across different Autonomous-Systems, so we must find a way for the 192.168.2.10/32 route of CE2-SP-PE10 loopback to flow through SP2, the inter-vpn on SP0 and then SP1 toward CE1-SP-PE7 with a label attached to it: this is accomplished through two BGP sessions, an MP-eBGP session between SP0 and SP1 and an MP-iBGP session within SP1, as shown on the topology, both with labeled-unicast feature enabled within inet address family. This feature tells the router to attach a label to the IPv4 route that sends to the BGP neighbor. I’ll show you the route to 192.168.2.10/32 on different routers, with the label(s) attached to it:

-- SP-PE3 inter-vpn.inet.0: route with the label attached by CE2-SP-PE8 due to labeled-unicast --
admin> show route logical-system SP-PE3 table inter-vpn.inet.0 192.168.2.10/32

inter-vpn.inet.0: 4 destinations, 4 routes (4 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

192.168.2.10/32 *[BGP/170] 01:32:39, localpref 100
 AS path: 65200 I, validation-state: unverified
 > to 172.16.38.8 via lt-0/0/10.38, Push 299824

-- SP-PE2 inter-vpn.inet.0: route with the L3-VPN label attached to the route by SP-PE3 and the RSVP LSP label on top of it to reach SP-PE3 (some lines are omitted for brevity) --
admin> show route logical-system SP-PE2 table inter-vpn.inet.0 192.168.2.10/32 detail

inter-vpn.inet.0: 4 destinations, 4 routes (4 active, 0 holddown, 0 hidden)
192.168.2.10/32 (1 entry, 1 announced)
 *BGP Preference: 170/-101
 Source: 192.168.0.4
 Next hop: 10.0.12.1 via ge-0/0/1.12, selected
 Label-switched-path from-SP-PE2-to-SP-PE3
 Label operation: Push 299776, Push 299824(top)
 Protocol next hop: 192.168.0.3
 VPN Label: 299776

-- CE1-SP-PE5 inet.0: route with the label attached by SP-PE2 due to labeled-unicast --

admin> show route logical-system CE1-SP-PE5 table inet.0 192.168.2.10/32

inet.0: 10 destinations, 11 routes (10 active, 1 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

192.168.2.10/32 *[BGP/170] 01:32:44, localpref 100
 AS path: 65000 65200 I, validation-state: unverified
 > to 172.16.25.2 via ge-0/0/1.25, Push 299808

-- CE1-SP-PE7 inet.0: route with the label attached by CE1-SP-PE5 due to labeled-unicast and the top label of the LSP toward CE1-SP-PE5 --

admin> show route logical-system CE1-SP-PE7 table inet.0 192.168.2.10/32

inet.0: 8 destinations, 8 routes (8 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

192.168.2.10/32 *[BGP/170] 01:32:54, localpref 100, from 192.168.1.5
 AS path: 65000 65200 I, validation-state: unverified
 > to 10.1.67.6 via ge-0/0/1.67, Push 299840, Push 299792(top)

I’ve omitted the route as it is seen on CE2-SP-PE8: in this case we receive a labeled IPv4 BGP route (preference 170) for 192.168.2.10/32 from CE2-SP-PE10, but it is also received in OSPF (preference 10), so the installed route has no label: this would cause a problem when CE1-SP-PE7 sends a packet with a vpn-a MPLS label, because when it reaches CE2-SP-PE8 it would be forwarded to CE2-SP-P9 toward CE2-SP-P10 with only the vpn-a label, which would be unknown to CE2-SP-P9. We must force the iBGP labeled route to be installed in CE2-SP-PE8 inet.0 table (even if the label is an explicit null, as we will see, but having a BGP route instead of an OSPF one forces the use of the LSP toward CE2-SP-P10 to deliver the packet, thus adding an additional label that will be popped by CE2-SP-P9), so I’ve raised OSPF preference to 200:

-- CE2-SP-PE8 --

admin> show route logical-system CE2-SP-PE8 table inet.0 192.168.2.10/32

inet.0: 10 destinations, 11 routes (10 active, 1 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

192.168.2.10/32 *[BGP/170] 01:47:34, localpref 100, from 192.168.2.10
 AS path: I, validation-state: unverified
 > to 10.2.89.9 via lt-0/0/10.89, Push 299776
 [OSPF/200] 01:47:37, metric 2
 > to 10.2.89.9 via lt-0/0/10.89

Label 299776 is not the label attached to the route by CE2-SP-PE10, but it is the label used to reach the next-hop 192.168.2.10, resolved via inet.3 table:

admin> show route logical-system CE2-SP-PE8 table inet.3

inet.3: 2 destinations, 2 routes (2 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

192.168.2.9/32 *[LDP/9] 10:12:23, metric 1
 > to 10.2.89.9 via lt-0/0/10.89
192.168.2.10/32 *[LDP/9] 10:12:23, metric 2
 > to 10.2.89.9 via lt-0/0/10.89, Push 299776

CE2-SP-PE10 in fact sends a labeled route to CE2-SP-PE8 but the label is an explicit null (reserved label 3), which means “you do not have to use a label to reach that route, just reach me”:

admin> show route logical-system CE2-SP-PE8 receive-protocol bgp 192.168.2.10 extensive

inet.0: 10 destinations, 11 routes (10 active, 1 holddown, 0 hidden)
* 192.168.2.10/32 (2 entries, 2 announced)
 Accepted
 Route Label: 3
 Nexthop: 192.168.2.10
 Localpref: 100
 AS path: I

In fact, a packet received by CE2-SP-PE8 directed to CE2-SP-PE10 vpn-a needs only the vpn-a label and on top of it a label to reach CE2-SP-PE10 via an LDP LSP. The outer label will then be removed by CE2-SP-P9 due to Penultimate-Hop-Popping and a packet with only the VPN label will be delivered to CE2-SP-P10.

VPN-A customer’s L3VPN between SP1 and SP2

As a final step, once that we have a bidirectional LSP between CE1-SP-PE7 and CE2-SP-PE10, we can build the vpn-a L3VPN with an MP-eBGP session between those two routers. In order to make the vpn-a work, as we’ve explained with inter-vpn, we need to have the eBGP next-hop reachable in inet.3 routing table. This is accomplished with the addition of resolve-vpn keyword added to labeled-unicast, which forces labeled IPv4 routes to be installed also in inet.3. Follows the configuration of BGP (internal and external BGP) on CE1-SP-PE7 and the configuration of the routing-instance for vpn-a, where I’ve used vrf-table-label (which forces a lookup in the vpn-a.inet.0 routing table instead of directly sending VPN packets destined to CE-A-11 on the P-t-P interface) and static routing to reach the loopback of the connected CEs of Customer A (the CEs in turn have a default route toward their Service Provider):

-- CE1-SP-PE7 --
admin> show configuration logical-systems CE1-SP-PE7 protocols bgp
group SP1-Internal {
 type internal;
 local-address 192.168.1.7;
 family inet {
  labeled-unicast {
   resolve-vpn;
  }
 }
 export export-loopback;
 neighbor 192.168.1.5;
}
group SP-1-2-external {
 type external;
 multihop;
 local-address 192.168.1.7;
 family inet-vpn {
  unicast;
 }
 peer-as 65200;
 neighbor 192.168.2.10;
}

admin> show configuration logical-systems CE1-SP-PE7 routing-instances vpn-a
instance-type vrf;
interface lt-0/0/10.711;
route-distinguisher 65012:100;
vrf-target target:65012:0;
vrf-table-label;
routing-options {
 static {
  route 192.168.4.11/32 next-hop 172.16.18.11;
 }
}

Follows the output of some show commands on the same router, that shows the use of 3 labels, label 16 is a VPN label (it is so low due to the vrf-table-label statement), 299840 is the label attached to 192.168.2.10/32 route sent by CE1-SP-PE5 to CE1-SP-PE7 and 299792 is the LDP label associated to the LSP toward CE1-SP-PE5:

-- CE1-SP-PE7 --
admin> show route logical-system CE1-SP-PE7 table vpn-a.inet.0

vpn-a.inet.0: 5 destinations, 5 routes (5 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

172.16.18.0/24 *[Direct/0] 00:01:32
 > via lt-0/0/10.711
172.16.18.7/32 *[Local/0] 02:18:07
 Local via lt-0/0/10.711
172.16.22.0/24 *[BGP/170] 00:01:32, localpref 100, from 192.168.2.10
 AS path: 65200 I, validation-state: unverified
 > to 10.1.67.6 via ge-0/0/1.67, Push 16, Push 299840, Push 299792(top)
192.168.4.11/32 *[Static/5] 00:01:32
 > to 172.16.18.11 via lt-0/0/10.711
192.168.4.12/32 *[BGP/170] 00:01:32, localpref 100, from 192.168.2.10
 AS path: 65200 I, validation-state: unverified
 > to 10.1.67.6 via ge-0/0/1.67, Push 16, Push 299840, Push 299792(top)

As you can see, the reception of label 16 forces a second lookup within vpn-a.inet.0 routing table, instead of directly sending the packet toward CE-A-11 on the lt-0/0/10.711 interface of CE1-SP-PE7:

admin> show route logical-system CE1-SP-PE7 table mpls.0 label 16

mpls.0: 8 destinations, 8 routes (8 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

16 *[VPN/0] 21:14:18
 to table vpn-a.inet.0, Pop

Verifying the connection between CE-A-11 and CE-A-12

Now it is time to verify the connectivity between the two Customer-A’s routers with a traceroute. In order to have information for every hop within the network, each MPLS-enabled router must have icmp-tunneling enabled within protocol mpls stanza, otherwise a packet with an expired Time-To-Live value within IP header would produce a reply toward the source of the packet, 192.168.4.11 for example, which is completely unknown to all the routers except SP1 and SP2 ASBRs, i.e. CE1-SP-PE7 and CE2-SP-10. Enabling icmp-tunneling forces the router where the packet is expired to build an ICMP response that is sent toward the destination instead of the source with the original MPLS tags. When it reaches, in our example, CE2-SP-PE10 within vpn-a.inet.0 table the router sees that the destination is 192.168.4.11 and it is sent back toward the source of the traceroute’s UDP packet. I’ll add some info about each label on every hop.

admin> traceroute logical-system CE-A-11 192.168.4.12 source 192.168.4.11
traceroute to 192.168.4.12 (192.168.4.12) from 192.168.4.11, 30 hops max, 40 byte packets

 1 172.16.18.7 (172.16.18.7) 1.237 ms 0.668 ms 0.545 ms

 2 10.1.67.6 (10.1.67.6) 3.833 ms 4.254 ms 3.706 ms
 MPLS Label=299792 CoS=0 TTL=1 S=0 => LDP label to reach CE1-SP-PE5
 MPLS Label=299840 CoS=0 TTL=1 S=0 => iBGP label-unicast label for 192.168.4.12 received from CE-SP-PE5
 MPLS Label=16 CoS=0 TTL=1 S=1 => vpn-a label received from 192.168.4.12
 
3 10.1.56.5 (10.1.56.5) 3.704 ms 4.209 ms 3.687 ms => PHP removes the outermost label
 MPLS Label=299840 CoS=0 TTL=1 S=0 
 MPLS Label=16 CoS=0 TTL=2 S=1
 
4 172.16.25.2 (172.16.25.2) 3.770 ms 4.368 ms 3.605 ms
 MPLS Label=299792 CoS=0 TTL=1 S=0 => eBGP label-unicast label for 192.168.4.12 received from SP-PE2. It replaced label 299840
 MPLS Label=16 CoS=0 TTL=3 S=1
 
5 10.0.12.1 (10.0.12.1) 5.300 ms 5.557 ms 3.600 ms
 MPLS Label=299824 CoS=0 TTL=1 S=0 => RSVP label for from-SP-PE2-to-SP-PE3 LSP toward 192.168.0.3
 MPLS Label=299776 CoS=0 TTL=1 S=0 => inter-vpn label for 192.168.4.12 received from 192.168.0.3 through Route-Reflector 192.168.0.4. This replaces label 299792
 MPLS Label=16 CoS=0 TTL=4 S=1
 
6 10.0.13.3 (10.0.13.3) 4.809 ms 3.749 ms 4.161 ms => PHP removes the outermost label
 MPLS Label=299776 CoS=0 TTL=1 S=0
 MPLS Label=16 CoS=0 TTL=5 S=1
 
7 172.16.38.8 (172.16.38.8) 3.610 ms 4.518 ms 3.661 ms
 MPLS Label=299824 CoS=0 TTL=1 S=0 => eBGP label-unicast label for 192.168.4.12 received from CE2-SP-PE8. It replaced label 299776
 MPLS Label=16 CoS=0 TTL=6 S=1
 
8 10.2.89.9 (10.2.89.9) 3.633 ms 4.246 ms 3.619 ms => third label not added due to the explicit-null label received for 192.168.4.12 from 192.168.2.10 through iBGP labeled-unicast route advertisement
 MPLS Label=299776 CoS=0 TTL=1 S=0 => LDP label to reach CE2-SP-PE10. It replaced label 299824
 MPLS Label=16 CoS=0 TTL=7 S=1
 
9 172.16.22.10 (172.16.22.10) 4.002 ms 4.404 ms 3.705 ms => PHP removed the outermost label. CE2-SP-PE10 received the packet only with label 16 that has been removed for a second lookup within vpn-a.inet.0 due to vrf-table-label

10 192.168.4.12 (192.168.4.12) 4.000 ms 3.831 ms 4.493 ms

Then, I’ve started a packet capture on ge-0/0/0 interface within EVE-NG and run the following command on vMX:

admin> ping logical-system CE-A-12 192.168.4.11 source 192.168.4.12 count 1

As you can see from the image below, I’ve captured the same echo request and echo reply packets 4 times in a row:

post6_fig3_captured_packets

Looking at the echo request packets, you can see all the labels we’ve seen from the traceroute output before (look at the vlan-id, it tells you the green link the packet is traversing):

post6_fig4_capture_detail

These are the replies captured on the same interface (I’ll show you also the replies because they show the triple-label stack when the packet is leaving the L3VPN routing instance in CE1-SP-PE7 and SP-PE2):

post6_fig5_capture_detail_2

vMX Configuration

You can find the full configuration of the lab that you can load within your vMX router. Adjust the IP address of em0 interface, which is the one that can be connected to the Net cloud connected to the real LAN if you want to manage vMX via SSH. You can upload the configuration file via SSH (copy and paste via console can give you buffer problems) and then load it with the following commands:

[edit]
admin# delete
This will delete the entire configuration
Delete everything under this level? [yes,no] (no) yes

admin# load merge carrier-of-carriers_vMX_topology.cfg
load complete

admin# commit

commit complete

User is admin with password admin1, root has password root123.

carrier-of-carriers_vMX_topology.cfg

Conclusion

I hope you’ve read the whole stuff, it has been a long post but I hope you’ve find the topic very interesting as I did. I suggest you to read the JNCIS-SP Study Guide available from Juniper Networks to understand a lot of interesting stuff about MPLS on Junos, which can be successfully tested on a vMX platform.

As I usually say, post comments or questions or even tell me if I’ve made some mistakes, I’ve just gone through this stuff for few days and there is sure room to improve my skills 🙂

Posted in Uncategorized | Tagged , , , , , , , | 4 Comments

Enabling compression on base-images in Unetlab/EVE-NG Alpha

Introduction

In this small tutorial we’ll see how to enable QEMU image compression on compress base images in Unetlab/EVE-NG Alpha. For some detail about where files are stored, have a look to my previous post Modifying base-images with snapshots on Unetlab/EVE-NG Alpha.

Update 2017-01-25: after asking some info about QEMU compression on QEMU users’ mailing list, Alberto Garcia clarified some aspects of how compression works that were not clear for me:

I think there’s some misunderstanding here about compressed images in
QEMU. I’ll try to clarify:

* You create a compressed image with ‘qemu-img convert -c’. That is a
copy of the original image with all the clusters compressed.

* The compression is read-only: QEMU will read the compressed clusters,
  but everything that it writes will be uncompressed (also if you
  rewrite compressed clusters).

* Therefore, there’s no such thing as an image with compression
enabled. In QEMU you don’t compress an image, you compress
individual clusters of data. An image can have a mix of compressed
and uncompressed clusters.

Compressing base images

Suppose we have an already working base image or an hda.qcow2 virtual hard disk prepared to be used. I’ll use my TinyCore linux I’ve used in the previous post. Let’s create a new folder and clone /opt/unetlab/addons/qemu/linux-tiny-core-7.2/hda2.qcow2 base image enabling compression. This time we must take a full clone, we can not use snapshots:

root@eve-ng:/# mkdir /opt/unetlab/addons/qemu/linux-tiny-core-7.2-compressed

root@eve-ng:/# cd /opt/unetlab/addons/qemu/linux-tiny-core-7.2-compressed

root@eve-ng:/opt/unetlab/addons/qemu/linux-tiny-core-7.2-compressed# /opt/qemu/bin/qemu-img convert -c -f qcow2 -O qcow2 /opt/unetlab/addons/qemu/linux-tiny-core-7.2/hda.qcow2 hda.qcow2

root@eve-ng:/opt/unetlab/addons/qemu/linux-tiny-core-7.2-compressed# cd ..

root@eve-ng:/opt/unetlab/addons/qemu# ls -l linux-tiny-core-7.2/hda.qcow2
-rw-r--r-- 1 root root 76414976 Nov 11 10:21 linux-tiny-core-7.2/hda.qcow2

root@eve-ng:/opt/unetlab/addons/qemu# ls -l linux-tiny-core-7.2-compressed/hda.qcow2
-rw-r--r-- 1 root root 72351744 Jan 6 14:49 linux-tiny-core-7.2-compressed/hda.qcow2

As you can see, with TinyCore Linux I won’t save so much space, but with images such as Raware Alteon you can easily gin 1 GByte or more by enabling compression.

Unfortunately, I’ve not found a way to explicitly show that an image has compression enabled, if you know it, let me know! => As I wrote in the update at the beginning of the post, the concept of enabling compression with QEMU is faulty: you compress blocks of data of the qcow2 image but then there is no “compression enabled”, new data is written uncompressed.

Testing the compressed base image

I’ve added a new node to the lab we’ve used in a previous post, node 4, based on the new compressed image. Everything works as expected, EVE makes a snapshot of the compressed image:

root@eve-ng:/opt/unetlab/tmp/0/da2b48f4-d910-4e7d-9645-f952457cbf6d/4# /opt/qemu/bin/qemu-img info --backing-chain hda.qcow2
image: hda.qcow2
file format: qcow2
virtual size: 614M (643825664 bytes)
disk size: 1.3M
cluster_size: 65536
backing file: /opt/unetlab/addons/qemu/linux-tiny-core-7.2-compressed/hda.qcow2
Format specific information:
 compat: 1.1
 lazy refcounts: false
 refcount bits: 16
 corrupt: false

image: /opt/unetlab/addons/qemu/linux-tiny-core-7.2-compressed/hda.qcow2
file format: qcow2
virtual size: 614M (643825664 bytes)
disk size: 68M
cluster_size: 65536
Format specific information:
 compat: 1.1
 lazy refcounts: false
 refcount bits: 16
 corrupt: false

Conclusions

This small tutorial shows you how you can save some space by enabling compression on compressing base images with QEMU tools. I’ve not done extensive tests on this, but everything seems to work. Any suggestions are welcome, just write a comment below!

Posted in Uncategorized | Tagged , , , | 1 Comment

Modifying base-images with snapshots on Unetlab/EVE-NG Alpha

Introduction

Unetlab/EVE-NG (Alpha) is a great tool you can use for learning about networking with different platforms (Dynamips routers, IOL, QEMU images). I won’t get into details about how to prepare the environment, you can find a lot of useful information on their site http://www.unetlab.com, but I’ll focus on how labs’ devices’ images are managed by EVE (let’s use EVE instead of Unetlab/EVE-NG for the rest of the post).

Browsing the temporary files for a lab with a QEMU node

Let’s create a new lab with a single QEMU node, which in my case is a TinyCore Linux image:

post5_fig1_add_tinycore

The base image I’ve chosen is linux-tiny-core-7.2, which you can find on EVE VM under /opt/unetlab/addons/qemu/linux-tiny-core-7.2:

root@eve-ng:~# ls -l /opt/unetlab/addons/qemu/linux-tiny-core-7.2/
total 74624
-rw-r--r-- 1 root root 76414976 Nov 11 10:21 hda.qcow2

hda.qcow2 is disk1 of the linux box.

What happens when you instantiate this image within your lab and run/modify its contents? As you can imagine, EVE does not modify the source image, otherwise it would be impossible to manage more instantiation for the same base image. Instead, it creates a QEMU snapshot within the lab’s temporary files. Let’s see it.

First, we must get some info about the lab: press Lab Details on the left menu bar and get the lab ID, which in my case is da2b48f4-d910-4e7d-9645-f952457cbf6d.

Go on the temporary lab folder on EVE (the “0” after /tmp is my pod number, zero as I’m working as admin):

cd /opt/unetlab/tmp/0/da2b48f4-d910-4e7d-9645-f952457cbf6d/

In this folder you can find a subfolder for each device of the lab. In this example we have only device 1, so enter within 1 subfolder (you can get the number of the node, if multiple nodes are running, by right-clicking on the node and looking at the number between the brackets after the name of the node) and look at its contents:

root@eve-ng:/opt/unetlab/tmp/0/da2b48f4-d910-4e7d-9645-f952457cbf6d# cd 1

root@eve-ng:/opt/unetlab/tmp/0/da2b48f4-d910-4e7d-9645-f952457cbf6d/1# ls -l
total 1284
-rw-r--r-- 1 root unl 1376256 Jan 6 11:43 hda.qcow2
-rw-rw-r-- 1 root unl 0 Jan 6 11:42 wrapper.txt

you can see hda.qcow2 again. Is it a copy? No, it would be a waste of space, it is a snapshot of the base image, let’s see it:

root@eve-ng:/opt/unetlab/tmp/0/da2b48f4-d910-4e7d-9645-f952457cbf6d/1# /opt/qemu/bin/qemu-img info --backing-chain hda.qcow2
image: hda.qcow2
file format: qcow2
virtual size: 614M (643825664 bytes)
disk size: 1.3M
cluster_size: 65536
backing file: /opt/unetlab/addons/qemu/linux-tiny-core-7.2/hda.qcow2
[...]

image: /opt/unetlab/addons/qemu/linux-tiny-core-7.2/hda.qcow2
file format: qcow2
virtual size: 614M (643825664 bytes)
disk size: 73M
cluster_size: 65536
[...]

So, every modification you do to the disk of the instantiated image, it is confined to the snapshot and the base image won’t be modified, as it is quite obvious.

Modifying the base image

Some times ago, I’ve felt the need to modify my base Tiny Core installation in order to add tcpdump package to it, which is useful for troubleshooting. Instead of installing it on every instantiation of TinyCore within my labs, I wanted to modify the base image, but you must not modify a base image after you’ve used it at least in one lab, unless you want to corrupt all the labs that instantiated it. The quick and dirty solution could be creating a new subdirectory within /opt/unetlab/addons/qemu/linux-tiny-core-7.2-v2 and copying the original hda.qcow2 file into it, then modify the new base image before using it into any lab. With TinyCore it is just a matter of few Mbytes, so it is perfectly right to make a simple copy. But what if the base image is 2-3 GBytes, such as Radware Alteon or Juniper vMX images? It would be a waste of space, and on my laptop I don’t want to waste it, so let’s try a different approach.

Making a snapshot of the base image

We can use qemu-img to create a new hda.qcow2 which is a snapshot of the original base image. Go into /opt/unetlab/addons/qemu/linux-tiny-core-7.2-v2 folder we’ve used in the previous section and delete the copy of the base image we’ve put within that folder. Let’s create a snapshot of the original base image with qemu-img create and the -b flag that specifies a backing file for the new image, thus creating a snapshot:

root@eve-ng:/opt/unetlab/addons/qemu/linux-tiny-core-7.2-v2# /opt/qemu/bin/qemu-img create -f qcow2 -b /opt/unetlab/addons/qemu/linux-tiny-core-7.2/hda.qcow2 hda.qcow2
Formatting 'hda.qcow2', fmt=qcow2 size=643825664 backing_file='/opt/unetlab/addons/qemu/linux-tiny-core-7.2/hda.qcow2' encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16

root@eve-ng:/opt/unetlab/addons/qemu/linux-tiny-core-7.2-v2# /opt/qemu/bin/qemu-img info hda.qcow2
image: hda.qcow2
file format: qcow2
virtual size: 614M (643825664 bytes)
disk size: 196K
cluster_size: 65536
backing file: /opt/unetlab/addons/qemu/linux-tiny-core-7.2/hda.qcow2
[...]

Preparing the new base image – method 1: running QEMU from CLI

Now, you have a new base image and you want to prepare it before using in future labs. Let’s run it from command line. I won’t go into QEMU details, you can find useful docs on the web or simply run ps aux from EVE VM with some nodes running to see how it runs QEMU nodes. TinyCore can be managed via VNC, so let’s pass -vnc :100 in order to make QEMU listen for VNC connections on port 6000 on every interface (6000 is the result of 5900 base VNC port + 100) for the node we’re running:

root@eve-ng:/opt/unetlab/addons/qemu/linux-tiny-core-7.2-v2# /opt/qemu/bin/qemu-system-x86_64 -m 2048 -hda hda.qcow2 -serial telnet:0.0.0.0:44444,server,nowait -monitor tcp:127.0.0.1:42379,server,nowait -nographic -enable-kvm -vnc 0.0.0.0:100

In the command above you also see how to enable serial connection through telnet, in case your image has console access enabled. If you need to reach the image via IP connection, for example to transfer some content onto it by using SCP, you can manage it in this way: suppose EVE VM has pnetwith eth0 connected in bridge-mode to your own network, configure eth0 on TinyCore (or your own image) with an IP address compatible with your own network through the Console or VNC connection or let it get an IP via your LAN DHCP and then shutdown the image (if you’re using TinyCore, remember to make the changes persistent, otherwise you’ll loose them after reboot). Now, prepare a virtual interface to put within pnet0 bridge and to which we will connect TinyCore:

root@eve-ng:/opt/unetlab/addons/qemu/linux-tiny-core-7.2-v2# ifconfig tmp_iface up

root@eve-ng:/opt/unetlab/addons/qemu/linux-tiny-core-7.2-v2# brctl addif pnet0 tmp_iface

root@eve-ng:/opt/unetlab/addons/qemu/linux-tiny-core-7.2-v2# brctl show pnet0
bridge name   bridge id           STP enabled   interfaces
pnet0         8000.000c29baeb65   no            eth0
                                                tmp_iface

Start qemu again mapping TinyCore eth0 onto tmp_iface we’ve just created:

root@eve-ng:/opt/unetlab/addons/qemu/linux-tiny-core-7.2-v2# /opt/qemu/bin/qemu-system-x86_64 -m 2048 -hda hda.qcow2 -serial telnet:0.0.0.0:44444,server,nowait -monitor tcp:127.0.0.1:42379,server,nowait -nographic -enable-kvm -vnc 0.0.0.0:100 -device virtio-net-pci,netdev=net0,mac=50:01:00:04:00:00 -netdev tap,id=net0,ifname=tmp_iface,script=no

Now you can reach your QEMU node via SSH/SCP and you can do whatever you want in order to prepare the new base image.

After some modifications, let’s see the status of the -v2 base image:

root@eve-ng:/opt/unetlab/addons/qemu/linux-tiny-core-7.2-v2# /opt/qemu/bin/qemu-img info hda.qcow2 | grep "disk size"
disk size: 1.4M

Before my modifications to the image, disk size was 196K, now it is grown to 1.4M. The original base image has not been modified.

Preparing the new base image – method 2: running a new node from GUI

If you don’t want to go through the steps from CLI I’ve explained above, you can instantiate a new node based on -v2 image and run it from the GUI:

post5_fig2_add_tinycore-v2.png

Create a network object mapped on pnet0 and connect it to eth0:

post5_fig3_add_network_pnet0post5_fig4_connect_v2-to-pnet0

Start the node based on the new base image -v2 and modify it as you want. As we’ve already said before, these modifications won’t go onto the original -v2 image we’ve prepared within /opt/unetlab/addons/qemu/linux-tiny-core-7.2-v2. Since I’ve added this new node on the same lab with the node based on the original base image, it will be node 2 of the same lab, so let’s move into its temporary folder on EVE VM and have a look at the snapshot image:

root@eve-ng:/opt/unetlab/tmp/0/da2b48f4-d910-4e7d-9645-f952457cbf6d/2# cd /

root@eve-ng:/# cd /opt/unetlab/tmp/0/da2b48f4-d910-4e7d-9645-f952457cbf6d/2

root@eve-ng:/opt/unetlab/tmp/0/da2b48f4-d910-4e7d-9645-f952457cbf6d/2# ls -l
total 1288
-rw-r--r-- 1 root unl 1376256 Jan 6 13:52 hda.qcow2
-rw-rw-r-- 1 root unl 118 Jan 6 13:52 wrapper.txt

root@eve-ng:/opt/unetlab/tmp/0/da2b48f4-d910-4e7d-9645-f952457cbf6d/2# /opt/qemu/bin/qemu-img info hda.qcow2
image: hda.qcow2
file format: qcow2
virtual size: 614M (643825664 bytes)
disk size: 1.3M
cluster_size: 65536
backing file: /opt/unetlab/addons/qemu/linux-tiny-core-7.2-v2/hda.qcow2
[...]

What can we do now to put the changes of this snapshot back to the -v2 base image? Let’s use QEMU tools to accomplish this task, but first shutdown the node within the GUI (don’t delete the node now!):

root@eve-ng:/opt/unetlab/tmp/0/da2b48f4-d910-4e7d-9645-f952457cbf6d/2# /opt/qemu/bin/qemu-img commit hda.qcow2
Image committed.

root@eve-ng:/opt/unetlab/tmp/0/da2b48f4-d910-4e7d-9645-f952457cbf6d/2# /opt/qemu/bin/qemu-img info hda.qcow2
image: hda.qcow2
file format: qcow2
virtual size: 614M (643825664 bytes)
disk size: 260K
cluster_size: 65536
backing file: /opt/unetlab/addons/qemu/linux-tiny-core-7.2-v2/hda.qcow2
[...]

The commit command tells QEMU to commit changes we’ve made to the instantiated node’s /opt/unetlab/tmp/0/da2b48f4-d910-4e7d-9645-f952457cbf6d/2/hda.qcow2  disk to the base image /opt/unetlab/addons/qemu/linux-tiny-core-7.2-v2/hda.qcow2.

The new node we’ve added into the EVE’s GUI is still valid, the operation above did not corrupt it’s snapshot, but now we can delete it and if you add a new node based on -v2 base image, you will find the modifications you’ve merged into the base image with commit command.

A Snapshot Chain: what happened behind the scenes

As you’ve probably already understood, when you instantiate a new node within the GUI based on -v2 base image, you’re creating a snapshot of a snapshot. This is our snapshots tree:

/opt/unetlab/addons/qemu/linux-tiny-core-7.2/hda.qcow2  
|
 \==> /opt/unetlab/addons/qemu/linux-tiny-core-7.2-v2/hda.qcow2
|     |
|     \==> /opt/unetlab/tmp/0/da2b48f4-d910-4e7d-9645-f952457cbf6d/2/hda.qcow2
|
 \==> /opt/unetlab/tmp/0/da2b48f4-d910-4e7d-9645-f952457cbf6d/1/hda.qcow2

Let’s see the chain of snapshots for node 2 disk:

root@eve-ng:/opt/unetlab/tmp/0/da2b48f4-d910-4e7d-9645-f952457cbf6d/2# /opt/qemu/bin/qemu-img info --backing-chain hda.qcow2
image: hda.qcow2
file format: qcow2
virtual size: 614M (643825664 bytes)
disk size: 1.3M
cluster_size: 65536
backing file: /opt/unetlab/addons/qemu/linux-tiny-core-7.2-v2/hda.qcow2
[...]

image: /opt/unetlab/addons/qemu/linux-tiny-core-7.2-v2/hda.qcow2
file format: qcow2
virtual size: 614M (643825664 bytes)
disk size: 1.4M
cluster_size: 65536
backing file: /opt/unetlab/addons/qemu/linux-tiny-core-7.2/hda.qcow2
[...]

image: /opt/unetlab/addons/qemu/linux-tiny-core-7.2/hda.qcow2
file format: qcow2
virtual size: 614M (643825664 bytes)
disk size: 73M
cluster_size: 65536
[...]

What you must remember is that modifications to one of the images in the snapshot tree invalidates al the snapshots in the tree under the modified snapshot. For example, committing changes made on /opt/unetlab/addons/qemu/linux-tiny-core-7.2-v2/hda.qcow2 base image forces a modification of original base image, which invalidates node 1 disk /opt/unetlab/tmp/0/da2b48f4-d910-4e7d-9645-f952457cbf6d/1/hda.qcow2 but does not damage the node 2 disk /opt/unetlab/tmp/0/da2b48f4-d910-4e7d-9645-f952457cbf6d/2/hda.qcow2. If you have doubts before making such important changes, take a snapshot of your EVE VM with the tools offered by your Virtualization Environment (such as VMWare Fusion or VirtualBox) to have a quick rollback solution in case of damages to your labs.

Conclusions

In this small tutorial you’ve seen how you can use QEMU tools to manage snapshots of base images in Unetlab/EVE-NG, in order to setup new base images based on older base images, without wasting space with full clones. As EVE-NG is being developed (now it is in Alpha stage) a lot of this stuff could integrated within the GUI, but I hope it can be useful to someone experimenting with now 🙂

In the next tutorial I’ll show you how to enable compression on base images, let’s move on with Enabling compression on base-images in Unetlab/EVE-NG Alpha!

Posted in Networking, Uncategorized | Tagged , , , , | 12 Comments