Creating and Troubleshooting Site to Site VPN

Branch office VPN connections are common place in most businesses. Even organisations with a single site will often have a connection for backups or to access online infrastructure. These tunnels utilise two endpoints, they can be appliance or server based and often they can be of varying manufacturer and age but this shouldn't matter. Only in very rare cases have I actually seen a pair of devices that just weren’t compatible, and I am sure that declaration was only made because a solution couldn’t be found and swapping one of the devices resolved the issue.

Initial Thoughts

To begin we need to make sure we can manage both the devices so the first advice I can give when creating new VPN tunnels is to make sure you can connect to both endpoints from wherever you are going to be working from.

Usually the firewalls or concentrators that are handling the VPN connections will not be open to management from a remote source so make sure you can get on both boxes through either a mobile VPN connection, RDP session or even make yourself a firewall rule (making sure its only you that can connect).

Next we make sure the endpoints can see each other, sounds obvious but the number of support calls that have been resolved with a single ping is much higher than it should be.

Let the ping run for a while (use ping -t for this) it could be the line has periodic drop outs that won’t be shown in a standard 4 packet ping.

If you are seeing packet loss, then you will need to resolve the line issue before you can be sure of the tunnel settings. If there are drops on the line the tunnel won’t be able to complete the negotiation process and will remain offline.

Check Your Tunnel Settings

Ok, so we have a tunnel setup at both ends, we have confirmed the endpoints can see each other and there is no packet loss on the line, but our tunnel still is not showing as connected.

The next check is to make sure the settings for the tunnel match at both ends.

Now this is a blog post not a training manual so I’m not going to list every available tunnel setting on a modern firewall, most are self-explanatory and honestly whether you choose to use AES256 or AES512 encryption isn’t going to make much difference to your average day. Some of the more mysterious options such as Diffie-Helman, PFS or NAT traversal aren’t as obvious in what they do (sounds like a great idea for another blog post!) but you can get by following one general principle:

ALL SETTINGS MUST MATCH UNLESS ITS AN ADDRESS

And with addresses the general rule is that they are always mirrored, what is considered the local endpoint to device A is the remote endpoint for device B. the same can be said for tunnel route entries and any NATing the device is doing. Getting confused and doing your tunnel routes the wrong way round on one endpoint is a common mistake and will stop the tunnel from coming up.

Different Devices = Different Settings

Yes, there is no universal standard for tunnel options, using endpoints from two different vendors can often mean one device has a setting option that is not available on the endpoint. When this is the case I would always try to turn off the setting if that is an option but when it is not possible to do that all we can do is leave it on the default setting and see if we can address it by error later. I have found that errant settings can actually cause errors that can be resolved by changing a different setting somewhere else in the configuration.

OK, so we think our configs match, we have our addresses correct and we know the endpoints can see each other so why isn’t our tunnel coming up.

null

Using the logs and Debugging

There are some best practices to working with logs when it comes to troubleshooting VPN connections. First pick which endpoint to work on, usually the one you know best or has the best log output.

Set the logging levels for IKE and VPN traffic as high as they can go to ensure you see logs for all stages of the VPN negotiation.

If you have multiple tunnels, then try to filter on the remote endpoints IP address to ensure you only see the negotiations for the tunnel in question.

Once you are ready to capture the logs you will need to kick start the tunnel initiation, usually a ping from a device on the network behind one endpoint to an IP behind the other endpoint will suffice (the main thing here is that the packets destination and source must match a tunnel route on its local endpoint).

THIS NEEDS TO BE DONE FROM THE REMOTE ENDPOINT, NOT THE DEVICE YOU ARE USING TO CAPTURE THE LOGS.

The reason for this is simple…

If I am sat on Device A watching logs and I start a ping between the networks in the direction of A > B then device A will try to begin the negotiation. If the settings are wrong or there is an error, the remote endpoint will not offer a helpful reason out to external sources. After all we could be malicious hackers trying to make a connection so as you would expect the only response we will ever see is “endpoint not responding”.

However, when device B tries to start the negotiation and sends its settings over, our local endpoint can report on what the mismatch is in the connection attempt.

COMMON ERRORS

I will list some of the common errors I see and the usual fixes, now I’m a Watchguard engineer so some of these are a bit Watchguard-centric in the syntax but the errors themselves are universal. I intend to extend this blog over time with more errors that relate to VPN connections and will also try to add more universal or multi-vendor content.

Endpoint not responding:

OK so either you totally skimmed over the bit where I said to initiate from the remote end or you have a weirder issue (i'm going to assume you are pinging correctly). Now if we know both ends can see each other with a standard ping and if the remote end is trying to start the job then it could be one of the devices is behind a NAT router.

If one of the endpoints is on a private address for its external address, then it is behind a NAT router this can be the case sometimes with public addresses as well. It’s usually on cheaper internet connections or connections intended for home use (BT hubs for example). Now usually you get traffic to a device behind a router like this using something called port forwarding or adding static NAT entries, but this won’t be enough for an IPSEC VPN tunnel. Even forwarding all ports will often not get the IKE packets to the device in question.

The usual fix is to look on the router for something called DMZ mode, DMZ Server or DMZ passthrough. If you can’t find that look for IPSEC related settings and something called IPSEC Passthrough or similar. These options may expect you to put the IP of the target device in so you would enter the external IP address for the VPN endpoint.

If you cannot find any such settings but you think the router is NATing then you will need to change the routers mode or swap the router. A call to your ISP will be able to give you options.

Received the unsupported KE group (24) or wrong KE data size:

This is usually down to a settings mismatch with the Diffie-Helman settings in either the phase 1 or phase 2 proposal. Open up both your configs and make sure they all match, if they do try a different Diffie-Helman number.

ID Mismatch:

The clue is in the name 😊

Check your ID’s, I often see this happening when one of the endpoints is on a NATed connection and uses an IP Address as the ID. You need to make sure it uses what the other end expects to see so if you enter the public address at the other end make sure you do the same. Often it will pre-populate but just use the IP Address of the external interface which in this case is a private range IP Address and not the public IP Address the remote endpoint is seeing.

No Proposal Chosen:

This will be a mismatch in the general settings of the phase 2 proposal. This is sort of a blanket error to cover a mismatch such as the encryption type, authentication type or the key expiration timings. Make sure everything matches if it does maybe try a different encryption type or authentication type.

NAT Failure:

Usually one end has NAT settings that the other is missing but this can also be seen sometimes when one end is using NAT traversal and the other end does not have it. NAT traversal allows encapsulated packets to be passed through a NAT process without corruption to the original packet. The process itself though can cause problems when both endpoints are not working to the same system.

If you have confirmed your NAT settings are correct for both ends (remember the remote endpoint sees the NAT addresses not the real addresses) then try turning NAT-T on for both endpoints and see if the error goes away.

I will continue to add to this over time as I think of more things.

As always if you have any questions or are interested in more detailed consultancy please contact us.


Author : Ric Gresty