Detecting Failure

Part 1: Internet Redundancy, Or Not

Part 2: Redundant Connections to a Single Host?

In the last post I discussed how devices like your laptop and mobile phone are computing devices with multiple Internet connections not all that different from a network with multiple connections. The anecdote about Skype on a mobile phone reconnecting a call after you leave the range of Wi-Fi alludes to one key difference. That is, a device directly connected to a particular network connection can easily detect a total failure of said connection. In the example, this allowed Skype to quickly try to reconnect using the phone’s cellular connection.

Think back to our initial problem, how can a normal business get redundant Internet connections? The simplest, and at best half solution, is a router with two WAN connections and NATing out each port.

Now imagine you are using a laptop which is connected to a network with dual NATed WAN connections and you are in the middle of a Skype call. The connection associated with the Skype call will use one of the two WAN network ports and since NAT is used, the source address of the connection will be the IP address associated with the chosen WAN port. As we discussed before, this ‘binds’ the connection to the given WAN connection.

In our previous example of a phone switching to its cellular connection when the Wi-Fi connection drops, Skype was able to quickly decide to try to open another connection. This was possible because when the Wi-Fi connection dropped, Skype got a notification that its connection(s) were terminated.

In the case of a device, like our laptop, which is behind the gateway there is no such notification because no network interface attached to the local device failed. All Skype knows is that it has stopped receiving data – it has no idea why. This could be a transient error or perhaps the whole Internet died. This forces applications to rely on keep alive messages to determine when the network has failed. When a failure determination occurs, the application can try to open another connection. In the case of our dual NATed WAN connected network this new connection will succeed because the new connection will be NATed out the second WAN interface.

In the mean time, the user experienced an outage even though the network did still have an active connection to the Internet. The duration of this outage depends on how aggressive the application timeouts are. It can have short timeouts and risk flapping between connections or longer timeouts and provide a poorer experience. Of course this also assumes that the application includes this non-trivial functionality, most don’t.

Isn’t delivering packets the network’s job not the application’s?

2 Replies to “Detecting Failure”

Leave a Reply

Your email address will not be published. Required fields are marked *