Making the Linux flow classifier tunnel aware

Flow Classifier

The Linux kernel has many different tools for managing traffic. One of them is the flow classifier which allows the user to configure which fields of the packet headers should be used to create a hash which is then used to identify flows and manage them. For example, if the user selects src,dst,proto,proto-src,proto-dst they get a unique value for each flow (within the limits of the hash). Alternatively, using only src as the key will result in all flows being grouped by the source IP address.

The Problem

Below is a slightly simplified version of my home network.

Figure 1: Simplified home network

All of the traffic, both IPv4 and IPv6, is tunnelled through a Linux router which lives at my service provider. The reason for this complicated setup is that it gives me control of the traffic in both the upstream and downstream. By shaping the traffic to just below the maximum rate in each direction I am able to avoid Bufferbloat problems and prioritize latency sensitive traffic such as SSH, DNS and Vonage. Especially under load, my QoS scripts make marked difference in how fast the Internet feels.

The multiple tunnels present a problem for implementing my QoS scheme because from the perspective of the underlying interface there are only two flows on the network. One for the IP-IP tunnel and one for the IP-IPv6 tunnel. A work around I used for a while was to apply the QoS rules to the IP-IP tunnel interface because that’s where the bulk of the traffic flows. However, this meant that IPv6 traffic was not properly controlled and any time I had a significant amount of IPv6 traffic I lost all the advantages of my QoS scheme.

To solve this properly I needed a way to look into the tunnels in order to identify the inner network flows. So I’ve extended the flow classifier with the keys in the following table. IP-IP, IP-IPv6, IPv6-IP and IPv6-IPv6 tunnels are supported.

Key Description
tunnel-src Extract the source IP from the inner header
tunnel-dst Extract the destination IP from the inner header
tunnel-proto Extract the protocol from the inner header
tunnel-proto-src Extract the transport protocol source port from the inner header
tunnel-proto-dst Extract the transport protocol destination port from the inner header

 Results

In order to validate that this works I started a couple SCP uploads to max the upstream bandwidth and then ran ping-exp to measure the latency. At the start of the test the flow classifier keys were src,dst,proto,proto-src,proto-dst. Approximately half way through I changed the keys to src,dst,proto,proto-src,proto-dst,tunnel-src,tunnel-dst,tunnel-proto,tunnel-proto-src,tunnel-proto-dst. The advantage of keeping the non-tunnel keys is that any traffic created by the router itself is still classified properly. Here is the tc script I used. You can see the results of this test in the figure 2 below.

Figure 2: Before and after tunnel keys

For the first half of the test you can see the high latency. This is due to all the traffic from the SCP upload and ICMP pings being placed into the same queue because from the perspective of the flow classifier there is only one flow. In the second half of the test the addition of the tunnel keys allows the flow classifier to place the ICMP packets into a different queue which is not affected by the SCP upload and therefore has much lower latency. The large amount of packet loss during the key change is because the script I used creates a large number of queues. While these queues are being created packets are dropped.

While my network setup may be a bit unique I think it’s likely that many home networks will have some form of tunnelling in the near future as tunnels are part of several IPv6 migration strategies. So hopefully this little addition will be useful in many different contexts.

Below are links to the two patches that are required. I’ll post them to Netdev for review shortly.

clsflow-tunnel-20111016.patch

iproute-clsflow-tunnel-20111016.patch

Linux flow classifier proto-dst and TOS

Recently I’ve been playing around with the Linux flow classifier on my gateway. The flow classifier provides the ability to group network flows by configuring which parts of the packet headers (referred to as keys) are used in a hash calculation which chooses the output queue.

All of my Internet traffic travels over an IPIP tunnel to another Linux box. I do this so I have control of the QoS in both the upstream and the downstream. A result of this configuration is that from the perspective of the output interface there is only a single network flow.

I configured the flow classifier to use the src,dst,proto,proto-src,proto-dst keys which aims to provide 5-tuple flow fairness. Here’s the simple tc script I used. Due to the IPIP tunnel I expected to see that all traffic would be placed into the same queue. Strangely, the below is what my little ping-exp utility showed when running at the same time as an SCP upload.

Figure 1: Unexpected flow classifier behaviour

Coincidentally I ran ping-exp configured to send three different streams of ICMP traffic with different IP TOS values. Note that SCP automatically sets the IP TOS to the equivalent of the “Low” stream in the test.

Notice that the pings using the high and default TOS values appear to be unaffected by low priority ping and SCP traffic. This was unexpected because none of src,dst,proto,proto-src or proto-dst keys should be affected by the TOS value.

After a bit of experimentation I determined that the proto-dst key was the source of the problem. If you spend a bit of time with the flow_get_proto_dst() function in cls_flow.c you’ll see that if the protocol is ICMP or IPIP, as it is in my test, then the following value is returned:

return addr_fold(skb_dst(skb)) ^ (__force u16)skb->protocol;

skb_dst() returns a pointer to a dst_entry structure. Since Linux maintains separate dst_entry structures for each destination,TOS pair the source of the unexpected behaviour is obvious.

I’m not knowledgeable enough about the Linux network stack to be certain but I don’t see any value in returning a value for proto-dst which is random with respect to the actual traffic on the wire. At the very least this is not intuitive behaviour.

If you look at flow_get_proto_src() you’ll see something similar:

return addr_fold(skb->sk);

In this case a pointer to the local socket structure is used as a fallback. Again, this has no relation to the actual packets on the wire and if the packet does not originate at the local machine then no socket exists which causes this value to be zero anyway.

It seems to me that the most intuitive behaviour would be to have the proto-src and proto-dst keys return zero when they are applied to traffic that doesn’t have the notion of transport layer ports.

I’ll post to Netdev about this and see what the kernel devs have to say.

Related to this, I have a patch to the flow classifier that adds tunnel awareness which I plan post to Netdev this weekend as well.

Network latency experiments

Recently a series of blog posts by Jim Gettys has started a lot of interesting discussions and research around the Bufferbloat problem. Bufferbloat is the term Gettys’ coined to describe huge packet buffers in network equipment which have been added through ignorance or a misguided attempt to avoid packet loss. These oversized buffers have the affect of greatly increasing latency when the network is under load.

If you’ve ever tried to use an application which requires low latency, such as VoIP or a SSH terminal at the same time as a large data transfer and experienced high latency then you have likely experienced Bufferbloat. What I find really interesting about this problem is that it is so ubiquitous that most people think this is how it is supposed to work.

I’m not going to repeat all of the details of the Bufferbloat problem here (see bufferbloat.net) but note that Bufferbloat occurs at may different places in the network. It is present within network interface device drivers, software interfaces, modems and routers.

For many the first instinct of how to respond to Bufferbloat is add traffic classification, which is often referred to simply as QoS. While this can also be a useful tool on top of the real solution it does not solve the problem. The only way to solve Bufferbloat is a combination of properly sizing the buffers and Active Queue Management (AQM).

As it turns out I’ve been mitigating the effects of Bufferbloat (to great benefit) on my home Internet connection for some time. This has been accomplished through traffic shaping, traffic classification and using sane queue lengths with Linux’s queuing disciplines. I confess to not understanding, until the recent activity, that interface queues and driver internal queues are also a big part of the latency problem. I’ve since updated my network configuration to take this into account.

In the remainder of this post I will show the effects that a few different queuing configurations have on network latency. The results will be presented using a little utility I developed called Ping-exp. The name is a bit lame but Ping-exp has made it a lot easier for me to compare the results of different network traffic configurations.

Continue reading “Network latency experiments”

LQL# HTB control

Now that LQL-Sharp has been released I thought I should put together a quick little demonstration of just how cool it is.

I have created a extremely simple GUI control that can modify the rate and ceiling parameters of a HTB class. This control should really subclass Gtk.Widget but it serves its purpose as is.

TC HTB Control

using System;
using Gtk;
using LQL;
class HTBControl {
	private LQL.ClassHTB klass;
	private LQL.Con con;
	private Gtk.SpinButton rateSpin;
	private Gtk.SpinButton ceilSpin;
	public HTBControl(LQL.ClassHTB klass, LQL.Con con)
	{
		this.klass = klass;
		this.con = con;
		Gtk.Window myWin = new Gtk.Window("TC GTK+");
		myWin.DeleteEvent += new DeleteEventHandler(WindowDelete);
		Gtk.VBox vbox = new Gtk.VBox(false, 3);
		Gtk.HBox hbox1 = new Gtk.HBox(false, 2);
		hbox1.Add(new Gtk.Label("Rate (bytes/sec): "));
		this.rateSpin = new Gtk.SpinButton(0, 10000000, 1);
		hbox1.Add(this.rateSpin);
		vbox.Add(hbox1);
		Gtk.HBox hbox2 = new Gtk.HBox(false, 2);
		hbox2.Add(new Gtk.Label("Ceiling (bytes/sec): "));		
		this.ceilSpin = new Gtk.SpinButton(0, 10000000, 1);
		hbox2.Add(this.ceilSpin);
		vbox.Add(hbox2);
		Gtk.Button modifyButton = new Gtk.Button("Modify");
		modifyButton.Clicked += new EventHandler(Modify);
		vbox.Add(modifyButton);
		rateSpin.Value = this.klass.Rate;
		ceilSpin.Value = this.klass.Ceiling;
		myWin.Add(vbox);
		myWin.ShowAll();
	}
	static void WindowDelete(object o, DeleteEventArgs args)
	{
		Gtk.Application.Quit();
		args.RetVal = true;
	}
	void Modify(object o, EventArgs args)
	{
		this.klass.Rate = (uint) this.rateSpin.Value;
		this.klass.Ceiling = (uint) this.ceilSpin.Value;
		this.klass.Modify(this.con);
	}
}
using System;
using Gtk;
using LQL;
class MainClass {
	public static void Main(string[] args)
	{
		Application.Init();
		LQL.Con con = new LQL.Con();
		LQL.Interface nIf = con.FindInterfaceByName("eth0");
		GLib.List classes = con.ListClasses(nIf);
		foreach (LQL.Class klass in classes) {
			if (klass is LQL.ClassHTB) {
				new HTBControl((LQL.ClassHTB) klass, con);
			}
		}
		Application.Run();
	}
}

LQL#

Work has begun on the long promised Mono (C#) bindings for LQL. This little C# program will display traffic statistics for all of the queueing disciplines that are supported by the C LQL library.

using System;
using LQL;
class MainClass {
    public static void Main(string[] args) {
        Gtk.Application.Init();
        LQL.Con con = new LQL.Con();
        GLib.List ifList = con.ListInterfaces();
        foreach (LQL.Interface netInf in ifList) {
            GLib.List qdiscList = con.ListQdiscs(netInf);
            foreach (LQL.QDisc qdisc in qdiscList) {
                qdisc.UpdateStats(con);
                qdisc.PrintStats();
            }
        }
    }
}

Very exciting stuff!

LQL# is not nearly polished enough for a public release yet but I am quite happy with how the work is progressing.

LQL 0.7.0 released

Another new version of LQL is available.

0.7.0 changes:

  • Add some new test programs to the tests directory.
  • Fixed a bunch of small bugs that were found with the new test programs.
  • Change the return type of a few _new() functions from GObject to the proper object type.
  • Update docs to match new API.
  • Add some background comments to the documentation.
  • Add support for the DSMark QDisc and corresponding documentation.
  • Add –with-kernel-source=PATH option to configure so alternate kernel include directories can be specified.
  • Add support for the Netem QDisc (everything but distribution tables). This means you now need the headers from a kernel with Netem support in order to compile LQL.
  • Add support for the TCIndex classifier.
  • Put some more time into the classifiers. Still not complete.