Per packet overhead on VDSL2 – part 3

Previous instalments:

For tonight’s edition I have increased the number of small packet sizes in the experiment and dropped the larger sizes. For each of the following data sizes (iperf -l) there are five seconds of traffic: 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 210, 220, 230, 240, 250, 260, 270, 280, 290 and 300 bytes.

Packets per second observed at the destination

For data sizes of up to 90 bytes the packet per second value is pretty much constant. Alex Burr offered a theory on the Bufferbloat list that I may be hitting a packet rate limitation. If I understand properly, the above chart seems to support this.

Bitrate observed at the destination

From the bitrate perspective, the curve flattens around 90 bytes of data as well.

Per packet overhead on VDSL2 – part 2

A few days ago I wrote about some interesting latency results I observed on my home Internet connection with small packets. This post adds a bit more data.

In this experiment I disabled all upstream traffic shaping and then used iperf to blast UDP packets of various sizes to a destination host I control. The transmitted rate was 10Mbps and the upstream link rate is ~6.5Mbps. On the destination I captured the packets with tcpdump and generated the charts below with Wireshark.

The charts show ten sub-experiments – 10 seconds of traffic for each data size (iperf -l): 25, 50, 75, 100, 200, 300, 400, 500, 1000, 1400 bytes. T0 is when the first packet is received.

Packets per second observed at the destination

The first chart shows the packets per second received at the destination. Not surprisingly, the packet rate is much higher with small packets.

Bitrate observed at the destination

The second chart shows the bitrate observed at the destination. Notice that for small packets the effective bitrate is much lower. This seems to support the theory that this link has a lot of per-packet overhead.

Per packet overhead on VDSL2

My home router (Linux box) is configured to shape upstream traffic to just below the link rate to avoid Bufferbloat – this greatly improves interactive performance under load. Recently I’ve experimented with various packet sizes. The charts below show the effect of small packets.

Effect of per-packet overhead on VDSL2?
  1. Between 0-6 seconds the link is idle.
  2. From 6-14 seconds the upstream link is flooded with 1,400 byte packets (10Mb/sec of traffic trying to get through a 6.2Mb link)
  3. At 17 seconds the upstream link is flooded with 64 byte packets (10Mb/sec as well)

Notice how much higher the latency and jitter are with small packets.

Confusingly, these results were gathered with the bandwidth shaper configured for 53 bytes of overhead [1] which is my current understanding of the per-packet overhead on VDSL2 [53 is a coincidence with the ATM cell size].

Per-packet overhead for VDSL2 (without ATM) and PPPoE:

  • 5 Bytes for PTM
  • 40 bytes for 802.3
  • 8 bytes for PPPoE

Either the above overhead numbers or wrong or there is something else going on.

[1] – overhead argument to tc.

 

Measuring one way packet loss

I’ve recently observed packet loss on my home Internet connection and suspected that this was occurring on the downstream path. Unfortunately, ping, the tool normally used to measure packet loss doesn’t tell you anything about where the packet was lost – just that either the request to the destination or the reply from the destination was lost.

After a quick bit of Googling I couldn’t find a tool which measured one way packet loss so I fell back to combining a couple of tools. Hopefully the technique outlined here will be useful to others.

Note that the steps below require that you have control of both of the hosts you wish to measure packet loss between. It’s very easy and cheap to get a virtual machine on Amazon EC2 or other services so this shouldn’t be a huge barrier.

To measure packet loss on the path from host A to host B run the following commands.

Host B

tcpdump -nni ppp0 udp port 8787 > tmp.file

Host A

hping3 -i u100000 --destport 8787 -2 HOSTB

Let some time elapse and then press CTRL-C to stop hping3.

Host B

Press CTRL-C to stop tcpdump

cat tmp.file | egrep 8787 | wc -l

Now compare the number of UDP packets transmitted by hping3 to the number of packets captured by tcpdump (output of wc -l). If there is any difference there is packet loss from A to B.

IPv6 deployment day is here

A few minutes ago major Internet companies enabled IPv6 permanently. Happy IPv6 day!

World IPv6 Day

World IPv6 Launch

The view from my server:

[dan@alpha ~]$ ping6 -n -c 1 www.google.com; ping6 -n -c 1 www.yahoo.com; ping6 -n -c 1 www.facebook.com
PING www.google.com(2001:4860:4008:802::1012) 56 data bytes
64 bytes from 2001:4860:4008:802::1012: icmp_seq=1 ttl=59 time=5.04 ms

--- www.google.com ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 5.048/5.048/5.048/0.000 ms
PING www.yahoo.com(2001:4998:f00d:1fe::3001) 56 data bytes
64 bytes from 2001:4998:f00d:1fe::3001: icmp_seq=1 ttl=60 time=31.6 ms

--- www.yahoo.com ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 31.638/31.638/31.638/0.000 ms
PING www.facebook.com(2a03:2880:10:1f02:face:b00c:0:25) 56 data bytes
64 bytes from 2a03:2880:10:1f02:face:b00c:0:25: icmp_seq=1 ttl=54 time=87.7 ms

--- www.facebook.com ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 87.754/87.754/87.754/0.000 ms

System virtualization moves the edge of the network

One of the biggest innovations of the Internet was moving the intelligence from the network to the edge devices. Making the end host responsible for data delivery and creating a network architecture that is application agnostic were radical and incredibly successful ideas. Although much of the architecture made this switch the demarcation between the network owner and its users forced some features such as access control and provisioning to remain in the access routers and switches. Given the relationship of consumers to their service providers this will probably never change in the consumer Internet market but something very interesting is happening within data centres due to virtualization.

Moving to the Edge: An ACM CTO Roundtable on Network Virtualization

One of the most interesting ideas in the discussion linked to above is that the advent of system virtualization necessarily moves the point of enforcement, or intelligence, from the network access layer into the host itself. This comes in the form of the networking features of hypervisors. Hypervisors implement switching and routing but what’s really interesting is that they are also the best location for functions such as firewalls because implementing these functions as separate devices greatly limits the flexibility of the virtualized data centre. Imagine migrating a VM anywhere the data center and having its firewall rules follow automatically to the new host vs having to choose amongst N hosts which are behind the same firewall.

Two groups may be affected greatly by this change: network equipment vendors and IT networking professionals.

I do not believe that owners of existing network infrastructure need to worry about the hardware they already have in place. Chances are your existing network infrastructure provides adequate bandwidth. Longer term, networking functions are being pulled into software, and you can probably keep your infrastructure. The reason you buy hardware the next time will be because you need more bandwidth or less latency. It will not be because you need some virtualization function. (Martin Casado)

The above argues that existing network switches and routers are already good enough for this new architecture. That is, networking equipment will become further commoditized which may not be good from the perspective of Cisco and other equipment vendors.

What about switch and router experts?

The people who will be left out in the cold are the folks in IT who have built their careers tuning switches. As the edge moves into the server where enforcement is significantly improved, there will be new interfaces that we’ve not yet seen. It will not be a world of discover, learn, and snoop; it will be a world of know and cause. (Lin Nease)

and

There’s a contention over who’s providing the network edge inside the server. It’s clearly going inside the server and is forever gone from a dedicated network device. A server-based architecture will eventually emerge providing network-management edge control that will have an API for edge functionality, as well as an enforcement point. The only question in my mind is what will shake out with NICs, I/O virtualization, virtual bridges, etc. Soft switches are here to stay, and I believe the whole NIC thing is going to be an option in which only a few will partake. The services provided by software are what is of value here, and Moore’s law has cheapened CPU cycles enough to make it worthwhile to burn switching cycles inside the server.

If I’m a network guy in IT, I better much more intensely learn the concept of port groups, how VMware, Xen, etc. work, and then figure out how to get control of the password and get on the edge. Those folks now have options that they have never had before.

The guys managing the servers are not qualified to lead on this because they don’t understand the concept of a single shared network. They think in terms of bandwidth and VPLS (virtual private LAN service) instead of thinking about the network as one system that everybody shares and is way oversubscribed. (Lin Nease)

Of course networking experts will still be required but this new world may involve spending a lot more time managing servers than at the router/switch CLI.

The simple network continues to win.

Making the Linux flow classifier tunnel aware

Flow Classifier

The Linux kernel has many different tools for managing traffic. One of them is the flow classifier which allows the user to configure which fields of the packet headers should be used to create a hash which is then used to identify flows and manage them. For example, if the user selects src,dst,proto,proto-src,proto-dst they get a unique value for each flow (within the limits of the hash). Alternatively, using only src as the key will result in all flows being grouped by the source IP address.

The Problem

Below is a slightly simplified version of my home network.

Figure 1: Simplified home network

All of the traffic, both IPv4 and IPv6, is tunnelled through a Linux router which lives at my service provider. The reason for this complicated setup is that it gives me control of the traffic in both the upstream and downstream. By shaping the traffic to just below the maximum rate in each direction I am able to avoid Bufferbloat problems and prioritize latency sensitive traffic such as SSH, DNS and Vonage. Especially under load, my QoS scripts make marked difference in how fast the Internet feels.

The multiple tunnels present a problem for implementing my QoS scheme because from the perspective of the underlying interface there are only two flows on the network. One for the IP-IP tunnel and one for the IP-IPv6 tunnel. A work around I used for a while was to apply the QoS rules to the IP-IP tunnel interface because that’s where the bulk of the traffic flows. However, this meant that IPv6 traffic was not properly controlled and any time I had a significant amount of IPv6 traffic I lost all the advantages of my QoS scheme.

To solve this properly I needed a way to look into the tunnels in order to identify the inner network flows. So I’ve extended the flow classifier with the keys in the following table. IP-IP, IP-IPv6, IPv6-IP and IPv6-IPv6 tunnels are supported.

Key Description
tunnel-src Extract the source IP from the inner header
tunnel-dst Extract the destination IP from the inner header
tunnel-proto Extract the protocol from the inner header
tunnel-proto-src Extract the transport protocol source port from the inner header
tunnel-proto-dst Extract the transport protocol destination port from the inner header

 Results

In order to validate that this works I started a couple SCP uploads to max the upstream bandwidth and then ran ping-exp to measure the latency. At the start of the test the flow classifier keys were src,dst,proto,proto-src,proto-dst. Approximately half way through I changed the keys to src,dst,proto,proto-src,proto-dst,tunnel-src,tunnel-dst,tunnel-proto,tunnel-proto-src,tunnel-proto-dst. The advantage of keeping the non-tunnel keys is that any traffic created by the router itself is still classified properly. Here is the tc script I used. You can see the results of this test in the figure 2 below.

Figure 2: Before and after tunnel keys

For the first half of the test you can see the high latency. This is due to all the traffic from the SCP upload and ICMP pings being placed into the same queue because from the perspective of the flow classifier there is only one flow. In the second half of the test the addition of the tunnel keys allows the flow classifier to place the ICMP packets into a different queue which is not affected by the SCP upload and therefore has much lower latency. The large amount of packet loss during the key change is because the script I used creates a large number of queues. While these queues are being created packets are dropped.

While my network setup may be a bit unique I think it’s likely that many home networks will have some form of tunnelling in the near future as tunnels are part of several IPv6 migration strategies. So hopefully this little addition will be useful in many different contexts.

Below are links to the two patches that are required. I’ll post them to Netdev for review shortly.

clsflow-tunnel-20111016.patch

iproute-clsflow-tunnel-20111016.patch

Linux flow classifier proto-dst and TOS

Recently I’ve been playing around with the Linux flow classifier on my gateway. The flow classifier provides the ability to group network flows by configuring which parts of the packet headers (referred to as keys) are used in a hash calculation which chooses the output queue.

All of my Internet traffic travels over an IPIP tunnel to another Linux box. I do this so I have control of the QoS in both the upstream and the downstream. A result of this configuration is that from the perspective of the output interface there is only a single network flow.

I configured the flow classifier to use the src,dst,proto,proto-src,proto-dst keys which aims to provide 5-tuple flow fairness. Here’s the simple tc script I used. Due to the IPIP tunnel I expected to see that all traffic would be placed into the same queue. Strangely, the below is what my little ping-exp utility showed when running at the same time as an SCP upload.

Figure 1: Unexpected flow classifier behaviour

Coincidentally I ran ping-exp configured to send three different streams of ICMP traffic with different IP TOS values. Note that SCP automatically sets the IP TOS to the equivalent of the “Low” stream in the test.

Notice that the pings using the high and default TOS values appear to be unaffected by low priority ping and SCP traffic. This was unexpected because none of src,dst,proto,proto-src or proto-dst keys should be affected by the TOS value.

After a bit of experimentation I determined that the proto-dst key was the source of the problem. If you spend a bit of time with the flow_get_proto_dst() function in cls_flow.c you’ll see that if the protocol is ICMP or IPIP, as it is in my test, then the following value is returned:

return addr_fold(skb_dst(skb)) ^ (__force u16)skb->protocol;

skb_dst() returns a pointer to a dst_entry structure. Since Linux maintains separate dst_entry structures for each destination,TOS pair the source of the unexpected behaviour is obvious.

I’m not knowledgeable enough about the Linux network stack to be certain but I don’t see any value in returning a value for proto-dst which is random with respect to the actual traffic on the wire. At the very least this is not intuitive behaviour.

If you look at flow_get_proto_src() you’ll see something similar:

return addr_fold(skb->sk);

In this case a pointer to the local socket structure is used as a fallback. Again, this has no relation to the actual packets on the wire and if the packet does not originate at the local machine then no socket exists which causes this value to be zero anyway.

It seems to me that the most intuitive behaviour would be to have the proto-src and proto-dst keys return zero when they are applied to traffic that doesn’t have the notion of transport layer ports.

I’ll post to Netdev about this and see what the kernel devs have to say.

Related to this, I have a patch to the flow classifier that adds tunnel awareness which I plan post to Netdev this weekend as well.