There is a lot happening on the XDP front in the Linux kernel these days. This presentation provides a good overview.

I love the idea that eBPF is becoming a policy language for the kernel.

Linux 4.10

As always, KernelNewbies is the place to go for a great summary of new kernel features. Here’s some 4.10 highlights I’m interested in.

BPF for lightweight tunnel encapsulation commit, https://git.kernel.org/torvalds/c/f74599f7c5309b21151233b98139e9b723fd1110


TCP: sender chronographs instrumentation. This feature exports the sender chronograph stats via the socket SO_TIMESTAMPING channel. Currently it can instrument how long a particular application unit of data was queued in TCP by tracking SOF_TIMESTAMPING_TX_SOFTWARE and SOF_TIMESTAMPING_TX_SCHED. Having these sender chronograph stats exported simultaneously along with these timestamps allow further breaking down the various sender limitation. For example, a video server can tell if a particular chunk of video on a connection takes a long time to deliver because TCP was experiencing small receive window commit, commit, commit, commit, commit, commit


sched/act_mirred: Implement the corresponding ingress actions TCA_INGRESS_REDIR and TCA_INGRESS_MIRROR (Up until now, action mirred supported only egress actions (either TCA_EGRESS_REDIR or TCA_EGRESS_MIRROR). This allows attaching filters whose target is to hand matching skbs into the rx processing of a specified device commit, commit, commit, commit


Add the LRU versions of the existing BPF_MAP_TYPE_HASH and BPF_MAP_TYPE_PERCPU_HASH maps: BPF_MAP_TYPE_LRU_HASH and BPF_MAP_TYPE_LRU_PERCPU_HASH sample, commit, commit, commit, commit, commit


Since the dawn of time, the way Linux synchronizes to disk the data written to memory by processes (aka. background writeback) has sucked. When Linux writes all that data in the background, it should have little impact on foreground activity. That’s the definition of background activity…But for a long as it can be remembered, heavy buffered writers have not behaved like that. For instance, if you do something like $ dd if=/dev/zero of=foo bs=1M count=10k, or try to copy files to USB storage, and then try and start a browser or any other large app, it basically won’t start before the buffered writeback is done, and your desktop, or command shell, feels unreponsive. These problems happen because heavy writes -the kind of write activity caused by the background writeback- fill up the block layer, and other IO requests have to wait a lot to be attended (for more details, see the LWN article).

This release adds a mechanism that throttles back buffered writeback, which makes more difficult for heavy writers to monopolize the IO requests queue, and thus provides a smoother experience in Linux desktops and shells than what people was used to. The algorithm for when to throttle can monitor the latencies of requests, and shrinks or grows the request queue depth accordingly, which means that it’s auto-tunable, and generally, a user would not have to touch the settings. This feature needs to be enabled explicitly in the configuration (and, as it should be expected, there can be regressions)

Recommended LWN article: Toward less-annoying background writeback

eBPF Map Size Limits

Recently, I’ve done some work with eBPF and specifically the in-kernel maps that are manipulated and shared by both kernel and user space code.

When doing this I ran into permission errors when installing large maps. It took a little while to figure out that the cause of this was the root user’s locked memory limit being too low (thanks Daniel Borkmann).

The locked memory limit is modified with ulimit:

ulimit -l unlimited