Qais Yousef

Geek/Engineer - Kernel Hacker - Low Level Software - Views my own

github: qais-yousef twitter: @_qyousef

Monitor RT-tasks runtime and cpufreq residency with bpftrace

Jul 10, 2020

It is a nice surprise to see how fast bpf tools are progressing. Last time I used bpf (via bcc), you had to setup your environment with proper tools like clang. And if you often run on embedded systems like me, setting the cross-compilation is a bit of a headache. Though thanks to adeb setting up a Debian image that you can push into a device with all the tools made it a lot easier. And your system needed to have the disk space for the rather large Debian image, which I luckily had.

UPDATE: Aug 09, 2020

I felt the above statement needs clarification. I have tried to use bpf in Android (and other embedded) environment which was a bit hard. adeb helped a lot. And recently you can find the tools available out of the box for most distributions. I have made good progress in adding the packages to buildroot too.

While working on a patchset that allows controlling the default boost value of RT tasks in the system, I wanted a fast and quick way to test this on my laptop without having to collect traces and parse the result; especially in my usual working environment I don’t use my laptop for much testing/work.

I came across bpftrace in the past but never had a chance to play with it, so this seemed the right opportunity for that.

First I wanted to have a feeling of what RT tasks are available in the system and for how long they run. I would expect that by changing the new sysctl knob I added, the runtime should increase when I disable the boosting.

Secondly I wanted to look at the frequency residency of each CPU to see the impact on that; after all the expectation is to reduce the power consumption of RT task in the system.

Worth noting that I use schedutil as my governor. Which luckily intel_pstate driver now supports. The sysctl is based on the new util clamp (or uclamp) feature which is tightly coupled to schedutil.

Monitoring RT tasks runtime

#!/usr/bin/env bpftrace
/*
 * SPDX-License-Identifier: GPL-2.0
 *
 * Author: Qais Yousef
 */

tracepoint:sched:sched_switch
{
	$npid = args->next_pid;
	$ncomm = args->next_comm;

	$ppid = args->prev_pid;
	$pcomm = args->prev_comm;
	$pprio = 99 - args->prev_prio;

	if (args->next_prio < 100) {
		@start[$npid, $ncomm] = nsecs;
	}

	if (args->prev_prio < 100) {
		$start = @start[$ppid, $pcomm];

		if ($start) {
			$runtime = (nsecs - $start) / 1000;
			@runtime[cpu, $pprio, $ppid, $pcomm] += $runtime;
		}
	}
}

interval:s:10
{
	print(@runtime);
	clear(@runtime);
	clear(@start);
}

The code is divided into 2 blocks. tracepoint:sched:sched_switch and interval:s:10. The 1st block will execute whenever sched_switch trace event is called, and the 2nd is called periodically every 10 seconds. The 2nd block is a simple print of the data we collected for the past 10 seconds then clears them out.

	$npid = args->next_pid;
	$ncomm = args->next_comm;

	$ppid = args->prev_pid;
	$pcomm = args->prev_comm;
	$pprio = 99 - args->prev_prio;

In bpftrace, you can have scratch variables and mappings. Scratch variables are temporary variables which are useful for temporary storage. Mappings are global data that is persistent across calls.

Scratch variables start with $ and mapping with @.

Each trace event has a set of arguments passed to it that you can access via the args variable. To find out what args are passed for sched_switch you can find the format in tracefs.

# cat /sys/kernel/debug/tracing/events/sched/sched_switch/format
name: sched_switch
ID: 316
format:
	field:unsigned short common_type;	offset:0;	size:2;	signed:0;
	field:unsigned char common_flags;	offset:2;	size:1;	signed:0;
	field:unsigned char common_preempt_count;	offset:3;	size:1;	signed:0;
	field:int common_pid;	offset:4;	size:4;	signed:1;

	field:char prev_comm[16];	offset:8;	size:16;	signed:1;
	field:pid_t prev_pid;	offset:24;	size:4;	signed:1;
	field:int prev_prio;	offset:28;	size:4;	signed:1;
	field:long prev_state;	offset:32;	size:8;	signed:1;
	field:char next_comm[16];	offset:40;	size:16;	signed:1;
	field:pid_t next_pid;	offset:56;	size:4;	signed:1;
	field:int next_prio;	offset:60;	size:4;	signed:1;

print fmt: "prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==> next_comm=%s next_pid=%d next_prio=%d", REC->prev_comm, REC->prev_pid, REC->prev_prio, (REC->prev_state & ((((0x0000 | 0x0001 | 0x0002 | 0x0004 | 0x0008 | 0x0010 | 0x0020 | 0x0040) + 1) << 1) - 1)) ? __print_flags(REC->prev_state & ((((0x0000 | 0x0001 | 0x0002 | 0x0004 | 0x0008 | 0x0010 | 0x0020 | 0x0040) + 1) << 1) - 1), "|", { 0x0001, "S" }, { 0x0002, "D" }, { 0x0004, "T" }, { 0x0008, "t" }, { 0x0010, "X" }, { 0x0020, "Z" }, { 0x0040, "P" }, { 0x0080, "I" }) : "R", REC->prev_state & (((0x0000 | 0x0001 | 0x0002 | 0x0004 | 0x0008 | 0x0010 | 0x0020 | 0x0040) + 1) << 1) ? "+" : "", REC->next_comm, REC->next_pid, REC->next_prio

The important bits are these

.
.
.
	field:char prev_comm[16];	offset:8;	size:16;	signed:1;
	field:pid_t prev_pid;	offset:24;	size:4;	signed:1;
	field:int prev_prio;	offset:28;	size:4;	signed:1;
	field:long prev_state;	offset:32;	size:8;	signed:1;
	field:char next_comm[16];	offset:40;	size:16;	signed:1;
	field:pid_t next_pid;	offset:56;	size:4;	signed:1;
	field:int next_prio;	offset:60;	size:4;	signed:1;
.
.
.

This tells us the name and type of the fields passed to sched_switch trace event which we can use in our bpftrace script via the args variable.

NOTE: RT kernel priorities are 0 for highest and 99 for lowest. Opposite of userspace view. Hence we do 99 - args->prev_prio to ensure we record the priority in the same way userspace sees it.

The rest of the code is simple. We use a 2 maps, one to record the start time using nsecs bpftrace variable when a task switches in (args->next_comm) and when a task switches out (args->prev_comm) we subtract current time from the stored time in the map.

	if (args->next_prio < 100) {
		@start[$npid, $ncomm] = nsecs;
	}

	if (args->prev_prio < 100) {
		$start = @start[$ppid, $pcomm];

		if ($start) {
			$runtime = (nsecs - $start) / 1000;
			@runtime[cpu, $pprio, $ppid, $pcomm] += $runtime;
		}
	}

Filtering out RT task is simply done by looking at args->{next,prev}_prio.

Monitoring cpufreq residencies

#!/usr/bin/env bpftrace
/*
 * SPDX-License-Identifier: GPL-2.0
 *
 * Author: Qais Yousef
 */

tracepoint:power:cpu_frequency
{
	$freq = args->state / 1000;
	$cpu = args->cpu_id;

	$start = @start[$cpu];
	$end = nsecs;

	$pfreq = @prev_freq[$cpu];

	if ($start) {
		$duration = ($end - $start) / 1000000;
		@residency[$cpu, $pfreq] += $duration;
	}

	@start[$cpu] = $end;
	@prev_freq[$cpu] = $freq;
}

interval:s:10
{
	print(@residency);
	clear(@residency);
}

The code should be familiar now. tracepoint:power:cpu_frequency trace event prints out the frequency (args->state) a cpu (args->cpu_id) is transitioning into. By using a map to store the previous frequency and the time at which it happened at, we can measure the time a frequency transition has lasted.

Output

I run my system with threadirqs passed to kernel cmdline. I have moved almost all unpinned tasks to CPU0 to stress it more and see if I can notice any slowdown by removing the boost.

Testing with a browser playing youtube video, at max RT boost:

echo 1024 | sudo tee /proc/sys/kernel/sched_util_clamp_min_rt_default

`sudo bpftrace rt_runtime.bt`

output in us

@runtime[0, 99, 1294, rtkit-daemon]: 13
@runtime[2, 99, 23, migration/2]: 21
@runtime[1, 99, 17, migration/1]: 26
@runtime[0, 99, 12, migration/0]: 36
@runtime[3, 99, 29, migration/3]: 38
@runtime[0, 50, 631, irq/57-i915]: 75
@runtime[0, 5, 1664, alsa-source-CS4]: 221
@runtime[0, 50, 119, irq/53-xhci_hcd]: 1509
@runtime[3, 10, 16374, AudioIPC0]: 10289
@runtime[2, 10, 10496, AudioIPC Callba]: 14467
@runtime[0, 5, 1663, alsa-sink-CS420]: 24902
@runtime[0, 50, 94, irq/9-acpi]: 80579
@runtime[0, 50, 518, irq/18-wlan%d]: 91062

Note that AudioIPC0 takes ~10ms. We’ll use this as a comparison once we turn max boosting off.

Note also that AudioIPC0 and AudioIPC are running on CPU2 and CPU3.

I have pinned some of the tasks to CPU0 to see the effect of cramming it.

The migration tasks are not actually RT tasks but a special class of tasks to help with, as the name says, migration.

`sudo bpftrace freq_residency.bt`

output in ms

.
.
.
@residency[1, 3100]: 65
@residency[3, 2300]: 66
@residency[1, 2400]: 77
@residency[0, 1000]: 84
@residency[2, 1000]: 86
@residency[1, 1100]: 124
@residency[2, 2600]: 131
@residency[2, 900]: 158
@residency[1, 1300]: 159
@residency[2, 2300]: 160
@residency[1, 1000]: 191
@residency[1, 1200]: 219
@residency[1, 1400]: 255
@residency[1, 900]: 271
@residency[1, 2500]: 347
@residency[2, 1400]: 364
@residency[1, 2200]: 421
@residency[0, 800]: 2797
@residency[2, 3100]: 3100
@residency[3, 800]: 4403
@residency[3, 3100]: 5200
@residency[2, 800]: 5541
@residency[0, 3100]: 6226
@residency[1, 800]: 7465

Note that CPU0 spends ~6 seconds @3.1GHz. That’s 60% of the 10 seconds run. Other cpus have high residency @3.1GHz too. CPU2 ~50% and CPU3 ~31%.

Testing with a browser playing youtube video, at 0 ‘boost’:

echo 0 | sudo tee /proc/sys/kernel/sched_util_clamp_min_rt_default

`sudo bpftrace rt_runtime.bt`

output in us

@runtime[0, 99, 1294, rtkit-daemon]: 41
@runtime[2, 99, 23, migration/2]: 47
@runtime[0, 99, 12, migration/0]: 53
@runtime[1, 99, 17, migration/1]: 62
@runtime[3, 99, 29, migration/3]: 143
@runtime[0, 50, 631, irq/57-i915]: 190
@runtime[0, 5, 1664, alsa-source-CS4]: 273
@runtime[0, 50, 119, irq/53-xhci_hcd]: 898
@runtime[0, 50, 518, irq/18-wlan%d]: 13125
@runtime[3, 10, 16374, AudioIPC0]: 23263
@runtime[2, 10, 10496, AudioIPC Callba]: 32536
@runtime[0, 5, 1663, alsa-sink-CS420]: 49641
@runtime[0, 50, 94, irq/9-acpi]: 77025

Note that AudioIPC0 now takes ~23ms. It is reasonably slower now because it doesn’t run at max frequency. Though it still a jump from 0.1% to 0.23%. There’s no noticeable impact since the actual runtime is still very small and the performance impact is hard to notice. 10 and 23ms total duration in 10 seconds run is small.

`sudo bpftrace freq_residency.bt`

output in ms

.
.
.
@residency[2, 2600]: 198
@residency[1, 2600]: 213
@residency[2, 2500]: 216
@residency[0, 2700]: 231
@residency[1, 2100]: 238
@residency[1, 2700]: 245
@residency[3, 2700]: 249
@residency[1, 900]: 353
@residency[3, 2200]: 354
@residency[0, 1100]: 444
@residency[0, 2900]: 475
@residency[0, 2600]: 480
@residency[0, 2800]: 486
@residency[3, 900]: 512
@residency[3, 2400]: 538
@residency[2, 900]: 645
@residency[2, 2300]: 1322
@residency[1, 2400]: 1471
@residency[1, 800]: 6613
@residency[3, 800]: 6887
@residency[0, 800]: 7113
@residency[2, 800]: 8172

Now CPU0 spends ~71% @800MHz! Similarly CPU3 spends ~68% @800MHz, and CPU2 spends ~81% @800MHz.

3.1GHz frequency is not visible at all in the top numbers.

That is a massive impact on reducing the residency of the highest frequency in the system, which should lead to better battery life for every day activity.

Conclusion

In less than 1 hour I managed to write two scripts that are portable and, without having to setup any special tools, I got nice visibility into 2 aspects of the system that I was interested in and helped me to evaluate the impact of my patchset.

The scripts are portable and can be easily re-used in the future. Very nice!

Next to see how far bcc tools have evolved..

You can find the scripts here