Tuning the Linux Kernel’s Completely Fair Scheduler

Tuning the Linux Kernel’s Completely Fair Scheduler

After hours of searching the Web, I have found nothing that really sheds a whole lot of light on the subject of tuning CFS. Most of the pages on the Web refer to CFS as it was in version 2.6.23 of the Linux kernel. Most web pages do not provide accurate information. I really wonder about the system administrator who could not understand why he was unable create a directory or file in the /proc directory. While I do not know the the best set of values for the tuneable parameters, I can, hopefully, shed some light on the subject.

Kernel Configuration Options

Since we are dealing with the Linux kernel, the best place to start is the kernel configuration options. The following table provides a list of CFS related configurations for mainstream kernel version 2.6.32.3:

CFS Kernel Configuration Options
CONFIG_PREEMPT_NONE This is the “traditional” Linux model that does not support preemption. It is a good model for servers or computation heavy system, as it minimizes context switching at the expense of higher latency.
CONFIG_PREEMPT_VOLUNTARY This option changes the behavior of the might_resched() function as defined in kernel.h. It is the default Desktop option.
CONFIG_PREEMPT This option supports the lowest latencies at the price of more context switching, and has the biggest impact on the scheduler code.
CONFIG_GROUP_SCHED This option tells the CFS to use task groups to control the allocation of CPU usage.
CONFIG_FAIR_GROUP_SCED Provides group scheduling to normal and batch tasks.
CONFIG_RT_GROUP_SCHED Provides group scheduling for real-time tasks according.
CONFIG_USER_SCHED Group tasks according to User IDs.
CONFIG_CGROUP_SCHED Group tasks according to administrator defined control groups.
CONFIG_CGROUP Provides support for grouping sets of processes together CPUsets, CFS, memory control or device isolation.
CONFIG_SCHED_SMT For Pentium 4 chips with Hyperthreading, this option improves performance at the cost of a bit more overhead.
CONFIG_SCHED_MC Improves performance for multi-core CPU chips.
CONFIG_SCHED_DEBUG Creates the /proc/sched_debug file, which presents scheduler debugging information. Requires that CONFIG_KERNEL_DEBUG=y.
CONFIG_SCHEDSTATS Adds the code that collects scheduler statistics and reports through /proc/schedstats

You do not need the kernel source to discover the settings for the above options. You can just grep the configuration file in the /boot directory that matches your running kernel, as illustrated in the following example:

fgrep CONFIG_KERNEL_DEBUG /boot/config-$(uname -r)
fgrep CONFIG_SCHED_DEBUG /boot/config_$(uname -r)

Short of recompiling the kernel, there is no way to change these options for an existing kernel. In future posts, I will present my method for compiling a kernel that works, unless I do something really stupid.

Impact of Preemption

With version 2.6, the Linux kernel became fully preemptible. The preemption configuration options determine how often the schedule() function performs a context switch. The CONFIG_PREEMPT_NONE option was the standard behavior of the kernel prior to version 2.6. Under this mode, kernel processes are not subject to preemption. The CONFIG_PREEMPT_VOLUNTARY option allows a kernel process to be voluntarily preempted. While the CONFIG_PREEMPT forces preemption, except in those cases where a kernel process is holding a lock. Only one option is selected.

Each level of preemption reduces latency at the expense of a higher overhead for context switching. While low latency is important to making videos play smoothly, it slows down server oriented tasks. The higher rate of context switching has a noticeable impact on the performance of older, and slower, machines.

This is a kernel configuration option. Trying to change the behavior of preemption through modifying the sysctl parameters of CFS is a futile task. To find out which option is set, just run the command:

fgrep CONFIG_PREEMPT /boot/config-$(uname -r)

sysctl Parameters

The parameters under /proc/sys are referred to in the kernel source as sysctl_, and they can be accessed using the sysctl command. The underlying file-system for /proc allows you to view files with the cat command, and change writable files with the echo command. By habit, I use the sysctl command for everything under /proc/sys.

The often found statement in Web articles that sched_min_granularity_ns requred CONFIG_SCHED_DEBUG=y was true for the early releases of CFS. However, it is not true by the time we get to kernel version 2.6.32.3. I am not sure where it changed, but both what requres CONFIG_SCHED_DEBUG and the available parameters change from release to release. They changed from being debugging information to tuning parameters. The key is in the file kernel/sched.h, which declares the sysctl_ parameters for the scheduler. You don’t need to download the kernel source to discover this information. You can use LXR (Linux Cross Referencer). Just look for sched.h in either the kernel directory or the include/linux directory. To save time, you can use the browser find option to search for sysctl_sched. The #ifdef CONFIG_SCHED_DEBUG tells what parameters require this kernel option. The default values can be found in kernel/sched_fair.c. You can also check the values using the sysctl command:

sudo sysctl -a | fgrep kernel.sched_

or

su -c”sysctl -a | fgrep kernel.sched_”

Changing these values requires a thorough understanding of the inner workings of CFS.

Scheduler Statistics

The /proc/schedstat data actually came into existance around kernel version 2.6.20. As with a majority of the /proc data, there is no fancy formating of the fields. It is up to a user-space application to format and present data from the kernel-space. Being kernel data, the first place to look details on the fields is the kernel Documention. For /proc/schedstat the document path is Documentation/scheduler/sched-stats.txt. Again you can find this file for your kernel version at LXR. The file has not changed since its implemention, but CFS changed the resulting values. Since CFS did away with run queues, the first three fields will be zero. If you only see nine fields for the CPU data, and not twelve, the first three fields were dropped.

If CONFIG_SCHED_DEBUG is set, /proc/sched_debug reports the values of the variables at that instant. While the fields are documented, they only have meaning to those kernel programmers who are debugging the scheduler.

Grouping Processes

The standard behavior of CFS is to be completely fair to each task. For overall performance of a system, being completely fair to every task may not provide the best overall performance of the system. To resolve this issue, CFS started supporting the concept of groups around kernel version 2.6.25. Before you can tune the behavior of CFS using groups, you need to know the kernel configuration.

If CONFIG_GROUP_SCHED is set, then the following configuration choices apply:

  1. CONFIG_FAIR_GROUP_SCHED determines whether group scheduling applies to the SCHED_NORMAL and SCHED_BATCH policies.
  2. CONFIG_RT_GROUP_SCHED applies group scheduling the SCHED_FIFO and SCHED_RR policies.
  3. Group scheduling allows one of the following group configuration methods that applies to all selected policies:
    1. CONFIG_USER_SCHED groups tasks by User ID.
    2. CONFIG_CGROUP_SCHED uses the “cgroup” pseudo file-system as a mechnism for configuring task groups.

The Documentation/scheduler/sched-design-CFS.txt file explains how a system administrator can modify the behavior of CFS using either of the above methods. Both solutions have their good and bad points, and both can be a challenge to implement.

User ID Groups

With the User Id method, the kernel creates a /sys/kernel/uids directory. In this directory, which is actually a kobject, there is a method for each uid that lets you set the CPU share for each uid. The default CPU share is 1024. Increasing the value gives the user a bigger share of the CPU resources. From the kernel code, it appears that CFS uses the real User ID, and not the effective User ID.

The advantage of User ID groups is that it applies to every task started by that uid. The quirky part is that something running as setuid will still run as the real uid, while have the permissions of the euid. The drawback is that the kernel generates the /sys directory at boot time. Consequently, a script would have to be run at the end of the boot process, to set the values for each uid.

Scheduling with cgroups

I checked Ubuntu 9.10, openSUSE 11.1, and Fedora 11. Each of this distros has CONFIG_CGROUP_SCHED set. Yet, not one created the /dev/cpuctl director, nor did they mount the cgroup file-system on any other mount point. I see two major problems with cgroups as a group scheduling method:

  1. The /dev directory uses the tmpfs pseudo file-system, and the udevd daemon generates the contents. Consequently, the instructions referenced in sched-design-CFS.txt would have to be in a script run after each boot.
  2. While you can create separate groups with the /dev/cpuctl directory, you still have to save the PID of each process to the /dev/cpuctl//tasks file. This means that every process would have to be scripted to insert its PID into the proper group.
  3. The good news is that upon termination of the process, the kernel removes the PID from the group.

While it works, it means you have to create a boot script that creates the directory structure in /dev, and to define the CPU share for each group. You would also have to script every process to place it the correct group. Process that do not belong to any group still receive their default CPU share of 1024.

Summary

CFS represents a radical change from the previous schedule. Like all radical changes, it is going through a maturation process. Before expermenting with the scheduler parameters in /proc/sys/kernel, you should consider the following:

  1. Check the preemption configuration. The level of preemption has a tremendous impact of the latency of the kernel.
  2. Modifying the parameters in /proc/sys/kernel applies to every task.
  3. Although it is messy, consider using control groups as a way to modify the behavior of the scheduler. This gives you more granular control over the behavior of the scheduler.

Once I complete the upgrades to the lates release of the distros, I may experiment with the BFS scheduler. If I do, I will post my opinion.