Posted on Jun 2, 2023 • Edited on Oct 4, 2023

How Linux Works: Chapter 3 Process Scheduler (Part 1)

In Chapter 2, we mentioned that most processes in the system are in the sleep state. So, how does the kernel make each process run on the CPU when there are multiple executable processes in the system? In this chapter, we will discuss the Linux kernel's process scheduler (hereinafter referred to as the scheduler), which is responsible for allocating CPU resources to processes.

In books about computer science, the scheduler is explained as follows:

Only one process can run at a time on a single logical CPU
It allows multiple executable processes to use the CPU in turn, in units called "time slices"

For example, if there are three processes, p0, p1, and p2, it would look like the following figure.

Let's confirm whether this is actually the case in Linux by conducting an experiment.

Elapsed Time and CPU Time

To understand the content of this chapter, it is essential to understand the concepts of elapsed time and CPU time related to processes. This section explains these times. Their definitions are as follows:

Elapsed Time: The time that passes from the start to the end of a process. It is like the time measured with a stopwatch from the start to the end of a process.
CPU Time: The time that a process actually uses a logical CPU.

These terms probably won't make much sense just by looking at the explanation, so let's understand them through an experiment. If you run a process using the time command, you can get the elapsed time and CPU time from the start to the end of the target process. For example, let's run the following "load.py" which terminates after consuming some CPU resources.

 #!/usr/bin/python3 # This parameter is to change the amount of CPU time used by this program. It is convenient to change this parameter to make the CPU time consumption several seconds. NLOOP=100000000 for _ in range(NLOOP): pass

The result is as follows.

 $ time ./load  real 0m2.357s user 0m2.357s sys 0m0.000s

The output includes three lines starting with real, user, and sys. Among these, real indicates the elapsed time, and user and sys indicate the CPU time. user refers to the time when the process was running. In contrast, sys refers to the time when the kernel was operating as a result of system calls issued by the process. The load program continues to use the CPU from the start to the end of its execution and does not issue any system calls during that time, so real and user are almost the same, and sys is almost zero. The reason why it is "almost" is because the Python interpreter calls a few system calls at the beginning and end of the process.

Let's also run an experiment with the sleep command, which sleeps most of the time.

 $ time sleep 3  real 0m3.009s user 0m0.002s sys 0m0.000s

Since it terminates after sleeping for 3 seconds, "real" is nearly 3 seconds. On the other hand, this command gives up the small mount of CPU time and goes to sleep right after starting, and when it begins to use only a little CPU time again 3 seconds later before terminating. So "user" and "sys" are almost 0. The differences of two values are shown in the following figure.

When Using Only One Logical CPU

To simplify the discussion, let's first consider the case of a single logical CPU. The "multiload.py" program will be used for the experiment. This program performs the following actions:

 Usage: ./multiload [-m] <number of processes> Executes a specified number of load processing processes for a set period of time and waits for all to finish. The time it took to execute each process is outputted. By default, all processes run only on a single logical CPU. Meaning of the option: -m: Allows each process to run on multiple CPUs.

Here is its source code.

 #!/bin/bash MULTICPU=0 PROGNAME=$0 SCRIPT_DIR=$(cd $(dirname $0) && pwd) usage() { exec >&2 echo Usage: ./multiload [-m] <number of processes> Executes a specified number of load processing processes for a set period of time and waits for all to finish. The time it took to execute each process is outputted. By default, all processes run only on a single logical CPU. Meaning of the option: -m: Allows each process to run on multiple CPUs." exit 1 } while getopts "m" OPT ; do case $OPT in m) MULTICPU=1 ;; \?) usage ;; esac done shift $((OPTIND - 1)) if [ $# -lt 1 ] ; then usage fi CONCURRENCY=$1 if [ $MULTICPU -eq 0 ] ; then # Pin the "load.py" program to CPU0 taskset -p -c 0 $$ >/dev/null fi for ((i=0;i<CONCURRENCY;i++)) do time "${SCRIPT_DIR}/load.py" & done for ((i=0;i<CONCURRENCY;i++)) do wait done

Let's first set "" to 1 and run it. This is almost the same as running the load program alone.

 $ ./multiload 1  real 0m2.359s user 0m2.358s sys 0m0.000s

In my environment, the elapsed time was 2.359 seconds. What about when the parallelism is 2 and 3?

 $ ./multiload 2  real 0m4.730s user 0m2.360s sys 0m0.004s real 0m4.739s user 0m2.374s sys 0m0.000s $ ./multiload 3  real 0m7.095s user 0m2.360s sys 0m0.004s real 0m7.374s user 0m2.499s sys 0m0.000s real 0m7.541s user 0m2.676s sys 0m0.000s

As the level of parallelism increased by 2 times, 3 times, the CPU time did not change much, but the elapsed time increased by about 2 times, 3 times. This is because, as mentioned at the beginning, only one process can run at the same time on one logical CPU, and the scheduler gives CPU resources to each process in turn.

When Using Multiple Logical CPUs

Next, let's also take a look at the case of multiple logical CPUs. When the multiload program is run with the "-m" option, the scheduler tries to distribute the multiple load processes evenly across all logical CPUs. As a result, for instance, if there are two logical CPUs and two load processes, as shown in the following figure, the two load processes can each monopolize the resources of a logical CPU.

The logic of load balancing is very complex, so this book avoids detailed explanation.

Let's actually verify this. The results of running the multiload program with the -m option and parallelism from 1 to 3 are shown below.

 $ ./multiload -m 1  real 0m2.361s user 0m2.361s sys 0m0.000s $ ./multiload -m 2  real 0m2.482s user 0m2.482s sys 0m0.000s real 0m2.870s user 0m2.870s sys 0m0.000s $ ./multiload -m 3  real 0m2.694s user 0m2.693s sys 0m0.000s real 0m2.857s user 0m2.853s sys 0m0.004s real 0m2.936s user 0m2.935s sys 0m0.000s

For all processes, the values of real and user+sys were almost the same. In other words, we can see that each process was able to monopolize the resources of a logical CPU.

Cases where "user + sys" is larger than "real"

Intuitively, you might think that "real >= user + sys" will always hold, but in reality, there can be cases where the value of "user + sys" is slightly larger than that of "real". This is due to the different methods used to measure each time and the fact that the precision of the measurement is not that high. There's no need to worry too much about this; it's enough to be aware that such things can happen.

Furthermore, there are cases where "user + sys" can be significantly larger than "real". For example, this occurs when you run the "multiload.py" program with the "-m" option and set the number of processes to 2 or more. Now, let's try running "./multiload -m 2" through the time command.

 $ time ./multiload -m 2  real 0m2.510s user 0m2.502s sys 0m0.008s real 0m2.725s user 0m2.716s sys 0m0.008s real 0m2.728s user 0m5.222s sys 0m0.016s

The first and second entries are data about the load processing processes of the "multiload.py" program. The third entry is data about the "multiload.py" program itself. As you can see, the value of user is about twice that of "real". In fact, the "user" and "sys" values obtained by the "time" command are the sums of the values for the target process and its terminated child processes. Therefore, if a process generates child processes and they each run on a different logical CPU, the value of "user+sys" could be greater than real. The "multiload.py" program fits exactly this condition.

previous part
next part

NOTE

This article is based on my book written in Japanese. Please contact me via satoru.takeuchi@gmail.com if you're interested in publishing this book's English version.