Published on

Introductory notes on the new Linux scheduler (EEVDF)

Authors

I've recently updated the material of my Advanced Operating System course with an analysis of the new Linux EEVDF scheduler, which has supplanted the original CFS scheduler. I'll recap here my introductory notes, starting with a brief overview of the original scheduler (CFS).

The Completely Fair Scheduler (CFS)

CFS stands for “Completely Fair Scheduler,” and was the desktop process scheduler implemented by Ingo Molnar in Linux kernel version 2.6.23 up until version 6.14.

CFS assigned timeslices τp\tau_p to each process pp dynamically based on total system load which is derived from the nice values νi\nu_i of all runnable processes:

τp=f(ν0,,νp,,νn1,τˉ,μ)max(λpτˉλi,μ)\tau_p = f(\nu_0, \ldots, \nu_p, \ldots, \nu_{n-1}, \bar{\tau}, \mu) \sim max(\frac{\lambda_p \bar{\tau}}{\sum\lambda_i}, \mu)

where

  • λi(νi)\lambda_i(\nu_i) is the load associated with a process, which is a function of the nice value νi\nu_i. It has an exponential formula, i.e., λibνi\lambda_i \simeq b^{-\nu_i} (b=1.25). For numerical reasons λi\lambda_i is stored as weight wi=210λiw_i=2^{10}\lambda_i. Most of the time however, the 2102^{10} factor cancels out, and because it simplifies formulas, we are going to express formulas in terms of λ\lambda and not ww.
  • τˉ\bar{\tau} was called schedule latency. Configurable, default 6ms. It was the targeted time for each process to get a chance to run.
  • μ\mu is called minimum granularity. Configurable, default 0.75ms. This ensured that that every low-priority process got a minimum amount of CPU time.
  • λpλi\frac{\lambda_p}{\sum\lambda_i} is the process share of time.

To easily ensure fairness, CFS introduced a virtual runtime (vruntime) variable ρp\rho_p as an absolute measure of the dynamic priority of each process pp whereby processes with lower ρ\rho are given priority because they lag behind. ρp\rho_p is influenced by the time ϵp\epsilon_p that process pp has used and is inversely proportional to its load λp\lambda_p. The key reason for using ρ\rho is that it offers a straightforward way to directly compare processes with different priorities. Specifically, assuming that all processes start with the same ρ\rho, if each of them has been served for the assigned timeslice τp\tau_p, all of them will end up again with the same ρ\rho because they have received the same amount of Δρ\Delta \rho:

p, Δρp=τpλp=τˉλpλpλi=τˉλi\forall p, ~\Delta\rho_p = \frac{\tau_p}{\lambda_p} = \frac{\bar{\tau}\lambda_p}{\lambda_p \sum\lambda_i} = \frac{\bar{\tau}}{\sum\lambda_i}

If a process has a lower ρ\rho it means that it lags behind and thus must be given priority. In fact, the process chosen for execution is the one with the minimum value of ρ\rho, i.e., ρmin\rho_{min}. Such a variable is tracked as min_vruntime in the runqueue and will play a role also in EEVDF.

CFS aimed for fairness but did not account for latency requirements well whereby some processes need quick access to the CPU but don't require much time1. CFS didn't offer a way to express these latency needs a part from specific latency-nice patches (which add a second nice value which controls when the CPU time will be available to that process) but a different and more streamlined solution has been cooking for sometime. Enter EEVDF.

Earliest Eligible Virtual Deadline First scheduler

EEVDF, based on a 1995 paper by Ion Stoica and Hussein Abdel-Wahab, has been introduced to address the above issues with latency. Like CFS, as of kernel 6.14.4, EEVDF divides CPU time fairly among processes by calculating the processes' "lag," or the difference between expected and actual service time; processes with positive lag are prioritized for scheduling. However, unlike CFS, EEVDF uses the concept of virtual deadlines to decide which process to run next allowing it to prioritize the latency-sensitive tasks.

EEVDF, like CFS, tracks the virtual runtime of each process, but introduces also the concept of global virtual time at time tt:

ρ(t)=tpλp\rho(t) = \frac{t}{\sum_p \lambda_p}

Intuitively, in an ideal system, if a unit of ρ(t)\rho(t) is passed then every process pp should have received an amount of time proportional to its load λp\lambda_p. If for some reason the virtual runtime of a process is less than ρ(t)\rho(t) then we say that the process lags behind and is owed CPU time.

In Linux, there is not an explicit variable that tracks ρ\rho. It is usually approximated with a weighted average of virtual runtimes of all running processes:

ρˉ(t)=iλiρipλp\bar{\rho}(t) = \frac{\sum_i \lambda_i \rho_i}{\sum_p \lambda_p}

For numerical stability the kernel stores ρˉ\bar{\rho} in two different variables called avg_vruntime and avg_load ρ^,Λ\hat{\rho}, \Lambda:

ρ^=iwi(ρiρmin),  Λ=pwp\hat{\rho} = \sum_i w_i * (\rho_i - \rho_{min}), ~~\Lambda = \sum_p w_p

Here, min_vruntime ρmin(t)\rho_{min}(t) is used to reduce the magnitudes involved in computations (it is always equal to the minimum value of runtime you can find in the run queue at any given moment). Implicitly the value of ρˉ\bar{\rho} could always be reconstructed from ρ^,Λ\hat{\rho}, \Lambda and ρmin\rho_{min} as: ρˉ=ρ^/Λ+ρmin\bar{\rho} = \hat{\rho}/\Lambda + \rho_{min}.

As in CFS, tasks that are behind in virtual runtime must be given priority. In EEVDF, this means only eligible tasks (those with positive lag) can be scheduled, which ensures fairness.

To compute eligibility for process pp, the kernel measures the actual process lag, i.e., how much actual time do we owe to a process pp at time tt relative to the current virtual time:

αp(t)=wp(ρˉ(t)ρp(t))\alpha_p(t) = w_p(\bar{\rho}(t) - \rho_p(t))

A task pp is considered eligible when ρp(t)ρˉ(t)\rho_p(t) \leq \bar{\rho}(t). The kernel tries hard to avoid loss in precision to compute if a task is eligible and in particular the actual formula used is is

(ρpρmin)Λρ^(\rho_p - \rho_{min})\Lambda\leq\hat{\rho}

Ensuring latency-sensitive operation

To prioritize among eligible tasks, EEVDF uses virtual deadlines to decide which process to run next. The virtual deadline represents the urgency with which a process should be scheduled, and the process with the earliest (smallest) virtual deadline δ\delta is chosen to run first. The deadline δ\delta is computed as

δp=τpλp+ρp\delta_p = \frac{\tau_p}{\lambda_p} + \rho_p

Unlike CFS, the timeslice τp\tau_p of the process can be a default timeslice or a custom one set via the sched_setattr() system call. The smaller the timeslice, the highest priority will be given to that eligible task.

Footnotes

  1. Realtime scheduling classes could handle this, but they are somewhat used for privileged operations and can disrupt other processes if not handled well. Hence, the need for improvement in the normal scheduler.