《linux內核設計與實現》閱讀筆記-進程與調度

時間 2019-12-14

標籤 linux內核設計與實現閱讀筆記進程調度欄目 Linux 简体版

原文原文鏈接

1、進程

process:html

executing program code(text section)
data section containing global variables
open files
pending signals
internal kernel data
address space
one or more threads of execution

Processes, in effect, are the living result of running program code.linux

這是 LKD 對進程的經典描述。算法

1.一、進程描述符

進程描述符(Process Descriptor)在 linux 中就是指 struct task_struct 結構體，這個結構體在 32 位機器上大約是 1.7KB。數據結構

1.1.一、PID

struct task_struct {
    ...
    pit_t pid;
    ...
}

1.1.二、current 宏

linux 一般獲取一個指向 task_struct 的指針，經過指針直接操做進程。針對不一樣體系結構實現了 current 宏。例如在 x86 下:ide

+---------+
     | current |
     +----+----+
          |
          v
    +-----+---------+
    | get_current() |
    +-----+---------+
          |
          v
+---------+------------+
| percpu_read_stable() |
+---------+------------+
          |
          v
  +-------+----------+
  | percpu_from_op() |
  +------------------+

#define __percpu_arg(x)     "%%"__stringify(__percpu_seg)":%P" #x    %%

#ifdef CONFIG_X86_64
#define __percpu_seg        gs
#define __percpu_mov_op     movq
#else
#define __percpu_seg        fs
#define __percpu_mov_op     movl
#endif

asm(movl "%%fs:%P1","%0" : 
    "=r" (pfo_ret__) :
    "p" (&(var))

asm(movq "%%gs:%P1", "%0" : 
    "=r" (pfo_ret__) :
    "p" (&(var))

這段彙編將段寄存器 fs:P1 gs:P2 處的內容讀出來(參考:linux內核數據結構)，那這個位置的內容究竟是什麼呢？(TODO)wordpress

上一個宏在 /arch/x86/include/asm 中；另外在源碼 /include/asm-generic 中還通用宏定義:函數

+---------+
       | current |
       +----+----+
            |
            v
    +-------+-------+
    | get_current() |
    +-------+-------+
            |
            v
+-----------+-----------+
| current_thread_info() |
+-----------+-----------+
            |
            v
 +----------+-----------+
 | percpu_read_stable() |
 +----------------------+

union thread_union {
    struct thread_info thread_info;
    unsigned long stack[THREAD_SIZE/sizeof(long)];
};

1.二、進程狀態

#define TASK_RUNNING        0
#define TASK_INTERRUPTIBLE  1
#define TASK_UNINTERRUPTIBLE    2
#define __TASK_STOPPED      4
#define __TASK_TRACED       8

struct task_struct {
    ...
    volatile long state;
    ...
}

set_current_state(state);
set_task_state(current, state);

1.三、進程的經歷

+----------+       +----------+      +----------+
|  fork()  +------>+  exec()  +----->+  exit()  |
+----------+       +----+-----+      +----+-----+
                        |                 |
                        |                 v
                        |            +----+-----+
                        +----------->+  wait()  +--------->
                                     +----------+

1.3.1 進程建立(CoW fork)

Copy-on-Write(CoW) 中譯寫時拷貝。在 CoW fork() 後，父子進程全部數據都只有一份，即它們映射到的物理內存是相同的。它們的 PTE 標誌都是 read-only，一旦父進程或者子進程對共享區域執行了寫操做，因此就會觸發 Page Fault。系統發現 Page Fault 是由於寫 CoW 區域形成。系統將寫操做區域複製一份，而後將觸發這個操做的進程的 PTE 指向新複製內存(並設置PTE爲Write)。從新執行寫操做，這時候複製的區域的寫操做成功。post

linux 實現了 CoW fork。性能

+------------+   +-------------+   +-------------+   +-----------------+
| sys_fork() |   | sys_vfork() |   | sys_clone() |   | kernel_thread() |
+------+-----+   +-------------+   +----+--------+   +-------+---------+
       |               |                |                    |
       |               +------+  +------+                    |
       |                      |  |                           |
       +-------------------+  |  |  +------------------------+
                           |  |  |  |
                          +v--v--v--v--+
                          |  do_fork() |
                          +------+-----+
                                 |
                         +-------+--------+
                         | copy_process() |
                         +----+---+-------+
      +--------------------+  |   |  |------------------------------+
      |                       |   +---------------+                 |
      v                       v                   v                 v
 +----+--------+      +-------+---------+     +---+----------+    +-+---+
 | alloc_pid() |      |dup_task_struct()|     | copy_flags() |    | ... |
 +-------------+      +-----------------+     +--------------+    +-----+

子進程共享 or 複製父進程的資源，取決於 flags 參數:this

#define CSIGNAL         0x000000ff
#define CLONE_VM        0x00000100
#define CLONE_FS        0x00000200
#define CLONE_FILES     0x00000400
#define CLONE_SIGHAND   0x00000800
...
#define CLONE_NEWNET    0x40000000
#define CLONE_IO        0x80000000

fork 成功後，linux 一般讓子進程先運行。緣由以下:

假設，父子進程返回用戶空間後，調度父進程先運行。父進程可能執行一個寫操做，這時會觸發 CoW。若是調度讓子進程先運行，子進程在 fork 後一般會執行 exec。就不和父進程共享數據了，後面便是父進程再執行寫操做，也不會觸發 CoW。

對於 linux 來講，線程(Thread)是一種特殊的進程。建立的是線程仍是進程，取決於 fork 時的 flag 參數:

// 線程
clone(CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND, 0);

// 進程
clone(SIGCHLD, 0);

其實 linux 裏面沒有嚴格的線程概念，它的線程就是進程(由於linux中進程已然很輕量)。

Interestingly, note that threads share the virtual memory abstraction, whereas each receives its own virtualized processor.

1.3.2 進程終結

結束進程生命週期由兩種方式:

顯示執行 exit()
隱式執行 exit()

第二種狀況指 C 的編譯器會在 main() 函數的返回後執行 exit()。

NORET_TYPE void do_exit(long code)
{
    ...
    exit_signals(tsk);  /* sets PF_EXITING */
    ...
    tsk->exit_code = code;
    ...
    exit_mm(tsk); /*release the mm_struct held by this process*/
    ...
    exit_sem(tsk); /* 退出 IPC 信號量隊列 */
    exit_files(tsk);
    exit_fs(tsk);
    ...
    exit_notify(tsk, group_dead);
    ...
    schedule();
    BUG();

    /* Avoid "noreturn function does return".  */
    for (;;)
        cpu_relax();    /* For when BUG is null */
}

這個函數永遠不會返回。如今這個進程已經被標誌爲 EXIT_ZOMBIE。之因此還稱它爲進程，是由於這個進程還有三個資源沒有釋放:

kernel stack
thread_info structure
task_struct structure.

這三個資源存在的意義是爲了通知父進程，讓父進程來釋放。

父進程執行 wait 族函數來釋放上訴資源:

+-------------+
                      | sys_wait4() |
                      +------+------+
                             |
                             v
                  +----------+---------+
                  | wait_task_zombie() |
                  +----------+---------+
                             |
                             v
                     +-------+--------+
                     | release_task() |
                     +------+---------+
         +---------------+  |    +-------------------+
         |                  |                        |
         v                  v                        v
+--------+--------+     +---+---------------+     +--+---+
| __exit_signal() |     | put_task_struct() |     | ...  |
+-----------------+     +-------------------+     +------+

自此，一個進程/線程在操做系統中的痕跡永遠抹去了。

2、進程調度

調度策略(Scheduling policies):

SCHED_NORMAL/SCHED_OTHER
SCHED_FIFO
SCHED_RR
SCHED_BATCH
SCHED_IDLE

進程分類:

普通進程(Normal Process)
- 交互式進程(interactive process)
- 批處理進程(batch process)
實時進程(Real-Time Process)

實時進程的調度策略爲: SCHED_FIFO/SCHED_RR；普通進程的調度策略爲: SCHED_NORMAL。

優先級:

實時優先級(0~99，數值越高優先級越高)
Nice 優先級(-20~19/100~139，數值越高優先級越低)

實時進程使用實時優先級，而普通進程則使用 Nice 優先級。在 linux 中實時進程老是優先於普通進程調度。因此這兩種優先級互不干擾。

調度器類:

rt_sched_class
fair_sched_class
idle_sched_class

這幾個類的類型都是 struct sched_class。調度器類也有優先級。

調度器實體(Scheduler Entity):

sched_entity
sched_rt_entity
sched_dl_entity

The highest priority scheduler class that has a runnable process wins, selecting who runs next.

2.一、普通進程調度

linux 中，普通進程調度實現了徹底公平調度(Completely Fair Scheduler)算法。

CFS is based on a simple concept: Model process scheduling as if the system had an ideal, perfectly multitasking processor. In such a system, each process would receive 1/n of the processor’s time, where n is the number of runnable processes, and we’d schedule them for infinitely small durations, so that in any measurable period we’d have run all n processes for the same amount of time.

上面描述的只是一種理想狀況。假設系統中有 100 個進程，measurable period 假設爲 1ms(極端例子)。每一個進程每運行 0.01ms 就要進行一次上下文切換。這是不現實的。

可是咱們須要一種標準來衡量 CFS 的性能，因而提出兩個概念:

targeted latency
minimum granularity(默認值爲 1ms)

總結一句話就是: 在 targeted latency 長的時間內，要讓每一個進程都能被調度到，且每一個進程的運行時間不低於 minimum granularity。

目前來講只是在紙上談兵。關鍵是每次調度一個進程後，到底應該運行多長時間呢？在 CFS 中，這個時間由全部普通進程的 Nice 值決定。

先經過 Nice 值計算每一個進程[i]的權重(weight):

weight[i] ≈ 1024 / (1.25)^(nice[i])

而後再由權重計算出該進程應該佔用的 CPU 比例:

CPU proportion[i] = weight[i]/weight[1] + ... + weight[n]

這是一種幾何加權。經過這種方式，使用 CFS 調度運行普通進程，能達到幾乎完美的多任務。CFS 的實現分爲四部分:

Time Accounting
Process Selection
The Scheduler Entry Point
Sleeping and Waking Up

2.1.一、Time Accounting

struct task_struct {
    ...
    struct sched_entity se;
    ...
}

struct sched_entity {
    ...
    u64         vruntime;
    ...
}

對於理想的 CFS 模型來講，每一個進程的 vruntime 都是相同的，但現實中卻不一樣。

CFS uses vruntime to account for how long a process has run and thus how much longer it ought to run.

static void update_curr(struct cfs_rq *cfs_rq)
{
    ...
    delta_exec = (unsigned long)(now - curr->exec_start);
    ...
    __update_curr(cfs_rq, curr, delta_exec);
    ...
}

static inline void
__update_curr(struct cfs_rq *cfs_rq, struct sched_entity *curr,
          unsigned long delta_exec)
{
    ...
    delta_exec_weighted = calc_delta_fair(delta_exec, curr);
    curr->vruntime += delta_exec_weighted;
    update_min_vruntime(cfs_rq);
}

能夠看到 vruntime 通過加權計算。

2.1.二、Process Selection

CFS 選擇 vruntime 最小的進程調度運行。爲了查找迅速，CFS 使用紅黑樹來組織 struct cfs_rq 運行隊列:

struct cfs_rq {
    ...
    struct sched_entity *curr, *next, *last;
    ...
}

vruntime 最小的 sched_entity 在紅黑樹的最左邊。

2.1.三、The Scheduler Entry Point

linux 中總調度入口在 kernel/sched.c/schedule() 中，這個函數的核心是 pick_next_task() 函數:

static inline struct task_struct *
pick_next_task(struct rq *rq)
{
    const struct sched_class *class;
    struct task_struct *p;
    ...
    class = sched_class_highest;
    for ( ; ; ) {
        p = class->pick_next_task(rq);
        if (p)
            return p;
        class = class->next;
    }
}

這個函數看上去挺簡單，實際上倒是整個進程調度的精華所在。上面提到過 struct sched_class 的變量有 3 個:

fair_sched_class
rt_sched_class
idle_sched_class

在 sched_rt.c 中，fair_sched_class 爲本身從新註冊了函數:

static const struct sched_class rt_sched_class = {
    .next           = &fair_sched_class,
    .enqueue_task       = enqueue_task_rt,
    .dequeue_task       = dequeue_task_rt,
    .yield_task     = yield_task_rt,

    .check_preempt_curr = check_preempt_curr_rt,

    .pick_next_task     = pick_next_task_rt,
    .put_prev_task      = put_prev_task_rt,
    ...
}

因此 pick_next_task() 的邏輯就是: 先按調度類優先級從高到底排序，執行各自的 pick_next_task_*() 函數。在各自的 struct *_rq 運行隊列中找一個合適的進程。調度類優先級最高的是:

#define sched_class_highest (&rt_sched_class)

2.1.四、Sleeping and Waking Up

主動 sleep
被動 sleep

內核使用一個結構體來組織休眠的 task:

struct __wait_queue_head {
    spinlock_t lock;
    struct list_head task_list;
};
typedef struct __wait_queue_head wait_queue_head_t;

實現原理相似 xv6 中的的 sleep/wakeup。

2.二、實時進程調度

實時進程使用另外一種調度方式，其實現比 CFS 要簡單不少。在 kernel/sched_rt.c 中，實時進程的策略有兩種:

SCHED_FIFO
SCHED_RR

SCHED_RR 是帶有時間片的 SCHED_FIFO。

struct task_struct {
    ...
    struct sched_rt_entity rt;
    ...
}

參考資料:

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。