Linux進程調度與搶佔

時間 2019-11-14

標籤 linux 進程調度搶佔欄目 Linux 简体版

原文原文鏈接

1、linux內核搶佔介紹html

1.搶佔發生的必要條件node

a.preempt_count搶佔計數必須爲0，不爲0說明其它地方調用了禁止搶佔的函數，好比spin_lock系列函數。
b.中斷必須是使能的狀態，由於搶佔動做要依賴中斷。linux

preempt_schedule()具體源碼實現參考以下：算法

asmlinkage __visible void __sched notrace preempt_schedule(void)
{
    /*
     * If there is a non-zero preempt_count or interrupts are disabled,
     * we do not want to preempt the current task. Just return..
     */
    /*preempt_disable()會增長preempt_count的計數*/
    if (likely(!preemptible()))
        return;
    preempt_schedule_common();
}
#define preemptible()    (preempt_count() == 0 && !irqs_disabled())

View Code

2.spin_lock系列函數api

a.spin_lock()會調用preempt_disable函數關閉搶佔.
b.spin_lock_irq()會調用spin_lock()函數和local_irq_disable()函數（關閉中斷）
c.spin_lock_irqsave()會調用spin_lock()函數和local_irq_save()函數（關閉中斷，同時保存cpu對中斷的屏蔽狀態）數組

spin_lock()：緩存

/*include/linux/spinlock.h*/
static __always_inline void spin_lock(spinlock_t *lock)
{
    raw_spin_lock(&lock->rlock);
}
#define raw_spin_lock(lock)    _raw_spin_lock(lock)

/*kernel/locking/spinlock.c*/
void __lockfunc _raw_spin_lock(raw_spinlock_t *lock)
{
    __raw_spin_lock(lock);
}

/*include/linux/spinlock_api_smp.h*/
static inline void __raw_spin_lock(raw_spinlock_t *lock)
{
    preempt_disable(); /*調用禁止搶佔函數*/
    spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
    LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
}

View Code

spin_unlock():ide

/*kernel/locking/spinlock.c*/
void __lockfunc _raw_spin_unlock(raw_spinlock_t *lock)
{
    __raw_spin_unlock(lock);
}

/*include/linux/spinlock_api_smp.h*/
static inline void __raw_spin_unlock(raw_spinlock_t *lock)
{
    spin_release(&lock->dep_map, 1, _RET_IP_);
    do_raw_spin_unlock(lock);
    preempt_enable();
}

View Code

preempt_enable():函數

/*include/linux/preempt.h*/
#define preempt_enable() \
do { \
    barrier(); \
    if (unlikely(preempt_count_dec_and_test())) \
        /*這裏提供了一個搶佔點__preempt_schedule()，其它高優先級的進程可直接搶佔*/\
        __preempt_schedule(); \
} while (0)

View Code

由上可知，spin_unlock()系列函數能夠直接觸發內核搶佔，由於它裏面提供可搶佔點。post

3.preempt_disable()和local_irq_disable()的區別

由搶佔發生的必要條件可知兩個函數均可以關閉搶佔。區別不在於關搶佔和關中斷函數上，而是在對應的開搶佔和開中斷的函數上，也就是
preempt_enable()函數local_irq_enable()函數。preempt_enable()會是能搶佔並提供搶佔點，而local_irq_enable()僅僅是開中斷(是能搶佔)，
並無提供搶佔點。

4.搶佔點多是：時鐘tick中斷處理返回、中斷返回、軟中斷結束、yield()(進程調用它放棄CPU)等等多種狀況。

5.注意spin_lock系列函數關閉了搶佔，可是並無關閉調度!

6.原子上下文中不可睡眠，能夠打開內核中的CONFIG_DEBUG_ATOMIC_SLEEP選項，運行時一旦檢測出在原子上下文中可能睡眠就會打印棧回溯信息。

7.進程的優先級使用nice值表示。

2、進程調度

1.目前4.14.35內核中只有下列sched_class：

fair_sched_class: .next = idle_sched_class
rt_sched_class  : .next = fair_sched_class
dl_sched_class  : .next = rt_sched_class
idle_sched_class: .next = NULL
stop_sched_class: .next = dl_sched_class

View Code

全部的調度類構成一個單鏈表：

stop_sched_class --> dl_sched_class --> rt_sched_class --> fair_sched_class --> idle_sched_class --> NULL

View Code

#ifdef CONFIG_SMP
#define sched_class_highest (&stop_sched_class)
#else
#define sched_class_highest (&dl_sched_class)
#endif
#define for_each_class(class)   for (class = sched_class_highest; class; class = class->next)

View Code

SCHED_NORMAL：普通的分時進程，使用的fair_sched_class調度類

SCHED_FIFO：先進先出的實時進程，使用的是rt_sched_class調度類。
當調用程序把CPU分配給進程的時候，它把該進程描述符保留在運行隊列鏈表的當前位置。使用此調度策略的進程一旦使用CPU則一直運行。如
果沒有其餘可運行的更高優先級實時進程，進程就會繼續使用CPU，想用多久就用多久，即便還有其餘具備相同優先級的實時進程處於可運行狀態。

SCHED_RR：時間片輪轉的實時進程，使用的rt_sched_class調度類。
當調度程序把CPU分配給進程的時候，它把該進程的描述符放在運行隊列鏈表的末尾。這種策略保證對全部具備相同優先級的SCHED_RR實時進程
進行公平分配CPU時間。

SCHED_BATCH：是SCHED_NORMAL的分化版本，使用的fair_shed_class調度類。
採用分時策略，根據動態優先級，分配CPU資源。在有實時進程的時候，實時進程優先調度。但針對吞吐量優化，除了不能搶佔外與常規進程一
樣，容許任務運行更長時間，更好使用高速緩存，適合於成批處理的工做。

SCHED_IDLE：優先級最低，在系統空閒時運行，使用的是idle_sched_class調度類，給0號進程使用。

SCHED_DEADLINE：新支持的實時進程調度策略，使用的是dl_sched_class調度類。
針對突發型計算，而且對延遲和完成時間敏感的任務使用，基於EDF（earliest deadline first）。

2.調度類struct sched class

struct sched_class {
    const struct sched_class *next;

    void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
    void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
    void (*yield_task) (struct rq *rq);
    bool (*yield_to_task) (struct rq *rq, struct task_struct *p, bool preempt);

    void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags);

    /*
     * It is the responsibility of the pick_next_task() method that will
     * return the next task to call put_prev_task() on the @prev task or
     * something equivalent.
     *
     * May return RETRY_TASK when it finds a higher prio class has runnable tasks.
     */
    struct task_struct * (*pick_next_task) (struct rq *rq, struct task_struct *prev, struct rq_flags *rf);
    void (*put_prev_task) (struct rq *rq, struct task_struct *p);

#ifdef CONFIG_SMP
    int  (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
    void (*migrate_task_rq)(struct task_struct *p);

    void (*task_woken)(struct rq *this_rq, struct task_struct *task);

    void (*set_cpus_allowed)(struct task_struct *p, const struct cpumask *newmask);

    void (*rq_online)(struct rq *rq);
    void (*rq_offline)(struct rq *rq);
#endif

    void (*set_curr_task) (struct rq *rq);
    void (*task_tick) (struct rq *rq, struct task_struct *p, int queued);
    void (*task_fork) (struct task_struct *p);
    void (*task_dead) (struct task_struct *p);

    /*
     * The switched_from() call is allowed to drop rq->lock, therefore we
     * cannot assume the switched_from/switched_to pair is serliazed by
     * rq->lock. They are however serialized by p->pi_lock.
     */
    void (*switched_from) (struct rq *this_rq, struct task_struct *task);
    void (*switched_to) (struct rq *this_rq, struct task_struct *task);
    void (*prio_changed) (struct rq *this_rq, struct task_struct *task, int oldprio);

    unsigned int (*get_rr_interval) (struct rq *rq, struct task_struct *task);

    void (*update_curr) (struct rq *rq);

#define TASK_SET_GROUP  0
#define TASK_MOVE_GROUP    1

#ifdef CONFIG_FAIR_GROUP_SCHED
    void (*task_change_group) (struct task_struct *p, int type);
#endif
};

View Code

next: 指向下一個調度類，用於在函數pick_next_task、check_preempt_curr、set_rq_online、set_rq_offline中遍歷整個調度類，根據調度
類的優先級選擇調度類。
優先級爲: stop_sched_class-->dl_sched_class-->rt_sched_class-->fair_sched_class-->idle_sched_class
enqueue_task: 將任務加入到調度類中
dequeue_task: 將任務從調度類中移除
yield_task/yield_to_task: 主動放棄CPU
check_preempt_curr: 檢查當前進程是否可被強佔
pick_next_task: 從調度類中選出下一個要運行的進程
put_prev_task: 將進程放回到調度類中
select_task_rq: 爲進程選擇一個合適的cpu的運行隊列
migrate_task_rq: 遷移到另外的cpu運行隊列
pre_schedule: 調度之前調用
post_schedule: 通知調度器完成切換
task_woken: 用於進程喚醒
set_cpus_allowed: 修改進程cpu親和力
affinityrq_online: 啓動運行隊列
rq_offline:關閉運行隊列
set_curr_task: 當進程改變調度類或者進程組時被調用
task_tick: 將會引發進程切換，驅動運行running強佔,由time_tick調用
task_fork: 進程建立時調用，不一樣調度策略的進程初始化不同
task_dead: 進程結束時調用
switched_from/switched_to:進程改變調度器時使用
prio_changed: 改變進程優先級.

3.調度的觸發

調度的觸發主要有兩種方式：

(1)一種是本地定時中斷觸發調用scheduler_tick()函數，而後使用當前運行進程的調度類中的task_tick.
(2)另一種則是主動調用schedule().
不論是哪種最終都會調用到__schedule函數，該函數調用pick_netx_task，經過(rq->nr_running==rq->cfs.h_nr_running)判斷出若是當前
運行隊列中的進程都在cfs調度器中，則直接調用cfs的調度類（內核代碼裏面這一判斷使用了likely說明大部分狀況都是知足該條件的）。如
果運行隊列不都在cfs中，則經過優先級stop_sched_class-->dl_sched_class-->rt_sched_class-->fair_sched_class-->idle_sched_class
遍歷選出下一個須要運行的進程，而後進程任務切換。

4.發生調度的時機

處於TASK_RUNNING狀態的進程纔會被進程調度器選擇，其餘狀態不會進入調度器，系統發生調度的時機以下：
a.調用cond_resched()時
b.顯式調用schedule()時
c.從中斷上下文返回時
當內核開啓搶佔時，會多出幾個調度時機：
d.在系統調用中或者中斷下文中調用preemt_enable()時

5.__schedule()實現
TODO：分析它

6.CFS(Completely Fair Scheduler)調度

該部分代碼位於linux/kernel/sched/fair.c中，定義了const struct sched_classfair_sched_class，這個是CFS的調度類定義的對象。其中
基本包含了CFS調度的全部實現。

CFS實現三個調度策略：
SCHED_NORMAL：這個調度策略是被常規任務使用
SCHED_BATCH：這個策略不像常規的任務那樣頻繁的搶佔，以犧牲交互性爲代價下，於是容許任務運行更長的時間以更好的利用緩存，這種策略
適合批處理。
SCHED_IDLE：這是nice值甚至比19還弱，可是爲了不陷入優先級致使問題，這個問題將會死鎖這個調度器，於是這不是一個真正空閒定時調
度器。

CFS調度類fair_sched_class：
enqueue_task()：當任務進入runnable狀態，這個回調將把這個任務的調度實體（entity）放入紅黑樹而且增長nr_running變量的值。
dequeue_task()：當任務再也不是runnable狀態，這個回調將會把這個任務的調度實體從紅黑樹中取出，而且減小nr_running變量的值。
yield_task()：除非compat_yield sysctl是打開的，這個回調函數基本上就是一個dequeue後跟一個enqueue，這那種狀況下，他將任務的調度
實體放入紅黑樹的最右端
check_preempt_curr()：這個回調函數是檢查一個任務進入runnable狀態是否應該搶佔當前運行的任務。
pick_next_task()：這個回調函數選出下一個最合適運行的任務。
set_curr_task()：當任務改變他的調度類或者改變他的任務組，將調用該回調函數。
task_tick()：這個回調函數大多數是被time tick調用。它可能引發進程切換，這就驅動了運行時搶佔。

/*
 * 一個調度實體（紅黑樹的一個節點），其包含一組或一個指定的進程，包含一個本身的運行隊列，
 * 一個父親指針，一個指向須要調度的隊列.
 */
struct sched_entity {
    /* For load-balancing: */
    struct load_weight        load; /*權重，在數組prio_to_weight[]包含優先級轉權重的數值*/
    struct rb_node            run_node; /*實體在紅黑樹對應的節點信息*/
    struct list_head        group_node; /*實體所在的進程組*/
    unsigned int            on_rq;  /*實體是否處於紅黑樹運行隊列中*/

    u64                exec_start; /*開始運行時間*/
    u64                sum_exec_runtime;  /*總運行時間*/
    /*
        虛擬運行時間，在時間中斷或者任務狀態發生改變時會更新.
        其會不停的增加，增加速度與load權重成反比，load越高，增加速度越慢，就越可能處於紅黑樹最左邊
        被調度。每次時鐘中斷都會修改其值，具體見calc_delta_fair()函數
    */
    u64                vruntime;
    /*進程在切換進cpu時的sum_exec_runtime值*/
    u64                prev_sum_exec_runtime;
    /*此調度實體中進程移到其餘cpu組的數量*/
    u64                nr_migrations;

    struct sched_statistics        statistics;

#ifdef CONFIG_FAIR_GROUP_SCHED
    int                depth;
    /*
     父親調度實體指針，若是是進程則指向其運行隊列的調度實體，若是是進程組則指向其上一個進程組的
     調度實體，在set_task_rq函數中設置。
    */
    struct sched_entity        *parent;
    /* rq on which this entity is (to be) queued: */
    struct cfs_rq            *cfs_rq; /*實體所處紅黑樹運行隊列*/
    /* rq "owned" by this entity/group: */
    struct cfs_rq            *my_q; /*實體的紅黑樹運行隊列，若是爲NULL代表其是一個進程，若非NULL代表其是調度組*/
#endif

#ifdef CONFIG_SMP
    /*
     * Per entity load average tracking.
     *
     * Put into separate cache line so it does not
     * collide with read-mostly values above.
     */
    struct sched_avg        avg ____cacheline_aligned_in_smp;
#endif
};

View Code

load
指定了權重, 決定了各個實體佔隊列總負荷的比重, 計算負荷權重是調度器的一項重任, 由於CFS所需的虛擬時鐘的速度最終依賴於負荷, 權
重經過優先級轉換而成，是vruntime計算的關鍵
run_node
調度實體在紅黑樹對應的結點信息, 使得調度實體能夠在紅黑樹上排序
sum_exec_runtime
記錄程序運行所消耗的CPU時間, 以用於徹底公平調度器CFS
on_rq
調度實體是否在就緒隊列上接受檢查, 代表是否處於CFS紅黑樹運行隊列中，須要明確一個觀點就是，CFS運行隊列裏面包含有一個紅黑樹，但
這個紅黑樹並非CFS運行隊列的所有，由於紅黑樹僅僅是用於選擇出下一個調度程序的算法。很簡單的一個例子，普通程序運行時，其並不在
紅黑樹中，可是仍是處於CFS運行隊列中，其on_rq爲真。只有準備退出、即將睡眠等待和轉爲實時進程的進程其CFS運行隊列的on_rq爲假。
vruntime
虛擬運行時間，調度的關鍵，其計算公式：一次調度間隔的虛擬運行時間 = 實際運行時間 * (NICE_0_LOAD / 權重)。能夠看出跟實際運行時
間和權重有關，紅黑樹就是以此做爲排序的標準，優先級越高的進程在運行時其vruntime增加的越慢，其可運行時間相對就長，並且也越有可
能處於紅黑樹的最左結點，調度器每次都選擇最左邊的結點爲下一個調度進程。注意其值爲單調遞增，在每一個調度器的時鐘中斷時當前進程的
虛擬運行時間都會累加。單純的說就是進程們都在比誰的vruntime最小，最小的將被調度。
cfs_rq
此調度實體所處於的CFS運行隊列
my_q
若是此調度實體表明的是一個進程組，那麼此調度實體就包含有一個本身的CFS運行隊列，其CFS運行隊列中存放的是此進程組中的進程，這些
進程就不會在其餘CFS運行隊列的紅黑樹中被包含(包括頂層紅黑樹也不會包含他們，他們只屬於這個進程組的紅黑樹)。
sum_exec_runtime
跟蹤運行時間是由update_curr不斷累積完成的。內核中許多地方都會調用該函數, 例如, 新進程加入就緒隊列時, 或者週期性調度器中. 每次
調用時, 會計算當前時間和exec_start之間的差值, exec_start則更新到當前時間. 差值則被加到sum_exec_runtime.
在進程執行期間虛擬時鐘上流逝的時間數量由vruntime統計。
在進程被撤銷時, 其當前sum_exec_runtime值保存到prev_sum_exec_runtime, 此後, 進程搶佔的時候須要用到該數據, 可是注意, 在prev_sum_exec_runtime
中保存了sum_exec_runtime的值, 而sum_exec_runtime並不會被重置, 而是持續單調增加。

每個進程的task_struct中都嵌入了sched_entry對象，因此進程是可調度的實體，可是可調度的實體不必定是進程，也多是進程組。

7.CFS調度總結：

Tcik中斷，主要會更新調度信息，而後調整當前進程在紅黑樹中的位置。調整完成之後若是當前進程再也不是最左邊的葉子，就標記爲Need_resched
標誌，中斷返回時就會調用scheduler()完成切換、不然當前進程繼續佔用CPU。從這裏能夠看出CFS拋棄了傳統時間片概念。Tick中斷只須要更新紅黑樹。

紅黑樹鍵值即爲vruntime，該值經過調用update_curr函數進行更新。這個值爲64位的變量，會一直遞增，__enqueue_entity中會將vruntime做爲鍵值將
要入隊的實體插入到紅黑樹中。__pick_first_entity會將紅黑樹中最左側即vruntime最小的實體取出。

優秀文章：

Linux 2.6 Completely Fair Scheduler 內幕： https://www.ibm.com/developerworks/cn/linux/l-completely-fair-scheduler/index.html