從一個CFS調度案例談Linux系統卡頓的根源

Linux系統是一個讓人感受卡頓的系統,先別懟,讓我說完:node

  • 卡頓的緣由在於Linux內核的調度器歷來不關注業務場景!

Linux內核只能看到機器而不肯意看到應用。它傾向於自下而上從CPU角度提升吞吐,而不是自上而下從業務角度提升用戶體驗。linux

擬人來看,Linux是一個好程序員,但不是一個好經理。git

萬事必有因緣,Linux就是一個程序員發起一幫程序員折騰起來的,幾乎沒有穿西裝的經理之類的人蔘與。程序員

程序員每天掛在嘴邊的就是性能,時間複雜度,cache利用率,CPU,內存,反之,經理天天吆喝的就是客戶,客戶,客戶,體驗,體驗,體驗!算法


前天晚上下班回到住處已經很晚,姓劉的副經理請教了我一個問題,說是他在調試一個消息隊列組件,涉及到生產者,消費者等多個相互配合的線程,很是複雜,部署上線後,發現一個奇怪的問題:bash

  • 整個系統資源彷佛被該組件的線程獨佔,該消息隊列組件的效率很是高,但其系統很是卡頓!

我問他有沒有部署cgroup,cpuset之類的配置,他說沒有。app

我又問他該消息隊列組件一共有多少線程,他說很少,不超過20個。ide

我又問…他說…函數

oop

我感到很奇怪,我告訴劉副經理說讓我登陸機器調試下試試看,他並無贊成,只是能儘量多的告訴我細節,我來遠程協助。

我並不懂消息隊列,我也不懂任何的中間件,調試任何一個此類系統對我而言是無能爲力的,我也感到遺憾,我只能嘗試從操做系統的角度去解釋和解決這個問題。很顯然,這就是操做系統的問題。並且很明確,這是操做系統的調度問題。

涉及生產者,消費者,若是我能本地重現這個問題,那麼我就必定能解決這個問題。

因而開始Google關於Linux schedule子系統關於生產者和消費者的內容,producer,consumer,schedule,cfs,linux kernel…

次日的一成天,我一直在思考這個問題,沒有復現環境,卻也只能閒暇時思考。

晚飯後,我找到了下面的patch:
https://git.bricked.de/Bricked/flo/commit/4793241be408b3926ee00c704d7da3b3faf3a05f

Impact: improve/change/fix wakeup-buddy scheduling

Currently we only have a forward looking buddy, that is, we prefer to
schedule to the task we last woke up, under the presumption that its
going to consume the data we just produced, and therefore will have
cache hot benefits.

This allows co-waking producer/consumer task pairs to run ahead of the
pack for a little while, keeping their cache warm. Without this, we
would interleave all pairs, utterly trashing the cache.

This patch introduces a backward looking buddy, that is, suppose that
in the above scenario, the consumer preempts the producer before it
can go to sleep, we will therefore miss the wakeup from consumer to
producer (its already running, after all), breaking the cycle and
reverting to the cache-trashing interleaved schedule pattern.

The backward buddy will try to schedule back to the task that woke us
up in case the forward buddy is not available, under the assumption
that the last task will be the one with the most cache hot task around
barring current.

This will basically allow a task to continue after it got preempted.

In order to avoid starvation, we allow either buddy to get wakeup_gran
ahead of the pack.
























彷佛CFS調度器的LAST_BUDDY feature相關,該feature涉及運行隊列的next,last指針,好不容易找到了這個,我決定設計一個最簡單的實驗,嘗試復現問題並對LAST_BUDDY feature進行一番探究。

個人實驗很是簡單:

  • 生產者循環喚醒消費者。

爲了讓喚醒操做更加直接,我但願採用wake_up_process直接喚醒,而不是使用信號這些複雜的機制,因此我必須尋找內核的支持,首先寫一個字符設備來支持這個操做:

// wakedev.c
#include <linux/sched.h>
#include <linux/module.h>
#include <linux/fs.h>
#include <linux/cdev.h>
#include <linux/device.h>
#include<linux/uaccess.h>

#define CMD_WAKE 122

dev_t dev = 0;
static struct class *dev_class;
static struct cdev wake_cdev;

static long _ioctl(struct file *file, unsigned int cmd, unsigned long arg);

static struct file_operations fops = {
	.owner          = THIS_MODULE,
	.unlocked_ioctl = _ioctl,
};

static long _ioctl(struct file *file, unsigned int cmd, unsigned long arg)
{
	u32 pid = 0;
	struct task_struct *task = NULL;

	switch(cmd) {
		// ioctl喚醒命令
		case CMD_WAKE:
			copy_from_user(&pid, (u32 *) arg, sizeof(pid));
			task = pid_task(find_vpid(pid), PIDTYPE_PID);
			if (task) {
				wake_up_process(task);
			}
			break;
	}
	return 0;
}

static int __init crosswake_init(void)
{
	if((alloc_chrdev_region(&dev, 0, 1, "test_dev")) <0){
		printk("alloc failed\n");
		return -1;
	}
	printk("major=%d minor=%d \n",MAJOR(dev), MINOR(dev));

	cdev_init(&wake_cdev, &fops);

	if ((cdev_add(&wake_cdev, dev, 1)) < 0) {
		printk("add failed\n");
		goto err_class;
	}

	if ((dev_class = class_create(THIS_MODULE, "etx_class")) == NULL) {
		printk("class failed\n");
		goto err_class;
	}

	if ((device_create(dev_class, NULL, dev, NULL, "etx_device")) == NULL) {
		printk(KERN_INFO "create failed\n");
		goto err_device;
	}

	return 0;

err_device:
	class_destroy(dev_class);
err_class:
	unregister_chrdev_region(dev,1);
	return -1;
}

void __exit crosswake_exit(void)
{
	device_destroy(dev_class,dev);
	class_destroy(dev_class);
	cdev_del(&wake_cdev);
	unregister_chrdev_region(dev, 1);
}

module_init(crosswake_init);
module_exit(crosswake_exit);

MODULE_LICENSE("GPL");
MODULE_AUTHOR("shabi");

編譯加載並建立字符設備:

[root@localhost test]# insmod ./wakedev.ko
[root@localhost test]# dmesg |grep major.*minor
[   68.385310] major = 248 minor = 0
[root@localhost test]# mknod /dev/test c 248 0

OK,接下來是生產者程序的代碼:

// producer.c
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <sys/ioctl.h>
#include <unistd.h>

int main(int argc, char **argv)
{
	int fd, i = 0xfff, j;
	int pid = -1;

	pid = atoi(argv[1]);
	fd = open("/dev/test", O_RDWR);
	perror("open");

	while(1 || i --) {
		j = 0xfffff;
		ioctl(fd, 122, &pid); // 喚醒consumer進程
		while(j--) {} // 模擬生產過程!
	}
}

接下來是消費者程序:

#include <stdio.h>
int main()
{
	while(1) {
		sleep(1);
	}
}

編譯運行之:

# 名字很長是由於日誌裏顯眼!
[root@localhost test]# gcc consumer.c -O0 -o consumerAconsumerACconsumerAconsumer
[root@localhost test]# gcc producer.c -O0 -o producerAproducerAproducerAproducer
# 啓動消費者
[root@localhost test]# ./consumerAconsumerACconsumerAconsumer &
# 啓動生產者,指定消費者進程
[root@localhost test]# ./producerAproducerAproducerAproducer 26274

差很少就是以上的方式進行實驗,試了屢次,沒有發現卡頓現象。

比較失望,可是我想這是必然的,由於若是問題真的如此容易復現的話,社區早就fix了,可見,這裏並無顯而易見的問題須要解決。可能只是使用系統的方式不當。

無論怎樣,仍是先看數據吧,從數據裏分析細節。

首先,在實驗進行的狀況下,先用perf導出進程切換的過程:

[root@localhost test]# perf record -e sched:sched_switch -e sched:sched_wakeup -a -- sleep 5
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.303 MB perf.data (1303 samples) ]
[root@localhost test]# perf script -i perf.data > final.data

下面是一個片斷,我捕獲了大概10次捕獲到的一個片斷:

loop_sleep  6642 [000] 23085.198476: sched:sched_switch: loop_sleep:6642 [120] R ==> xfsaild/dm-0:400 [120]
    xfsaild/dm-0   400 [000] 23085.198482: sched:sched_switch: xfsaild/dm-0:400 [120] S ==> loop_sleep:6642 [120]
      loop_sleep  6642 [000] 23085.217752: sched:sched_switch: loop_sleep:6642 [120] R ==> producerAproduc:26285 [130]
## 從這裏開始霸屏
 producerAproduc 26285 [000] 23085.220257: sched:sched_wakeup: consumerAconsum:26274 [120] success=1 CPU:000
 producerAproduc 26285 [000] 23085.220259: sched:sched_switch: producerAproduc:26285 [130] R ==> consumerAconsum:26274 [120]
 consumerAconsum 26274 [000] 23085.220273: sched:sched_switch: consumerAconsum:26274 [120] S ==> producerAproduc:26285 [130]
 producerAproduc 26285 [000] 23085.269921: sched:sched_wakeup: consumerAconsum:26274 [120] success=1 CPU:000
 producerAproduc 26285 [000] 23085.269923: sched:sched_switch: producerAproduc:26285 [130] R ==> consumerAconsum:26274 [120]
 consumerAconsum 26274 [000] 23085.269927: sched:sched_switch: consumerAconsum:26274 [120] S ==> producerAproduc:26285 [130]
 producerAproduc 26285 [000] 23085.292748: sched:sched_wakeup: consumerAconsum:26274 [120] success=1 CPU:000
 producerAproduc 26285 [000] 23085.292749: sched:sched_switch: producerAproduc:26285 [130] R ==> consumerAconsum:26274 [120]
 consumerAconsum 26274 [000] 23085.292752: sched:sched_switch: consumerAconsum:26274 [120] producerAproduc:26285 [130]
 producerAproduc 26285 [000] 23085.320205: sched:sched_wakeup: consumerAconsum:26274 [120] success=1 CPU:000
 producerAproduc 26285 [000] 23085.320208: sched:sched_switch: producerAproduc:26285 [130] R ==> consumerAconsum:26274 [120]
 consumerAconsum 26274 [000] 23085.320212: sched:sched_switch: consumerAconsum:26274 [120] S ==> producerAproduc:26285 [130]
 producerAproduc 26285 [000] 23085.340971: sched:sched_wakeup: consumerAconsum:26274 [120] success=1 CPU:000
 producerAproduc 26285 [000] 23085.340973: sched:sched_switch: producerAproduc:26285 [130] R ==> consumerAconsum:26274 [120]
 consumerAconsum 26274 [000] 23085.340977: sched:sched_switch: consumerAconsum:26274 [120] S ==> producerAproduc:26285 [130]
 producerAproduc 26285 [000] 23085.369630: sched:sched_wakeup: consumerAconsum:26274 [120] success=1 CPU:000
 producerAproduc 26285 [000] 23085.369632: sched:sched_switch: producerAproduc:26285 [130] R ==> consumerAconsum:26274 [120]
 consumerAconsum 26274 [000] 23085.369637: sched:sched_switch: consumerAconsum:26274 [120] S ==> producerAproduc:26285 [130]
 producerAproduc 26285 [000] 23085.400818: sched:sched_wakeup: consumerAconsum:26274 [120] success=1 CPU:000
 producerAproduc 26285 [000] 23085.400821: sched:sched_switch: producerAproduc:26285 [130] R ==> consumerAconsum:26274 [120]
 consumerAconsum 26274 [000] 23085.400825: sched:sched_switch: consumerAconsum:26274 [120] S ==> producerAproduc:26285 [130]
 producerAproduc 26285 [000] 23085.426043: sched:sched_wakeup: consumerAconsum:26274 [120] success=1 CPU:000
 producerAproduc 26285 [000] 23085.426045: sched:sched_switch: producerAproduc:26285 [130] R ==> consumerAconsum:26274 [120]
 consumerAconsum 26274 [000] 23085.426048: sched:sched_switch: consumerAconsum:26274 [120] S ==> producerAproduc:26285 [130]
 producerAproduc 26285 [000] 23085.447646: sched:sched_wakeup: xfsaild/dm-0:400 [120] success=1 CPU:000
 producerAproduc 26285 [000] 23085.447649: sched:sched_switch: producerAproduc:26285 [130] R ==> xfsaild/dm-0:400 [120]
## 實在過久了,R狀態讓位! 
## 到這裏才結束!!! 
    xfsaild/dm-0   400 [000] 23085.447654: sched:sched_switch: xfsaild/dm-0:400 [120] S ==> loop_sleep:6642 [120]
      loop_sleep  6642 [000] 23085.468047: sched:sched_switch: loop_sleep:6642 [120] R ==> producerAproduc:26285 [130]
 producerAproduc 26285 [000] 23085.469862: sched:sched_wakeup: consumerAconsum:26274 [120] success=1 CPU:000
 producerAproduc 26285 [000] 23085.469863: sched:sched_switch: producerAproduc:26285 [130] R ==> consumerAconsum:26274 [120]
 consumerAconsum 26274 [000] 23085.469867: sched:sched_switch: consumerAconsum:26274 [120] S ==> loop_sleep:6642 [120]
      loop_sleep  6642 [000] 23085.488800: sched:sched_switch: loop_sleep:6642 [120] R ==> producerAproduc:26285 [130]

看來確實存在生產者和消費者配合在一塊兒霸屏的現象,只是比較少而已。

究竟是什麼緣由讓兩個進程如此粘連在一塊兒的呢?這個問題比較有意思。

CFS調度器並無兌現「消除trick,保持簡單」的承諾,愈來愈多的「啓發式算法feature」被加入進去,重蹈了 O ( 1 ) O(1) O(1)調度器的老路!!

咱們從 /sys/kernel/debug/sched_features 中能夠看到這些「trick式的feature」 :

[root@localhost test]# cat /sys/kernel/debug/sched_features
GENTLE_FAIR_SLEEPERS START_DEBIT NO_NEXT_BUDDY LAST_BUDDY CACHE_HOT_BUDDY WAKEUP_PREEMPTION ARCH_POWER NO_HRTICK NO_DOUBLE_TICK LB_BIAS NONTASK_POWER TTWU_QUEUE NO_FORCE_SD_OVERLAP RT_RUNTIME_SHARE NO_LB_MIN NO_NUMA NUMA_FAVOUR_HIGHER NO_NUMA_RESIST_LOWER

對着文檔一個個仔細看吧。這裏咱們只關心 LAST_BUDDY

若是review相關的代碼的話,咱們會發現wakeup操做的下面的片斷:

static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
{
	...
    if (wakeup_preempt_entity(se, pse) == 1) {
        /* * Bias pick_next to pick the sched entity that is * triggering this preemption. */
        if (!next_buddy_marked)
            set_next_buddy(pse);
        goto preempt;
    }	
    ...
preempt:
    resched_task(curr);
    /* * Only set the backward buddy when the current task is still * on the rq. This can happen when a wakeup gets interleaved * with schedule on the ->pre_schedule() or idle_balance() * point, either of which can * drop the rq lock. * * Also, during early boot the idle thread is in the fair class, * for obvious reasons its a bad idea to schedule back to it. */
    if (unlikely(!se->on_rq || curr == rq->idle))
        return;
	// 這裏是關鍵!設置了一個next
    if (sched_feat(LAST_BUDDY) && scale && entity_is_task(se))
        set_last_buddy(se);
}

以及pick_next操做的下面的片斷:

/* 看這些註釋就夠了! * Pick the next process, keeping these things in mind, in this order: * 1) keep things fair between processes/task groups * 2) pick the "next" process, since someone really wants that to run * 3) pick the "last" process, for cache locality * 4) do not run the "skip" process, if something else is available */
static struct sched_entity *pick_next_entity(struct cfs_rq *cfs_rq)
{
	...
	/* * Prefer last buddy, try to return the CPU to a preempted task. */
    // next比last要優先!
    if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
        se = cfs_rq->last;

    /* * Someone really wants this to run. If it's not unfair, run it. */
    if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
        se = cfs_rq->next;

    clear_buddies(cfs_rq, se);
}

這裏就有疑點了。

設生產者爲P,消費者爲C,則:

  • P喚醒了C,那麼P就會成爲last。
  • C投入運行。
  • C運行結束進入阻塞。
  • CPU進行調度,pick next。
  • last搶先leftmost而勝出。
    last的vruntime雖然比leftmost的大,但不足一個granularity,所以優選last!

上述最後一點須要解釋,我仍是解釋一下內核函數wakeup_preempt_entity的註釋吧:
在這裏插入圖片描述

事情到此爲止,其實問題的解法已經有了,大體就是:

  • 禁用LAST_BUDDY feature.

而這個只須要一條命令便可:

[root@localhost test]# echo NO_LAST_BUDDY >/sys/kernel/debug/sched_features

和副經理交了差以後,卻並未知足本身的好奇心,我仍是但願看個究竟。

爲了挖出事情怎麼發生的,僅靠perf就不夠了,須要systemtap來幫忙。

下面的腳本能夠探測在進程被喚醒,調度切換的時候,細節究竟是什麼樣的:

#!/usr/bin/stap -g

global g_cfs_rq;

probe begin {
	g_cfs_rq = 0;
}

function container_of_entity:long(se:long)
{
	offset = &@cast(0, "struct task_struct")->se;
	return se - offset;
}

function container_to_entity:long(task:long)
{
	offset = &@cast(0, "struct task_struct")->se;
	return task +  offset;
}

function entity_to_rbnode:long(rb:long)
{
	offset = &@cast(0, "struct sched_entity")->run_node;
	return rb - offset;
}

function print_task(s:string, se:long, verbose:long, min_vruntime:long)
{
	my_q = @cast(se, "struct sched_entity")->my_q;
	if(my_q == 0) {
		t_se = container_of_entity(se);
		printf("%8s %p %s %d \n", s, t_se, task_execname(t_se), task_pid(t_se));
	}
}

probe kernel.function("pick_next_task_fair")
{
	printf("--------------- begin pick_next_task_fair --------------->\n");
	g_cfs_rq = &$rq->cfs;
}

probe kernel.function("pick_next_entity")
{
	if (g_cfs_rq == 0)
		next;

	printf("------- begin pick_next_entity ------->\n");
	cfsq = g_cfs_rq;
	vrun_first = 0;
	vrun_last = 0;
	last = @cast(cfsq, "struct cfs_rq")->last;
	if (last) {
		my_q = @cast(last, "struct sched_entity")->my_q;
		if(my_q != 0) {
			cfsq = @cast(last, "struct sched_entity")->my_q;
			last = @cast(cfsq, "struct cfs_rq")->last;
		}
		t_last = container_of_entity(last);
		vrun_last = @cast(last, "struct sched_entity")->vruntime;
		printf("LAST:[%s] vrun:%d\t", task_execname(t_last), vrun_last);
	}
	firstrb = @cast(cfsq, "struct cfs_rq")->rb_leftmost;
	if (firstrb) {
		firstse = entity_to_rbnode(firstrb);
		my_q = @cast(firstse, "struct sched_entity")->my_q;
		if(my_q != 0) {
			firstrb = @cast(my_q, "struct cfs_rq")->rb_leftmost;
			firstse = entity_to_rbnode(firstrb);
		}
		t_first = container_of_entity(firstse);
		vrun_first = @cast(firstse, "struct sched_entity")->vruntime;
		printf("FIRST:[%s] vrun:%d\t", task_execname(t_first), vrun_first);
	}
	if (last && firstrb) {
		printf("delta: %d\n", vrun_last - vrun_first);
	} else {
		printf("delta: N/A\n");
	}
	printf("<------- end pick_next_entity -------\n");
	printf("###################\n");
}

probe kernel.function("pick_next_task_fair").return
{
	if($return != 0) {
		se = &$return->se;
		t_se = container_of_entity(se);
		t_curr = task_current();
		printf("Return task: %s[%d] From current: %s[%d]\n", task_execname(t_se), task_pid(t_se), task_execname(t_curr), task_pid(t_curr));
	}

	printf("<--------------- end pick_next_task_fair ---------------\n");
	printf("###########################################################\n");
	g_cfs_rq = 0;
}

probe kernel.function("set_last_buddy")
{
	se_se = $se;
	print_task("=== set_last_buddy", se_se, 0, 0);
}

probe kernel.function("__clear_buddies_last")
{
	se_se = $se;
	print_task("=== __clear_buddies_last", se_se, 0, 0);
}

probe kernel.function("check_preempt_wakeup")
{
	printf("--------------- begin check_preempt_wakeup --------------->\n");
	_cfs_rq = &$rq->cfs;
	min_vruntime = @cast(_cfs_rq, "struct cfs_rq")->min_vruntime;
	ok = @cast(_cfs_rq, "struct cfs_rq")->nr_running - sched_nr_latency;
	t_curr = task_current();
	t_se = $p;
	se_curr = container_to_entity(t_curr);
	se_se = container_to_entity(t_se);
	vrun_curr = @cast(se_curr, "struct sched_entity")->vruntime;
	vrun_se = @cast(se_se, "struct sched_entity")->vruntime;

	printf("curr wake:[%s] woken:[%s]\t", task_execname(t_curr), task_execname(t_se));
	printf("UUUUU curr:%d se:%d min:%d\t", vrun_curr, vrun_se, min_vruntime);
	printf("VVVVV delta:%d %d\n", vrun_curr - vrun_se, ok);
}

probe kernel.function("check_preempt_wakeup").return
{
	printf("<--------------- end check_preempt_wakeup ---------------\n");
	printf("###########################################################\n");
}

從新作實驗,咱們跑幾遍腳本,終於採集到一個case,且看下面的輸出:

.....
--------------- begin check_preempt_wakeup --------------->
curr wake:[producerAproduc]   woken:[consumerAconsum]   UUUUU curr:17886790442766  se:17886787442766 min:20338095223270 VVVVV delta:3000000   1
=== set_last_buddy 0xffff8800367b4a40  producerAproduc 26285
<--------------- end check_preempt_wakeup ---------------
###########################################################
--------------- begin pick_next_task_fair --------------->
------- begin pick_next_entity ------->
LAST:[producerAproduc] vrun:17886790442766  FIRST:[consumerAconsum] vrun:17886787442766 delta: 3000000
<------- end pick_next_entity -------
###################
Return task: consumerAconsum[26274]  From current: producerAproduc[26285]
<--------------- end pick_next_task_fair ---------------
###########################################################
--------------- begin pick_next_task_fair --------------->
------- begin pick_next_entity ------->
#【注意這裏的case!】
#【原本loop_sleep將要投入運行的,結果被上次被搶佔從而設置爲last的producerAproduc搶先!!】
LAST:[producerAproduc] vrun:17886790442766  FIRST:[loop_sleep] vrun:17886790410519  delta: 32247
<------- end pick_next_entity -------
###################
=== __clear_buddies_last 0xffff8800367b4a40  producerAproduc 26285
#【放棄loop_sleep,選擇producerAproduc】
Return task: producerAproduc[26285]  From current: consumerAconsum[26274]
<--------------- end pick_next_task_fair ---------------
###########################################################
--------------- begin pick_next_task_fair --------------->
------- begin pick_next_entity ------->
FIRST:[loop_sleep] vrun:17886790410519  delta: N/A
<------- end pick_next_entity -------
###################
Return task: loop_sleep[4227]  From current: producerAproduc[26285]
<--------------- end pick_next_task_fair ---------------
.......

抓到的這個現場,意思是,本應該投入運行的loop_sleep進程居然 因爲CPU cache親和的緣由 被一個以前設置的last錨點進程給搶先運行了,雖然這個從CPU的視角來看,最大化了cache的利用率,可是咱們從業務的角度來看,這並不合理。

因爲代碼和時間戳之間緊密耦合,很難構造持續LAST_BUDDY搶先的場景,但只要發現相似上例一處此類case,基本就說明問題了,一旦發生全局同步式的共振,某些進程耦合霸屏形成系統卡頓的場面就復現了。

整體而言,LAST_BUDDY feature爲CFS紅黑樹的leftmost node引入了一個有力的競爭者,而leftmost node頗有可能會輸掉競爭而沒法投入運行,這是否是破壞了CFS引覺得傲的精妙簡單的紅黑樹結構帶來快速進程選擇的收益呢?

一棵紅黑樹被各類啓發式feature啄木鳥般啄得體無完膚!

想當初, O ( 1 ) O(1) O(1)調度器就是由於愈來愈多的或補償,或懲罰的啓發式trick推進了新一代CFS的上位,期間經歷了RSDL(The Rotating Staircase Deadline Schedule)等過渡,最終都沒有CFS來的簡單直接。現在,CFS又開始逐漸變複雜,變臃腫的節奏。事實上,這把CFS調度器最終也變成了各類trade off。

接下來,我來稍做評價。

也許loop_sleep是一個急切要運行的進程,它可能救經理於水火,應用程序能夠無視CPU的cache利用率,可是卻不能無視其業務邏輯以及延遲執行的後果!

固然,做爲通用操做系統的Linux歷來沒有說本身是實時操做系統,但退一萬步,系統卡頓對於用戶而言,也是很是很差的體驗!Windows系統耗電,線程切換頻繁,可是用起來如絲般順滑啊,這就是差異,差異不只僅是一雙皮鞋,一套西裝。

咱們再次審視 /sys/kernel/debug/sched_features 裏的那麼一大坨features,調的一手好參數並非人人具有的能力,經理也不行,如此衆多的features如何排列組合成一個通用場景下適中的,我想大神級的專家也沒這能力吧,更況且,不少feature的組合之間是相互掣肘的。

只要一套 動態優先級提高 調度算法就能解決全部的這些問題,對於提高用戶體驗這個角度而言,Windows系統的調度算法就很是不錯。

關於LAST_BUDDY feature,在2.6.39-rc1內核版本增長了pick next排序以前,它甚至是 有bug的:

static struct sched_entity *pick_next_entity(struct cfs_rq *cfs_rq)
{
	struct sched_entity *se = __pick_next_entity(cfs_rq);
	struct sched_entity *left = se;

	// 先check next
	if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
		se = cfs_rq->next;

	/* * Prefer last buddy, try to return the CPU to a preempted task. */
	// 再check last!因此,last有可能覆蓋掉選擇的結果next,形成嚴重的乒乓效應!
	if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
		se = cfs_rq->last;

	clear_buddies(cfs_rq, se);

	return se;
}

我將大笑,而且歌唱。


浙江溫州皮鞋溼,下雨進水不會胖。

相關文章
相關標籤/搜索