Linux Process/Thread Creation、Linux Process Principle、sys_fork、sys_execve、glibc fork/execve api sour

時間 2019-12-02

標籤 linux process thread creation principle sys fork execve glibc api sour 欄目 Linux 简体版

原文原文鏈接

相關學習資料html

linux內核設計與實現+原書第3版.pdf(3.3章)
深刻linux內核架構(中文版).pdf
深刻理解linux內核中文第三版.pdf
《獨闢蹊徑品內核Linux內核源代碼導讀》
http://www.yanyulin.info/pages/2013/11/linux0.html
http://blog.csdn.net/ddna/article/details/4958058
http://www.cnblogs.com/coolgestar02/archive/2010/12/31/1922629.html
http://blog.sina.com.cn/s/blog_4ba5b45e0102e3to.html
http://www.kernel.org/

目錄node

1. Linux/Unix進程建立相關基本知識
2. Linux進程管理
3. sys_fork() 
4. sys_execve()函數
5. Copy On Write COW(寫時複製)技術
6. Linux Glibc提供的建立進程的7種API方式
7. Glibc execve、fork API源代碼分析
8. 查看進程的啓動過程工具
9. Linux下線程建立
10. Posix線程

1. Linux/Unix進程建立相關基本知識linux

0x1: linux和windows在進程建立上的區別程序員

unix/linux的進程建立和Windows有很大不同，windows對線程和進程的實現很是標準，windows內核有明確的線程和進程的概念。在windows API中，可使用明確的API: CreateProcess和CreateThread來建立進程和線程，而且有一系列的API來操縱它們，但對於Linux來講，線程並非一個強制性明確的概念算法

在Linux內核中並不存在真正意義上的線程概念，Linux將全部的執行實體(進程或線程)都稱爲"任務(task)"，每個任務概念上都相似於一個單線程的進程，具備內存空間、執行實體、文件資源等。可是Linux下不一樣的任務之間能夠選擇共享內存空間，所以在實際意義上，共享了同一個內存空間的多個任務構成了一個進程，這些任務也能夠稱之爲這個進程中的線程windows

1. windows
windows採用了createProcess()來進行新進程的建立，大體流程以下：
    1) 申請一塊全新的內存(包括內核空間和用戶空間)
    2) 打開新進程對應的磁盤文件，將文件內容複製到新申請的內存中
    3) 啓動主線程重新進程的函數入口點(默認是main)開始順序執行
在windows的哲學中，每個新進程都是一個新的、獨立的內存空間，進程之間彼此相對獨立。
雖然在內核對象中也有父進程和子進程這些字段，可是這只是一個弱關係，windows中的父子進程並無強制性的依賴關係。
關於windows的進程建立過程，請參閱另外一篇文章
http://www.cnblogs.com/LittleHann/p/3458736.html

2. linux/unix
對於linu/unix的操做系統來講，它並不像windows那樣採用"產生(spawn)"進程的機制。
而是將建立進程的步驟分解到兩個單獨的函數中去執行:
    1) fork()
    fork()經過"拷貝"當前進程，建立一個子進程。這個時候的子進程和父進程的區別僅僅在於PID(進程號)、PPID(父進程號)、和某些資源和統計量
    2) exec()
    exec()函數則負責讀取可執行文件並將其載入地址空間開始運行
把這兩個函數(fork、exec)組合起來的最終效果就等同於windows中的createProcess

須要明白的是，fork和exec並非強制必定要按順序執行的，實際上，能夠單獨只執行fork、或者單獨執行exec、或者執行fork+exec。在調用fork和exec之間插入額外的代碼執行也是可行的，fork和exec在原理上兩個獨立的概念api

0x2: linux中的0號、1號進程數組

1. 進程0
Linux引導中建立的第一個進程，完成加載系統後，演變爲進程調度、交換及存儲管理進程(也就是說0號進程自從建立完1號進程後就不會再次去建立其餘進程了，以後由1號進程負責新子進程的建立)
Linux中1號進程是由0號進程來建立的，因爲在建立進程時，程序一直運行在內核態，而進程運行在用戶態，所以建立0號進程涉及到特權級的變化，即從特權級0變到特權級3，Linux是經過模擬中斷返回來實現特權級的變化以及建立0號
進程，經過將0號進程的代碼段選擇子以及程序計數器EIP直接壓入內核態堆棧，而後利用iret彙編指令中斷返回跳轉到0號進程運行。

2. 進程1
init 進程，由0進程建立，完成系統的初始化。是系統中全部其它用戶進程的祖先進程。

2. Linux進程管理
0x1: 進程概念安全

進程就是處於執行期的程序(目標碼存放在某種存儲介質上)，從廣義上講，它包括cookie

1. 通常的可執行代碼(即代碼段)
2. 打開的文件
3. 掛起的信號
4. 內核內部數據結構
5. 處理器狀態
6. 一個或多個具備內存映射的內存地址空間
7. 一個或多個執行線程(thread of execution)
8. 存放全局變量的數據段
//進程就是正在執行的程序代碼的"實時結果"，內核須要有效而又透明地管理全部細節

0x2: 建立進程 && 建立新進程
在學習Linux進程建立相關知識的時候，咱們須要對Linux下"進程建立"和"新進程建立"這兩個概念進行區分，完整地說，Linux下進程建立有以下幾個場景

1. 從當前進程複製一份和父進程徹底同樣的新進程: 準確地說是複製了一份父進程做爲新進程
從系統調用的角度來講，和進程建立相關的系統調用只有fork()，進程在調用fork()建立它的時刻開始存活，fork()經過"複製"(Linux下全部進程都是"複製"出來的)一個現有進程來建立一個新的進程，調用fork()的進程稱爲父進程，新產生的進程稱爲子進程。在該調用結束時，在返回到這個相同位置上，父進程恢復執行，子進程開始執行。
fork()系統調用從內核返回兩次，一次返回到父進程、另外一次返回到新產生的子進程
    1) 調用fork()
    or
    2) 調用clone()
/*
就像一個細胞複製了一份和本身相同的新細胞，兩個細胞同時運行
*/

2. 運行新代碼的新進程建立: 在調用fork的基礎上，繼續調用exec()，讀取並載入新進程代碼並繼續運行
一般，建立新的進程都是爲了當即執行新的、不一樣的代碼，而接着調用exec這組函數就能夠建立新的"地址空間"，並把新的程序載入其中。在現代Linux內核中，fork()其實是由clone()系統調用實現的
    1) fork()/clone() + exec()
/*
就像一個細胞複製了一份和本身相同的新細胞，並填充進了新的細胞核，兩個細胞同時運行
*/

3. 運行新進程: 直接將當前進程轉變爲一個包含不一樣代碼的新進程
    1) exec()
/*
就像一個細胞使用新的蛋白質將本身的細胞核改變了，並繼續運行
*/

0x3: 進程描述符及任務(task)結構

內核把進程的的列表存放在"任務隊列(task list)"(這是一個雙向循環鏈表)中，鏈表中的每一項都是類型爲task_struct稱爲進程描述符(process descriptor)的結構，該結構中包含了具體進程的全部相關信息，例如

1. 打開的文件
2. 進程的地址空間
3. 掛起的信號
4. 進程的狀態
..

關於task_struct數據結構的相關知識，請參閱另外一篇文章

http://www.cnblogs.com/LittleHann/p/3865490.html
//搜索：0x1: struct task_struct

0x4: Linux進程建立方法

從程序員的角度來講，Linux下實現進程建立能夠經過如下方法

1. 經過系統提供的系統調用
    1) fork()/clone(): 複製一份新進程
    2) exec(): 運行新進程
    3) fork()/clone() + exec(): 複製並運行一個新進程(父進程和子進程運行不一樣的代碼)
2. 經過glibc提供的API函數: exec系列函
    1) exec系列函數: glibc實現對系統調用exec()的一層包裝
    2) fork api

咱們接下來先了解內核態的fork、execve系統調用開始，而後再學習用戶態Glibc提供的進程建立相關API

3. sys_fork()

使用fork建立的進程被稱爲原父進程(parents process)的子進程(child process)。從用戶的角度來看，子進程是父進程的一個精確副本，兩個進程只是PID不一樣，fork系統調用從內核態返回2次，PID分別爲

1. 子進程: PID = 0
2. 父進程: PID = 子進程的PID
//程序能夠經過檢測fork的返回值來判斷當前進程是父進程仍是子進程

從總體上來看，一次fork調用包括瞭如下幾步

1. 爲子進程分配和初始化一個新的task_struct結構
    1) 從父進程中複製: 包括全部從父進程繼承而來的特權和限制
        1.1) 進程組和會話信息
        1.2) 信號狀態(忽略、捕獲、阻塞信號的掩碼)
        1.3) kg_nice調度參數
        1.4) 對父進程憑據的引用
        1.5) 對父進程打開文件的引用(即文件句柄表。以及相關引用數據結構，使用這些數據結構能夠操做對應的文件)
        1.6) 對父進程限制(resources limitation)的引用
    2) 清零
        2.1) 最近CPU利用率
        2.2) 等待通道
        2.3) 交換和睡眠時間
        2.4) 定時器
        2.5) 跟蹤機制
        2.6) 掛起信號的信息
    3) 顯式地進行初始化
        3.1) 包括全部進程的鏈表的入口
        3.2) 父進程的子進程鏈表的入口以及指向其父進程的返回指針
        3.3) 父進程的進程組鏈表的入口
        3.4) 散列結構的入口，該結構使得進程能夠經過其PID進行查找
        3.5) 指向進程統計結構的指針，該結構位於用戶結構中
        3.6) 指向進程信號處理結構的指針，該結構位於用戶結構中
        3.7) 該進程的新PID
2. 複製父進程的地址空間
在複製一個進程的映像時，內核經過vm_forkproc()來調用內存管理機制。vm_forkproc()例程的參數是一個指向一個已經初始化過的子進程的task_struct的指針，它的任務是爲該子進程分配其執行所需的所有資源。vm_forkproc()調用在子進程中經過另外一條直接進入用戶態的執行線路返回，而在父進程中沿着正常的執行線路返回(即一次調用、2次返回)
將父進程的上下文複製給子進程，包括
    1) 線程結構
    2) 父進程的register寄存器狀態，fork系統調用結束後，父進程進程從同一個代碼位置開始繼續執行
    2) 虛擬內存資源，只是複製了一份引用，copy on write機制
3. 調度子進程運行(execve)
子進程最終建立完畢以後，就被放入運行隊列，這樣調度程序就知道這個新進程了

Fork的系統調用代碼在\linux-2.6.32.63\arch\x86\kernel\process.c中

/*
Sys_fork系統調用經過 do_fork()函數實現，經過對do_fork()函數傳遞不一樣的clone_flags來實現:
1. fork
2. clone
3. vfork
*/
int sys_fork(struct pt_regs *regs)
{
    return do_fork(SIGCHLD, regs->sp, regs, 0, NULL, NULL);
}

咱們繼續跟蹤do_fork()的代碼

\linux-2.6.32.63\kernel\fork.c

/*
1. clone_flags: 指定了子進程結束時，須要向父進程發送的信號，一般這個信號是SIGCHLD，同時clone_flags還指定子進程須要共享父進程的哪些資源
2. stack_start: 子進程用戶態堆棧起始地址。一般設置爲0，父進程會複製本身的堆棧指針，當子進程對堆棧進行寫入時，缺頁中斷處理程序會設置新的物理頁面(即copy on write 寫時複製)
3. regs: pt_regs結構，保存了進入內核態時的存儲器的值，父進程會將寄存器狀態完整的複製給子進程
4. stack_size: 默認爲0
5. parent_tidptr: 用戶態內存指針，當CLONE_PARENT_SETTID被設置時，內核會把新創建的子進程ID經過parent_tidptr返回
6. child_tidptr: 用戶態內存指針，當CLONE_CHILD_SETTID被設置時，內核會把新創建的子進程ID經過child_tidptr返回
*/
long do_fork(unsigned long clone_flags, unsigned long stack_start, struct pt_regs *regs, unsigned long stack_size, int __user *parent_tidptr, int __user *child_tidptr)
{
    struct task_struct *p;
    int trace = 0;
    long nr;

    /*
     * Do some preliminary argument and permissions checking before we actually start allocating stuff
    */
    if (clone_flags & CLONE_NEWUSER) 
    {
        if (clone_flags & CLONE_THREAD)
            return -EINVAL;
        /* hopefully this check will go away when userns support is
         * complete
         */
        if (!capable(CAP_SYS_ADMIN) || !capable(CAP_SETUID) || !capable(CAP_SETGID))
            return -EPERM;
    }

    /*
    We hope to recycle these flags after 2.6.26
    採用向下兼容的模式，2.6.26以後，將CLONE_STOPPED廢除
    */
    if (unlikely(clone_flags & CLONE_STOPPED)) 
    {
        static int __read_mostly count = 100;

        if (count > 0 && printk_ratelimit()) 
        {
            char comm[TASK_COMM_LEN];

            count--;
            printk(KERN_INFO "fork(): process `%s' used deprecated clone flags 0x%lx\n", get_task_comm(comm, current), clone_flags & CLONE_STOPPED);
        }
    }

    /*
    When called from kernel_thread, don't do user tracing stuff.
    */
    if (likely(user_mode(regs)))
        trace = tracehook_prepare_clone(clone_flags);
    
    /*
    Do_fork()函數的核心是copy_process()函數，該函數完成了進程建立的絕大部分工做
    分配子進程的task_struct結構，並複製父進程的資源
    */
    p = copy_process(clone_flags, stack_start, regs, stack_size, child_tidptr, NULL, trace);
    /*
     * Do this prior waking up the new thread - the thread pointer
     * might get invalid after that point, if the thread exits quickly.
     */
    if (!IS_ERR(p)) 
    {
        struct completion vfork;

        trace_sched_process_fork(current, p);
        /*
        /source/include/linux/sched.h
        /source/kernel/pid.c
        設置pid namespace，不一樣的namespace中，能夠創建相同的pid的進程
        */
        nr = task_pid_vnr(p);

        if (clone_flags & CLONE_PARENT_SETTID)
            put_user(nr, parent_tidptr);

        /*
        CLONE_VFORK要求父進程進入子進程，如今初始化一個等待對象
        */
        if (clone_flags & CLONE_VFORK) 
        {
            p->vfork_done = &vfork;
            init_completion(&vfork);
        }

        audit_finish_fork(p);
        tracehook_report_clone(regs, clone_flags, nr, p);

        /*
         We set PF_STARTING at creation in case tracing wants to use this to distinguish a fully live task from one that hasn't gotten to tracehook_report_clone() yet.  
         Now we clear it and set the child going.
         */
        p->flags &= ~PF_STARTING;

        /*
        若是被設置了CLONE_STOPPED標誌，則向進程發送SIGSTOP信號
        */
        if (unlikely(clone_flags & CLONE_STOPPED)) 
        {
            /*
            We'll start up with an immediate SIGSTOP.
            */
            sigaddset(&p->pending.signal, SIGSTOP);
            set_tsk_thread_flag(p, TIF_SIGPENDING);
            __set_task_state(p, TASK_STOPPED);
        } 
        else 
        {
            //若是沒有設置CLONE_STOPPED標誌，就把進程加入就緒隊列
            wake_up_new_task(p, clone_flags);
        }

        tracehook_report_clone_complete(trace, regs, clone_flags, nr, p);

        if (clone_flags & CLONE_VFORK) 
        {
            freezer_do_not_count();
            //當前進程進入以前初始化好等待隊列
            wait_for_completion(&vfork);
            freezer_count();
            tracehook_report_vfork_done(p, nr);
        }
    } else {
        nr = PTR_ERR(p);
    }
    return nr;
}

Do_fork()函數的核心是copy_process()函數，該函數完成了進程建立的絕大部分工做
繼續跟蹤copy_process()
\linux-2.6.32.63\kernel\fork.c

/*
This creates a new process as a copy of the old one, but does not actually start it yet.
It copies the registers, and all the appropriate parts of the process environment (as per the clone * flags). The actual kick-off is left to the caller.
*/
static struct task_struct *copy_process(unsigned long clone_flags,
                    unsigned long stack_start,
                    struct pt_regs *regs,
                    unsigned long stack_size,
                    int __user *child_tidptr,
                    struct pid *pid,
                    int trace)
{
    int retval;
    struct task_struct *p;
    int cgroup_callbacks_done = 0;

    /*
    1. 對傳入的clone_flag進行檢查
    */
    if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
        return ERR_PTR(-EINVAL);
 
    if ((clone_flags & CLONE_THREAD) && !(clone_flags & CLONE_SIGHAND))
        return ERR_PTR(-EINVAL);
 
    if ((clone_flags & CLONE_SIGHAND) && !(clone_flags & CLONE_VM))
        return ERR_PTR(-EINVAL);
 
    if ((clone_flags & CLONE_PARENT) &&
                current->signal->flags & SIGNAL_UNKILLABLE)
        return ERR_PTR(-EINVAL);

    /*
    LSM安全框架檢查，利用它能夠在進程創建以前檢查是否容許檢查，利用這個內核框架，能夠開發出進程監控功能。默認調用dummy_task_create函數，這是一個空函數
    */
    retval = security_task_create(clone_flags);
    if (retval)
        goto fork_out;

    retval = -ENOMEM;
    /*
    2. 調用了dup_task_struct()函數，該函數的主要做用是
        1) 爲子進程建立一個新的內核棧
        2) 複製父進程的task_struct結構和thread_info結構，這裏只是對結構完整的複製，因此子進程的進程描述符跟父進程徹底同樣
    */
    p = dup_task_struct(current);
    if (!p)
        goto fork_out;

    ftrace_graph_init_task(p);

    rt_mutex_init_task(p);

#ifdef CONFIG_PROVE_LOCKING
    DEBUG_LOCKS_WARN_ON(!p->hardirqs_enabled);
    DEBUG_LOCKS_WARN_ON(!p->softirqs_enabled);
#endif
    retval = -EAGAIN;
    /*
    檢查進程的資源限制
    */
    if (atomic_read(&p->real_cred->user->processes) >= p->signal->rlim[RLIMIT_NPROC].rlim_cur)
    {
        if (!capable(CAP_SYS_ADMIN) && !capable(CAP_SYS_RESOURCE) && p->real_cred->user != INIT_USER)
            goto bad_fork_free;
    }

    /*
    複製父進程的cred信號，這個結構保存的是進程的身份權限信息(例如UID)
    */
    retval = copy_creds(p, clone_flags);
    if (retval < 0)
        goto bad_fork_free;
 
    retval = -EAGAIN;
    /*
    3. 檢查建立的進程是否超過了系統進程總量
    */
    if (nr_threads >= max_threads)
        goto bad_fork_cleanup_count;

    if (!try_module_get(task_thread_info(p)->exec_domain->module))
        goto bad_fork_cleanup_count;

    p->did_exec = 0;
    delayacct_tsk_init(p);    /* Must remain after dup_task_struct() */
    //複製clone_flags到子進程的task_struct結構中
    copy_flags(clone_flags, p);
    INIT_LIST_HEAD(&p->children);
    INIT_LIST_HEAD(&p->sibling);
    rcu_copy_process(p);
    p->vfork_done = NULL;
    spin_lock_init(&p->alloc_lock);

    init_sigpending(&p->pending);

    /*
    4. 開始對子進程task_struct結構的初始化過程
    */
    p->utime = cputime_zero;
    p->stime = cputime_zero;
    p->gtime = cputime_zero;
    p->utimescaled = cputime_zero;
    p->stimescaled = cputime_zero;
    p->prev_utime = cputime_zero;
    p->prev_stime = cputime_zero;

    p->default_timer_slack_ns = current->timer_slack_ns;

    task_io_accounting_init(&p->ioac);
    acct_clear_integrals(p);

    posix_cpu_timers_init(p);

    p->lock_depth = -1;        /* -1 = no lock */
    do_posix_clock_monotonic_gettime(&p->start_time);
    p->real_start_time = p->start_time;
    monotonic_to_bootbased(&p->real_start_time);
    p->io_context = NULL;
    p->audit_context = NULL;
    cgroup_fork(p);
#ifdef CONFIG_NUMA
    p->mempolicy = mpol_dup(p->mempolicy);
     if (IS_ERR(p->mempolicy)) {
         retval = PTR_ERR(p->mempolicy);
         p->mempolicy = NULL;
         goto bad_fork_cleanup_cgroup;
     }
    mpol_fix_fork_child_flag(p);
#endif
#ifdef CONFIG_TRACE_IRQFLAGS
    p->irq_events = 0;
#ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
    p->hardirqs_enabled = 1;
#else
    p->hardirqs_enabled = 0;
#endif
    p->hardirq_enable_ip = 0;
    p->hardirq_enable_event = 0;
    p->hardirq_disable_ip = _THIS_IP_;
    p->hardirq_disable_event = 0;
    p->softirqs_enabled = 1;
    p->softirq_enable_ip = _THIS_IP_;
    p->softirq_enable_event = 0;
    p->softirq_disable_ip = 0;
    p->softirq_disable_event = 0;
    p->hardirq_context = 0;
    p->softirq_context = 0;
#endif
#ifdef CONFIG_LOCKDEP
    p->lockdep_depth = 0; /* no locks held yet */
    p->curr_chain_key = 0;
    p->lockdep_recursion = 0;
#endif

#ifdef CONFIG_DEBUG_MUTEXES
    p->blocked_on = NULL; /* not blocked yet */
#endif

    p->bts = NULL;

    /* Perform scheduler related setup. Assign this task to a CPU. */
    sched_fork(p, clone_flags);

    retval = perf_event_init_task(p);
    if (retval)
        goto bad_fork_cleanup_policy;

    if ((retval = audit_alloc(p)))
        goto bad_fork_cleanup_policy;
    /* 
    copy all the process information 
    根據clone_flags複製父進程的資源到子進程，對於clone_flags指定共享的資源，父子進程間共享這些資源，僅僅設置子進程的相關指針，並增長資源數據結構的引用計數
    */
    if ((retval = copy_semundo(clone_flags, p)))
        goto bad_fork_cleanup_audit;
    if ((retval = copy_files(clone_flags, p)))
        goto bad_fork_cleanup_semundo;
    if ((retval = copy_fs(clone_flags, p)))
        goto bad_fork_cleanup_files;
    if ((retval = copy_sighand(clone_flags, p)))
        goto bad_fork_cleanup_fs;
    if ((retval = copy_signal(clone_flags, p)))
        goto bad_fork_cleanup_sighand;
    if ((retval = copy_mm(clone_flags, p)))
        goto bad_fork_cleanup_signal;
    if ((retval = copy_namespaces(clone_flags, p)))
        goto bad_fork_cleanup_mm;
    if ((retval = copy_io(clone_flags, p)))
        goto bad_fork_cleanup_namespaces;
    //複製父進程的內核態堆棧到子進程
    retval = copy_thread(clone_flags, stack_start, stack_size, p, regs);
    if (retval)
        goto bad_fork_cleanup_io;

    if (pid != &init_struct_pid) 
    {
        retval = -ENOMEM;
        pid = alloc_pid(p->nsproxy->pid_ns);
        if (!pid)
            goto bad_fork_cleanup_io;

        if (clone_flags & CLONE_NEWPID) 
        {
            retval = pid_ns_prepare_proc(p->nsproxy->pid_ns);
            if (retval < 0)
                goto bad_fork_free_pid;
        }
    }

    p->pid = pid_nr(pid);
    /*
    5. 若是設置了同在一個線程組則繼承TGID。對於普通進程來講TGID和PID相等，對於線程來講，同一線程組內的全部線程的TGID都相等，這使得這些多線程能夠經過調用getpid()得到相同的PID
    若是創建的是輕權進程，那麼父子進程在同一個線程組中，就設置子進程的tgid
    */
    p->tgid = p->pid;
    if (clone_flags & CLONE_THREAD)
        p->tgid = current->tgid;

    //建立新的namespace
    if (current->nsproxy != p->nsproxy) 
    {
        retval = ns_cgroup_clone(p, pid);
        if (retval)
            goto bad_fork_free_pid;
    }

    p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : NULL;
    /*
     * Clear TID on mm_release()?
     */
    p->clear_child_tid = (clone_flags & CLONE_CHILD_CLEARTID) ? child_tidptr: NULL;
#ifdef CONFIG_FUTEX
    p->robust_list = NULL;
#ifdef CONFIG_COMPAT
    p->compat_robust_list = NULL;
#endif
    INIT_LIST_HEAD(&p->pi_state_list);
    p->pi_state_cache = NULL;
#endif 

    if ((clone_flags & (CLONE_VM|CLONE_VFORK)) == CLONE_VM)
        p->sas_ss_sp = p->sas_ss_size = 0;
 
    clear_tsk_thread_flag(p, TIF_SYSCALL_TRACE);
#ifdef TIF_SYSCALL_EMU
    clear_tsk_thread_flag(p, TIF_SYSCALL_EMU);
#endif
    clear_all_latency_tracing(p);

    /*
    父進程是否要求子進程退出時發送信號 
    ok, now we should be set up.. 
    */
    p->exit_signal = (clone_flags & CLONE_THREAD) ? -1 : (clone_flags & CSIGNAL);
    p->pdeath_signal = 0;
    //子進程默認的退出狀態
    p->exit_state = 0;
 
    p->group_leader = p;
    INIT_LIST_HEAD(&p->thread_group);
 
    cgroup_fork_callbacks(p);
    cgroup_callbacks_done = 1;

    /* Need tasklist lock for parent etc handling! */
    write_lock_irq(&tasklist_lock);

    /* 
    CLONE_PARENT re-uses the old parent 
    */
    if (clone_flags & (CLONE_PARENT|CLONE_THREAD)) 
    {
        //把子進程的real_parent設置爲父進程的real_parent
        p->real_parent = current->real_parent;
        p->parent_exec_id = current->parent_exec_id;
    } 
    else 
    {
        p->real_parent = current;
        p->parent_exec_id = current->self_exec_id;
    }

    spin_lock(&current->sighand->siglock);
 
    recalc_sigpending();
    if (signal_pending(current)) {
        spin_unlock(&current->sighand->siglock);
        write_unlock_irq(&tasklist_lock);
        retval = -ERESTARTNOINTR;
        goto bad_fork_free_pid;
    }

    if (clone_flags & CLONE_THREAD) {
        atomic_inc(&current->signal->count);
        atomic_inc(&current->signal->live);
        p->group_leader = current->group_leader;
        list_add_tail_rcu(&p->thread_group, &p->group_leader->thread_group);
    }

    if (likely(p->pid)) 
    {
        //把子進程添加到父進程的子進程鏈表中，這樣組成了兄弟進程鏈表
        list_add_tail(&p->sibling, &p->real_parent->children);
        tracehook_finish_clone(p, clone_flags, trace);

        if (thread_group_leader(p)) 
        {
            if (clone_flags & CLONE_NEWPID)
                p->nsproxy->pid_ns->child_reaper = p;

            p->signal->leader_pid = pid;
            tty_kref_put(p->signal->tty);
            p->signal->tty = tty_kref_get(current->signal->tty);
            attach_pid(p, PIDTYPE_PGID, task_pgrp(current));
            attach_pid(p, PIDTYPE_SID, task_session(current));
            list_add_tail_rcu(&p->tasks, &init_task.tasks);
            __get_cpu_var(process_counts)++;
        }
        attach_pid(p, PIDTYPE_PID, pid);
        nr_threads++;
    }

    total_forks++;
    spin_unlock(&current->sighand->siglock);
    write_unlock_irq(&tasklist_lock);
    proc_fork_connector(p);
    cgroup_post_fork(p);
    perf_event_fork(p);
    return p;
/*
出錯退出
*/
bad_fork_free_pid:
    if (pid != &init_struct_pid)
        free_pid(pid);
bad_fork_cleanup_io:
    if (p->io_context)
        exit_io_context(p);
bad_fork_cleanup_namespaces:
    exit_task_namespaces(p);
bad_fork_cleanup_mm:
    if (p->mm)
        mmput(p->mm);
bad_fork_cleanup_signal:
    if (!(clone_flags & CLONE_THREAD))
        __cleanup_signal(p->signal);
bad_fork_cleanup_sighand:
    __cleanup_sighand(p->sighand);
bad_fork_cleanup_fs:
    exit_fs(p); /* blocking */
bad_fork_cleanup_files:
    exit_files(p); /* blocking */
bad_fork_cleanup_semundo:
    exit_sem(p);
bad_fork_cleanup_audit:
    audit_free(p);
bad_fork_cleanup_policy:
    perf_event_free_task(p);
#ifdef CONFIG_NUMA
    mpol_put(p->mempolicy);
bad_fork_cleanup_cgroup:
#endif
    cgroup_exit(p, cgroup_callbacks_done);
    delayacct_tsk_free(p);
    module_put(task_thread_info(p)->exec_domain->module);
bad_fork_cleanup_count:
    atomic_dec(&p->cred->user->processes);
    exit_creds(p);
bad_fork_free:
    free_task(p);
fork_out:
    return ERR_PTR(retval);
}

繼續跟蹤dup_task_struct()
\linux-2.6.32.63\kernel\fork.c

static struct task_struct *dup_task_struct(struct task_struct *orig)
{
    struct task_struct *tsk;
    struct thread_info *ti;
    unsigned long *stackend;

    int err;

    prepare_to_copy(orig);

    /*
    1. 經過alloc_task_struct()函數建立內核棧和task_struct結構空間
    */
    tsk = alloc_task_struct();
    if (!tsk)
        return NULL;
    /*
    2. 分配thread_info結構空間
    */
    ti = alloc_thread_info(tsk);
    if (!ti) {
        free_task_struct(tsk);
        return NULL;
    }
    /*
    3. 爲整個task_struct結構複製
    */
     err = arch_dup_task_struct(tsk, orig);
    if (err)
        goto out;

    tsk->stack = ti;

    err = prop_local_init_single(&tsk->dirties);
    if (err)
        goto out;
    /*
    4. 調用setup_thread_stack()函數爲thread_info結構複製
    */
    setup_thread_stack(tsk, orig);
    stackend = end_of_stack(tsk);
    *stackend = STACK_END_MAGIC;    /* for overflow detection */

#ifdef CONFIG_CC_STACKPROTECTOR
    tsk->stack_canary = get_random_int();
#endif

    /*
    更新該用戶的user_struct結構，累加相應的計數器，由atomic_inc()函數完成
    */
    atomic_set(&tsk->usage,2);
    atomic_set(&tsk->fs_excl, 0);
#ifdef CONFIG_BLK_DEV_IO_TRACE
    tsk->btrace_seq = 0;
#endif
    tsk->splice_pipe = NULL;

    account_kernel_stack(ti, 1);

    return tsk;

out:
    free_thread_info(ti);
    free_task_struct(tsk);
    return NULL;
}

copy_process()完成的工做主要是進行必要的檢查、初始化、複製必要的數據結構。這裏咱們重點分析兩個函數

1. copy_mm(): 涉及到父子進程的copy on write，以及共享內核虛擬地址的實現
2. copy_thread(): 涉及到父子進程返回的實現(一次調用、2次返回)

//複製父進程的內核態堆棧到子進程
retval = copy_thread(clone_flags, stack_start, stack_size, p, regs);

應用程序經過fork()系統調用進入內核空間，其內核態堆棧上保存着該進程的"進程上下文(寄存器狀態)"，經過copy_thread將複製父進程的內核態堆棧上的"進程上下文"到子進程中，同時把子進程堆棧上的EAX設置爲0。因爲父子進程的代碼和數據是共享的，因此在返回後將接着執行，因此會發現如下現象

1. 父子進程從同一個代碼位置開始繼續執行: 由於它們的"進程上下文"相同
2. 父進程調用fork()返回子進程的PID: 父進程是正常調用
3. 子進程返回0，由於內核態的EAX被設置爲了0
4. 父子進程不必定同時開始執行，但會有從內核態返回2次，一次是父進程，一次是子進程

if ((retval = copy_mm(clone_flags, p)))
goto bad_fork_cleanup_signal;

內核調用copy_mm()來創建子進程的內存區域

static int copy_mm(unsigned long clone_flags, struct task_struct * tsk)
{
    struct mm_struct * mm, *oldmm;
    int retval;

    tsk->min_flt = tsk->maj_flt = 0;
    tsk->nvcsw = tsk->nivcsw = 0;
#ifdef CONFIG_DETECT_HUNG_TASK
    tsk->last_switch_count = tsk->nvcsw + tsk->nivcsw;
#endif

    tsk->mm = NULL;
    tsk->active_mm = NULL;

    /*
     * Are we cloning a kernel thread?
     *
     * We need to steal a active VM for that..
     */
    oldmm = current->mm;
    if (!oldmm)
        return 0;

    /*
    若是要共享mm，則增長父進程mm的引用計數，同時把子進程的mm設置爲current->mm
    */
    if (clone_flags & CLONE_VM) 
    {
        atomic_inc(&oldmm->mm_users);
        mm = oldmm;
        goto good_mm;
    }

    retval = -ENOMEM;
    /*
    複製mm_struct的工做由dup_mm()來完成，這個函數會複製父進程的頁表到子進程，這樣父子進程就共享一樣的物理頁面，同時也共享了整個內核空間。
    可是對於可寫的用戶空間對應的頁表，dup_mm()會把它們設置爲"只讀"，這樣當進程(父進程或子進程)對它進行寫入時，do_page_fault()函數將分配新的物理頁面，爲進程複製一份私有數據，這就是copy on write的機制
    當父子進程的任何一個返回用戶態首次對堆棧進行寫入操做時，父子進程就會有各自獨立的用戶態堆棧了，可是對於代碼段，它們卻始終共享同一份物理頁面，除非子進程調用exec()系列
    */
    mm = dup_mm(tsk);
    if (!mm)
        goto fail_nomem;

good_mm:
    /* Initializing for Swap token stuff */
    mm->token_priority = 0;
    mm->last_interval = 0;

    tsk->mm = mm;
    tsk->active_mm = mm;
    return 0;

fail_nomem:
    return retval;
}

繼續跟進dup_mm

/*
Allocate a new mm structure and copy contents from the mm structure of the passed in task structure.
*/
struct mm_struct *dup_mm(struct task_struct *tsk)
{
    struct mm_struct *mm, *oldmm = current->mm;
    int err;

    if (!oldmm)
        return NULL;
    //分配mm_struct結構
    mm = allocate_mm();
    if (!mm)
        goto fail_nomem;
    //複製mm_struct結構
    memcpy(mm, oldmm, sizeof(*mm));

    /* Initializing for Swap token stuff */
    mm->token_priority = 0;
    mm->last_interval = 0;

    /*
    初始化，同時分配頁表
    mm_init()初始化mm_struct結構中的自旋鎖、鏈表等資源，而後調用mm_alloc_pgd()函數分配頁表，同時把父進程的內核虛擬地址對應的頁表項複製到子進程的頁表，所以父子進程共享了內核態地址空間
    */
    if (!mm_init(mm, tsk))
        goto fail_nomem;

    if (init_new_context(tsk, mm))
        goto fail_nocontext;

    dup_mm_exe_file(oldmm, mm);

    //拷貝vm_area_struct結構
    err = dup_mmap(mm, oldmm);
    if (err)
        goto free_pt;

    mm->hiwater_rss = get_mm_rss(mm);
    mm->hiwater_vm = mm->total_vm;

    if (mm->binfmt && !try_module_get(mm->binfmt->module))
        goto free_pt;

    return mm;

free_pt:
    /* don't put binfmt in mmput, we haven't got module yet */
    mm->binfmt = NULL;
    mmput(mm);

fail_nomem:
    return NULL;

fail_nocontext:
    /*
     * If init_new_context() failed, we cannot use mmput() to free the mm
     * because it calls destroy_context()
     */
    mm_free_pgd(mm);
    free_mm(mm);
    return NULL;
}

繼續跟進mm_alloc_pgd()

\linux-2.6.32.63\arch\x86\mm\pgtable.c

static inline int mm_alloc_pgd(struct mm_struct * mm)
{
    mm->pgd = pgd_alloc(mm);
    if (unlikely(!mm->pgd))
        return -ENOMEM;
    return 0;
}

pgd_t *pgd_alloc(struct mm_struct *mm)
{
    pgd_t *pgd;
    pmd_t *pmds[PREALLOCATED_PMDS];

    pgd = (pgd_t *)__get_free_page(PGALLOC_GFP);

    if (pgd == NULL)
        goto out;

    mm->pgd = pgd;

    if (preallocate_pmds(pmds) != 0)
        goto out_free_pgd;

    if (paravirt_pgd_alloc(mm) != 0)
        goto out_free_pmds;

    /*
    Make sure that pre-populating the pmds is atomic with respect to anything walking the pgd_list, 
    so that they never see a partially populated pgd.
    */
    spin_lock(&pgd_lock);

    pgd_ctor(pgd);
    pgd_prepopulate_pmd(mm, pgd, pmds);

    spin_unlock(&pgd_lock);

    return pgd;

out_free_pmds:
    free_pmds(pmds);
out_free_pgd:
    free_page((unsigned long)pgd);
out:
    return NULL;
}

至此，頁表分配完畢，同時內核態地址空間的映射關係已經創建。咱們繼續學習dup_mmap()是如何處理用戶態地址空間的相關數據結構的，它主要完成如下幾件事

1. 分配並複製vm_area_struct結構
2. 根據vm_area_struct結構的屬性標誌設置頁表項，把可寫入的內存片斷設置爲只讀
3. 當某個進程(父進程、或子進程)對其進行寫入時，do_page_fault()將分配新的物理頁面，爲該進程創建私有數據，同時修改頁表，指向新的物理頁面

err = dup_mmap(mm, oldmm);

static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
{
    struct vm_area_struct *mpnt, *tmp, *prev, **pprev;
    struct rb_node **rb_link, *rb_parent;
    int retval;
    unsigned long charge;
    struct mempolicy *pol;

    down_write(&oldmm->mmap_sem);
    flush_cache_dup_mm(oldmm);
    /*
     * Not linked in yet - no deadlock potential:
     */
    down_write_nested(&mm->mmap_sem, SINGLE_DEPTH_NESTING);

    mm->locked_vm = 0;
    mm->mmap = NULL;
    mm->mmap_cache = NULL;
    mm->free_area_cache = oldmm->mmap_base;
    mm->cached_hole_size = ~0UL;
    mm->map_count = 0;
    cpumask_clear(mm_cpumask(mm));
    mm->mm_rb = RB_ROOT;
    rb_link = &mm->mm_rb.rb_node;
    rb_parent = NULL;
    pprev = &mm->mmap;
    retval = ksm_fork(mm, oldmm);
    if (retval)
        goto out;

    prev = NULL;
    //處理每個vm_area_struct結構
    for (mpnt = oldmm->mmap; mpnt; mpnt = mpnt->vm_next) 
    {
        struct file *file;
        //不須要複製
        if (mpnt->vm_flags & VM_DONTCOPY) 
        {
            long pages = vma_pages(mpnt);
            mm->total_vm -= pages;
            vm_stat_account(mm, mpnt->vm_flags, mpnt->vm_file,
                                -pages);
            continue;
        }
        charge = 0;
        //須要安全計數檢查
        if (mpnt->vm_flags & VM_ACCOUNT) 
        {
            unsigned int len = (mpnt->vm_end - mpnt->vm_start) >> PAGE_SHIFT;
            if (security_vm_enough_memory(len))
                goto fail_nomem;
            charge = len;
        }
        //爲子進程分配新的vm_area_struct結構
        tmp = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
        if (!tmp)
            goto fail_nomem;
        //複製整個結構
        *tmp = *mpnt;
        pol = mpol_dup(vma_policy(mpnt));
        retval = PTR_ERR(pol);
        if (IS_ERR(pol))
            goto fail_nomem_policy;
        vma_set_policy(tmp, pol);
        tmp->vm_flags &= ~VM_LOCKED;
        tmp->vm_mm = mm;
        tmp->vm_next = tmp->vm_prev = NULL;
        anon_vma_link(tmp);
        file = tmp->vm_file;
        
        //若是這篇內存對應的是一個文件映射，則設置文件相關信息，增長文件的引用計數等
        if (file) 
        {
            struct inode *inode = file->f_path.dentry->d_inode;
            struct address_space *mapping = file->f_mapping;

            get_file(file);
            if (tmp->vm_flags & VM_DENYWRITE)
                atomic_dec(&inode->i_writecount);
            spin_lock(&mapping->i_mmap_lock);
            if (tmp->vm_flags & VM_SHARED)
                mapping->i_mmap_writable++;
            tmp->vm_truncate_count = mpnt->vm_truncate_count;
            flush_dcache_mmap_lock(mapping);
            /* insert tmp into the share list, just after mpnt */
            vma_prio_tree_add(tmp, mpnt);
            flush_dcache_mmap_unlock(mapping);
            spin_unlock(&mapping->i_mmap_lock);
        }

        /*
         * Clear hugetlb-related page reserves for children. This only
         * affects MAP_PRIVATE mappings. Faults generated by the child
         * are not guaranteed to succeed, even if read-only
         */
        if (is_vm_hugetlb_page(tmp))
            reset_vma_resv_huge_pages(tmp);

        /*
        Link in the new vma and copy the page table entries.
        */
        //把新的vm_area_struct結構添加到子進程
        *pprev = tmp;
        pprev = &tmp->vm_next;
        tmp->vm_prev = prev;
        prev = tmp;
        //添加紅黑樹
        __vma_link_rb(mm, tmp, rb_link, rb_parent);
        rb_link = &tmp->vm_rb.rb_right;
        rb_parent = &tmp->vm_rb;

        mm->map_count++;
        /*
        分配設置頁表，並不須要分配物理頁面
        copy_page_range()函數，須要爲vm_area_struct結構指定的內存區域分配並設置頁表，同時把頁表的物理地址設置到頁目錄(即上級頁表中)
        */
        retval = copy_page_range(mm, oldmm, mpnt);

        if (tmp->vm_ops && tmp->vm_ops->open)
            tmp->vm_ops->open(tmp);

        if (retval)
            goto out;
    }
    /* a new mm has just been created */
    arch_dup_mmap(oldmm, mm);
    retval = 0;
out:
    up_write(&mm->mmap_sem);
    flush_tlb_mm(oldmm);
    up_write(&oldmm->mmap_sem);
    return retval;
fail_nomem_policy:
    kmem_cache_free(vm_area_cachep, tmp);
fail_nomem:
    retval = -ENOMEM;
    vm_unacct_memory(charge);
    goto out;
}

繼續跟進copy_page_range()

\linux-2.6.32.63\mm\memory.c

static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma, unsigned long addr, unsigned long end)
{
    pte_t *orig_src_pte, *orig_dst_pte;
    pte_t *src_pte, *dst_pte;
    spinlock_t *src_ptl, *dst_ptl;
    int progress = 0;
    int rss[2];

again:
    rss[1] = rss[0] = 0;
    dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
    if (!dst_pte)
        return -ENOMEM;
    src_pte = pte_offset_map_nested(src_pmd, addr);
    src_ptl = pte_lockptr(src_mm, src_pmd);
    spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
    orig_src_pte = src_pte;
    orig_dst_pte = dst_pte;
    arch_enter_lazy_mmu_mode();

    do {
        /*
         * We are holding two locks at this point - either of them
         * could generate latencies in another task on another CPU.
         */
        if (progress >= 32) {
            progress = 0;
            if (need_resched() ||
                spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
                break;
        }
        if (pte_none(*src_pte)) {
            progress++;
            continue;
        }
        /*
        因爲爲了支持多級分頁，從代碼上看copy_pte_range比較繁瑣，在多級分頁中，copy_pte_range須要不斷地爲vm_start、vm_end指定的虛擬地址設置頁表，最終它調用copy_one_pte設置頁表項
        */
        copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, vma, addr, rss);
        progress += 8;
    } while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);

    arch_leave_lazy_mmu_mode();
    spin_unlock(src_ptl);
    pte_unmap_nested(orig_src_pte);
    add_mm_rss(dst_mm, rss[0], rss[1]);
    pte_unmap_unlock(orig_dst_pte, dst_ptl);
    cond_resched();
    if (addr != end)
        goto again;
    return 0;
}

繼續跟進copy_one_pte()

/*
copy one vm_area from one task to the other. Assumes the page tables already present in the new task to be cleared in the whole range  covered by this vma.
*/
static inline void copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma,    unsigned long addr, int *rss)
{
    unsigned long vm_flags = vma->vm_flags;
    pte_t pte = *src_pte;
    struct page *page;

    //pte contains position in swap or file, so copy. 
    /*
    虛擬地址對應的頁表被交換到磁盤上，須要注意的是，缺頁中斷能夠從磁盤交換分區調入內存，可是缺頁中斷自身所用的內存及其頁表是不可交換的，所以內核空間使用的頁表是不可交換的
    */
    if (unlikely(!pte_present(pte))) {
        if (!pte_file(pte)) {
            swp_entry_t entry = pte_to_swp_entry(pte);

            swap_duplicate(entry);
            /* make sure dst_mm is on swapoff's mmlist. */
            if (unlikely(list_empty(&dst_mm->mmlist))) {
                spin_lock(&mmlist_lock);
                if (list_empty(&dst_mm->mmlist))
                    list_add(&dst_mm->mmlist,
                         &src_mm->mmlist);
                spin_unlock(&mmlist_lock);
            }
            if (is_write_migration_entry(entry) && is_cow_mapping(vm_flags)) 
            {
                /*
                 * COW mappings require pages in both parent
                 * and child to be set to read.
                 */
                make_migration_entry_read(&entry);
                pte = swp_entry_to_pte(entry);
                set_pte_at(src_mm, addr, src_pte, pte);
            }
        }
        goto out_set_pte;
    }

    /*
    If it's a COW mapping, write protect it both in the parent and the child
    若是是可寫內存區域，則利用頁表把該段內存區域設置爲只讀，以實現copy on write機制
    */
    if (is_cow_mapping(vm_flags)) 
    {
        ptep_set_wrprotect(src_mm, addr, src_pte);
        pte = pte_wrprotect(pte);
    }

    /*
     * If it's a shared mapping, mark it clean in
     * the child
     */
    if (vm_flags & VM_SHARED)
        pte = pte_mkclean(pte);
    pte = pte_mkold(pte);

    page = vm_normal_page(vma, addr, pte);
    if (page) {
        get_page(page);
        page_dup_rmap(page);
        rss[PageAnon(page)]++;
    }

out_set_pte:
    set_pte_at(dst_mm, addr, dst_pte, pte);
}

繼續回到copy_process中，當父進程進行系統調用時，在父進程的內核態保存了進程的"進程上下文(通用寄存器)"，這是一個pt_regs結構，copy_thread()會複製父進程的pt_regs結構到子進程的內核態堆棧

\linux-2.6.32.63\arch\x86\kernel\process_32.c

int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long unused, struct task_struct *p, struct pt_regs *regs)
{
    struct pt_regs *childregs;
    struct task_struct *tsk;
    int err;

    //內核態堆棧
    childregs = task_pt_regs(p);
    //父進程內核態堆棧中的pt_regs複製到子進程的內核態堆棧
    *childregs = *regs;
    //子進程的pt_regs結構的eax設置爲0，因此子進程的fork()"返回"的值爲0
    childregs->ax = 0;
    //調整子進程內核態堆棧指針
    childregs->sp = sp;

    p->thread.sp = (unsigned long) childregs;
    p->thread.sp0 = (unsigned long) (childregs+1);

    //設置子進程的thread.eip，這樣當子進程被調度運行時，就直接從ret_from_fork返回，這也就是"一次調用、2次返回的原理"
    p->thread.ip = (unsigned long) ret_from_fork;

    task_user_gs(p) = get_user_gs(regs);

    tsk = current;
    // I/O權限位
    if (unlikely(test_tsk_thread_flag(tsk, TIF_IO_BITMAP))) 
    {
        p->thread.io_bitmap_ptr = kmemdup(tsk->thread.io_bitmap_ptr,
                        IO_BITMAP_BYTES, GFP_KERNEL);
        if (!p->thread.io_bitmap_ptr) {
            p->thread.io_bitmap_max = 0;
            return -ENOMEM;
        }
        set_tsk_thread_flag(p, TIF_IO_BITMAP);
    }

    err = 0;

    /*
    Set a new TLS for the child thread
    線程本地存儲機制
    */
    if (clone_flags & CLONE_SETTLS)
        err = do_set_thread_area(p, -1,
            (struct user_desc __user *)childregs->si, 0);

    if (err && p->thread.io_bitmap_ptr) {
        kfree(p->thread.io_bitmap_ptr);
        p->thread.io_bitmap_max = 0;
    }

    clear_tsk_thread_flag(p, TIF_DS_AREA_MSR);
    p->thread.ds_ctx = NULL;

    clear_tsk_thread_flag(p, TIF_DEBUGCTLMSR);
    p->thread.debugctlmsr = 0;

    return err;
}

這樣，子進程創建的工做就結束了，當調度到這個進程時，它將從ret_from_fork開始執行，而後跳轉到syscall_exit，即"1次調用fork、兩次從內核態返回到用戶空間"，其用戶空間的返回地址保存在內核態堆棧的pt_regs結構中，這個返回地址和父進程是一致的
最後，再回到do_fork函數，若是copy_process()函數成功返回，新建立子進程被喚醒並讓其投入運行。內核有意選擇子進程首先執行，由於通常子進程都會立刻調用exec()函數，這樣能夠避免寫時拷貝的額外開銷，若是父進程首先執行的話，有可能會開始向地址空間寫入

0x1: 建立線程

首先說明Linux下的進程與線程比較相近，很大的緣由是它們都須要相同的數據結構來表示，即task_struct。區別在於進程有獨立的用戶空間，而線程是共享的用戶空間
對於Linux下線程建立的理解，咱們須要抓住如下幾個重點

1. Linux系統的線程實現很特別，它對線程和進程並不特別區分，對Linux而言，線程只不過是一種特殊的進程罷了
2. 父進程複製子進程的API包括三種(用戶態)
    1) fork
    2) clone
    3) vfork
//這三個API的內部實際都是調用一個內核內部函數do_fork，只是填寫的參數不一樣而已，經過flags參數來指明父、子進程須要共享的資源
3. 從內核的角度來講，Linux並無線程這個概念，Linux把全部線程都看成任務(task_struct)來實現，內核並無準備特殊的調度算法或是定義特殊的數據結構來表現線程，線程僅僅被視爲一個與其餘進程共享某些資源的進程，每一個線程都有屬於本身獨立的task_strcut，因此在內核中，線程就是一個普通的進程，在Linux下，區分線程和進程的關係是父子進程資源上的共享程度，從這個意義上來講，線程只是一種進程間共享資源的手段

線程的建立和普通進程的建立相似(本質上就是進程建立)，只不過在調用clone()的時候須要傳遞一些參數標誌來指明須要共享的資源(資源的共享是Linux線程的核心概念)

clone(CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND, 0);
/*
上面的代碼和調用fork差很少，只是父子進程共享地址空間、文件系統資源、文件描述符、信號處理程序
在這種狀況下，新建的進程和它的父進程就是流行的所謂線程
*/

傳遞給clone()的參數標誌決定了新建立進程的行爲方式和父子進程之間共享的資源種類

#define CSIGNAL            0x000000ff    /* 父子進程共享信號處理函數及被阻斷的信號 */
#define CLONE_VM        0x00000100    /* 父子進程共享地址空間 */
#define CLONE_FS        0x00000200    /* 父子進程共享文件系統信息 */
#define CLONE_FILES        0x00000400    /* 父子進程共享打開的文件 */
#define CLONE_SIGHAND        0x00000800    /* 父子進程共享信號處理信息 */
#define CLONE_PTRACE        0x00002000    /* 繼續調試子進程，即若是父進程正在處於被調試狀態，要求子進程也處於被調試狀態 */
#define CLONE_VFORK        0x00004000    /* 創建子進程後，父進程保持阻塞狀態，直到子進程退出或者調用execve() */
#define CLONE_PARENT        0x00008000    /* 指定子進程與父進程擁有同一個父進程 */
#define CLONE_THREAD        0x00010000    /* 父子進程放入相同的線程組，這樣子進程的tgid、group_leader都會作相應的設置，能夠把它理解爲同一個進程中的多個線程 */
#define CLONE_NEWNS        0x00020000    /* 爲子進程建立新的命名空間 */
#define CLONE_SYSVSEM        0x00040000    /* 父子進程共享SystemV IPC Semaphore語義 */
#define CLONE_SETTLS        0x00080000    /* 爲子進程設置獨立的線程本地存儲(TLS) */
#define CLONE_PARENT_SETTID    0x00100000    /* 設置父進程的TID */
#define CLONE_CHILD_CLEARTID    0x00200000    /* 清除子進程的TID */
#define CLONE_DETACHED        0x00400000    /* Unused, ignored */
#define CLONE_UNTRACED        0x00800000    /* 防止跟蹤進程在子進程上強制執行CLONE_PTRACE，創建一個不容許被調試的進程，一般內核態線程會設置此標誌 */
#define CLONE_CHILD_SETTID    0x01000000    /* 設置子進程的TID */
#define CLONE_STOPPED        0x02000000    /* 以TASK_STOPPED狀態開始子進程，未來由掐進程來改變這種狀態 */
#define CLONE_NEWUTS        0x04000000    /* New utsname group */
#define CLONE_NEWIPC        0x08000000    /* New ipcs */
#define CLONE_NEWUSER        0x10000000    /* New user namespace */
#define CLONE_NEWPID        0x20000000    /* New pid namespace */
#define CLONE_NEWNET        0x40000000    /* New network namespace */
#define CLONE_IO        0x80000000    /* Clone io context */

4. sys_execve()函數

咱們知道，全部的exec家族的函數最終都是調用了sys_execve()這個系統調用來實現的，exce調用並不建立新進程，因此先後的進程ID並未改變，exec只是用一個全新的程序替換了當前進程的正文、數據、堆和棧段

\arch\x86\kernel\process_32.c

int sys_execve(struct pt_regs *regs)
{
    int error;
    char *filename;
    /*
    1. 將可執行文件的名稱裝入到一個新分配的頁面中
    */
    filename = getname((char __user *) regs->bx);
    error = PTR_ERR(filename);
    if (IS_ERR(filename))
        goto out;

    /*
    調用do_execve()執行可執行文件
    */
    error = do_execve(filename,
            (char __user * __user *) regs->cx,
            (char __user * __user *) regs->dx,
            regs);
    if (error == 0) {
        /* Make sure we don't return using sysenter.. */
        set_thread_flag(TIF_IRET);
    }
    putname(filename);
out:
    return error;
}

繼續跟蹤do_execve()

linux-2.6.32.63\fs\exec.c

/*
sys_execve() executes a new program.
1. filename: 可執行文件名稱
2. argv: 指向進程參數的指針數組
3. envp: 指向進程環境變量的指針數組
4. regs: 寄存器集合

內核在訪問用戶空間內存時須要十分謹慎，而__user註釋容許自動化工具來檢測是否有相關事宜都處理得當
*/
int do_execve(char * filename, char __user *__user *argv, char __user *__user *envp, struct pt_regs * regs)
{
    //將運行可執行文件時所需的信息組織到一塊兒
    struct linux_binprm *bprm;
    struct file *file;
    struct files_struct *displaced;
    bool clear_in_exec;
    int retval;

    retval = unshare_files(&displaced);
    if (retval)
        goto out_ret;

    retval = -ENOMEM;
    bprm = kzalloc(sizeof(*bprm), GFP_KERNEL);
    if (!bprm)
        goto out_files;

    /*
    1. LSM Hook Point 1: 
    retval = prepare_bprm_creds(bprm);
    prepare_exec_creds->security_prepare_creds
    int security_prepare_creds(struct cred *new, const struct cred *old, gfp_t gfp)
    {
        return security_ops->cred_prepare(new, old, gfp);
    }
    */
    retval = prepare_bprm_creds(bprm);
    if (retval)
        goto out_free;

    retval = check_unsafe_exec(bprm);
    if (retval < 0)
        goto out_free;
    clear_in_exec = retval;
    current->in_execve = 1;
    //找到並打開給定的可執行程序文件，open_exec()返回file結構指針，表明着讀入可執行文件的上下文
    file = open_exec(filename);
    //強制轉換
    retval = PTR_ERR(file);
    //判斷open_exec()返回的是不是無效指針
    if (IS_ERR(file))
        goto out_unmark;
     
    sched_exec();

    //要執行的文件
    bprm->file = file;
    //要執行的文件的名字
    bprm->filename = filename;
    bprm->interp = filename;

    /*
    處理若干管理型任務
    1. mm_alloc生成一個新的mm_struct實例來管理進程地址空間
    2. init_new_context是一個特定於體系結構的函數，用於初始化該實例
    3. __bprm_mm_init創建初始的棧
    */
    retval = bprm_mm_init(bprm);
    if (retval)
        goto out_file;

    //統計命令汗參數的個數
    bprm->argc = count(argv, MAX_ARG_STRINGS);
    if ((retval = bprm->argc) < 0)
        goto out;
    //統計環境變量參數的個數
    bprm->envc = count(envp, MAX_ARG_STRINGS);
    if ((retval = bprm->envc) < 0)
        goto out;
    
    //新進程的各個參數(euid、egid、參數列表、環境、文件名..)會被合併成一個類型爲linux_biprm的結構，用於以後傳遞給內核函數

    /*
    可執行文件中讀入開頭的128個字節到linux_binprm結構brmp中的緩衝區，用於以後內核根據這頭128字節判斷應該調用哪一個解析引擎來處理當前文件
    prepare_binprm用於提供一些父進程相關的值
    */
    retval = prepare_binprm(bprm);
    if (retval < 0)
        goto out;

    //文件名拷貝到新分配的頁面中
    retval = copy_strings_kernel(1, &bprm->filename, bprm);
    if (retval < 0)
        goto out;

    bprm->exec = bprm->p;
    //將環境變量拷貝到新分配的頁面中
    retval = copy_strings(bprm->envc, envp, bprm);
    if (retval < 0)
        goto out;

    //將命令行參數拷貝到新分配的頁面中
    retval = copy_strings(bprm->argc, argv, bprm);
    if (retval < 0)
        goto out;

    //全部準備工做已經完成，全部必要的信息都已經蒐集到了linux_binprm結構中的bprm中
    current->flags &= ~PF_KTHREAD;
    
    //調用search_binary_handler()裝入並運行目標程序，根據讀入數據結構linux_binprm內的二進制文件128字節頭中的關鍵字，決定調用哪一種加載函數
    /*
    . LSM Hook Point 2: 
    retval = security_bprm_check(bprm); 
    int security_bprm_check(struct linux_binprm *bprm) 
    { 
        return security_ops->bprm_check_security(bprm); 
    } 

    search_binary_handler用於在do_execve結束時查找一種適當的二進制格式，用於所要執行的特定文件(一般根據文件頭的一個"魔數")
    二進制格式處理程序負責將新程序的數據加載到舊的地址空間中，一般它們執行如下操做
    1. 釋放原進程使用的"全部"資源
    2. 將應用程序映射到虛擬地址空間中，必須考慮下列段的處理(涉及的變量是task_struct的成員，由二進制格式處理程序設置爲正確的值)
        1) text段包含程序的可執行代碼，start_code、end_code指定該段在地址空間中駐留的區域
        2) 預先初始化的數據(在編譯時指定了具體值的變量)位於start_data、end_data之間，映射自可執行文件的對應段
        3) 堆(heap)用於動態內存分配，也置於虛擬地址空間中，start_brk、brk指定了其邊界
        4) 棧的位置由start_stack定義，向下增加
        5) 程序的參數和環境也映射到虛擬地址空間中，位於arg_start、arg_end之間，以及env_start、env_end之間
    3. 設置進程的指令指針和其餘特定於體系結構的寄存器，以便在調度器選擇該進程時開始執行程序的main函數
    */
    retval = search_binary_handler(bprm, regs);
    if (retval < 0)
        goto out;

    /* execve succeeded */
    current->fs->in_exec = 0;
    current->in_execve = 0;
    acct_update_integrals(current);    

    free_bprm(bprm);
    if (displaced)
        put_files_struct(displaced);
    return retval;

out:
    if (bprm->mm) {
        acct_arg_size(bprm, 0);
        mmput(bprm->mm);
    }

out_file:
    //發生錯誤，返回inode，並釋放資源
    if (bprm->file) 
    {
        //調用allow_write_access()防止其餘進程在讀入可執行文件期間經過內存映射改變它的內容
        allow_write_access(bprm->file);
        //遞減file文件中的共享計數
        fput(bprm->file);
    }

out_unmark:
    if (clear_in_exec)
        current->fs->in_exec = 0;
    current->in_execve = 0;

out_free:
    free_bprm(bprm);

out_files:
    if (displaced)
        reset_files_struct(displaced);
out_ret:
    return retval;
}

5. Copy On Write COW(寫時複製)技術

fork產生新任務的速度很是快，由於fork並不複製原任務的內存空間，而是和原任務一塊兒共享一個"寫時複製(Copy On Write)的內存空間"，關於寫時複製，咱們須要重點理解它的概念和存在的意義

1. 兩個任務(task)能夠同時自由地讀取內存，但任意一個任務試圖對內存進行修改時，內存就會複製一份提供給修改方單獨使用，以避免影響到其餘的任務使用
2. 從產生的意義的角度來理解，寫時複製和動態連接庫的延遲綁定技術有殊途同歸之妙，正常來講，執行了fork以後，操做系統應該將父進程(父任務)的內存空間複製一份給子進程，可是這個複製過程可能須要消耗較多的時間，並且在於子進程也並不必定會對這塊內存進行"寫操做"，因此操做系統採用了一種"延遲複製"的思想，即等到子進程確實須要修改的時候再進行從父進程到子進程內存空間的複製
/*
fork()的實際開銷就是複製父進程的頁表以及給子進程建立惟一的進程描述符，在通常狀況下，進程建立後都會立刻運行一個可執行的文件(ELF文件)，這種優化能夠避免拷貝大量根本就不會被使用的數據(地址空間裏經常包含數十兆的數據)，這個技術大大加快的Linux進程建立的速度
*/

寫入時複製(Copy-on-write)是一個被使用在程式設計領域的最佳化策略。其基礎的觀念是，若是有多個呼叫者(callers)同時要求相同資源，他們會共同取得相同的指標指向相同的資源，直到某個呼叫者(caller)嘗試修改資源時，系統纔會真正複製一個副本(private copy)給該呼叫者，以免被修改的資源被直接察覺到，這過程對其餘的呼叫者都是透明的(transparently)。此做法主要的優勢是若是呼叫者並無修改該資源，就不會有副本(private copy)被創建

寫時複製是「延遲計算（lazy evaluation）」這一計算技術（evaluation technique）的一個例子，內存管理器普遍地使用了延遲計算的技術。延遲計算使得只有當絕對須要時才執行一個昂貴的操做－－若是該操做歷來也不須要的話，則它不會浪費任何一點時間。

Relevant Link:

http://zh.wikipedia.org/wiki/%E5%AF%AB%E5%85%A5%E6%99%82%E8%A4%87%E8%A3%BD http://cookies5000.blog.163.com/blog/static/995922052009223112797/ http://www.cnblogs.com/biyeymyhjob/archive/2012/07/20/2601655.html http://www.programlife.net/copy-on-write.html

6. Linux下建立進程的7種API方式

#include <unistd.h>
1. int execl(const char *path, const char *arg, ...);
2. int execlp(const char *file, const char *arg, ...);
3. int execle(const char *path, const char *arg, ..., char *const envp[]);
4. int execv(const char *path, char *const argv[]);
5. int execvp(const char *file, char *const argv[]);
6. int execve(const char *path, char *const argv[], char *const envp[]);

這些都是用以執行一個可執行文件的函數，它們統稱爲"exec函數"，它們的差別在於對命令行參數和環境變量參數的傳遞方式不一樣
以上函數的本質都是調用\arch\x86\kernel\process_32.c文件中實現的系統調用sys_execve()來執行一個可執行文件
exec系列函數共有7函數可供使用，這些函數的區別在於

1. 使用"路徑"指示新程序的位置
    1) 若是是使用文件名，則在系統的PATH環境變量所描述的路徑中搜索該程序
2. 使用"文件名"指示新程序的位置
3. 使用參數列表的方式做爲傳入參數
4. 使用argv[]數組的方式傳入參數

0x1: int execl(const char *pathname, const char *arg0, ... /* (char *)0 */ );

execl()函數用來執行參數pathname字符串所指向的程序，第二個及之後的參數表明執行文件時傳遞的參數列表，最後一個參數必須是空指針以標誌參數列表爲空.

//File: execl.c 
#include <unistd.h>

main()
{
    // 執行/bin目錄下的ls, 第一參數爲程序名ls, 第二個參數爲"-al", 第三個參數爲"/etc/passwd"
    execl("/bin/ls", "ls", "-al", "/etc/passwd", (char *) 0);
    
    //最後一個參數傳入NULL也是能夠的
    execl("/bin/ls", "ls", "-al", "/etc/", NULL);
}

0x2: int execv(const char *pathname, char *const argv[]);

execv()函數函數用來執行參數path字符串所指向的程序，第二個爲數組指針維護的程序參數列表，該數組的最後一個成員必須是空指針.

#include <unistd.h>

int main()
{
        char *argv[] = {"ls", "-l", "/etc", (char *)0};
        execv("/bin/ls", argv);
        return 0;
}

0x3: int execle(const char *pathname, const char *arg0, .../* (char *)0, char *const envp[] */ );

execle()函數用來執行參數pathname字符串所指向的程序，第二個及之後的參數表明執行文件時傳遞的參數列表，最後一個參數必須指向一個新的環境變量數組，即新執行程序的環境變量.

#include <unistd.h>

int main(int argc， char *argv[]， char *env[])
{
        execle("/bin/ls", "ls", "-l", "/etc", (char *)0，env);
        return 0;
}

0x4: int execve(const char *pathname, char *const argv[], char *const envp[]);

execve()用來執行參數filename字符串所表明的文件路徑，第二個參數是利用指針數組來傳遞給執行文件，而且須要以空指針(NULL)結束，最後一個參數則爲傳遞給執行文件的新環境變量數組

#include<unistd.h>

main()
{
    char * argv[ ]={"ls", "-al", "/etc/passwd", (char *)0};
    char * envp[ ]={"PATH=/bin", 0};
    execve("/bin/ls", argv, envp);
}

0x5: int execlp(const char *filename, const char *arg0, ... /* (char *)0 */ );

execlp()函數會從PATH環境變量所指的目錄中查找文件名爲第一個參數filename指示的字符串，找到後執行該文件，第二個及之後的參數表明執行文件時傳遞的參數列表，最後一個參數必須是空指針.

#include <unistd.h>

int main()
{
        execlp("ls", "ls", "-l", "/etc", (char *)0);
        return 0;
}

0x6: int execvp(const char *filename, char *const argv[]);

execvp()函數會從PATH環境變量所指的目錄中查找文件名爲第一個參數file指示的字符串，找到後執行該文件，第二個及之後的參數表明執行文件時傳遞的參數列表，最後一個成員必須是空指針.

#include <unistd.h>

int main()
{
        char *argv[] = {"ls", "-l", "/etc", (char *)0};
        execvp("ls", argv);
        return 0;
}

0x7: int fexecve(int fd, char *const argv[], char *const envp[]);

fexecve()執行的任務與execve()相同，所不一樣的是執行的文件經過文件描述符fd指定，而不是經過路徑。文件描述符fd必需以只讀方式打開，而且調用者必需有執行相應文件的權限，在 Linux系統裏，fexecve()的實現使用了proc()文件系統，因此 /proc 必需被掛載並在調用時可用
值得注意的是fexecve的打開對象是文件描述符(file discriptor fd)，在Linux下，文件描述符能夠是經過open打開的可執行文件、也能夠是經過父進程繼承的命名管道(named pipe)

//lgo.c 
#include <stdio.h> 
#include <unistd.h> 

int main(int argc, char **argv) 
{ 
    extern char **environ; 
    (void) argc; 
    fexecve(0, argv, environ); 
    perror("fexecve"); 
    return 1; 
}

Relevant Link:

http://www.2cto.com/os/201410/342362.html
http://cpp.ezbty.org/import_doc/linux_manpage/fexecve.3.html
http://stackoverflow.com/questions/13690454/how-to-compile-and-execute-from-memory-directly
http://security.stackexchange.com/questions/20974/how-a-malware-executes-remote-payload

值得注意的是，glibc提供的這7種進程執行的API，只是起到一個適配轉接的做用，最終在內部都會調用到同一個函數"__execve"

\glibc-2.18\posix\execle.c\

/* Execute PATH with all arguments after PATH until a NULL pointer,
   and the argument after that for environment.  */
int execle (const char *path, const char *arg, ...)
{
#define INITIAL_ARGV_MAX 1024
  size_t argv_max = INITIAL_ARGV_MAX;
  const char *initial_argv[INITIAL_ARGV_MAX];
  const char **argv = initial_argv;
  va_list args;
  argv[0] = arg;

  va_start (args, arg);
  unsigned int i = 0;
  while (argv[i++] != NULL)
    {
      if (i == argv_max)
    {
      argv_max *= 2;
      const char **nptr = realloc (argv == initial_argv ? NULL : argv,
                       argv_max * sizeof (const char *));
      if (nptr == NULL)
        {
          if (argv != initial_argv)
        free (argv);
          return -1;
        }
      if (argv == initial_argv)
        /* We have to copy the already filled-in data ourselves.  */
        memcpy (nptr, argv, i * sizeof (const char *));

      argv = nptr;
    }

      argv[i] = va_arg (args, const char *);
    }

  const char *const *envp = va_arg (args, const char *const *);
  va_end (args);

  int ret = __execve (path, (char *const *) argv, (char *const *) envp);
  if (argv != initial_argv)
    free (argv);

  return ret;
}
libc_hidden_def (execle)

\glibc-2.18\sysdeps\unix\sysv\linux\execve.c

int __execve (file, argv, envp)
     const char *file;
     char *const argv[];
     char *const envp[];
{
  return INLINE_SYSCALL (execve, 3, file, argv, envp);
}
weak_alias (__execve, execve)

7. Glibc execve、fork API源代碼分析

0x1: execve

\glibc-2.18\sysdeps\unix\sysv\linux\execve.c

int __execve (file, argv, envp)
     const char *file;
     char *const argv[];
     char *const envp[];
{
  return INLINE_SYSCALL (execve, 3, file, argv, envp);
}
weak_alias (__execve, execve)

0x2: fork

pid_t __libc_fork (void)
{
  pid_t pid;
  struct used_handler
  {
    struct fork_handler *handler;
    struct used_handler *next;
  } *allp = NULL;

  /* Run all the registered preparation handlers.  In reverse order.
     While doing this we build up a list of all the entries.  */
  struct fork_handler *runp;
  while ((runp = __fork_handlers) != NULL)
    {
      /* Make sure we read from the current RUNP pointer.  */
      atomic_full_barrier ();

      unsigned int oldval = runp->refcntr;

      if (oldval == 0)
    /* This means some other thread removed the list just after
       the pointer has been loaded.  Try again.  Either the list
       is empty or we can retry it.  */
    continue;

      /* Bump the reference counter.  */
      if (atomic_compare_and_exchange_bool_acq (&__fork_handlers->refcntr,
                        oldval + 1, oldval))
    /* The value changed, try again.  */
    continue;

      /* We bumped the reference counter for the first entry in the
     list.  That means that none of the following entries will
     just go away.  The unloading code works in the order of the
     list.

     While executing the registered handlers we are building a
     list of all the entries so that we can go backward later on.  */
      while (1)
    {
      /* Execute the handler if there is one.  */
      if (runp->prepare_handler != NULL)
        runp->prepare_handler ();

      /* Create a new element for the list.  */
      struct used_handler *newp
        = (struct used_handler *) alloca (sizeof (*newp));
      newp->handler = runp;
      newp->next = allp;
      allp = newp;

      /* Advance to the next handler.  */
      runp = runp->next;
      if (runp == NULL)
        break;

      /* Bump the reference counter for the next entry.  */
      atomic_increment (&runp->refcntr);
    }

      /* We are done.  */
      break;
    }

  _IO_list_lock ();

#ifndef NDEBUG
  pid_t ppid = THREAD_GETMEM (THREAD_SELF, tid);
#endif

  /* We need to prevent the getpid() code to update the PID field so
     that, if a signal arrives in the child very early and the signal
     handler uses getpid(), the value returned is correct.  */
  pid_t parentpid = THREAD_GETMEM (THREAD_SELF, pid);
  THREAD_SETMEM (THREAD_SELF, pid, -parentpid);

#ifdef ARCH_FORK
  pid = ARCH_FORK ();
#else
# error "ARCH_FORK must be defined so that the CLONE_SETTID flag is used"
  pid = INLINE_SYSCALL (fork, 0);
#endif


  if (pid == 0)
    {
      struct pthread *self = THREAD_SELF;

      assert (THREAD_GETMEM (self, tid) != ppid);

      if (__fork_generation_pointer != NULL)
    *__fork_generation_pointer += 4;

      /* Adjust the PID field for the new process.  */
      THREAD_SETMEM (self, pid, THREAD_GETMEM (self, tid));

#if HP_TIMING_AVAIL
      /* The CPU clock of the thread and process have to be set to zero.  */
      hp_timing_t now;
      HP_TIMING_NOW (now);
      THREAD_SETMEM (self, cpuclock_offset, now);
      GL(dl_cpuclock_offset) = now;
#endif

#ifdef __NR_set_robust_list
      /* Initialize the robust mutex list which has been reset during
     the fork.  We do not check for errors since if it fails here
     it failed at process start as well and noone could have used
     robust mutexes.  We also do not have to set
     self->robust_head.futex_offset since we inherit the correct
     value from the parent.  */
# ifdef SHARED
      if (__builtin_expect (__libc_pthread_functions_init, 0))
    PTHFCT_CALL (ptr_set_robust, (self));
# else
      extern __typeof (__nptl_set_robust) __nptl_set_robust
    __attribute__((weak));
      if (__builtin_expect (__nptl_set_robust != NULL, 0))
    __nptl_set_robust (self);
# endif
#endif

      /* Reset the file list.  These are recursive mutexes.  */
      fresetlockfiles ();

      /* Reset locks in the I/O code.  */
      _IO_list_resetlock ();

      /* Reset the lock the dynamic loader uses to protect its data.  */
      __rtld_lock_initialize (GL(dl_load_lock));

      /* Run the handlers registered for the child.  */
      while (allp != NULL)
    {
      if (allp->handler->child_handler != NULL)
        allp->handler->child_handler ();

      /* Note that we do not have to wake any possible waiter.
         This is the only thread in the new process.  The count
         may have been bumped up by other threads doing a fork.
         We reset it to 1, to avoid waiting for non-existing
         thread(s) to release the count.  */
      allp->handler->refcntr = 1;

      /* XXX We could at this point look through the object pool
         and mark all objects not on the __fork_handlers list as
         unused.  This is necessary in case the fork() happened
         while another thread called dlclose() and that call had
         to create a new list.  */

      allp = allp->next;
    }

      /* Initialize the fork lock.  */
      __fork_lock = LLL_LOCK_INITIALIZER;
    }
  else
    {
      assert (THREAD_GETMEM (THREAD_SELF, tid) == ppid);

      /* Restore the PID value.  */
      THREAD_SETMEM (THREAD_SELF, pid, parentpid);

      /* We execute this even if the 'fork' call failed.  */
      _IO_list_unlock ();

      /* Run the handlers registered for the parent.  */
      while (allp != NULL)
    {
      if (allp->handler->parent_handler != NULL)
        allp->handler->parent_handler ();

      if (atomic_decrement_and_test (&allp->handler->refcntr)
          && allp->handler->need_signal)
        lll_futex_wake (allp->handler->refcntr, 1, LLL_PRIVATE);

      allp = allp->next;
    }
    }

  return pid;
}
weak_alias (__libc_fork, __fork)
libc_hidden_def (__fork)
weak_alias (__libc_fork, fork)

8. 查看進程的啓動過程工具

要想查看進程的啓動過程，可使用兩個工具: strace和LD_DEBUG

source:

#include <stdlib.h>  
#include <stdio.h>  
 
int main() { printf("hello world\n"); return 0; }

編譯程序:

gcc -o hello -O2 hello.c

strace -tt ./hello

05:47:11.645477 execve("./hello", ["./hello"], [/* 38 vars */]) = 0 05:47:11.646521 brk(0) = 0x82f8000 05:47:11.646660 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb77fd000 05:47:11.646745 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) 05:47:11.646929 open("/etc/ld.so.cache", O_RDONLY) = 3 05:47:11.647012 fstat64(3, {st_mode=S_IFREG|0644, st_size=50450, ...}) = 0 05:47:11.647176 mmap2(NULL, 50450, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb77f0000 05:47:11.647223 close(3) = 0 05:47:11.647348 open("/lib/libc.so.6", O_RDONLY) = 3 05:47:11.647409 read(3, "\177ELF\1\1\1\3\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0@\356X\0004\0\0\0"..., 512) = 512 05:47:11.647496 fstat64(3, {st_mode=S_IFREG|0755, st_size=1906124, ...}) = 0 05:47:11.647605 mmap2(0x578000, 1665416, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x578000 05:47:11.647648 mprotect(0x708000, 4096, PROT_NONE) = 0 05:47:11.647693 mmap2(0x709000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x190) = 0x709000 05:47:11.647761 mmap2(0x70c000, 10632, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x70c000 05:47:11.647819 close(3) = 0 05:47:11.648707 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb77ef000 05:47:11.648797 set_thread_area({entry_number:-1 -> 6, base_addr:0xb77ef6c0, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, 
useable:1}) = 0 05:47:11.649201 mprotect(0x709000, 8192, PROT_READ) = 0 05:47:11.649272 mprotect(0x570000, 4096, PROT_READ) = 0 05:47:11.649326 munmap(0xb77f0000, 50450) = 0 05:47:11.649560 fstat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0 05:47:11.649678 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb77fc000 05:47:11.649754 write(1, "hello world\n", 12hello world ) = 12 05:47:11.649829 exit_group(0) = ?

LD_DEBUG=libs ./hello

     26605:    find library=libc.so.6 [0]; searching 26605: search cache=/etc/ld.so.cache 26605: trying file=/lib/libc.so.6 26605: 26605: 26605: calling init: /lib/libc.so.6 26605: 26605: 26605: initialize program: ./hello 26605: 26605: 26605: transferring control: ./hello 26605: hello world 26605: 26605: calling fini: ./hello [0] 26605: 26605: 26605: calling fini: /lib/libc.so.6 [0] 26605:

9. Linux下線程建立

Linux下沒有像windows那樣明確的線程定義，或者從另外一個角度來講，Linux下的線程更加切進"線程"本質的概念，即線程和進程的差異本質上是"資源共享程度"問題，進程和線程之間的界限從某種程度上來講也不該該是那麼明確，Linux下建立線程來以下幾種方法

1. fork()+execve: 須要共享的資源: CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND
2. Linux POSIX Thread
3. LinuxThreads:was a partial implementation of POSIX Threads
4. Native POSIX Thread Library(NPTL)

須要明白的是，Linux下線程建立的最底層的內核級支持仍是fork，fork提供了豐富的flog共享參數，以此提供"父子進程(線程)"不一樣程度的共享，POSIX線程以及基於POSIX實現的線程庫都是基於fork實現的，可是要注意的，Linux的線程不是簡單的包裝用戶空間的fork，而是一種基於內核級支持的線程庫實現

0x1: Linux POSIX Thread

POSIX(可移植操做系統接口)線程是提升代碼響應和性能的有力手段

#include <pthread.h>
#include <stdlib.h>
#include <unistd.h>

 void *thread_function(void *arg) 
 {
    int i;
    for ( i=0; i<20; i++) 
    {
        printf("Thread says hi!\n");
        sleep(1);
    }
    return NULL;
}

int main(void) 
{
    pthread_t mythread;

    if ( pthread_create( &mythread, NULL, thread_function, NULL) ) 
    {
        printf("error creating thread.");
        abort();
    }
    if ( pthread_join ( mythread, NULL ) ) 
    {
        printf("error joining thread.");
        abort();
    }
    exit(0);
}

咱們知道，當用 fork() 建立另外一個新進程時，新進程是子進程，原始進程是父進程。這建立了可能很是有用的層次關係，尤爲是等待子進程終止時。例如，waitpid() 函數讓當前進程等待全部子進程終止。waitpid() 用來在父進程中實現簡單的清理過程
而 POSIX 線程中不存在這種層次關係。雖然主線程能夠建立一個新線程，新線程能夠建立另外一個新線程，POSIX 線程標準將它們視爲等同的層次。因此等待子線程退出的概念在這裏沒有意義 POSIX 線程標準不記錄任何"家族"信息

缺乏家族信息有一個主要含意：若是要等待一個線程終止，就必須將線程的 tid 傳遞給 pthread_join()。線程庫沒法爲您判定 tid

Relevant Link:

http://www.ibm.com/developerworks/cn/linux/thread/posix_thread1/

0x2: LinuxThreads:was a partial implementation of POSIX Threads

In the Linux operating system, LinuxThreads was a partial implementation of POSIX Threads. It has since been superseded by the Native POSIX Thread Library (NPTL)

Relevant Link:

http://www.ibm.com/developerworks/cn/linux/l-threading.html
http://en.wikipedia.org/wiki/LinuxThreads

0x3: Native POSIX Thread Library(NPTL)

Native POSIX Thread Library(NPTL)是一個可以使使用POSIX Threads編寫的程序在Linux內核上更有效地運行的軟件
NPTL的解決方法與LinuxThreads相似，內核看到的首要抽象依然是一個進程，新線程是經過clone()系統調用產生的。可是NPTL須要特殊的內核支持來解決同步的原始類型之間互相競爭的情況。在這種狀況下線程必須可以入眠和再復甦。用來完成這個任務的原始類型叫作futex
NPTL是一個所謂的"1 x 1線程函數庫"。用戶產生的線程與內核可以分配的對象之間的聯繫是一對一的。這是全部線程程序中最簡單的

getconf GNU_LIBPTHREAD_VERSION

Relevant Link:

http://zh.wikipedia.org/wiki/Native_POSIX_Thread_Library

10. Posix線程

0x1: 線程建立

1. 線程與進程

相對進程而言，線程是一個更加接近於執行體的概念，它能夠與同進程中的其餘線程共享數據，但擁有本身的棧空間，擁有獨立的執行序列。在串行程序基礎上引入線程和進程是爲了提升程序的併發度，從而提升程序運行效率和響應時間。對於Linux來講，線程實際上是一個僞概念

1. Linux下的全部新建進程都是經過父進程"複製"出來的，父子進程之間能夠實現不一樣程度的資源共享
2. 父子進程原本就是不一樣的進程，有本身的棧空間、執行序列是顯然的
3. Linux上區分進程、線程只是經過clone_flags，即資源共享程度來定義區分的

線程和進程在使用上各有優缺點：線程執行開銷小，但不利於資源的管理和保護；而進程正相反。同時，線程適合於在SMP機器上運行，而進程則能夠跨機器遷移

2. 建立線程

POSIX經過pthread_create()函數建立線程，API定義以下

int  pthread_create(pthread_t * thread, pthread_attr_t * attr, void * (*start_routine)(void *), void * arg);

與fork()調用建立一個進程的方法不一樣

1. pthread_create()建立的線程並不具有與主線程(即調用pthread_create()的線程)一樣的執行序列，而是使其運行start_routine(arg)函數
2. thread參數接收返回建立的線程ID
3. attr參數返回的是建立線程時設置的線程屬性
4. pthread_create()的返回值表示線程建立是否成功
5. 儘管arg是void *類型的變量，但它一樣能夠做爲任意類型的參數傳給start_routine()函數
6. start_routine()能夠返回一個void *類型的返回值，而這個返回值也能夠是其餘類型，並由pthread_join()獲取

咱們從更本質的角度來看Linux下POSIX線程庫的線程建立

1. Linux中調用fork、clone系統調用均可以產生新進程，而在這兩個系統調用內部都是調用的do_fork()內核函數實現的，只是所傳的參數不一樣
2. 而POSIX線程庫本質上是對clone系統調用的封裝，對pthread_create代碼進行strace能夠看到
/*clone(child_stack=0x7fe4b92daff0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7fe4b92db9d0, tls=0x7fe4b92db700, child_tidptr=0x7fe4b92db9d0) = 4634
*/
3. fork和clone在這裏的區別在於
    1) fork不能修改複製出來的子進程的棧空間，因此只能複製一份"完整的"父進程
    2) clone容許修改子進程的棧空間，因此容許建立任意入口函數的子進程，即咱們能夠在線程中部署不一樣的咱們所須要的任務函數

3. 線程建立屬性

pthread_create()中的attr參數是一個結構指針，結構中的元素分別對應着新線程的運行屬性，主要包括如下幾項

typedef struct
{
    /*
    線程的分離狀態
    detachstate表示新線程是否與進程中其餘線程脫離同步
    1. 若是置位則新線程不能用pthread_join()來同步，且在退出時自行釋放所佔用的資源。缺省爲PTHREAD_CREATE_JOINABLE狀態
    2. 這個屬性也能夠在線程建立並運行之後用pthread_detach()來設置
    3. 一旦設置爲PTHREAD_CREATE_DETACH狀態(不管是建立時設置仍是運行時設置)則不能再恢復到PTHREAD_CREATE_JOINABLE狀態 
    */
    int detachstate;  
    
    /*
    線程調度策略
    schedpolicy表示新線程的調度策略，主要包括
    1. SCHED_OTHER(正常、非實時): 缺省爲SCHED_OTHER
    2. SCHED_RR(實時、輪轉法): 僅對超級用戶有效
    3. SCHED_FIFO(實時、先入先出): 僅對超級用戶有效
    運行時能夠用過pthread_setschedparam()來改變 
    */
    int schedpolicy;   
    
    /*
    線程的調度參數
    schedparam是一個struct sched_param結構，目前僅有一個sched_priority整型變量表示線程的運行優先級。這個參數僅當調度策略爲實時(即SCHED_RR或SCHED_FIFO)時纔有效
    並能夠在運行時經過pthread_setschedparam()函數來改變，缺省爲0
    */
    struct sched_param schedparam;  
    inheritsched有兩種值可供選擇：PTHREAD_EXPLICIT_SCHED和PTHREAD_INHERIT_SCHED，前者表示新線程使用顯式指定調度策略和調度參數（即attr中的值），然後者表示繼承調用者線程的值。缺省爲PTHREAD_EXPLICIT_SCHED。

    //線程的繼承性
    int inheritsched;   
    
    /*
    線程的做用域
    scope表示線程間競爭CPU的範圍，也就是說線程優先級的有效範圍。POSIX的標準中定義了兩個值
    1. PTHREAD_SCOPE_SYSTEM: 表示與系統中全部線程一塊兒競爭CPU時間，目前LinuxThreads僅實現了PTHREAD_SCOPE_SYSTEM一值
    2. PTHREAD_SCOPE_PROCESS: 表示僅與同進程中的線程競爭CPU 
    */
    int scope;            
    size_t    guardsize;        //線程棧末尾的警惕緩衝區大小
    int stackaddr_set;
    void * stackaddr;        //線程棧的位置
    size_t stacksize;        //線程棧的大小
}pthread_attr_t;

pthread_attr_t結構中還有一些值，但不使用pthread_create()來設置，爲了設置這些屬性，POSIX定義了一系列屬性設置函數，包括

1. pthread_attr_init()
2. pthread_attr_destroy()
3. 與各個屬性相關的pthread_attr_getxxx / pthread_attr_setxxx函數

4. 線程建立的Linux POSIX實現

咱們知道，Linux的線程實現是內核外進行的，內核內提供的是建立進程的接口do_fork()。內核提供了兩個系統調用__clone()和fork()，最終都用不一樣的參數調用do_fork()核內API。固然，要想實現線程，沒有核心對多進程(實際上是輕量級進程)共享數據段的支持是不行的，所以，do_fork()提供了不少參數，包括

1. CLONE_VM(共享內存空間)
2. CLONE_FS(共享文件系統信息)
3. CLONE_FILES(共享文件描述符表)
4. CLONE_SIGHAND(共享信號句柄表)
5. CLONE_PID(共享進程ID，僅對核內進程，即0號進程有效)

當使用fork系統調用時，內核調用do_fork()不使用任何共享屬性，進程擁有獨立的運行環境
而使用pthread_create()來建立線程時,則最終設置了全部這些屬性來調用__clone()，而這些參數又所有傳給核內的do_fork()，從而建立的"進程"擁有共享的運行環境，只有棧是獨立的
Linux線程在覈內是以輕量級進程的形式存在的，擁有獨立的進程表項，而全部的建立、同步、刪除等操做都在覈外pthread庫中進行。pthread庫使用一個管理線程(__pthread_manager()，每一個進程獨立且惟一)來管理線程的建立和終止，爲線程分配線程ID，發送線程相關的信號(好比Cancel)，而主線程(pthread_create())的調用者則經過管道將請求信息傳給管理線程。這在JVM For Linux上的實現也是相似的原理

0x2: 線程取消

1. 線程取消的定義

通常狀況下，線程在其主體函數退出的時候會自動終止，但同時也能夠由於接收到另外一個線程發來的終止(取消)請求而強制終止

2. 線程取消的語義

線程取消的方法是向目標線程發Cancel信號，但如何處理Cancel信號則由目標線程本身決定

1. 或者忽略
2. 或者當即終止
3. 或者繼續運行至Cancelation-point(取消點)，由不一樣的Cancelation狀態決定

線程接收到CANCEL信號的缺省處理(即pthread_create()建立線程的缺省狀態)是繼續運行至取消點，也就是說設置一個CANCELED狀態，線程繼續運行，只有運行至Cancelation-point的時候纔會退出

3. 取消點

根據POSIX標準，pthread_join()、pthread_testcancel()、pthread_cond_wait()、pthread_cond_timedwait()、sem_wait()、sigwait()等函數以及read()、write()等會引發阻塞的系統調用都是Cancelation-point，而其餘pthread函數都不會引發Cancelation動做，可是CANCEL信號會使線程從阻塞的系統調用中退出，並置EINTR錯誤碼，所以能夠在須要做爲Cancelation-point的系統調用先後調用pthread_testcancel()，從而達到POSIX標準所要求的目標，即以下代碼段

pthread_testcancel();
retcode = read(fd, buffer, length);
pthread_testcancel();

4. 程序設計方面的考慮

若是線程處於無限循環中，且循環體內沒有執行至取消點的必然路徑，則線程沒法由外部其餘線程的取消請求而終止。所以在這樣的循環體的必經路徑上應該加入pthread_testcancel()調用

5 與線程取消相關的pthread函數

1. int pthread_cancel(pthread_t thread);
發送終止信號給thread線程，若是成功則返回0，不然爲非0值。發送成功並不意味着thread會終止

2. int pthread_setcancelstate(int state, int *oldstate);
設置本線程對Cancel信號的反應，state有兩種值
    1) PTHREAD_CANCEL_ENABLE(缺省): 收到信號後設爲CANCLED狀態
    2) PTHREAD_CANCEL_DISABLE: 收到信號後忽略CANCEL信號繼續運行
old_state若是不爲NULL則存入原來的Cancel狀態以便恢復

3. int pthread_setcanceltype(int type, int *oldtype);
設置本線程取消動做的執行時機，type由兩種取值
    1) PTHREAD_CANCEL_DEFFERED: 收到信號後繼續運行至下一個取消點再退出
    2) PTHREAD_CANCEL_ASYCHRONOUS: 收到信號後繼續當即執行取消動做(退出)，僅當Cancel狀態爲Enable時有效
oldtype若是不爲NULL則存入運來的取消動做類型值

4. void pthread_testcancel(void);
檢查本線程是否處於Canceld狀態，若是是，則進行取消動做，不然直接返回

0x3: 經過FLAG判斷當前是否爲線程新建

包括pthread_create創建的線程，或者是其餘任何POSIX庫、包括原生的創建線程的過程。其本質都是都是經過clone系統調用創建的

int clone(int (*fn)(void *), void *child_stack, int flags, void *arg, ... /* pid_t *ptid, struct user_desc *tls, pid_t *ctid */ );

flags:
1. CLONE_CHILD_CLEARTID 
2. CLONE_CHILD_SETTID 
3. CLONE_FILES 
4. CLONE_FS 
5. CLONE_IO 
6. CLONE_NEWIPC 
7. CLONE_NEWNET 
8. CLONE_NEWNS 
9. CLONE_NEWPID 
10. CLONE_NEWUTS 
11. CLONE_PARENT 
12. CLONE_PARENT_SETTID 
13. CLONE_PID 
14. CLONE_PTRACE 
15. CLONE_SETTLS 
16. CLONE_SIGHAND: 共享信號
17. CLONE_STOPPED 
18. CLONE_SYSVSEM 
19. CLONE_THREAD: 聲明全部的同一進程的線程要在一個線程組裏面了
20. CLONE_UNTRACED
21. CLONE_VFORK
22. CLONE_VM: 共享虛擬內存空間

從最基本的角度來看，要構成一個線程的要求

1. 一個線程和同一進程的別的線程必須共享地址空間
2. 按照POSIX的約定，線程們必須共享信號，所以，CLONE_SIGHAND也是必須的
3. 全部的同一進程的線程要在一個線程組裏面了，所以CLONE_THREAD也是必須的

/*
從man手冊能夠看出，CLONE_THREAD的設置要求CLONE_SIGHAND被設置，而CLONE_SIGHAND的設置要求CLONE_VM被設置，在內核的copy_process函數裏面有： 
if ((clone_flags & CLONE_THREAD) && !(clone_flags & CLONE_SIGHAND)) 
                 return ERR_PTR(-EINVAL); 
if ((clone_flags & CLONE_SIGHAND) && !(clone_flags & CLONE_VM)) 
                 return ERR_PTR(-EINVAL); 
*/

經過判斷傳入的clone_flags參數，能夠判斷出當前是建立進程仍是建立線程

CLONE_VM | CLONE_THREAD | CLONE_SIGHAND
//只要同時出現這3個FLAG，說明此時正在進行線程建立

Relevant Link:

https://www.ibm.com/developerworks/cn/linux/thread/posix_threadapi/part1/

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。