一個std::sort 自定義比較排序函數 crash的分析過程

時間 2019-12-15

標籤一個 std sort 自定義比較排序函數 crash 分析過程简体版

原文原文鏈接

兩年未寫總結博客，今天先來練練手，總結最近遇到的一個crash case。
　注意：如下的分析都基於GCC4.4.6html

1、解決crashlinux

咱們有一個複雜的排序，涉及到不少個因子，使用自定義排序函數的std::sort作排序。Compare函數相似下文的僞代碼：c++

bool compare(const FakeObj& left, const FakeObj& right) {
    if (left.a != right.a) {
        return left.a > right.a;
    }
    if (left.b != right.b) {
        return left.b > right.b;
    }
     ....
}

後來，咱們給排序函數加了更多的複雜邏輯：git

bool compare(const FakeObj& left, const FakeObj& right) {
    if (left.a != right.a) {
        return left.a > right.a;
    }
    if (left.b != right.b) {
        return left.b > right.b;
    }
    if (left.c != 0 && right.c != 0 && left.c != right.c) {
        // 當C屬性都存在的時候使用C屬性作比較
        return left.c > right.c;
    }
    if (left.d != right.d) {
        return left.d > right.d;
    }
    ....
}

服務發佈以後，進程就開始出現偶現的crash，使用gdb查看，調用堆棧以下：github

/usr/lib/gcc/x86_64-redhat-linux/4.4.6/../../../../include/c++/4.4.6/bits/stl_algo.h:5260
/usr/lib/gcc/x86_64-redhat-linux/4.4.6/../../../../include/c++/4.4.6/bits/stl_algo.h:2194
/usr/lib/gcc/x86_64-redhat-linux/4.4.6/../../../../include/c++/4.4.6/bits/stl_algo.h:2161
/usr/lib/gcc/x86_64-redhat-linux/4.4.6/../../../../include/c++/4.4.6/bits/stl_algo.h:2084

crash發生位置：在標準庫調用compare函數執行比較的時候出現了越界：dom

這時候，開始懷疑compare函數沒有按照標準庫的規範實現，查看相關資源：函數

　　https://stackoverflow.com/questions/41488093/why-do-i-get-runtime-error-when-comparison-function-in-stdsort-always-return-toop

　　https://en.cppreference.com/w/cpp/named_req/Comparespa

仔細看官方的文檔能夠發現：3d

咱們的compare函數對c屬性的判斷，沒有嚴格遵照可傳遞性：if comp(a,b)==true and comp(b,c)==true then comp(a,c)==true。假設存在A、B、C三個對象，

一、A、B對象有屬性c，且A.c > B.c，按照咱們的比較函數，這時候A>B；

二、C對象沒有c屬性，且C.d>A.d，這時候C>A;

三、C對象沒有c屬性，且B.d < C.d，這時候B>C

綜上，A>B 且 B>C，可是C>A，這就違反了strict weak ordering的transitivity。

到這裏，咱們的case就解決了，但實際上，基於如下幾個緣由，這個case花費了很長的時間：

一、咱們的compare函數的代碼不是逐步添加的，而是一次性寫完，致使沒有當即懷疑c屬性的比較有bug；

二、對官方文檔不夠重視，只關注到了非對稱性：comp(a,b) ==true then comp(b,a)==false，忽略了可傳遞性；

展轉了好久才注意到傳遞性要求。後續在解決問題時，應該更細緻，不放過每個細節。

2、crash更深層的緣由

業務上的crash問題已經解決，但crash的直接緣由是什麼仍是未知的，須要繼續探索。

找到std::sort的源碼：

https://github.com/gcc-mirror/gcc/blob/gcc-4_4-branch/libstdc%2B%2B-v3/include/bits/stl_algo.h

再結合其餘人分析std::sort源碼的總結：

http://www.javashuo.com/article/p-keruwjzp-er.html

https://liam.page/2018/09/18/std-sort-in-STL/

簡單的總結：std::sort爲了提升效率，綜合了快排、堆排序、插入排序，能夠分爲兩階段：

一、快排+堆排序（__introsort_loop），對於元素個數大於_S_threshold的序列，執行快排，當快排的遞歸深刻到必定層次（__depth_limit）時，再也不遞歸深刻，對待排序元素執行堆排序；對於元素個數小於_S_threshold的序列則不處理，交給後面的插入排序。

二、插入排序（__final_insertion_sort），當元素個數小於_S_threshold時，執行普通的插入排序（__insertion_sort）；當大於_S_threshold時，執行兩批次的插入排序，首先是普通的插入排序排[0, _S_threshold)；而後是無保護的插入排序（__unguarded_insertion_sort），從_S_threshold位置開始排，直到end，注意這裏可能還會處理到_S_threshold以前的元素（由於這個函數只用比較結果來判斷是否中止，而不強制要求在某個位置點上中止）。

咱們的crash發生在__unguarded_insertion_sort階段，也就是無保護的插入排序。看下這塊的代碼：

/// This is a helper function for the sort routine.
template<typename _RandomAccessIterator, typename _Compare>
inline void __unguarded_insertion_sort(_RandomAccessIterator __first,
               _RandomAccessIterator __last, _Compare __comp)
{
    typedef typename iterator_traits<_RandomAccessIterator>::value_type _ValueType;
    for (_RandomAccessIterator __i = __first; __i != __last; ++__i)
        std::__unguarded_linear_insert(__i, _ValueType(*__i), __comp);
    }


/// This is a helper function for the sort routine.
template<typename _RandomAccessIterator, typename _Tp, typename _Compare>
void __unguarded_linear_insert(_RandomAccessIterator __last, _Tp __val,
              _Compare __comp) {
    _RandomAccessIterator __next = __last;
    --__next;
    while (__comp(__val, *__next)) {
        *__last = *__next;
        __last = __next;
        --__next;
    }
    *__last = __val;
}

能夠看到，__unguarded_linear_insert 函數比較的終止條件是compare函數返回false，不然就一直排序下去，這裏之因此能夠這麼作，是由於以前的快排+堆排代碼保證了[0,X)序列的元素確定大於（假設是遞減排序）[X, end)，其中0<X<=_S_threshol，一旦沒法保證，則會致使--__next越界，最終致使crash。

再回到咱們的crash case，由於compare函數不知足傳遞性，雖然[0,X)區間的全部元素都大於X，且(X,end]區間的全部元素都小於X，可是並不能保證(X,end]的元素都小於[0,X)區間的元素，在__unguarded_linear_insert函數裏，對(X,end]區間的元素執行插入排序時，某元素大於[0,X)區間的全部元素，這時候就發生了越界crash。

這裏使用__unguarded_insertion_sort而不是僅使用__insertion_sort的好處是能夠節省邊界判斷。相關討論：https://bytes.com/topic/c/answers/819473-questions-about-stl-sort