算法導論第十八章 B樹

時間 2019-11-13

原文原文鏈接

本文首發於個人公衆號 Linux雲計算網絡（id: cloud_dev） ，專一於乾貨分享，號內有 10T 書籍和視頻資源，後臺回覆 「1024」 便可領取，歡迎你們關注，二維碼文末能夠掃。node

1、高級數據結構ios

　　本章之後到第21章（並查集）隸屬於高級數據結構的內容。前面還留了兩章：貪心算法和攤還分析，打算後面再來補充。以前的章節討論的支持動態數據集上的操做，如查找、插入、刪除等都是基於簡單的線性表、鏈表和樹等結構，本章之後的部分在原來更高的層次上來討論這些操做，更高的層次意味着更復雜的結構，但更低的時間複雜度（包括攤還時間）。算法

B樹是爲磁盤存儲還專門設計的平衡查找樹。由於磁盤操做的速度要遠遠慢於內存，因此度量B樹的性能，不只要考慮動態集合操做消耗了多少計算時間，還要考慮這些操做執行了多少次磁盤存儲。所以，B樹被設計成儘可能減小磁盤訪問的次數。知道了這一點，就會明白B樹的變形B+樹了，B+樹經過將數據存儲在葉子節點從而增大了一個節點所包含的信息，進而更加減小了磁盤的訪問次數。
可合併堆：支持make-heap, insert, minimum, extract-min, union這5種操做。在堆排序章節討論過二叉堆，除了union操做，二叉堆的性能都很好。該部分討論的二項堆和斐波那契堆對union操做可以得到很好的性能，此外，對於其餘操做，也能得到較好的改進。
該部分提出一種數據結構：van Emde Boas樹，當關鍵字在有限範圍內的整數時，進一步改進了動態集合操做的性能，能夠在O(lglgu)時間內完成。
不相交集合（並查集）：經過一棵簡單的有根樹來表示每一個集合，就能夠獲得驚人的快速操做：一個由m個操做構成的序列的運行時間爲O(n&(n))，而對於宇宙中的原子數總和n，&(n)也<=4，因此能夠認爲實際時間是O(n)。

2、B樹數據庫

　　從歷史演進上來看，B樹是在2-3樹的基礎上演變而來，2-3樹是一種類型的平衡查找樹，AVL樹的平衡條件是「保證任意節點的左右子樹的高度差不超過1」，而紅黑樹則是「經過對節點着不一樣的顏色來約束平衡」，2-3樹則是「經過約束內部節點的度來達到平衡」：分爲普通兩個度的節點和三個度的節點，故名爲2-3樹，以下圖所示：編程

　　更深一步，從實現原理上看，紅黑樹是2-3樹的一種簡單實現，緣由在於2-3樹在編碼實現上比較複雜，且失去二叉樹的特性，不易被人接受和理解。若是稍加對2-3樹作一點轉換，就能夠變爲二叉樹，作法是：用兩種連線來區分度爲3和度爲2的節點，好比用紅色的線來鏈接度爲3的節點，黑色的線鏈接普通的節點，以這種方法，便可將2-3樹轉化爲紅黑樹。以下：緩存

　　2-3樹是將內部節點賦予2-3度來達到平衡，那更通常地，天然想到爲內部節點賦予更大大小的度，進而減少了樹的高度，應對更多不一樣的場景。從這個層面上看，B樹是在前人的基礎上應運而生的一種樹結構。網絡

　　從應用場景來看，在一些大規模的數據存儲中，如數據庫，分佈式系統等，實現索引查詢這樣一個實際背景下，數據的訪問常常須要進行磁盤的讀寫操做，這個時候的瓶頸主要就在於磁盤的I/O上。若是採用普通的二叉查找樹結構，因爲樹的深度過大，會形成磁盤I/O的讀寫過於頻繁，進而致使訪問效率低下（通常樹的一個節點對應一個磁盤的頁面，讀取一次磁盤頁至關於訪問無數次內存）。那麼，如何減小樹的深度，一個基本的、很天然的想法就是：採用多叉樹結構。節點的分支因子越大（能夠理解成節點的孩子節點數），樹的高度也就越低，從而查詢的效率也就越高。從這個意義上來看，就有了B樹的結構。數據結構

　　前面提到過，在大多數系統中，B樹算法的運行時間主要由它所執行的disk-read和disk-write操做的次數所決定的，其他時間在內存中計算，速度不在一個量級。所以，應該有效地使用這兩種操做，即讓它們讀取更多的信息以更少的次數。因爲這個緣由，在B樹中，一個節點的大小一般至關於一個完整的磁盤頁。所以，一個B樹節點能夠擁有的孩子數就由磁盤頁的大小決定。理論上說，孩子數越多越好，由於這樣樹的高度會減小，查詢效率會增長，但要保證一個節點的總大小不能大於磁盤中的一個頁的大小，不然在一個節點內操做時還要來回訪問內存，反而拖慢效率。分佈式

3、B樹的定義及動態集合操做ide

一棵B樹具備如下的性質：

1）每一個節點x有三個屬性：

　　a、x.n—>關鍵字個數

　　b、關鍵字遞增排序

　　c、x.leaf—>節點是否屬於葉子節點

2）每一個節點有x.n+1個孩子節點

3）每一個節點關鍵字 > 其左孩子節點 < 其右孩子節點

4）每一個葉子節點具備相同的深度，即樹的高度h。

5）每一個節點用最小度數 t 來表示其關鍵字個數的上下界，或者孩子節點（分支因子）的個數的上下界。通常，每一個非根節點中所包含的關鍵字個數 j 知足：

t-1 <= j <= 2*t - 1

根節點至少包括一個關鍵字，若非葉子節點，則至少兩個分子，即 t>= 2。

　　與紅黑樹相比，雖然二者的高度都以 O(lgn)的速度增加，但對於 B 樹來講底要大不少倍。對大多數的樹的操做來講，要查找的結點數在 B 樹中要比紅黑樹中少大約 lgt 的因子。由於在樹中查找任意一個結點一般須要一次磁盤存取，因此磁盤存取的次數大大的減小了。

如下代碼表示B樹中的一個節點：

 1 /// B樹中的一個結點
 2 struct BTreeNode
 3 {
 4     vector<int>            Keys;
 5     vector<BTreeNode *>    Childs;
 6     BTreeNode            *Parent;///< 父結點。當該結點是樹的根結點時，Parent結點爲nullptr
 7     bool                IsLeaf;  ///< 是否爲葉子結點
 8 
 9     BTreeNode() : Parent( nullptr ), IsLeaf( true ) {}
10 
11     size_t KeysSize()
12     {
13         return Keys.size();
14     }
15 };

　　關於B樹的動態集合操做，就不一一述說了，《算法導論》書已經講得很是清楚了，並且圖文並茂，照着認真看，絕對是沒問題的。下面是實現的代碼：

#ifndef _B_TREE_H_
#define _B_TREE_H_


#include <iostream>
#include <algorithm>
#include <vector>
#include <string>
#include <sstream>
#include <cassert>

using namespace std;


class BTree
{
public:
    /// B樹中的一個結點
    struct BTreeNode
    {
        vector<int>            Keys;
        vector<BTreeNode *>    Childs;
        BTreeNode            *Parent;        ///< 父結點。當該結點是樹的根結點時，Parent結點爲nullptr
        bool                IsLeaf;            ///< 是否爲葉子結點

        BTreeNode() : Parent( nullptr ), IsLeaf( true ) {}

        size_t KeysSize()
        {
            return Keys.size();
        }
    };

    /// 構造一棵最小度爲t的B樹(t>=2)
    BTree( int t ) : _root( nullptr ), _t( t )
    {
        assert( t >= 2 );
    }

    ~BTree()
    {
        _ReleaseNode( _root );
    }

    /// @brief B樹的查找操做
    ///
    /// 在B-樹中查找給定關鍵字的方法相似於二叉排序樹上的查找。
    /// 不一樣的是在每一個結點上肯定向下查找的路徑不必定是二路而是keynum+1路的。\n
    /// 實現起來仍是至關容易的!
    pair<BTreeNode *, size_t> Search( int key )
    {
        return _SearchInNode( _root, key );
    }

    /// @brief 插入一個值的操做
    ///
    /// 這裏沒有使用《算法導論》裏介紹的一趟的方法，而是本身想象出來的二趟的方法
    /// 效率確定不如書上介紹的一趟優美，可是能解決問題。\n
    /// 由於插入操做確定是在葉子結點上進行的,首先順着書向下走直到要進行插入操做的葉子結點將新值插入到該葉子結點中去.
    /// 若是由於這個插入操做而使用該結點的值的個數>2*t-1的上界，就須要遞歸向上進行分裂操做。
    /// 若是分裂到了根結點，還要處理樹長高的狀況。\n
    bool Insert( int new_key )
    {
        if ( _root == nullptr )    //空樹
        {
            _root = new BTreeNode();
            _root->IsLeaf = true;
            _root->Keys.push_back( new_key );
            return true;
        }

        if ( Search( new_key ).first == nullptr )    //是否已經存在該結點
        {
            BTreeNode *node = _root;
            while ( !node->IsLeaf )
            {
                int index = 0;
                while ( index < node->Keys.size() && new_key >= node->Keys[index] )
                {
                    ++index;
                }
                node = node->Childs[index];
            }

            //插入到Keys裏去
            node->Keys.insert( find_if( node->Keys.begin(), node->Keys.end(), bind2nd( greater<int>(), new_key ) ), new_key );

            //再遞歸向上處理結點太大的狀況
            while ( node->KeysSize() > 2 * _t - 1 )
            {
                //=====開始分裂======
                int prove_node_key = node->Keys[node->KeysSize() / 2 - 1];            // 要提高的結點的key

                //後半部分紅爲一個新節點
                BTreeNode *new_node = new BTreeNode();
                new_node->IsLeaf = node->IsLeaf;
                new_node->Keys.insert( new_node->Keys.begin(), node->Keys.begin() + node->KeysSize() / 2, node->Keys.end() );
                new_node->Childs.insert( new_node->Childs.begin(), node->Childs.begin() + node->Childs.size() / 2, node->Childs.end() );
                assert( new_node->Childs.empty() || new_node->Childs.size() == new_node->Keys.size() + 1 );
                for_each( new_node->Childs.begin(), new_node->Childs.end(), [&]( BTreeNode * c )
                {
                    c->Parent = new_node;
                } );

                //把後半部分從原來的節點中刪除
                node->Keys.erase( node->Keys.begin() + node->KeysSize() / 2 - 1, node->Keys.end() );
                node->Childs.erase( node->Childs.begin() + node->Childs.size() / 2, node->Childs.end() );
                assert( node->Childs.empty() || node->Childs.size() == node->Keys.size() + 1 );

                BTreeNode *parent_node = node->Parent;
                if ( parent_node == nullptr )    //分裂到了根結點，樹要長高了，須要NEW一個結點出來
                {
                    parent_node = new BTreeNode();
                    parent_node->IsLeaf = false;
                    parent_node->Childs.push_back( node );
                    _root = parent_node;
                }
                node->Parent = new_node->Parent = parent_node;

                auto insert_pos = find_if( parent_node->Keys.begin(), parent_node->Keys.end(), bind2nd( greater<int>(), prove_node_key ) ) - parent_node->Keys.begin();
                parent_node->Keys.insert( parent_node->Keys.begin() + insert_pos, prove_node_key );
                parent_node->Childs.insert( parent_node->Childs.begin() + insert_pos + 1, new_node );

                node = parent_node;
            }

            return true;
        }
        return false;
    }

    /// @brief 刪除一個結點的操做
    bool Delete( int key_to_del )
    {
        auto found_node = Search( key_to_del );
        if ( found_node.first == nullptr )        //找不到值爲key_to_del的結點
        {
            return false;
        }

        if ( !found_node.first->IsLeaf )        //當要刪除的結點不是葉子結點時用它的前驅來替換，再刪除它的前驅
        {
            //前驅
            BTreeNode *previous_node = found_node.first->Childs[found_node.second];
            while ( !previous_node->IsLeaf )
            {
                previous_node = previous_node->Childs[previous_node->Childs.size() - 1];
            }

            //替換
            found_node.first->Keys[found_node.second] = previous_node->Keys[previous_node->Keys.size() - 1];
            found_node.first = previous_node;
            found_node.second = previous_node->Keys.size() - 1;
        }

        //到這裏，found_node必定是葉子結點
        assert( found_node.first->IsLeaf );
        _DeleteLeafNode( found_node.first, found_node.second );

        return true;
    }

private:
    void _ReleaseNode( BTreeNode *node )
    {
        for_each( node->Childs.begin(), node->Childs.end(), [&]( BTreeNode * c )
        {
            _ReleaseNode( c );
        } );
        delete node;
    }

    /// @brief 刪除B樹中的一個葉子結點
    ///
    /// @param    node    要刪除的葉子結點！
    /// @param    index    要刪除的葉子結點上的第幾個值
    /// @note            必須保證傳入的node結點爲葉子結點
    void _DeleteLeafNode( BTreeNode *node, size_t index )
    {
        assert( node && node->IsLeaf );

        if ( node == _root )
        {
            //要刪除的值在根結點上，而且此時根結點也是葉子結點，由於本方法被調用時要保證node參數是葉子結點
            _root->Keys.erase( _root->Keys.begin() + index );
            if ( _root->Keys.empty() )
            {
                //成爲了一棵空B樹
                delete _root;
                _root = nullptr;
            }
            return;
        }

        //如下是非根結點的狀況

        if ( node->Keys.size() > _t - 1 )
        {
            //要刪除的結點中Key的數目>t-1，所以再-1也不會打破B樹的性質
            node->Keys.erase( node->Keys.begin() + index );
        }
        else    //會打破平衡
        {
            //是否借到了一個頂點
            bool        borrowed = false;

            //試着從左兄弟借一個結點
            BTreeNode    *left_brother = _GetLeftBrother( node );
            if ( left_brother && left_brother->Keys.size() > _t - 1 )
            {
                int index_in_parent = _GetIndexInParent( left_brother );
                BTreeNode *parent = node->Parent;

                node->Keys.insert( node->Keys.begin(), parent->Keys[index_in_parent] );
                parent->Keys[index_in_parent] = left_brother->Keys[left_brother->KeysSize() - 1];
                left_brother->Keys.erase( left_brother->Keys.end() - 1 );

                ++index;
                borrowed = true;
            }
            else
            {
                //當左兄弟借不到時，試着從右兄弟借一個結點
                BTreeNode    *right_brother = _GetRightBrother( node );
                if ( right_brother && right_brother->Keys.size() > _t - 1 )
                {
                    int index_in_parent = _GetIndexInParent( node );
                    BTreeNode *parent = node->Parent;

                    node->Keys.push_back( parent->Keys[index_in_parent] );
                    parent->Keys[index_in_parent] = right_brother->Keys[0];
                    right_brother->Keys.erase( right_brother->Keys.begin() );

                    borrowed = true;
                }
            }

            if ( borrowed )
            {
                //由於借到告終點，因此能夠直接刪除結點
                _DeleteLeafNode( node, index );
            }
            else
            {
                //左右都借不到時先刪除再合併
                node->Keys.erase( node->Keys.begin() + index );
                _UnionNodes( node );
            }
        }
    }

    /// @brief node找一個相鄰的結點進行合併
    ///
    /// 優先選取左兄弟結點，再次就選擇右兄弟結點
    void _UnionNodes( BTreeNode * node )
    {
        if ( node )
        {
            if ( node == _root )    //node是頭結點
            {
                if ( _root->Keys.empty() )
                {
                    //頭結點向下移動一級，此時樹的高度-1
                    _root = _root->Childs[0];
                    _root->Parent = nullptr;

                    delete node;
                    return;
                }
            }
            else
            {
                if ( node->KeysSize() < _t - 1 )
                {
                    BTreeNode *left_brother = _GetLeftBrother( node );
                    if ( left_brother == nullptr )
                    {
                        left_brother = _GetRightBrother( node );
                        swap( node, left_brother );
                    }

                    //與左兄弟進行合併
                    int index_in_parent = _GetIndexInParent( left_brother );
                    node->Keys.insert( node->Keys.begin(), node->Parent->Keys[index_in_parent] );
                    node->Parent->Keys.erase( node->Parent->Keys.begin() + index_in_parent );
                    node->Parent->Childs.erase( node->Parent->Childs.begin() + index_in_parent + 1 );
                    left_brother->Keys.insert( left_brother->Keys.end(), node->Keys.begin(), node->Keys.end() );
                    left_brother->Childs.insert( left_brother->Childs.begin(), node->Childs.begin(), node->Childs.end() );
                    for_each( left_brother->Childs.begin(), left_brother->Childs.end(), [&]( BTreeNode * c )
                    {
                        c->Parent = left_brother;
                    } );

                    delete node;
                    _UnionNodes( left_brother->Parent );
                }
            }
        }
    }

    pair<BTreeNode *, size_t> _SearchInNode( BTreeNode *node, int key )
    {
        if ( !node )
        {
            //未找到，樹爲空的狀況
            return make_pair( static_cast<BTreeNode *>( nullptr ), 0 );
        }
        else
        {
            int index = 0;
            while ( index < node->Keys.size() && key >= node->Keys[index] )
            {
                if ( key == node->Keys[index] )
                {
                    return make_pair( node, index );
                }
                else
                {
                    ++index;
                }
            }

            if ( node->IsLeaf )
            {
                //已經找到根了，不能再向下了未找到
                return make_pair( static_cast<BTreeNode *>( nullptr ), 0 );
            }
            else
            {
                return _SearchInNode( node->Childs[index], key );
            }
        }
    }

    void _GetDotLanguageViaNodeAndEdge( stringstream &ss, BTreeNode *node )
    {
        if ( node && !node->Keys.empty() )
        {
            int index = 0;
            ss << "    node" << node->Keys[0] << "[label = \"";
            while ( index < node->Keys.size() )
            {
                ss << "<f" << 2 * index << ">|";
                ss << "<f" << 2 * index + 1 << ">" << node->Keys[index] << "|";
                ++index;
            }
            ss << "<f" << 2 * index << ">\"];" << endl;;

            if ( !node->IsLeaf )
            {
                for( int i = 0; i < node->Childs.size(); ++i )
                {
                    BTreeNode *c = node->Childs[i];
                    ss << "    \"node" << node->Keys[0] << "\":f" << 2 * i << " -> \"node" << c->Keys[0] << "\":f" << ( 2 * c->Keys.size() + 1 ) / 2 << ";" << endl;
                }
            }

            for_each( node->Childs.begin(), node->Childs.end(), [&]( BTreeNode * c )
            {
                _GetDotLanguageViaNodeAndEdge( ss, c );
            } );
        }
    }

    /// 獲得一個結點的左兄弟結點，若是不存在左兄弟結點則返回nullptr
    BTreeNode * _GetLeftBrother( BTreeNode *node )
    {
        if ( node && node->Parent )
        {
            BTreeNode *parent = node->Parent;
            for ( int i = 1; i < parent->Childs.size(); ++i )
            {
                if ( parent->Childs[i] == node )
                {
                    return parent->Childs[i - 1];
                }
            }
        }
        return nullptr;
    }

    /// 獲得一個結點的右兄弟結點，若是不存在右兄弟結點則返回nullptr
    BTreeNode * _GetRightBrother( BTreeNode *node )
    {
        if ( node && node->Parent )
        {
            BTreeNode *parent = node->Parent;
            for ( int i = 0; i < static_cast<int>( parent->Childs.size() ) - 1; ++i )
            {
                if ( parent->Childs[i] == node )
                {
                    return parent->Childs[i + 1];
                }
            }
        }
        return nullptr;
    }

    /// 獲得一個結點在其父結點中屬於第幾個子結點
    /// @return    返回-1時表示錯誤
    int _GetIndexInParent( BTreeNode *node )
    {
        assert( node && node->Parent );

        for ( int i = 0; i < node->Parent->Childs.size(); ++i )
        {
            if ( node->Parent->Childs[i] == node )
            {
                return i;
            }
        }

        return -1;
    }


    BTreeNode    *_root;            ///< B樹的根結點指針
    int            _t;                ///< B樹的 最小度數。即全部的結點的Keys的個數應該t-1 <= n <= 2t-1，除了根結點能夠最少爲1個Key
};

#endif//_B_TREE_H_

View Code

4、B樹的引伸——B+樹、B*樹

B+樹是對B樹的一種變形樹，它與B樹的差別在於：

有k個子結點的結點必然有k個關鍵碼；
非葉結點僅具備索引做用，跟記錄有關的信息均存放在葉結點中。
樹的全部葉結點構成一個有序鏈表，能夠按照關鍵碼排序的次序遍歷所有記錄。

B樹和B+樹各有優缺點：

B+樹的磁盤讀寫代價更低：B+樹的內部結點並無指向關鍵字具體信息的指針。所以其內部結點相對B 樹更小。若是把全部同一內部結點的關鍵字存放在同一磁盤頁中，那麼一頁所能容納的關鍵字數量也越多。一次性讀入內存中的須要查找的關鍵字也就越多。相對來講IO讀寫次數也就下降了。
訪問緩存命中率高：其一，B+樹在內部節點上不含數據項，所以關鍵字存放的更加緊密，具備更好的空間局部性。所以訪問葉子節點上關聯的數據項也具備更好的緩存命中率；其二，B+樹的葉子結點都是相鏈的，所以對整棵樹的遍歷只須要一次線性遍歷葉子結點便可。而B樹則須要進行每一層的遞歸遍歷。相鄰的元素可能在內存中不相鄰，因此緩存命中性沒有B+樹好。
B+樹的查詢效率更加穩定：因爲非葉子節點只是充當葉子結點中數據項的索引。因此任何關鍵字的查找必須走一條從根結點到葉子結點的路。全部關鍵字查詢的路徑長度相同，致使每個數據的查詢效率至關。

固然，B樹也不是所以就沒有優勢，因爲B樹的每個節點都包含key和value，所以常常訪問的元素可能離根節點更近，所以訪問也更迅速。

因爲B+樹較好的訪問性能，通常，B+樹比B 樹更適合實際應用中操做系統的文件索引和數據庫索引！

　　B*樹則是在B+樹的基礎上，又新增了一項規定：內部節點新增指向兄弟節點的指針。另外，B*樹定義了非葉子結點關鍵字個數至少爲(2/3)*t，即塊的最低使用率爲2/3（代替B+樹的1/2）；B*樹在分裂節點時，因爲能夠向空閒較多的兄弟節點進行轉移，所以其空間利用率更高。

個人公衆號 「Linux雲計算網絡」(id: cloud_dev)，號內有 10T 書籍和視頻資源，後臺回覆 「1024」 便可領取，分享的內容包括但不限於 Linux、網絡、雲計算虛擬化、容器Docker、OpenStack、Kubernetes、工具、SDN、OVS、DPDK、Go、Python、C/C++編程技術等內容，歡迎你們關注。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。