集成學習lgb庫調參的粒子羣方法

時間 2021-04-24

標籤算法 app dom ide 函數學習測試 lua code orm 欄目 HTML 简体版

原文原文鏈接

粒子羣算法是模擬鳥羣蜂羣的覓食行爲的一種算法。基本思想是經過羣體中個體之間的協做和信息共享來尋找最優解。試着想一下一羣鳥在尋找食物，在這個區域中只有一隻蟲子，全部的鳥都不知道食物在哪。可是它們知道本身的當前位置距離食物有多遠，同時它們知道離食物最近的鳥的位置。想一下這時候會發生什麼？算法

鳥A：哈哈哈原來蟲子離我最近！
鳥B,C,D：我得趕忙往 A 那裏過去看看！
同時各只鳥在位置不停變化時候離食物的距離也不斷變化，因此必定有過離食物最近的位置，這也是它們的一個參考。鳥某某：我剛剛的位置好像靠近了食物，我得往那裏靠近！app

公式請自行百度知乎dom

具體代碼流程以下：
ide

本文主要描述如何用粒子羣方法搜索到一個適合lgb的參數
調整參數通常的步驟以下：函數

*設定基礎參數{parm0}，基礎評判指標{metrics0}；學習

在訓練集上作cross-validation，作訓練集/交叉驗證集上誤差/方差與樹棵樹的關係圖；測試

判斷模型是過擬合 or 欠擬合，更新相應參數{parm1}；lua

重複二、3步，肯定樹的棵樹nestimators；code

採用參數{parm1}、nestimators，訓練模型，並應用到測試集；
最好損失函數的評估部分要隨機對原數據取樣用一半數據去訓練而後預測另一半數據使參數向方差變小的方向移動*orm

先要定一個損失函數：

def gini_coef(wealths):
    cum_wealths = np.cumsum(sorted(np.append(wealths, 0)))
    sum_wealths = cum_wealths[-1]
    xarray = np.array(range(0, len(cum_wealths))) / np.float(len(cum_wealths)-1)
    yarray = cum_wealths / sum_wealths
    B = np.trapz(yarray, x=xarray)
    A = 0.5 - B
    return A / (A+B)

固然也能夠傳入訓練數據的標籤值和預測值作協方差這裏採用基尼係數做爲損失函數

定義一個評估函數：用於評估該參數版本的效果如何：

def evaluate(train1 , feature_use,parent):
    np.set_printoptions(suppress=True)
    print("*************************************")
    print(parent)
    model_lgb = lgb.LGBMRegressor(objective='regression',
                                  min_sum_hessian_in_leaf=parent[0],
                                  learning_rate=parent[1],
                                  bagging_fraction=parent[2],
                                  feature_fraction=parent[3],
                                  num_leaves=int(parent[4]),
                                  n_estimators=int(parent[5]),
                                  max_bin=int(parent[6]),
                                  bagging_freq=int(parent[7]),
                                  feature_fraction_seed=int(parent[8]),
                                  min_data_in_leaf=int(parent[9]),
                                  is_unbalance = True
                                  )
    targetme = train1['target']

    X_train, X_test, y_train, y_test = train_test_split(train1[feature_use] , targetme, test_size=0.5)

    model_lgb.fit(X_train.fillna(-1), y_train)
    y_pred = model_lgb.predict(X_test.fillna(-1))
    return gini_coef(y_pred)

參數初始化代碼：

## 參數初始化
# 粒子羣算法中的兩個參數
c1 = 1.49445
c2 = 1.49445
maxgen= 50  # 進化次數
sizepop= 100  # 種羣規模

Vmax1=0.1
Vmin1=-0.1

## 產生初始粒子和速度
pop=[]
V = []
fitness =[]
for i in range(sizepop):
    # 隨機產生一個種羣
    temp_pop =[]
    temp_v = []

    min_sum_hessian_in_leaf = random.random()
    temp_pop.append(min_sum_hessian_in_leaf)
    temp_v.append(random.random())

    learning_rate = random.uniform(0.001,0.2)
    temp_pop.append(learning_rate)
    temp_v.append(random.random())

    bagging_fraction = random.uniform(0.5,1)
    temp_pop.append(bagging_fraction)
    temp_v.append(random.random())

    feature_fraction = random.uniform(0.3,1)
    temp_pop.append(feature_fraction)
    temp_v.append(random.random())

    num_leaves = random.randint(3,100)
    temp_pop.append(num_leaves)
    temp_v.append(random.randint(-3,3))

    n_estimators = random.randint(800,1200)
    temp_pop.append(n_estimators)
    temp_v.append(random.randint(-3,3))

    max_bin = random.randint(100,500)
    temp_pop.append(max_bin)
    temp_v.append(random.randint(-3,3))

    bagging_freq = random.randint(1,10)
    temp_pop.append(bagging_freq)
    temp_v.append(random.randint(-3,3))

    feature_fraction_seed = random.randint(1,10)
    temp_pop.append(feature_fraction_seed)
    temp_v.append(random.randint(-3,3))

    min_data_in_leaf = random.randint(1,20)
    temp_pop.append(min_data_in_leaf)
    temp_v.append(random.randint(-3,3))

    pop.append(temp_pop) # 初始種羣
    V.append(temp_v) # 初始化速度
    # 計算適應度
    fitness.append(evaluate(train1,feature_use ,temp_pop))  # 染色體的適應度  end

pop = np.array(pop)
V = np.array(V)

# 個體極值和羣體極值
bestfitness =min(fitness)
bestIndex = fitness.index(bestfitness)
zbest=pop[bestIndex,:]   #全局最佳
gbest=pop               #個體最佳
fitnessgbest=fitness   #個體最佳適應度值
fitnesszbest=bestfitness   #全局最佳適應度值

開始迭代尋優：

count = 0
## 迭代尋優
for i in range(maxgen):
    for j in range(sizepop):
        count = count + 1
        print(count)
        # 速度更新
        V[j,:] = V[j,:] + c1 * random.random() * (gbest[j,:] - pop[j,:]) + c2 * random.random() * (zbest - pop[j,:])
        if(V[j,0]<-0.1):
            V[j,0]=-0.1
        if(V[j,0]>0.1):
            V[j,0]=0.1
        if(V[j,1]<-0.02):
            V[j,1]=-0.02
        if(V[j,1]>0.02):
            V[j,1]=0.02
        if(V[j,2]<-0.1):
            V[j,2]=-0.1
        if(V[j,2]>0.1):
            V[j,2]=0.1
        if(V[j,3]<-0.1):
            V[j,3]=-0.1
        if(V[j,3]>0.1):
            V[j,3]=0.1
        if(V[j,4]<-2):
            V[j,4]=-2
        if(V[j,4]>2):
            V[j,4]=2
        if(V[j,5]<-10):
            V[j,5]=-10
        if(V[j,5]>10):
            V[j,5]=10
        if(V[j,6]<-5):
            V[j,6]=-5
        if(V[j,6]>5):
            V[j,6]=5
        if(V[j,7]<-1):
            V[j,7]=-1
        if(V[j,7]>1):
            V[j,7]=1
        if(V[j,8]<-1):
            V[j,8]=-1
        if(V[j,8]>1):
            V[j,8]=1
        if(V[j,9]<-1):
            V[j,9]=-1
        if(V[j,9]>1):
            V[j,9]=1

        pop[j,:]=pop[j,:]+0.5*V[j,:]

        if(pop[j,0]<0):
            pop[j,0]=0.001
        if (pop[j, 0] > 1):
            pop[j, 0] = 0.9

        if (pop[j, 1] < 0):
            pop[j, 1] = 0.001
        if (pop[j, 1] > 0.2):
            pop[j, 1] = 0.2

        if (pop[j, 2] < 0.5):
            pop[j, 2] = 0.5
        if (pop[j, 2] > 1):
            pop[j, 2] = 1

        if (pop[j, 3] < 0.3):
            pop[j, 3] = 0.3
        if (pop[j, 3] > 1):
            pop[j, 3] = 1

        if (pop[j, 4] < 3):
            pop[j, 4] =3
        if (pop[j, 4] > 100):
            pop[j, 4] = 100

        if (pop[j, 5] < 800):
            pop[j, 5] = 800
        if (pop[j, 5] > 1200):
            pop[j, 5] = 1200

        if (pop[j, 6] < 100):
            pop[j, 6] = 100
        if (pop[j, 6] > 500):
            pop[j, 6] = 500

        if (pop[j, 7] < 1):
            pop[j, 7] = 1
        if (pop[j, 7] > 10):
            pop[j, 7] = 10

        if (pop[j, 8] < 1):
            pop[j, 8] = 1
        if (pop[j, 8] > 10):
            pop[j, 8] = 10

        if (pop[j, 9] < 1):
            pop[j, 9] = 1
        if (pop[j, 9] > 20):
            pop[j, 9] = 20
        fitness[j] = evaluate(train1,feature_use,pop[j,:])

    for k in range(1,sizepop):
        if(fitness[k] > fitnessgbest[k]):
            gbest[k,:] = pop[k,:]
            fitnessgbest[k] = fitness[k]

        #羣體最優更新
        if fitness[k] > fitnesszbest:
            zbest = pop[k,:]
            fitnesszbest = fitness[k]

載入參數進行預測：

# 採用lgb迴歸預測模型，具體參數設置以下
model_lgb = lgb.LGBMRegressor(objective='regression',
                              min_sum_hessian_in_leaf=zbest[0],
                              learning_rate=zbest[1],
                              bagging_fraction=zbest[2],
                              feature_fraction=zbest[3],
                              num_leaves=int(zbest[4]),
                              n_estimators=int(zbest[5]),
                              max_bin=int(zbest[6]),
                              bagging_freq=int(zbest[7]),
                              feature_fraction_seed=int(zbest[8]),
                              min_data_in_leaf=int(zbest[9]),
                              is_unbalance=True)
targetme=train1['target']

model_lgb.fit(train1[feature_use].fillna(-1), train1['target'])
y_pred = model_lgb.predict(test1[feature_use].fillna(-1))
print("lgb success")

上文中的 train是 pandas中的dataframe類型，下圖爲這個代碼運行起來的狀況

有技術交流的能夠掃描如下