#Random Forest——隨機森林 上一篇是講到了決策樹,這篇就來說一下樹的集合,隨機森林。 #①Aggregation Model 隨機森林仍是沒有脫離聚合模型這塊,以前學過兩個aggregation model,bagging和decision tree,一個是邊learning邊uniform。首先是boostrap方式獲得數據D1,以後訓練作平均;另外一個也是邊learning可是作的是condition,直接用數據D作conditional切分。 git
#②Random Forest github
#④Feature Selection 在feature選擇的過程當中,還有一類問題要注意。選擇的過程當中有可能遇到多餘的問題,好比抽取到生日和年齡,這就尷尬了。另外一種是不相關的。 算法
#⑤代碼實現bootstrap
最後,咱們經過實際的例子來看一下RF的特色。首先,仍然是一個二元分類的例子。以下圖所示,左邊是一個C&RT樹沒有使用bootstrap獲得的模型分類效果,其中不一樣特徵之間進行了隨機組合,因此有斜線做爲分類線;中間是由bootstrap(N’=N/2)後生成的一棵決策樹組成的隨機森林,圖中加粗的點表示被bootstrap選中的點;右邊是將一棵決策樹進行bagging後的分類模型,效果與中間圖是同樣的,都是一棵樹。 數組
而後,咱們再來看一個比較複雜的例子,二維平面上分佈着許多離散點,分界線形如sin函數。當只有一棵樹的時候(t=1),下圖左邊表示單一樹組成的RF,右邊表示全部樹bagging組合起來構成的RF。由於只有一棵樹,因此左右兩邊效果一致。 bash
以後就是真正的代碼實現了 這裏使用的仍是隨機選擇特徵的方法。 首先是一個特徵選擇函數:app
def choose_samples(self, data, k):
'''choose the feature from data input:data, type = list output:k '''
n, d = np.shape(data)
feature = []
for j in range(k):
feature.append(rd.randint(0, d - 2))
index = []
for i in range(n):
index.append(rd.randint(0, n-1))
data_samples = []
for i in range(n):
data_tmp = []
for fea in feature:
data_tmp.append(data[i][fea])
data_tmp.append(data[i][-1])
data_samples.append(data_tmp)
pass
return data_samples, feature
pass
複製代碼
在data數據裏面選擇出k維的數據。 以後就是隨機森林的創建了,使用的決策樹是上篇文章實現的決策樹,儘可能作到全是本身實現的:dom
def random_forest(self, data, trees_num):
'''create a forest input:data, type = list output:trees_result, trees_feature '''
decisionTree = tree.decision_tree()
trees_result = []
trees_feature = []
d = np.shape(data)[1]
if d > 2:
k = int(math.log(d - 1, 2)) + 1
else:
k = 1
for i in range(trees_num):
print('The ', i, ' tree. ')
data_samples, feature = self.choose_samples(data, k)
t = decisionTree.build_tree(data_samples)
trees_result.append(t)
trees_feature.append(feature)
pass
return trees_result, trees_feature
複製代碼
其實都很常規,最後返回的是樹的數量和選取的特徵。 以後就是一個切割數據和加載數據的工具函數:函數
def split_data(data_train, feature):
'''select the feature from data input:data, feature output:data, type = list '''
m = np.shape(data_train)[0]
data = []
for i in range(m):
data_tmp = []
for x in feature:
data_tmp.append(data_train[i][x])
data_tmp.append(data_train[i][-1])
data.append(data_tmp)
return data
def load_data():
'''use the boston dataset from sklearn'''
print('loading data......')
dataSet = load_breast_cancer()
data = dataSet.data
target = dataSet.target
for i in range(len(target)):
if target[i] == 0:
target[i] = -1
dataframe = pd.DataFrame(data)
dataframe.insert(np.shape(data)[1], 'target', target)
dataMat = np.mat(dataframe)
X_train, X_test, y_train, y_test = train_test_split(dataMat[:, 0:-1], dataMat[:, -1], test_size=0.3, random_state=0)
data_train = np.hstack((X_train, y_train))
data_train = data_train.tolist()
X_test = X_test.tolist()
return data_train, X_test, y_test
複製代碼
load_data是把數據3,7切分,測試和訓練。 而後就是預測函數和計算準確度的函數了:工具
def get_predict(self, trees_result, trees_feature, data_train):
'''predict the result input:trees_result, trees_feature, data output:final_prediction '''
decisionTree = tree.decision_tree()
m_tree = len(trees_result)
m = np.shape(data_train)[0]
result = []
for i in range(m_tree):
clf = trees_result[i]
feature = trees_feature[i]
data = tool.split_data(data_train, feature)
result_i = []
for i in range(m):
result_i.append( list((decisionTree.predict(data[i][0 : -1], clf).keys()))[0] )
result.append(result_i)
final_predict = np.sum(result, axis = 0)
return final_predict
def cal_correct_rate(self, target, final_predict):
m = len(final_predict)
corr = 0.0
for i in range(m):
if target[i] * final_predict[i] > 0:
corr += 1
pass
return corr/m
pass
複製代碼
這個和以前決策樹的差很少,也是調用了以前的代碼。 最後就是入口函數:
def running():
'''entrance'''
data_train, text, target = load_data()
forest = randomForest()
predic = []
for i in range(1, 20):
trees, features = forest.random_forest(data_train, i)
predictions = forest.get_predict(trees, features, text)
accuracy = forest.cal_correct_rate(target, predictions)
print('The forest has ', i, 'tree', 'Accuracy : ' , accuracy)
predic.append(accuracy)
plt.xlabel('Number of tree')
plt.ylabel('Accuracy')
plt.title('The relationship between tree number and accuracy')
plt.plot(range(1, 20), predic, color = 'orange')
plt.show()
pass
if __name__ == '__main__':
running()
複製代碼
計算了1到20課樹他們以前準確率的變化,畫了對比圖。
全部代碼GitHub上: github.com/GreenArrow2…