RandomForest中的feature_importance

時間 2019-12-05

標籤 randomforest feature importance 简体版

原文原文鏈接

python信用評分卡（附代碼，博主錄製）

https://study.163.com/course/introduction.htm?courseId=1005214003&utm_campaign=commission&utm_source=cp-400000000398149&utm_medium=sharepython

隨機森林算法（RandomForest）的輸出有一個變量是 feature_importances_ ，翻譯過來是特徵重要性，具體含義是什麼，這裏試着解釋一下。算法

參考官網和其餘資料能夠發現，RF能夠輸出兩種 feature_importance，分別是Variable importance和Gini importance，二者都是feature_importance，只是計算方法不一樣。dom

Variable importance

選定一個feature M，在全部OOB樣本的feature M上人爲添加噪聲，再測試模型在OOB上的判斷精確率，精確率相比沒有噪聲時降低了多少，就表示該特徵有多重要。ide

假如一個feature對數據分類很重要，那麼一旦這個特徵的數據再也不準確，對測試結果會形成較大的影響，而那些不重要的feature，即便受到噪聲干擾，對測試結果也沒什麼影響。這就是 Variable importance 方法的樸素思想。測試

[添加噪聲：這裏官網給出的說法是 randomly permute the values of variable m in the oob cases，permute的含義我還不是很肯定，有的說法是打亂順序，有的說法是在數據上加入白噪聲。]this

Gini importance

選定一個feature M，統計RF的每一棵樹中，由M造成的分支節點的Gini指數降低程度（或不純度降低程度）之和，這就是M的importance。翻譯

二者對比來看，前者比後者計算量更大，後者只須要一邊構建DT，一邊作統計就能夠。從sklearn的官方文檔對feature_importances_參數的描述來看，sklearn應當是使用了Gini importance對feature進行排序，同時sklearn把全部的Gini importance以sum的方式作了歸一化，獲得了最終的feature_importances_輸出參數。rest

參考文獻：code

RandomForest 官網 https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htmorm

Variable importance

The variable importances are critical. The run computing importances is done by switching imp =0 to imp =1 in the above parameter list. The output has four columns:

	gene number 
	the raw importance score 
	the z-score obtained by dividing the raw score by its standard error 
	the significance level.

The highest 25 gene importances are listed sorted by their z-scores. To get the output on a disk file, put impout =1, and give a name to the corresponding output file. If impout is put equal to 2 the results are written to screen and you will see a display similar to that immediately below:

gene       raw     z-score  significance
number    score
  667     1.414     1.069     0.143
  689     1.259     0.961     0.168
  666     1.112     0.903     0.183
  668     1.031     0.849     0.198
  682     0.820     0.803     0.211
  878     0.649     0.736     0.231
 1080     0.514     0.729     0.233
 1104     0.514     0.718     0.237
  879     0.591     0.713     0.238
  895     0.519     0.685     0.247
 3621     0.552     0.684     0.247
 3529     0.650     0.683     0.247
 3404     0.453     0.661     0.254
  623     0.286     0.655     0.256
 3617     0.498     0.654     0.257
  650     0.505     0.650     0.258
  645     0.380     0.644     0.260
 3616     0.497     0.636     0.262
  938     0.421     0.635     0.263
  915     0.426     0.631     0.264
  669     0.484     0.626     0.266
  663     0.550     0.625     0.266
  723     0.334     0.610     0.271
  685     0.405     0.605     0.272
 3631     0.402     0.603     0.273

Using important variables

Another useful option is to do an automatic rerun using only those variables that were most important in the original run. Say we want to use only the 15 most important variables found in the first run in the second run. Then in the options change mdim2nd=0 to mdim2nd=15 , keep imp=1 and compile. Directing output to screen, you will see the same output as above for the first run plus the following output for the second run. Then the importances are output for the 15 variables used in the 2nd run.

    gene         raw       z-score    significance
   number       score
    3621 		6.235 		2.753 		0.003 
    1104 		6.059 		2.709 		0.003 
    3529 		5.671 		2.568 		0.005 
     666 		7.837 		2.389 		0.008 
    3631 		4.657 		2.363 		0.009 
     667 		7.005 		2.275 		0.011 
     668 		6.828 		2.255 		0.012 
     689 		6.637 		2.182 		0.015 
     878 		4.733 		2.169 		0.015 
     682 		4.305 		1.817 		0.035 
     644 		2.710 		1.563 		0.059 
     879 		1.750 		1.283 		0.100 
     686 		1.937 		1.261 		0.104 
    1080 		0.927 		0.906 		0.183 
     623 		0.564 		0.847 		0.199

Variable interactions

Another option is looking at interactions between variables. If variable m1 is correlated with variable m2 then a split on m1 will decrease the probability of a nearby split on m2 . The distance between splits on any two variables is compared with their theoretical difference if the variables were independent. The latter is subtracted from the former-a large resulting value is an indication of a repulsive interaction. To get this output, change interact =0 to interact=1 leaving imp =1 and mdim2nd =10.

The output consists of a code list: telling us the numbers of the genes corresponding to id. 1-10. The interactions are rounded to the closest integer and given in the matrix following two column list that tells which gene number is number 1 in the table, etc.

		
     1   2   3   4   5   6   7   8   9  10

 1   0  13   2   4   8  -7   3  -1  -7  -2

 2  13   0  11  14  11   6   3  -1   6   1

 3   2  11   0   6   7  -4   3   1   1  -2

 4   4  14   6   0  11  -2   1  -2   2  -4

 5   8  11   7  11   0  -1   3   1  -8   1

 6  -7   6  -4  -2  -1   0   7   6  -6  -1

 7   3   3   3   1   3   7   0  24  -1  -1

 8  -1  -1   1  -2   1   6  24   0  -2  -3

 9  -7   6   1   2  -8  -6  -1  -2   0  -5

10  -2   1  -2  -4   1  -1  -1  -3  -5   0

There are large interactions between gene 2 and genes 1,3,4,5 and between 7 and 8.

https://study.163.com/course/courseMain.htm?courseId=1005988013&share=2&shareId=400000000398149

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。