如下是Coursera上的How to Win a Data Science Competition: Learn from Top Kagglers課程筆記。html
Statistics and distance based features
該部分專一於此高級特徵工程:計算由另外一個分組的一個特徵的各類統計數據和從給定點的鄰域分析獲得的特徵。python
groupby and nearest neighbor methodsgit
例子:這裏有一些CTR任務的數據
咱們能夠暗示廣告有 頁面上的最低價格將吸引大部分注意力。 頁面上的其餘廣告不會頗有吸引力。 計算與這種含義相關的特徵很是容易。 咱們能夠爲每一個廣告的每一個用戶和網頁添加最低和最高價格。 在這種狀況下,具備最低價格的廣告的位置也可使用。github
代碼實現
spring
- More feature
- How many pages user visited
- Standard deviation of prices
- Most visited page
- Many, many more
若是沒有特徵能夠像這樣使用groupby呢?可使用最近鄰點app
Neighbors
- Explicit group is not needed
- More flexible
- Much harder to implement
Examplesless
- Number of houses in 500m, 1000m,..
- Average price per square meter in 500m, 1000m,..
- Number of schools/supermarkets/parking lots in 500m, 1000m,..
- Distance to colsest subway station
講師在Springleaf
比賽中使用了它。ide
KNN features in springleaf
- Mean encode all the variables
- For every point, find 2000 nearst neighbors using Bray-Curtis metric
$$\frac{\sum{|u_i - v_i|}}{\sum{|u_i + v_i|}}$$
- Calculate various features from those 2000 neighbors
Evaluate學習
- Mean target of neatrest 5,10,15,500,2000, neighbors
- Mean distance to 10 closest neighbors
- Mean distance to 10 closest neighbors with target 1
- Mean distance to 10 closest neighbors with target 0
- Example of feature fusion
Notes about Matrix Fatorization
- Can be apply only for some columns
- Can provide additional diversity
- Good for ensembles
- It is lossy transformation.Its' efficirncy depends on:
- Particular task
- Number of latent factors
Implementtation
- Serveral MF methods you can find in sklearn
- SVD and PCA
- Standart tools for Matrix Fatorization
- TruncatedSVD
- Works with sparse matrices
- Non-negative Matrix Fatorization(NMF)
- Ensures that all latent fators are non-negative
- Good for counts-like data
NMF for tree-based methods
non-negative matrix factorization
簡稱NMF,它以一種使數據更適合決策樹的方式轉換數據。
flex
能夠看出,NMF變換數據造成平行於軸的線。
因子分解
可使用與線性模型的技巧來分解矩陣。
Conclusion
- Matrix Factorization is a very general approach for dimensionality reduction and feature extraction
- It can be applied for transforming categorical features into real-valued
- Many of tricks trick suitable for linear models can be useful for MF
Feature interactions
特徵值的全部組合
假設咱們正在構建一個預測模型,在網站上顯示的最佳廣告橫幅。
... |
auto_part |
game_news |
... |
0 |
... |
music_tickets |
music_news |
.. |
1 |
... |
mobile_phones |
auto_blog |
... |
0 |
將廣告橫幅自己的類別和橫幅將顯示的網站類別,進行組合將構成一個很是強的特徵。
... |
auto_part | game_news |
... |
0 |
... |
music_tickets | music_news |
.. |
1 |
... |
mobile_phones | auto_blog |
... |
0 |
構建這兩個特徵的組合特徵ad_site
從技術角度來看, 有兩種方法能夠構建這種交互。
方法1
方法2
- 類似的想法也可用於數值變量
事實上,這不限於乘法操做,還能夠是其餘的
- Multiplication
- Sum
- Diff
- Division
- ..
Practival Notes
- We have a lot of possible interactions -N*N for N features.
- a. Even more if use several types in interactions
- Need ti reduce it's number
- a. Dimensionality reduction
- b. Feature selection
經過這種方法生成了大量的特徵,可使用特徵選擇或降維的方法減小特徵。如下用特徵選擇舉例說明
Interactions' order
- We looked at 2nd order interactions.
- Such approach can be generalized for higher orders.
- It is hard to do generation and selection automatically.
- Manual building of high-order interactions is some kind of art.
看一下決策樹。 讓咱們將每一個葉子映射成二進制特徵。 對象葉子的索引能夠用做新分類特徵的值。 若是咱們不使用單個樹而是使用它們的總體。 例如,隨機森林, 那麼這種操做能夠應用於每一個條目。 這是一種提取高階交互的強大方法。
In sklearn:
tree_model.apply()
In xgboost:
booster.predict(pred_leaf=True)
Conclusion
- We looked at ways to build an interaction of categorical attributes
- Extended this approach to real-valued features
- Learn how to extract features via decision trees
t-SNE
用於探索數據分析。能夠被視爲從數據中獲取特徵的方法。
Practical Notes
- Result heavily depends on hyperparameters(perplexity)
- Good practice is to use several projections with different perplexities(5-100)
- Due to stochastic nature, tSNE provides different projections even for the same data\hyperparams
- Train and test should be projected together
- tSNE runs for a long time with a big number of features
- it is common to do dimensionality reduction before projection.
- Implementation of tSNE can be found in sklearn library.
- But personally I perfer you use stand-alone implementation python package tsne due to its' faster speed.
Conclusion
- tSNE is a great tool for visualization
- It can be used as feature as well
- Be careful with interpretation of results
- Try different perplexities
矩陣分解: