最近遇到的問題，待整理

時間 2019-12-08

標籤最近遇到問題整理简体版

原文原文鏈接

介紹在PE的產篩算法針對本身的項目，上採樣，下采樣若是適用？
本身項目中，Pandas處理樣本數據量(百萬級）
是否熟練適用sql，我回答說大部分是用的mongodb
是否適用過度布式圖計算，大數據平臺
t test 如何解釋
顯著性檢驗，P value 如何解釋
迴歸算法的假設條件
用過哪些基礎算法？（邏輯迴歸，樹模型，模型調參gird_search是默認都會的）
線性迴歸（廣義的線性迴歸瞭解過嗎?）(能夠參考http://www.javashuo.com/article/p-cjrjvqrm-es.html爲）
PCA降維的原理（線性代數的本質角度去理解或者按照知乎的文章去理解，此處須要整理）
隨機森林中的feature importance是如何計算出來的（第一條連接解釋的最爲詳細，能夠參考 https://stats.stackexchange.com/questions/162162/relative-variable-importance-for-boosting，https://stackoverflow.com/questions/34218245/how-is-the-feature-score-importance-in-the-xgboost-package-calculated, http://www.javashuo.com/article/p-undiasof-gt.html ）(http://www.javashuo.com/article/p-bghajdgy-o.html, https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#giniimp)
- GBDT 中的feature importance （https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/gradient_boosting.py）
有沒有了解過相應的計算方法？（我回答的按照gini係數或者信息增益，實際上是分支的時候的特徵選擇）
隨機森林中的隨機體如今哪些方面？Bootstrap 取樣的方式？
甲乙兩人擲硬幣，誰先擲硬幣的正面誰贏，問甲先擲硬幣贏得機率?
論文中算法的復現能力，主要是code能力
特徵工程（http://www.cnblogs.com/jasonfreak/p/5448385.html）
什麼邏輯迴歸模型要使用 sigmoid 函數？
- 廣義模型推導所得
- 知足統計的最大熵模型
- 性質優秀，方便使用（Sigmoid函數是平滑的，並且任意階可導，一階二階導數能夠直接由函數值獲得不用進行求導，這在實現中很實用）
邏輯斯蒂迴歸常見的問題
- http://www.javashuo.com/article/p-efupxmyy-bx.html
  \[ \begin{array} { l } { \log i t ( \mathrm { x } ) = \ln \left( \frac { P ( \mathrm { y } = 1 | \mathrm { x } ) } { P ( \mathrm { y } = 0 | \mathrm { x } ) } \right) } \\ { = \ln \left( \frac { P ( \mathrm { y } = 1 | \mathrm { x } ) } { 1 - P ( \mathrm { y } = 1 | \mathrm { x } ) } \right) = \theta _ { 0 } + \theta _ { 1 } x _ { 1 } + \theta _ { 2 } x _ { 2 } + \ldots + \theta _ { m } x _ { m } } \end{array} \]
feature importance sklearn GBDT（gradient boosting decision tree 中feature importance 源碼理解）

先計算出每棵decision tree中的特徵重要性
而後特徵在全部樹中的重要性的平均值
計算每一個特徵的相對重要性(歸一化）

相關標籤/搜索

遇到的問題

待遇