class lightgbm.LGBMClassifier(boosting_type='gbdt', num_leaves=31, max_depth=-1, learning_rate=0.1, n_estimators=10, max_bin=255, subsample_for_bin=200000, objective=None, min_split_gain=0.0, min_child_weight=0.001, min_child_samples=20, subsample=1.0, subsample_freq=1, colsample_bytree=1.0, reg_alpha=0.0, reg_lambda=0.0, random_state=None, n_jobs=-1, silent=True, **kwargs)
boosting_type |
default="gbdt" |
"gbdt":Gradient Boosting Decision Treehtml "dart":Dropouts meet Multiple Additive Regression Trees數組 "goss":Gradient-based One-Side Samplingdom "rf": Random Forestide |
|
num_leaves | (int, optional (default=31)) | 每一個基學習器的最大葉子節點 | <=2^max_depth |
max_depth | (int, optional (default=-1)) | 每一個基學習器的最大深度, -1 means no limit | 當模型過擬合,首先下降max_depth |
learning_rate | (float, optional (default=0.1)) | Boosting learning rate | |
n_estimators | (int, optional (default=10)) | 基學習器的數量 | |
max_bin | (int, optional (default=255)) | feature將存入的bin的最大數量,應該是直方圖的k值 | |
subsample_for_bin | (int, optional (default=50000)) | Number of samples for constructing bins | |
objective | (string, callable or None, optional (default=None)) | default:函數 ‘regression’ for LGBMRegressor, 學習 ‘binary’ or ‘multiclass’ for LGBMClassifier, url ‘lambdarank’ for LGBMRanker.spa |
|
min_split_gain | (float, optional (default=0.)) | 樹的葉子節點上進行進一步劃分所需的最小損失減小 | |
min_child_weight | (float, optional (default=1e-3)) | Minimum sum of instance weight(hessian) needed in a child(leaf) |
|
min_child_samples |
(int, optional (default=20)) | 葉子節點具備的最小記錄數 | |
subsample |
(float, optional (default=1.)) | 訓練時採樣必定比例的數據 | |
subsample_freq | (int, optional (default=1)) | Frequence of subsample, <=0 means no enable | |
colsample_bytree |
(float, optional (default=1.)) | Subsample ratio of columns when constructing each tree | |
reg_alpha |
(float, optional (default=0.)) | L1 regularization term on weights | |
reg_lambda
|
(float, optional (default=0.)) | L2 regularization term on weights | |
random_state翻譯
|
(int or None, optional (default=None)) | ||
silent | (bool, optional (default=True)) | ||
n_jobs | (int, optional (default=-1)) |
######################################################################################################
下表對應了Faster Spread,better accuracy,over-fitting三種目的時,能夠調整的參數:
###########################################################################################
類的屬性:
n_features_ | int | 特徵的數量 |
classes_ | rray of shape = [n_classes] | 類標籤數組(只針對分類問題) |
n_classes_ | int | 類別數量 (只針對分類問題) |
best_score_ | dict or None | 最佳擬合模型得分 |
best_iteration_ | int or None | 若是已經指定了early_stopping_rounds,則擬合模型的最佳迭代次數 |
objective_ | string or callable | 擬合模型時的具體目標 |
booster_ | Booster | 這個模型的Booster |
evals_result_ | dict or None | 若是已經指定了early_stopping_rounds,則評估結果 |
feature_importances_ | array of shape = [n_features] | 特徵的重要性 |
###########################################################################################
類的方法:
fit(X, y, sample_weight=None, init_score=None, eval_set=None, eval_names=None, eval_sample_weight=None, eval_init_score=None, eval_metric='logloss', early_stopping_rounds=None, verbose=True, feature_name='auto', categorical_feature='auto', callbacks=None)
X | array-like or sparse matrix of shape = [n_samples, n_features] | 特徵矩陣 |
y | array-like of shape = [n_samples] | The target values (class labels in classification, real numbers in regression) |
sample_weight | array-like of shape = [n_samples] or None, optional (default=None)) | 樣本權重,能夠採用np.where設置 |
init_score | array-like of shape = [n_samples] or None, optional (default=None)) | Init score of training data |
group | array-like of shape = [n_samples] or None, optional (default=None) | Group data of training data. |
eval_set | list or None, optional (default=None) | A list of (X, y) tuple pairs to use as a validation sets for early-stopping |
eval_names | list of strings or None, optional (default=None) | Names of eval_set |
eval_sample_weight | list of arrays or None, optional (default=None) | Weights of eval data |
eval_init_score | list of arrays or None, optional (default=None) | Init score of eval data |
eval_group | list of arrays or None, optional (default=None) | Group data of eval data |
eval_metric | string, list of strings, callable or None, optional (default="logloss") | "mae","mse",... |
early_stopping_rounds | int or None, optional (default=None) | 必定rounds,即中止迭代 |
verbose | bool, optional (default=True) | |
feature_name | list of strings or 'auto', optional (default="auto") | If ‘auto’ and data is pandas DataFrame, data columns names are used |
categorical_feature | list of strings or int, or 'auto', optional (default="auto") | If ‘auto’ and data is pandas DataFrame, pandas categorical columns are used |
callbacks | list of callback functions or None, optional (default=None) |
###############################################################################################
X | array-like or sparse matrix of shape = [n_samples, n_features] | Input features matrix |
raw_score | bool, optional (default=False) | Whether to predict raw scores |
num_iteration | int, optional (default=0) | Limit number of iterations in the prediction; defaults to 0 (use all trees). |
Returns | predicted_probability | The predicted probability for each class for each sample. |
Return type | array-like of shape = [n_samples, n_classes] |
不平衡處理的參數:
1.一個簡單的方法是設置is_unbalance
參數爲True
或者設置scale_pos_weight
,兩者只能選一個。 設置is_unbalance
參數爲True
時會把負樣本的權重設爲:正樣本數/負樣本數。這個參數只能用於二分類。
2.自定義評價函數:
https://cloud.tencent.com/developer/article/1357671
lightGBM的原理總結:
http://www.cnblogs.com/gczr/p/9024730.html
論文翻譯:https://blog.csdn.net/u010242233/article/details/79769950,https://zhuanlan.zhihu.com/p/42939089
處理分類變量的原理:https://blog.csdn.net/anshuai_aw1/article/details/83275299