How LightGBM handles missing values?

時間 2019-11-13

標籤 lightgbm handles missing values 简体版

原文原文鏈接

https://www.kaggle.com/c/home-credit-default-risk/discussion/57918git

How LightGBM handles missing values?github

posted in Home Credit Default Risk 2 months agoide

8post

Hi everyone Can anyone explain that how lightgbm handles the missing data? Will manually handling the missing data help lightgbm?this

Thanks in advance.3d

Comments (7)Sort byorm

Hotnessblog

JohnM•(81st in this Competition)•2 months ago•Options•Replyget

10it

Good question, Usman, I wasn't sure myself. From what I understand, lightGBM will ignore missing values during a split, then allocate them to whichever side reduces the loss the most. Section 3.2 of this reference explains it.

There are some options you can set such as use_missing=false, which disables handling for missing values. You can also use the zero_as_missing option to change behavior. GitHub reference.

Manually dealing with missing values will often improve model performance. It sounds like if you set missing values to something like -99, those values will be considered during a split (not sure though). And of course if you impute missing values and the imputation is directionally correct, you should also see an improvement as long as the factor itself is meaningful.

Adrien•(1859th in this Competition)•9 days ago•Options•Reply

Hello JohnM, I have a litte question about this handling of missing value for LightGBM. You said that lightGBM ignore MV for the split and then allocate them to wichever side reduces the loss the most. But in the case of the test set, how could the model allocate them to reduce the loss, when it doesn't have any 'TARGET' ?

Thank you for your answer and your help !

JohnM•(81st in this Competition)•9 days ago•Options•Reply

The trees and cutoffs, including NAs, are established during training where we have the targets. The test observations are then run through the model for prediction. Here's a cool viz that shows what I'm trying to say.http://www.r2d3.us/visual-intro-to-machine-learning-part-1/

Usman AbbasTopic Author•8 days ago•Options•Reply

Hi JohnM would you please tell that what CV are you getting for your model with LB 0.805? right now I am struggling with overfitting with CV:0.794 and LB 0.802 any suggestions on how to reduce the gap ?thanks :)

JohnM•(81st in this Competition)•7 days ago•Options•Reply

Usman, I wish I had a better CV strategy. My CV for the 0.805 submission is 0.796 which seems to mirror your situation. It's an ensemble. I have other ensembles with slightly better CVs but worse LBs. I'm going to try a couple more things before resigning myself to stratified K-Fold and luck.

Usman AbbasTopic Author•7 days ago•Options•Reply

Thanks for you answer and all the best :)

Usman AbbasTopic Author•2 months ago•Options•Reply

Thanks JohnM for your answer :) It gave me a good insight.

相關標籤/搜索

gbdt&lightgbm&xgboost

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。