How LightGBM handles missing values?

https://www.kaggle.com/c/home-credit-default-risk/discussion/57918git

How LightGBM handles missing values?github

posted in Home Credit Default Risk 2 months agoide

8post

Hi everyone Can anyone explain that how lightgbm handles the missing data? Will manually handling the missing data help lightgbm?this

Thanks in advance.3d

Comments (7)Sort byorm

Hotnessblog

tantexian18

 

JohnM

JohnM•(81st in this Competition)•2 months ago•Options•Replyget

10it

Good question, Usman, I wasn't sure myself. From what I understand, lightGBM will ignore missing values during a split, then allocate them to whichever side reduces the loss the most. Section 3.2 of this reference explains it.

There are some options you can set such as use_missing=false, which disables handling for missing values. You can also use the zero_as_missing option to change behavior. GitHub reference.

Manually dealing with missing values will often improve model performance. It sounds like if you set missing values to something like -99, those values will be considered during a split (not sure though). And of course if you impute missing values and the imputation is directionally correct, you should also see an improvement as long as the factor itself is meaningful.

Adrien

Adrien•(1859th in this Competition)•9 days ago•Options•Reply

0

Hello JohnM, I have a litte question about this handling of missing value for LightGBM. You said that lightGBM ignore MV for the split and then allocate them to wichever side reduces the loss the most. But in the case of the test set, how could the model allocate them to reduce the loss, when it doesn't have any 'TARGET' ?

Thank you for your answer and your help !

JohnM

JohnM•(81st in this Competition)•9 days ago•Options•Reply

1

The trees and cutoffs, including NAs, are established during training where we have the targets. The test observations are then run through the model for prediction. Here's a cool viz that shows what I'm trying to say.http://www.r2d3.us/visual-intro-to-machine-learning-part-1/

Usman Abbas

Usman AbbasTopic Author•8 days ago•Options•Reply

0

Hi JohnM would you please tell that what CV are you getting for your model with LB 0.805? right now I am struggling with overfitting with CV:0.794 and LB 0.802 any suggestions on how to reduce the gap ?thanks :)

JohnM

JohnM•(81st in this Competition)•7 days ago•Options•Reply

0

Usman, I wish I had a better CV strategy. My CV for the 0.805 submission is 0.796 which seems to mirror your situation. It's an ensemble. I have other ensembles with slightly better CVs but worse LBs. I'm going to try a couple more things before resigning myself to stratified K-Fold and luck.

Usman Abbas

Usman AbbasTopic Author•7 days ago•Options•Reply

0

Thanks for you answer and all the best :)

Usman Abbas

Usman AbbasTopic Author•2 months ago•Options•Reply

0

Thanks JohnM for your answer :) It gave me a good insight.

相關文章
相關標籤/搜索