https://www.kaggle.com/c/home-credit-default-risk/discussion/57918git
How LightGBM handles missing values?github
posted in Home Credit Default Risk 2 months agoide
8post
Hi everyone Can anyone explain that how lightgbm handles the missing data? Will manually handling the missing data help lightgbm?this
Thanks in advance.3d
Comments (7)Sort byorm
Hotnessblog
JohnM•(81st in this Competition)•2 months ago•Options•Replyget
10it
Good question, Usman, I wasn't sure myself. From what I understand, lightGBM will ignore missing values during a split, then allocate them to whichever side reduces the loss the most. Section 3.2 of this reference explains it.
There are some options you can set such as use_missing=false, which disables handling for missing values. You can also use the zero_as_missing option to change behavior. GitHub reference.
Manually dealing with missing values will often improve model performance. It sounds like if you set missing values to something like -99, those values will be considered during a split (not sure though). And of course if you impute missing values and the imputation is directionally correct, you should also see an improvement as long as the factor itself is meaningful.
Adrien•(1859th in this Competition)•9 days ago•Options•Reply
0
Hello JohnM, I have a litte question about this handling of missing value for LightGBM. You said that lightGBM ignore MV for the split and then allocate them to wichever side reduces the loss the most. But in the case of the test set, how could the model allocate them to reduce the loss, when it doesn't have any 'TARGET' ?
Thank you for your answer and your help !
JohnM•(81st in this Competition)•9 days ago•Options•Reply
1
The trees and cutoffs, including NAs, are established during training where we have the targets. The test observations are then run through the model for prediction. Here's a cool viz that shows what I'm trying to say.http://www.r2d3.us/visual-intro-to-machine-learning-part-1/
Usman AbbasTopic Author•8 days ago•Options•Reply
0
Hi JohnM would you please tell that what CV are you getting for your model with LB 0.805? right now I am struggling with overfitting with CV:0.794 and LB 0.802 any suggestions on how to reduce the gap ?thanks :)
JohnM•(81st in this Competition)•7 days ago•Options•Reply
0
Usman, I wish I had a better CV strategy. My CV for the 0.805 submission is 0.796 which seems to mirror your situation. It's an ensemble. I have other ensembles with slightly better CVs but worse LBs. I'm going to try a couple more things before resigning myself to stratified K-Fold and luck.
Usman AbbasTopic Author•7 days ago•Options•Reply
0
Thanks for you answer and all the best :)
Usman AbbasTopic Author•2 months ago•Options•Reply
0
Thanks JohnM for your answer :) It gave me a good insight.