-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dask (xgboost and lightgbm with Dask) -- likely WRONG results #49
Comments
xgboost setup:
regular XGBoost (without Dask):
Results:
With Dask:
Results:
|
Changing number of workers, threads, partitions:
UPDATE: I suspect a data leakage in Dask when lumping train and test to do a consistent label encoding and then splitting it back into train-test. Data leakage might occur because of partitions (??). That might be responsible for higher AUC. So instead of this look at this new github issue #50 with the analysis redone using integer encoding outside of Dask. |
10M rows:
|
RAM usage 1M rows: regular XGBoost ~1GB (data+while training) Dask 16 workers (1 thread each) ~3GB data ~5GB when training |
lightgbm without Dask:
Results:
|
lightgbm with Dask: so far in development, use this:
|
Results:
|
UPDATE: I suspect a data leakage in Dask when lumping train and test to do a consistent label encoding and then splitting it back into train-test. Data leakage might occur because of partitions (??). That might be responsible for higher AUC. So instead of this look at this new github issue #50 with the analysis redone using integer encoding outside of Dask. |
UPDATE: I suspect a data leakage in Dask when lumping train and test to do a consistent label encoding and then splitting it back into train-test. Data leakage might occur because of partitions (??). That might be responsible for higher AUC. So instead of this look at this new github issue #50 with the analysis redone using integer encoding outside of Dask.
m5.4xlarge 16c (8+8HT)
1M rows
integer encoding for simplicity
The text was updated successfully, but these errors were encountered: