-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CART Regression throws ArrayIndexOutOfBoundsException when using TrainTestSplit with proportion 1.0 #374
Comments
Can you provide the code where you construct the dataset with and without the train test split? And also ask the training dataset how big it is? I agree that using a train test split of 1.0 shouldn't crash the training run (though the test dataset will be malformed and we should put proper validation on the trainProportion argument), but I can't quite see where it's triggering the issue, especially if XGBoost is fine. While the trees & XGBoost have different methods to iterate the dataset, they both rely on the underlying list inside the dataset for their size information, so if that's an odd size I'd expect both of them to break. |
I can provide the code, but I don't think it will help much because it is very generic as it is part of a bigger system. var sourceQuery = config.getDatasourceQuery();
SQLDataSource<O> sqlSource = null;
try {
sqlSource = new SQLDataSource<>(
sourceQuery,
new SQLDBConfig("jdbc:duckdb:" + dbPath, Map.of()),
outputFactorySupplier.get(),
rowProcessor,
true);
} catch (SQLException e) {
throw new RuntimeException(e);
}
var totalSize = sizeFunction.apply(config);
double splitProportion;
if (config.getSampleAmount() < totalSize) {
splitProportion = ((double) config.getSampleAmount()) / totalSize;
} else splitProportion = 1;
MutableDataset<O> trainingData;
MutableDataset<O> testData = null;
//Avoid passing it through train test splitter with split proportion of 1 because of a bug in tribuo
//causing an exception in the CART Regression Trainer https://github.com/oracle/tribuo/issues/374
//Might also occur when using a proportion thats slightly less than 1.0 (like 0.99)
if (splitProportion < 1) {
var splitter = trainTestSplitter.apply(sqlSource, splitProportion);
trainingData = new MutableDataset<>(splitter.getTrain());
testData = new MutableDataset<>(splitter.getTest());
} else {
trainingData = new MutableDataset<>(sqlSource);
} |
Can you check if the featureIDMap from the |
Sorry for my late response, I'm quite busy with my master thesis at the moment, I'll have a bit more time in October. In the meantime, it got the same exception when training a random forest without splitting, but rather using the MeanSquaredError on a dataset that contains doubles pretty much as close to zero as doubles allow it, but only when the feature set contains both categorical and real features. I think it is a bit difficult to debug this by just describing the characteristics of the dataset, I'd need to share them with you. At the moment I can't do that as we might also use the datasets in a publication, but afterwards I can provide the exact datasets and code to replicate the errors. |
Ok, so that sounds a lot more like a bug in the tree implementation itself rather than an issue with the train test splitter. Which is good, because the splitter is very simple and I really couldn't see what could go wrong there, but the tree implementation code is tricky and may well still have bugs. |
Describe the bug
When using the TrainTestSplitter with split proportion 1.0 and seed 1L with a SQLDataSource that retrieves 1107 tuples with below trainer configuration the following exception is thrown.
To Reproduce
I use the following configuration of the trainer:
The error does not occur when using XGBoost, or when using the SQLDataSource directly without passing it through the splitter, even though the amount of tuples is the same.
Expected behaviour
I expect that using the TrainTestSplitter with a proportion of 1.0 behaves the same way as not using it at all (or at least not producing an error)
System information:
The text was updated successfully, but these errors were encountered: