Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CART Regression throws ArrayIndexOutOfBoundsException when using TrainTestSplit with proportion 1.0 #374

Open
Artraxon opened this issue Jul 30, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@Artraxon
Copy link

Artraxon commented Jul 30, 2024

Describe the bug
When using the TrainTestSplitter with split proportion 1.0 and seed 1L with a SQLDataSource that retrieves 1107 tuples with below trainer configuration the following exception is thrown.

java.lang.ArrayIndexOutOfBoundsException: arraycopy: last destination index 1116 out of bounds for int[1107]
	at java.base/java.lang.System.arraycopy(Native Method)
	at org.tribuo.regression.rtree.impl.InvertedFeature.split(InvertedFeature.java:173)
	at org.tribuo.regression.rtree.impl.TreeFeature.split(TreeFeature.java:155)
	at org.tribuo.regression.rtree.impl.RegressorTrainingNode.splitAtBest(RegressorTrainingNode.java:322)
	at org.tribuo.regression.rtree.impl.RegressorTrainingNode.buildGreedyTree(RegressorTrainingNode.java:204)
	at org.tribuo.regression.rtree.impl.RegressorTrainingNode.buildTree(RegressorTrainingNode.java:152)
	at org.tribuo.regression.rtree.CARTRegressionTrainer.train(CARTRegressionTrainer.java:210)
	at org.tribuo.regression.rtree.CARTRegressionTrainer.train(CARTRegressionTrainer.java:60)
	at org.tribuo.ensemble.BaggingTrainer.trainSingleModel(BaggingTrainer.java:186)
	at org.tribuo.ensemble.BaggingTrainer.train(BaggingTrainer.java:168)
	at org.tribuo.ensemble.BaggingTrainer.train(BaggingTrainer.java:145)
	at org.tribuo.ensemble.BaggingTrainer.train(BaggingTrainer.java:140)
	at org.tribuo.ensemble.BaggingTrainer.train(BaggingTrainer.java:54)

To Reproduce
I use the following configuration of the trainer:

CARTRegressionTrainer cartTrainer = new CARTRegressionTrainer(10,
                                                              AbstractCARTTrainer.MIN_EXAMPLES,
                                                              0.0F,
                                                              0.5F,
                                                              false,
                                                              new MeanAbsoluteError(),
                                                              Trainer.DEFAULT_SEED);
Trainer<Regressor> rfTrainer = new RandomForestTrainer<>(cartTrainer,
                                                         new AveragingCombiner(),
                                                         100,
                                                         5);

The error does not occur when using XGBoost, or when using the SQLDataSource directly without passing it through the splitter, even though the amount of tuples is the same.

Expected behaviour

I expect that using the TrainTestSplitter with a proportion of 1.0 behaves the same way as not using it at all (or at least not producing an error)

System information:

  • Tribuo Version: 4.3.1
  • OS: Arch Linux with linux 6.10.2, but runs in Ubuntu 22:04 container
  • Java Version: 21 ( openjdk-21-jdk 21.0.3+9-1ubuntu1~22.04.1)
  • JDK Vendor: openjdk
@Artraxon Artraxon added the bug Something isn't working label Jul 30, 2024
@Craigacp
Copy link
Member

Can you provide the code where you construct the dataset with and without the train test split? And also ask the training dataset how big it is?

I agree that using a train test split of 1.0 shouldn't crash the training run (though the test dataset will be malformed and we should put proper validation on the trainProportion argument), but I can't quite see where it's triggering the issue, especially if XGBoost is fine. While the trees & XGBoost have different methods to iterate the dataset, they both rely on the underlying list inside the dataset for their size information, so if that's an odd size I'd expect both of them to break.

@Artraxon
Copy link
Author

Artraxon commented Aug 1, 2024

I can provide the code, but I don't think it will help much because it is very generic as it is part of a bigger system.
The training set is 1107 tuples big. The data set is as far as I can tell the, same when using the TrainTestSplitter or taking it directly from the SQLDataSource, although of course I don't know for which properties to look for.

        var sourceQuery = config.getDatasourceQuery();

        SQLDataSource<O> sqlSource = null;
        try {
            sqlSource = new SQLDataSource<>(
                    sourceQuery,
                    new SQLDBConfig("jdbc:duckdb:" + dbPath, Map.of()),
                    outputFactorySupplier.get(),
                    rowProcessor,
                    true);
        } catch (SQLException e) {
            throw new RuntimeException(e);
        }
        var totalSize = sizeFunction.apply(config);
        double splitProportion;
        if (config.getSampleAmount() < totalSize) {
            splitProportion = ((double) config.getSampleAmount()) / totalSize;
        } else splitProportion = 1;

        MutableDataset<O> trainingData;
        MutableDataset<O> testData = null;
        //Avoid passing it through train test splitter with split proportion of 1 because of a bug in tribuo
        //causing an exception in the CART Regression Trainer https://github.com/oracle/tribuo/issues/374
        //Might also occur when using a proportion thats slightly less than 1.0 (like 0.99)
        if (splitProportion < 1) {
            var splitter = trainTestSplitter.apply(sqlSource, splitProportion);
            trainingData = new MutableDataset<>(splitter.getTrain());
            testData = new MutableDataset<>(splitter.getTest());
        } else {
            trainingData = new MutableDataset<>(sqlSource);
        }

@Craigacp
Copy link
Member

Craigacp commented Aug 2, 2024

Can you check if the featureIDMap from the MutableDataset when using the splitter and without is equal to the other? And if the problem still exists if you use CARTJointRegressionTrainer instead of CARTRegressionTrainer?

@Artraxon
Copy link
Author

Sorry for my late response, I'm quite busy with my master thesis at the moment, I'll have a bit more time in October.
At the moment I don't have the time to reproduce the exact error again as I'm also changing the training data and setup a lot, but I'd be able to reproduce it later and help find the error.

In the meantime, it got the same exception when training a random forest without splitting, but rather using the MeanSquaredError on a dataset that contains doubles pretty much as close to zero as doubles allow it, but only when the feature set contains both categorical and real features.

I think it is a bit difficult to debug this by just describing the characteristics of the dataset, I'd need to share them with you. At the moment I can't do that as we might also use the datasets in a publication, but afterwards I can provide the exact datasets and code to replicate the errors.

@Craigacp
Copy link
Member

Ok, so that sounds a lot more like a bug in the tree implementation itself rather than an issue with the train test splitter. Which is good, because the splitter is very simple and I really couldn't see what could go wrong there, but the tree implementation code is tricky and may well still have bugs. CARTJointRegressionTrainer and CARTRegressionTrainer have different underlying tree implementations, so if you could compare those two on identical datasets (as they should perform identically with only a single output dimension) that will help me narrow down where the issue is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants