Slow performance while running predict for quantile regression #1308

gautham2492 · 2023-05-25T06:46:32Z

I am using quantile forest in the GRF package on my data which is around 12 million records. I have a couple of questions:

The model trains fine, ie, in around 1 hour when i use 300 trees (am running it on a databricks cluster with 192 gb ram and 48 cores). However it takes an extremely long time (around 1 full day) to predict (200k records). Why is this and is there a way i can speed it up? My model has only 4 features.
Is there any way to do hyperparameter tuning on the quantile forest? i notice grf tune is there for regression forests and not quantile.

Any leads will be helpful

Ri0016 · 2023-05-30T13:53:48Z

Hi gautham,

I'm not a member of grf lab. But I'm working on random forest algorithm. Following is my guess/explanation of your situation.

It depends on the tree size (the number of nodes in each tree, or the depth of each tree).

Suppose the average depth of each tree is d.
The complexity of constructing a forest with B trees is O(B * mtry * 2^d).
The complexity of predicting one point is O(B * d).

There is another sort cost for quantile estimation. So the complexity of predicting one point is O(B * d * n * log(n)), where n is the sample size. In your case, n is very large. I think this is the key point.

I think d is not very large (less than 20) in your situation. Then B * d * 200k is greater than B * mtry * 2^d. So it's reasonable to predict will be slower than constructing.

Also, there is another possible explanation. You forget to pass value to num.threads.

Hope it's helpful

mayer79 · 2024-08-31T21:32:31Z

Same here. Prediction on a data of similar size as the training data is taking about as much time as fitting the model. Don't get me wrong: fitting is extremely fast.

erikcs · 2024-09-06T03:40:45Z

Hi all, you are completely right that quantile forest can take some time with prediction compared to training. Unfortunately, this is due to a design choice that reduces memory use, at the cost of more CPU use. Quantile forests uses grf's DefaultPredictionStrategy (some details here) as predicting many quantiles at the same time could blow up memory use with the OptimizedPredictionStrategy that needs to store sufficient statistics in each tree node.

This is also the reason why the survival forest predictions can be a bit slow. In the future we can consider implementing an OptimizedPredictionStrategy for these forests and toggle between that and the default based on some memory use heuristic (like number of quantiles for quantile forest, and length of survival curve for survival forest) #1419

mayer79 · 2024-09-06T06:29:34Z

Thanks a lot for the detailed answer.

I actually noticed it while running Friedman's H statistics and Kernel SHAP to explain a quantile random forest with 5 quantiles. Both methods need to make predictions on very large datasets (and do so multiple times). But this is not the typical application of predict(), of course.

This is how such explanations look for a causal forest:

https://lorentzen.ch/index.php/2024/09/02/explaining-a-causal-forest/

erikcs · 2024-09-06T06:46:13Z

Thank you @mayer79, that is a very nice blog!

erikcs added the question label Jul 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow performance while running predict for quantile regression #1308

Slow performance while running predict for quantile regression #1308

gautham2492 commented May 25, 2023

Ri0016 commented May 30, 2023 •

edited

Loading

mayer79 commented Aug 31, 2024 •

edited

Loading

erikcs commented Sep 6, 2024

mayer79 commented Sep 6, 2024

erikcs commented Sep 6, 2024

Slow performance while running predict for quantile regression #1308

Slow performance while running predict for quantile regression #1308

Comments

gautham2492 commented May 25, 2023

Ri0016 commented May 30, 2023 • edited Loading

mayer79 commented Aug 31, 2024 • edited Loading

erikcs commented Sep 6, 2024

mayer79 commented Sep 6, 2024

erikcs commented Sep 6, 2024

Ri0016 commented May 30, 2023 •

edited

Loading

mayer79 commented Aug 31, 2024 •

edited

Loading