-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
insert_point is slow #28
Comments
Storing the bounding box for each branch could cut ~70% of time expended. |
After storing the bboxes, time breakdown for
And within
|
Hello, This algorithm performs really great but insert_point is really slow. |
If using batch mode is an option, it can improve speeds by quite a bit. For the taxicab dataset, changing from streaming to batch mode brought the total computation time from ~20 minutes down to 20-30 seconds. See this example: https://klabum.github.io/rrcf/taxi.html As far as speeding up Because trees are independent, benefits could be gained through parallelization. My knowledge of parallelization in python is pretty limited though. |
I've seen this example. Does it matter if all the points are not in each of the trees of your forest? Because I'm currently inserting around 50 000 points by tree (it takes around 200 seconds). Maybe it's a bit too much? :) I was thinking of a batch_insert_point that take an |
Parallelization in python is incredibly easy. I already did it myself so I can write it down.
|
Hi @atthom, Thanks for taking a look at this. The points don't need to be in every single tree as long as you make sure you're averaging codisp properly. Ultimately, you're generating an estimator for codisp. As a side note, it would be interesting to look at the properties of the codisp estimator as a function of the number of trees, tree size, etc. Have you tested how much of a speedup you're getting from parallelizing? I briefly experimented with using the multiprocessing module but wasn't able to get a significant speedup. For batch insertion, you would need to show that the implementation works out mathematically: in other words, that the new tree is drawn from the distribution T' (S' U {p}): https://klabum.github.io/rrcf/insert-and-delete.html It would probably require a completely different |
I tested the parallelization. It's the perfect use case of parallelization because the trees are independents. Each processor can compute its tree separately. Now, I understand why it's not easy to create this insert multiple points! |
Sounds good. If you want to contribute any parallelization code, feel free to submit a pull request. |
I wonder if Dask would make sense here? |
Definitely worth looking into 👍 |
Running a line profiler on insert_point shows the top time consumers:
Within _insert_point_cut:
The text was updated successfully, but these errors were encountered: