Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inference on new data samples #70

Open
sophark opened this issue Dec 16, 2019 · 3 comments
Open

inference on new data samples #70

sophark opened this issue Dec 16, 2019 · 3 comments

Comments

@sophark
Copy link

sophark commented Dec 16, 2019

Hi, Thanks for implementing this. I have a use case that need train the rrcf using some given dataset, and then predict on unseen data samples this haven't been seen during training period.

Can we use batch mode to achieve that? One simple solution I can come up with is to first insert that point to the forest, calculate the codisp, and then delete it. I am wondering is there any smarter ways to save the inference time?

Thanks.

@mdbartos
Copy link
Member

mdbartos commented Dec 17, 2019

That's probably the most flexible way to do it (create forest from point set S using batch mode -> insert new point x into each tree -> compute codisp -> delete point x). But yes, it will probably be slow. Parallelizing can help though.

I'm not sure if this is helpful, but note that the insert_point algorithm is guaranteed to produce a tree drawn from RRCF(S \union x), where S is a point set, and x is an additional point.

In other words, the following two trees are statistically indistinguishable, and the codisp of x will be the same in expectation:

  • Create tree T' from point set (S \union x) via batch mode.
  • Create tree T from point set S via batch mode and then insert x, resulting in tree T'.

@sophark
Copy link
Author

sophark commented Dec 17, 2019

That's probably the most flexible way to do it (create forest from point set S using batch mode -> insert new point x into each tree -> compute codisp -> delete point x). But yes, it will probably be slow. Parallelizing can help though.

Thanks for your hints. Yes, it indeed a little bit slow without parallelizing. Do you know which step above consume most of time and its time complexity? I guess maybe the insert point step?

@mdbartos
Copy link
Member

Yeah, I would say insert_point is the slowest step. I have time breakdowns here:
#28

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants