-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Streaming code #62
Comments
I have the same question as @ashujainml. Shouldn't the points inserted in to each tree in the forest be different, similar to what is accomplished in the batch examples with uniform random sampling? |
@mdbartos if you have time could you please comment on this issue? I am using the same code as the sine wave streaming example, and when I use a seed value for the random number generator, all of the trees are similar. I get the same results with 1 tree as I do with 40 trees, and I see the same behavior with my own sample data. |
Greetings, Thanks for the question. Can you tell me what you mean by similar? Do you mean the trees are all exactly the same? This should not happen unless there's a bug. Even if the trees incorporate the exact same points, the structure of the trees should be different because the partitioning algorithm is randomized. See, for example: import numpy as np
import rrcf
np.random.seed(0)
X = np.random.randn(8, 3)
tree_0 = rrcf.RCTree()
tree_1 = rrcf.RCTree()
for index, point in enumerate(X):
tree_0.insert_point(point, index=index)
tree_1.insert_point(point, index=index) print(tree_0)
─+
├───+
│ ├───+
│ │ ├──(6)
│ │ └───+
│ │ ├───+
│ │ │ ├───+
│ │ │ │ ├───+
│ │ │ │ │ ├──(4)
│ │ │ │ │ └──(2)
│ │ │ │ └──(0)
│ │ │ └──(7)
│ │ └──(5)
│ └──(1)
└──(3) print(tree_1)
─+
├───+
│ ├──(6)
│ └──(2)
└───+
├───+
│ ├───+
│ │ ├───+
│ │ │ ├──(7)
│ │ │ └───+
│ │ │ ├──(3)
│ │ │ └──(4)
│ │ └──(5)
│ └──(0)
└──(1) Thus, even with the exact same point set you should be able to build an ensemble. Let me know if this helps. Thanks, |
@mdbartos thank you very much for the reply. After reviewing your example and replicating it with my data, I believe the error was in my understanding of how random_state worked when constructing the random-cut trees. If i pass an integer value as random_state to RCTree(), I get the same trees every time when I use the same points. For example, the code below yields identical trees: tree_0 = rrcf.RCTree(random_state=123)
tree_1 = rrcf.RCTree(random_state=123)
X = np.random.randn(8, 3)
for index, point in enumerate(X):
tree_0.insert_point(point, index=index)
tree_1.insert_point(point, index=index) |
Ah now I see. Yes, this is the reason that random_state was added--it gives you a way of replicating exactly the same tree (for testing purposes, replicating results, etc.). Otherwise you'd get a different tree each time. |
Hi,
I tested the streaming code( https://klabum.github.io/rrcf/streaming.html) and found that all trees are similar to each other compared to the batch processing code (https://klabum.github.io/rrcf/batch.html). I spent some time understanding the issue and noticed that if we always remove the oldest point from all trees and insert new point in all trees, all trees will be identical in nature.
If tree is above permitted size...
Am I missing something?
The text was updated successfully, but these errors were encountered: