Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming code #62

Open
ashujainml opened this issue Sep 20, 2019 · 5 comments
Open

Streaming code #62

ashujainml opened this issue Sep 20, 2019 · 5 comments

Comments

@ashujainml
Copy link

Hi,

I tested the streaming code( https://klabum.github.io/rrcf/streaming.html) and found that all trees are similar to each other compared to the batch processing code (https://klabum.github.io/rrcf/batch.html). I spent some time understanding the issue and noticed that if we always remove the oldest point from all trees and insert new point in all trees, all trees will be identical in nature.

If tree is above permitted size...

    if len(tree.leaves) > tree_size:
        # Drop the oldest point (FIFO)
        tree.forget_point(index - tree_size)
    # Insert the new point into the tree
    tree.insert_point(point, index=index)

Am I missing something?

@Chmavo
Copy link

Chmavo commented Jan 14, 2020

I have the same question as @ashujainml. Shouldn't the points inserted in to each tree in the forest be different, similar to what is accomplished in the batch examples with uniform random sampling?

@Chmavo
Copy link

Chmavo commented May 20, 2020

@mdbartos if you have time could you please comment on this issue? I am using the same code as the sine wave streaming example, and when I use a seed value for the random number generator, all of the trees are similar. I get the same results with 1 tree as I do with 40 trees, and I see the same behavior with my own sample data.

@mdbartos
Copy link
Member

Greetings,

Thanks for the question. Can you tell me what you mean by similar? Do you mean the trees are all exactly the same?

This should not happen unless there's a bug. Even if the trees incorporate the exact same points, the structure of the trees should be different because the partitioning algorithm is randomized.

See, for example:

import numpy as np
import rrcf
np.random.seed(0)

X = np.random.randn(8, 3)

tree_0 = rrcf.RCTree()
tree_1 = rrcf.RCTree()

for index, point in enumerate(X):
    tree_0.insert_point(point, index=index)
    tree_1.insert_point(point, index=index)
print(tree_0)

─+
 ├───+
 │   ├───+
 │   │   ├──(6)
 │   │   └───+
 │   │       ├───+
 │   │       │   ├───+
 │   │       │   │   ├───+
 │   │       │   │   │   ├──(4)
 │   │       │   │   │   └──(2)
 │   │       │   │   └──(0)
 │   │       │   └──(7)
 │   │       └──(5)
 │   └──(1)
 └──(3)
print(tree_1)

─+
 ├───+
 │   ├──(6)
 │   └──(2)
 └───+
     ├───+
     │   ├───+
     │   │   ├───+
     │   │   │   ├──(7)
     │   │   │   └───+
     │   │   │       ├──(3)
     │   │   │       └──(4)
     │   │   └──(5)
     │   └──(0)
     └──(1)

Thus, even with the exact same point set you should be able to build an ensemble. Let me know if this helps.

Thanks,
MDB

@Chmavo
Copy link

Chmavo commented May 21, 2020

@mdbartos thank you very much for the reply. After reviewing your example and replicating it with my data, I believe the error was in my understanding of how random_state worked when constructing the random-cut trees. If i pass an integer value as random_state to RCTree(), I get the same trees every time when I use the same points. For example, the code below yields identical trees:

tree_0 = rrcf.RCTree(random_state=123)
tree_1 = rrcf.RCTree(random_state=123)
X = np.random.randn(8, 3)
for index, point in enumerate(X):
    tree_0.insert_point(point, index=index)
    tree_1.insert_point(point, index=index)

@mdbartos
Copy link
Member

Ah now I see. Yes, this is the reason that random_state was added--it gives you a way of replicating exactly the same tree (for testing purposes, replicating results, etc.). Otherwise you'd get a different tree each time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants