Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QUESTION: Simulating sampling of points in streaming detection #91

Open
stianvale opened this issue Jun 16, 2021 · 2 comments
Open

QUESTION: Simulating sampling of points in streaming detection #91

stianvale opened this issue Jun 16, 2021 · 2 comments

Comments

@stianvale
Copy link

Hi!
I've tested both your implementation of 'streaming detection' and 'batch detection'. So far, I'm getting the best results with the 'batch detection'. However, I want to use the streaming approach to dynamically update the model according to a continuous stream of data.

My current understanding is that 'batch detection' performs better because of the random sampling of points. With 'streaming detection', all trees contain the same points. Therefore, I tested an approach where some points are randomly deleted from trees after calculating the codisp. That way, the trees will contain different points, which in way simulates random sampling of points. My current results tells me that this works well.

Does this sound like a valid alternative to the standard 'streaming detection', or are there some traps I'm missing here?

@stianvale stianvale changed the title QUESTION: Simulating sampling of points in streaming algorithm QUESTION: Simulating sampling of points in streaming detection Jun 16, 2021
@mdbartos
Copy link
Member

Greetings,

The method for sampling included in the README was chosen for demonstration purposes---the implementation is short and easy to read. It's definitely not the only way to do sampling, and different sampling methods are encouraged.

The original RRCF paper proposes 'reservoir sampling', which would correspond to uniform sampling in time for the batch mode case. (See: https://en.wikipedia.org/wiki/Reservoir_sampling)

Ultimately the choice of sampling method will depend on the user's needs---namely, how far back in time do you want to algorithm to 'remember'.

MDB

@stianvale
Copy link
Author

Thanks for your response, @mdbartos !

Cool, yeah, I see that's the default sampling technique of Sagemaker's RRCF as well. I'll test out reservoir sampling then. Have you implemented it for this repo before? In that case, maybe you could share the code?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants