Skip to content
This repository has been archived by the owner on Jul 16, 2021. It is now read-only.

Adding text vectorization #177

Open
ghost opened this issue Apr 24, 2017 · 5 comments
Open

Adding text vectorization #177

ghost opened this issue Apr 24, 2017 · 5 comments

Comments

@ghost
Copy link

ghost commented Apr 24, 2017

Hey guys and ladies!

I was wondering (and I'm offering myself to work a little bit on this) if you consider appropiate to add some text vectorization to rusty-machine based on sklearn current features:

  • Simple frecuency count
  • TF-IDF
  • Hashing techniques (frecuency count + hashing trick)
    I'd be pretty cool to add some examples of sentiment analysis or something like that using rusty-machine only :P
@tafia
Copy link
Contributor

tafia commented Apr 24, 2017

I'd personally love that!

@ghost
Copy link
Author

ghost commented Apr 24, 2017

I was thinking on using the Transformer trait. However is not appropiate because it ask that the input and output should be of the same type

@AtheMathmo
Copy link
Owner

I agree that this is a really great idea!

It seems an unfortunate restriction that you cannot use the Transformer trait. I think that it might be worth changing the trait to allow different input and output types. Do either of you see any reason why this might cause issues? It would be a fairly minor breaking change (for users who have implemented the trait themselves).

@ghost
Copy link
Author

ghost commented Apr 24, 2017

I'm just implemented a Vectorizer trait that is pretty similar to Transformer, it could be used as base for non text stuff, like images or nested data for example. Here is a little proof of concept:

https://github.com/z1mvader/rusty-machine/blob/master/src/data/vectorizers/text.rs

But if @AtheMathmo wants we could just modify the Transformer trait

@ghost
Copy link
Author

ghost commented Apr 24, 2017

Besides the Transformer trait, I believe that there are two main needs for the text vectorization workflow. First, to be able to set your own tokenizer. And second, to allow sparse matrices/vectors. I don't know if rusty-machine supports sparse matrices right now

@ghost ghost mentioned this issue Apr 25, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants