Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

what exactly is the input data format expected by Metronome? #2

Open
pchalasani opened this issue Jul 3, 2014 · 4 comments
Open

what exactly is the input data format expected by Metronome? #2

pchalasani opened this issue Jul 3, 2014 · 4 comments

Comments

@pchalasani
Copy link

subject says it all

@jpatanooga
Copy link
Owner

Its really similar to the SVMLight format where its just a CSV style line
oriented format, but we changed it slightly to accomodate multiple outputs.
The best reference is the unit test:

https://github.com/jpatanooga/Metronome/blob/master/src/test/java/tv/floe/metronome/io/records/TestMetronomeVectorizatonFormat.java

but in general it comes down to a mapping of an input vector to an output
vector:

[i0 i1 i2 | o0 o1 o2]

where spaces separate the vector entries and then each is indexed to save
space. We provide the vectorization class (MetronomeRecordFactory) with a
schema as shown in the unit test.

So yeah its a bit custom, but after looking around and thinking about it we
just wanted something simple to map in:output and this made sense.

Adam and I are working on some more robust and complete vectorization tools
( https://github.com/jpatanooga/Canova - still a work in progress) that
will interop in a number of formats and run serially or in MapReduce that
should make all of this simpler. Today Metronome should be considered
alpha/beta software at best and that's why you don't see a more robust set
of input formats for every tool. If you compare it to say MLLib in Spark,
you'll see that we're at about a similar state (some of their stuff is
hardcoded to arbitrary csv formats);

TLDR: yes, vectorization and input formats are important, we;re thinking
hard about it all holistically (Canova)

Thanks!

JP

On Wed, Jul 2, 2014 at 8:52 PM, pchalasani [email protected] wrote:

subject says it all


Reply to this email directly or view it on GitHub
#2.

@agibsonccc
Copy link
Collaborator

I would like to add here that this is a big problem. Rather than take an adhoc approach, canova will also support different modes of feature extraction for various kinds of data.

Lots of people don't think about word vectors, moving window on images, and other kinds of the harder formats.

Featurization is a huge problem we'll be tackling here in the coming weeks. As ambitious as it sounds,
much of this is being incubated in the deeplearning4j project now, and a more "neutral" version of this with support for SVM light and other formats will be supported by canova.

@pchalasani
Copy link
Author

Thanks for the clarifications. I was just trying to figure out how I can (say) use Metronome to deploy deep-learning on Hadoop for one of our data-sets. Eventually, I'll probably put a friendly Clojure wrapper around it.

@jpatanooga
Copy link
Owner

glad we could help. let me know if you need help getting it going, I can
help you triage errors / etc.

JP

On Mon, Jul 7, 2014 at 2:55 PM, pchalasani [email protected] wrote:

Thanks for the clarifications. I was just trying to figure out how I can
(say) use Metronome to deploy deep-learning on Hadoop for one of out
data-sets. Eventually, I'll probably put a friendly Clojure wrapper around
it.


Reply to this email directly or view it on GitHub
#2 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants