-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EGGO-30] Generate partitioned data. #33
base: master
Are you sure you want to change the base?
[EGGO-30] Generate partitioned data. #33
Conversation
bbc5eed
to
5f18147
Compare
5f18147
to
e186f7b
Compare
I think this is ready to review. @laserson can you have a look please? I have run this successfully locally, and on EC2 I generated flat data after using the workaround described in #43. Partitioning isn't working on EC2 since it has an old version of MR on it, so we need to work out what to do there. |
Do we need Hadoop 2.x for the partitioning to work? If so, we can just run the Spark EC2 scripts with |
Thanks for the pointer Frank. I tried it, but it doesn't start the Hadoop cluster daemons correctly, so I need to debug a bit more. |
Ah, odd. I haven't tried it myself as I don't often have an explicit need for Hadoop 2. |
@@ -0,0 +1,10 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this file be combined with the other test-genotypes.json
file? Or you want to keep the partitioning separate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I want to have an example that exercises the partitioning.
lgtm, generally. if we end up pulling in other tools in the hadoop stack, it perhaps provides further rationale for switching to using cloudera director. in my experience, the spark-ec2 scripts are a bit uneven. |
I didn't have much luck with the MR2 installation on EC2, as it's using an old version. I'm looking into #44 to improve the cluster experience, so I won't commit this until I have a better idea of how feasible that it. It will also be useful if we want to use Impala for querying genome data. |
https://github.com/tomwhite/adam-partitioning, while Spark version is being debugged.
e186f7b
to
1e23617
Compare
This is #30 rebased on #29