Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EGGO-30] Generate partitioned data. #33

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

tomwhite
Copy link
Member

This is #30 rebased on #29

@tomwhite tomwhite force-pushed the EGGO-30-locuspart-on-refactor-luigi-code branch 2 times, most recently from bbc5eed to 5f18147 Compare April 21, 2015 16:58
@tomwhite tomwhite force-pushed the EGGO-30-locuspart-on-refactor-luigi-code branch from 5f18147 to e186f7b Compare April 22, 2015 12:42
@tomwhite
Copy link
Member Author

I think this is ready to review. @laserson can you have a look please?

I have run this successfully locally, and on EC2 I generated flat data after using the workaround described in #43. Partitioning isn't working on EC2 since it has an old version of MR on it, so we need to work out what to do there.

@fnothaft
Copy link
Member

Partitioning isn't working on EC2 since it has an old version of MR on it, so we need to work out what to do there.

Do we need Hadoop 2.x for the partitioning to work? If so, we can just run the Spark EC2 scripts with --hadoop-major-version 2. This launches the cluster with Hadoop 2.0.0, IIRC.

@tomwhite
Copy link
Member Author

Thanks for the pointer Frank. I tried it, but it doesn't start the Hadoop cluster daemons correctly, so I need to debug a bit more.

@fnothaft
Copy link
Member

Ah, odd. I haven't tried it myself as I don't often have an explicit need for Hadoop 2.

@@ -0,0 +1,10 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this file be combined with the other test-genotypes.json file? Or you want to keep the partitioning separate?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I want to have an example that exercises the partitioning.

@laserson
Copy link
Contributor

lgtm, generally. if we end up pulling in other tools in the hadoop stack, it perhaps provides further rationale for switching to using cloudera director. in my experience, the spark-ec2 scripts are a bit uneven.

@tomwhite
Copy link
Member Author

I didn't have much luck with the MR2 installation on EC2, as it's using an old version. I'm looking into #44 to improve the cluster experience, so I won't commit this until I have a better idea of how feasible that it. It will also be useful if we want to use Impala for querying genome data.

@tomwhite tomwhite force-pushed the EGGO-30-locuspart-on-refactor-luigi-code branch from e186f7b to 1e23617 Compare April 29, 2015 11:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants