[EGGO-30] Generate partitioned data. #33

tomwhite · 2015-04-17T11:29:27Z

This is #30 rebased on #29

tomwhite · 2015-04-22T15:12:11Z

I think this is ready to review. @laserson can you have a look please?

I have run this successfully locally, and on EC2 I generated flat data after using the workaround described in #43. Partitioning isn't working on EC2 since it has an old version of MR on it, so we need to work out what to do there.

fnothaft · 2015-04-22T15:20:21Z

Partitioning isn't working on EC2 since it has an old version of MR on it, so we need to work out what to do there.

Do we need Hadoop 2.x for the partitioning to work? If so, we can just run the Spark EC2 scripts with --hadoop-major-version 2. This launches the cluster with Hadoop 2.0.0, IIRC.

tomwhite · 2015-04-22T16:36:41Z

Thanks for the pointer Frank. I tried it, but it doesn't start the Hadoop cluster daemons correctly, so I need to debug a bit more.

fnothaft · 2015-04-22T16:40:49Z

Ah, odd. I haven't tried it myself as I don't often have an explicit need for Hadoop 2.

laserson · 2015-04-22T18:29:41Z

test/registry/test-1kg-genotypes-subset.json

@@ -0,0 +1,10 @@
+{


can this file be combined with the other test-genotypes.json file? Or you want to keep the partitioning separate?

Yes, I want to have an example that exercises the partitioning.

laserson · 2015-04-22T18:34:52Z

lgtm, generally. if we end up pulling in other tools in the hadoop stack, it perhaps provides further rationale for switching to using cloudera director. in my experience, the spark-ec2 scripts are a bit uneven.

tomwhite · 2015-04-23T16:36:55Z

I didn't have much luck with the MR2 installation on EC2, as it's using an old version. I'm looking into #44 to improve the cluster experience, so I won't commit this until I have a better idea of how feasible that it. It will also be useful if we want to use Impala for querying genome data.

https://github.com/tomwhite/adam-partitioning, while Spark version is being debugged.

tomwhite force-pushed the EGGO-30-locuspart-on-refactor-luigi-code branch 2 times, most recently from bbc5eed to 5f18147 Compare April 21, 2015 16:58

tomwhite mentioned this pull request Apr 21, 2015

Generate partitioned data #30

Closed

tomwhite force-pushed the EGGO-30-locuspart-on-refactor-luigi-code branch from 5f18147 to e186f7b Compare April 22, 2015 12:42

laserson reviewed Apr 22, 2015
View reviewed changes

tomwhite added 6 commits April 29, 2015 12:18

[EGGO-30] Generate partitioned data.

bdb7ad8

[EGGO-30] Generate flattened, partitioned data.

ce6e215

[EGGO-30] Use Crunch-based MR partitioner, from

0366ec2

https://github.com/tomwhite/adam-partitioning, while Spark version is being debugged.

[EGGO-30] Add support for partitioning BAM/SAM.

d5f2d98

Add hint for parallelism in registry files.

1cad68c

Fix typo

1e23617

tomwhite force-pushed the EGGO-30-locuspart-on-refactor-luigi-code branch from e186f7b to 1e23617 Compare April 29, 2015 11:22

laserson force-pushed the master branch from 38b890a to fbffdb6 Compare August 18, 2015 00:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EGGO-30] Generate partitioned data. #33

[EGGO-30] Generate partitioned data. #33

tomwhite commented Apr 17, 2015

tomwhite commented Apr 22, 2015

fnothaft commented Apr 22, 2015

tomwhite commented Apr 22, 2015

fnothaft commented Apr 22, 2015

laserson Apr 22, 2015

tomwhite Apr 23, 2015

laserson commented Apr 22, 2015

tomwhite commented Apr 23, 2015

[EGGO-30] Generate partitioned data. #33

Are you sure you want to change the base?

[EGGO-30] Generate partitioned data. #33

Conversation

tomwhite commented Apr 17, 2015

tomwhite commented Apr 22, 2015

fnothaft commented Apr 22, 2015

tomwhite commented Apr 22, 2015

fnothaft commented Apr 22, 2015

laserson Apr 22, 2015

Choose a reason for hiding this comment

tomwhite Apr 23, 2015

Choose a reason for hiding this comment

laserson commented Apr 22, 2015

tomwhite commented Apr 23, 2015