new "parallel unix" backend for running jobs over multiple local processors #75

jso · 2013-08-14T21:02:13Z

I wrote this backend to enable local dumbo jobs to leverage multiple processor cores.

Minimal usage example, which will run 4 mappers in parallel and then run 4 reducers:
dumbo -input [infiles] -output [outpath] -punix yes -tmpdir [tmppath] -nmappers 4 -nreducers 4

Along the way, I also added a few additional options and features.

These new options are backend-agnostic:

inputfile: which is used instead of "input". This is helpful for when the input files are too long for the command line buffer, i.e. for big jobs.
shell: to specify the shell to be executed instead of the default '/bin/sh'

These are specific to the new parallel unix backend:

punix: flag to enable this backend
nmappers: the number of mappers to run at the same time
nreducers: the number of reducers to use, which all run simultaneously
tmpdir: local file system path to store temporary files
permapper: the number of input files to be handled by each mapper process. This reduces the turnover rate of creating new processes, which can be beneficial for reducing file system load if you are running dumbo on a clustered file system (optional)

…tiple cores. also, added an option to specify the shell to be used for executing jobs instead of the default /bin/sh.

…ng all the input files to be processed. in case of very large jobs, this works around the limit of what can be passed on the command line

klbostee · 2013-09-02T06:28:17Z

Sound good! Will try to find some time to review soonish.

klbostee · 2013-09-04T15:33:11Z

Two comments:

Doesn't adhere to PEP 8 style guide.
The Hadoop backend already has "nummaptasks" and "numreducetasks" options — maybe it would be better to use the same option names for this backend, instead of "nmappers" and "nreducers"?

jso · 2013-09-04T16:04:43Z

Thanks for the comments, Klaas, and your patience -- I'm new to this!

I will fix up the code to comply with PEP 8 and follow up here when I have addressed these issues.

Regarding the option names, I think there is a distinction to be made in this case. My intention with these options was to control the degree of multiprogramming used by the backend. In my mind, there are several reasons to set this separately from the input-/output-specific nummaptasks/numreducetasks options for the Hadoop backend, such as wanting to split the output across 100 reduce tasks but only processing 8 at a time.

Granted, as the code stands, numreducetasks is redundant with nreducers -- but given this discussion, perhaps it would be worthwhile to account for this distinction that I described above.

On Wednesday, September 4, 2013 at 10:33 AM, Klaas Bosteels wrote:

Two comments:
Doesn't adhere to PEP 8 style guide.
The Hadoop backend already has "nummaptasks" and "numreducetasks" options — maybe it would be better to use the same option names for this backend, instead of "nmappers" and "nreducers"?

—
Reply to this email directly or view it on GitHub (#75 (comment)).

igorgatis · 2013-11-05T18:31:54Z

I think new flags with so similar names will introduce confusion.

+1 to reuse hadoop ones. I think number of processes should be capped by number of cores (does it make any sense to have more processes than number of cores?)

jso added 6 commits July 6, 2011 14:07

detect gzip encoding by file extension, and use gzip library accordingly

a1d0321

add the punix 'parallel unix' backend, allowing local jobs to use mul…

6af51c8

…tiple cores. also, added an option to specify the shell to be used for executing jobs instead of the default /bin/sh.

add 'inputfile' option, which allows the user to pass a file containi…

e437327

…ng all the input files to be processed. in case of very large jobs, this works around the limit of what can be passed on the command line

merge

ec031b0

finish up the resolve/merge

303867a

open gzipped input files with gzip.open

2ddb997

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

new "parallel unix" backend for running jobs over multiple local processors #75

new "parallel unix" backend for running jobs over multiple local processors #75

jso commented Aug 14, 2013

klbostee commented Sep 2, 2013

klbostee commented Sep 4, 2013

jso commented Sep 4, 2013

igorgatis commented Nov 5, 2013

new "parallel unix" backend for running jobs over multiple local processors #75

Are you sure you want to change the base?

new "parallel unix" backend for running jobs over multiple local processors #75

Conversation

jso commented Aug 14, 2013

klbostee commented Sep 2, 2013

klbostee commented Sep 4, 2013

jso commented Sep 4, 2013

igorgatis commented Nov 5, 2013