-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
new "parallel unix" backend for running jobs over multiple local processors #75
base: master
Are you sure you want to change the base?
Conversation
…tiple cores. also, added an option to specify the shell to be used for executing jobs instead of the default /bin/sh.
…ng all the input files to be processed. in case of very large jobs, this works around the limit of what can be passed on the command line
Sound good! Will try to find some time to review soonish. |
Two comments:
|
Thanks for the comments, Klaas, and your patience -- I'm new to this! I will fix up the code to comply with PEP 8 and follow up here when I have addressed these issues. Regarding the option names, I think there is a distinction to be made in this case. My intention with these options was to control the degree of multiprogramming used by the backend. In my mind, there are several reasons to set this separately from the input-/output-specific nummaptasks/numreducetasks options for the Hadoop backend, such as wanting to split the output across 100 reduce tasks but only processing 8 at a time. Granted, as the code stands, numreducetasks is redundant with nreducers -- but given this discussion, perhaps it would be worthwhile to account for this distinction that I described above. On Wednesday, September 4, 2013 at 10:33 AM, Klaas Bosteels wrote:
|
I think new flags with so similar names will introduce confusion. +1 to reuse hadoop ones. I think number of processes should be capped by number of cores (does it make any sense to have more processes than number of cores?) |
I wrote this backend to enable local dumbo jobs to leverage multiple processor cores.
Minimal usage example, which will run 4 mappers in parallel and then run 4 reducers:
dumbo -input [infiles] -output [outpath] -punix yes -tmpdir [tmppath] -nmappers 4 -nreducers 4
Along the way, I also added a few additional options and features.
These new options are backend-agnostic:
These are specific to the new parallel unix backend: