Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new "parallel unix" backend for running jobs over multiple local processors #75

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

jso
Copy link

@jso jso commented Aug 14, 2013

I wrote this backend to enable local dumbo jobs to leverage multiple processor cores.

Minimal usage example, which will run 4 mappers in parallel and then run 4 reducers:
dumbo -input [infiles] -output [outpath] -punix yes -tmpdir [tmppath] -nmappers 4 -nreducers 4

Along the way, I also added a few additional options and features.

These new options are backend-agnostic:

  • inputfile: which is used instead of "input". This is helpful for when the input files are too long for the command line buffer, i.e. for big jobs.
  • shell: to specify the shell to be executed instead of the default '/bin/sh'

These are specific to the new parallel unix backend:

  • punix: flag to enable this backend
  • nmappers: the number of mappers to run at the same time
  • nreducers: the number of reducers to use, which all run simultaneously
  • tmpdir: local file system path to store temporary files
  • permapper: the number of input files to be handled by each mapper process. This reduces the turnover rate of creating new processes, which can be beneficial for reducing file system load if you are running dumbo on a clustered file system (optional)

jso added 6 commits July 6, 2011 14:07
…tiple cores. also, added an option to specify the shell to be used for executing jobs instead of the default /bin/sh.
…ng all the input files to be processed. in case of very large jobs, this works around the limit of what can be passed on the command line
@klbostee
Copy link
Owner

klbostee commented Sep 2, 2013

Sound good! Will try to find some time to review soonish.

@klbostee
Copy link
Owner

klbostee commented Sep 4, 2013

Two comments:

  • Doesn't adhere to PEP 8 style guide.
  • The Hadoop backend already has "nummaptasks" and "numreducetasks" options — maybe it would be better to use the same option names for this backend, instead of "nmappers" and "nreducers"?

@jso
Copy link
Author

jso commented Sep 4, 2013

Thanks for the comments, Klaas, and your patience -- I'm new to this!

I will fix up the code to comply with PEP 8 and follow up here when I have addressed these issues.

Regarding the option names, I think there is a distinction to be made in this case. My intention with these options was to control the degree of multiprogramming used by the backend. In my mind, there are several reasons to set this separately from the input-/output-specific nummaptasks/numreducetasks options for the Hadoop backend, such as wanting to split the output across 100 reduce tasks but only processing 8 at a time.

Granted, as the code stands, numreducetasks is redundant with nreducers -- but given this discussion, perhaps it would be worthwhile to account for this distinction that I described above.

On Wednesday, September 4, 2013 at 10:33 AM, Klaas Bosteels wrote:

Two comments:
Doesn't adhere to PEP 8 style guide.
The Hadoop backend already has "nummaptasks" and "numreducetasks" options — maybe it would be better to use the same option names for this backend, instead of "nmappers" and "nreducers"?


Reply to this email directly or view it on GitHub (#75 (comment)).

@igorgatis
Copy link

I think new flags with so similar names will introduce confusion.

+1 to reuse hadoop ones. I think number of processes should be capped by number of cores (does it make any sense to have more processes than number of cores?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants