Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for PyTorch Lightning in the DDP backend. #162

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

gopaljigaur
Copy link
Collaborator

This pull request includes several changes to improve the handling of distributed data parallel (DDP) setups and trial evaluation in the neps runtime. The changes focus on adding support for evaluating trials in a DDP context and ensuring proper state management.

DDP and Trial Evaluation Enhancements:

  • Added a new function _is_ddp_and_not_rank_zero to check if the current process is part of a DDP setup and is not the rank zero process. (neps/runtime.py, neps/runtime.pyR49-R66)
  • Introduced the _launch_ddp_runtime function to handle the evaluation of trials in a DDP setup. This function ensures that only the rank zero process launches a new worker. (neps/runtime.py, neps/runtime.pyR512-R531)
  • Modified the _launch_runtime function to use _launch_ddp_runtime when in a DDP setup and not rank zero. This prevents non-rank-zero processes from launching new workers. (neps/runtime.py, neps/runtime.pyR550-R556)

State Management Improvements:

  • Added the evaluating method to the FileBasedTrialStore class to retrieve all evaluating trials. (neps/state/filebased.py, neps/state/filebased.pyR212-R220)
  • Added the get_current_evaluating_trial method to the NepsState class to get the current trial being evaluated. (neps/state/neps_state.py, neps/state/neps_state.pyR217-R222)
  • Defined the evaluating method in the TrialStore protocol to standardize the retrieval of evaluating trials across different implementations. (neps/state/protocols.py, neps/state/protocols.pyR141-R144)

These changes collectively enhance the neps runtime's ability to manage and evaluate trials efficiently, especially in distributed computing environments.

@gopaljigaur gopaljigaur requested review from eddiebergman and DaStoll and removed request for eddiebergman December 8, 2024 17:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

1 participant