Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some adjustment for supporting Deepspeed-Ulysses #2877

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

zeyugao
Copy link

@zeyugao zeyugao commented Jun 20, 2024

What does this PR do?

In this pr, I made some necessary modification to accelerate to achieve sequence parallel with Deepspeed-Ulysses incorporated with transformers.

For more concrete

  • I copy the parallel_state.py (mpu) from Megatron-Deepspeed and add several helper functions including
get_sequence_parallel_world_size_or_one  # Get the sequence parallel world size if initialized, otherwise return 1
sequence_parallel_is_enabled  # Check if the model parallel is initialized and the sequence parallel world size is larger than 1
  • Adjust the multiplier of world size when calculating the train_batch_size.
  • Pass the mpu argument into deepspeed.initialize when model parallel is initialized
  • Do not shard the dataset when the sampler is DistributedSampler, which means that the dataset is manually controlled.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@muellerzr @BenjaminBossan @SunMarc

@SunMarc
Copy link
Member

SunMarc commented Jun 26, 2024

Thanks for the PR @zeyugao. We will review this PR asap but just to let you know, we have quite a lot of backlog since @pacman100 left. This might take a bit of time before we review this ! Hope you understand ! If anyone is interested in merging asap, please add a thumbs up !

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@zeyugao zeyugao force-pushed the pass-mpu-into-deepspeed branch from 315bc81 to 13dad66 Compare July 3, 2024 11:47
@zeyugao zeyugao force-pushed the pass-mpu-into-deepspeed branch from 13dad66 to 2e33e77 Compare July 15, 2024 14:57
Copy link
Contributor

@stas00 stas00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding support for Deepspeed SP, @zeyugao

I don't think all of parallel_state.py needs to be copied when all you want to use is just a few functions. It's a bad practice to include huge chunks of unused code in a public repo. So perhaps you could massive reduce it?

And the key missing part from this PR is a test that adds support for this feature.

@@ -0,0 +1,790 @@
# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
Copy link
Contributor

@stas00 stas00 Jul 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
# Adapted from https://github.com/microsoft/Megatron-DeepSpeed/blob/main/megatron/core/parallel_state.py

probably a good idea to mention the source - I didn't check but it looks like this is where you took it from.

@stas00
Copy link
Contributor

stas00 commented Jul 29, 2024

BTW, an alternative solution has been just posted here: huggingface/transformers#32305

so perhaps wait before the maintainers decided which version is the most fitting before you invest more time into improving this PR, @zeyugao (I'm not a maintainer)

@zeyugao
Copy link
Author

zeyugao commented Jul 30, 2024

Thank you for commenting and mentioning that alternative PR.

I acknowledge that parallel_state.py may be too large and redundant. I can adjust this PR accordingly if needed. If the alternative PR is chosen, I can help identify any potential mishandled behaviors, as my PR is currently being used internally in training models.

Here are some differences between the two PRs for reference (In my understanding):

  1. Utilization of _SeqAllToAll from DeepSpeed:

    • The alternative PR uses the original _SeqAllToAll provided by DeepSpeed. Upon reviewing the DeepSpeed repository, it appears that the bug I encountered while developing my PR has been fixed (see the relevant commit microsoft/DeepSpeed@3bdd187, a concurrent work). Edit: Seems that it is still not working in my understanding: See [BUG] sequence parallel alltoall with batch size > 1 microsoft/DeepSpeed#5808
    • Previously, I struggled to work around this bug, incorrectly identified the root cause, and employed a more complex reshape workflow to use all_to_all_single. Now we can safely switch to using all_to_all_single for better performance.
  2. Sequence Parallel Initialization:

    • There are additional preparations needed to enable sequence parallelism, including how to initialize the sequence parallel group in DeepSpeed. The alternative PR omits these steps and leaves the customization to the user (perhaps I missed it).
    • In the alternative PR, groups._get_sequence_parallel_world_size is used, which internally utilizes mpu.get_sequence_parallel_world_size. This approach might be more flexible as the user may not want to use the mpu provided by accelerate.
  3. Batch Size Calculation and Data Preparation:

    • Batch size calculation and other dataset/dataloader preparations should be adjusted accordingly. When using sequence parallelism, the data parallel world size is no longer as simple as previously assumed by transformers (i.e., data parallel world size == GPU count).
  4. Hijacking the Sequence Parallel Location:

    • When developing my PR, _flash_attention_forward existed in the xxFlashAttention module in each model's modeling file. This required me to hijack each attention forward method individually.
    • As transformers have now switched to using a globally defined _flash_attention_forward, it is a much better idea to move the hijacking into that function so that all models can benefit from sequence parallelism. However, we need to correctly handle the length-related variables as there are some assumptions regarding these variables that need to be addressed accordingly.

@stas00
Copy link
Contributor

stas00 commented Jul 30, 2024

That's a wonderfully written followup, Elsa

@samadejacobs, could you please have a look at Elsa's comments above and see if anything still needs to be added to your PR - and if so hopefully you can share the credits with Elsa.

Once thing I for sure agree with Elsa is this is Accelerate and not HF transformers, so we need to support things like the correct world size.

I'm pretty sure the final result should be an amalgamation of these 2 PRs. Since these are 2 different repos - possibly merging Sam's PR first as it's more upstream to Accelerate, and then having Elsa adapt her PR to add the missing parts?

@zeyugao
Copy link
Author

zeyugao commented Jul 30, 2024

One more comment: The accelerate and transformers should be both modified to provide a more out of box usage of sequence parallel: huggingface/transformers#31525 and this one

@samadejacobs
Copy link

samadejacobs commented Jul 30, 2024

@zeyugao, thanks for your contribution. Few things to note/recommend:

  1. PR32305 is compatible with recent (latest) refactored flash_attn in transformers models
  2. PR32305 implementation leverages torch.distributed.mesh_device in lieu of MPU/parallel state, thereby avoiding code duplication
  3. Finetuning of Llama example script is available here (you need both HF PR and DS PR)
  4. I recommend Ulysses-specific bug fix/enhancement should be made as PR(s) to DeepSpeed repo
  5. I agree with @stas00 suggestion that PR32305 go first and yours can follow with support for accelerate and other nice enhancements.

@stas00
Copy link
Contributor

stas00 commented Jul 30, 2024

@zeyugao, so I suppose huggingface/transformers#31525 will have to be revisited as well once huggingface/transformers#32305 has been merged.

@huggingface huggingface deleted a comment from github-actions bot Sep 2, 2024
@muellerzr muellerzr added the wip Work in progress label Sep 2, 2024
@pavelgein
Copy link

Hi, I have a couple of questions.

First of all, what is the status of this? I will glad to give a hand if it will be useful.

Secondly, should we adjust prepare_data_loader method of Accelerator?
Specifically, I think that the arguments num_processes and process_index should reflect that actual world size (or, more precisely, data parallel world size) and process rank in this group.

@zeyugao
Copy link
Author

zeyugao commented Nov 18, 2024

@pavelgein The ulysses integration for transformers is tracked in huggingface/transformers#32305 just mentioned before.

If you use transformers to train the model, there is no need to adjust prepare_data_loader as the data loader is handled in the transformers side.

I think that we can try to find more inconsistencies after huggingface/transformers#32305 is merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wip Work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants