-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workflow for AutoTP #4961
base: master
Are you sure you want to change the base?
Workflow for AutoTP #4961
Conversation
The specific error below is because of the container is not created with CAP_SYS_NICE capability. I'll check the additional flags I use for container and post it here.
|
On my system docker container needs to be started with SYS_NICE capability with the following flag.
Not sure how to turn on this for DeepSpeed runner. Without this capability, we have to remove |
A proper behavior of DeepSpeed |
Hi @loadams the blocking issue for this PR had been resolved. Can you help restart the workflow? Thanks! |
@tjruwase Thanks! Currently the autotp workflow passed. One thing I'm not sure is whether the checkpoint downloaded will be preserved across different runs. This will be most time consuming part of this workflow. Will need some comments (i.e. which directory in runner can preserve?) or observe another run to see whether the checkpoint preserves. |
@delock, it is great to see the CI now passing. I think @mrwyattii or @loadams would be the best to answer questions about the checkpoint. |
@mrwyattii @loadams it will be great if there is any link showing how persistency is done on this runner.
|
I know @mrwyattii and I still need to leave feedback on this PR, but an example of where things are on the blob storage here, I'm not sure that's the best example, but that's one that shows persisting a larger download/install. |
Thanks for the suggestion @loadams . By looking at the usage of '/blob' in DeepSpeed workflows. I found I need to use the default value of TRANSFORMERS_CACHE. Let me make the change and see if it persists. |
Hi @loadams can you help start the workflow? The model checkpoint path had been moved to the persistent storage as suggested. |
Apologies, I was out but it should be running now. |
Thanks! The failure in the workflow should be due to version mismatch between pytorch (2.2.0) and Intel extension for PyTorch (2.1). The recent failure in |
@loadams Intel Extension for Pytorch 2.2 had been released today. Restart the workflow should resolve the failure.
|
@loadams Falcon 7b model is not supported by DeepSpeed AutoTP yet. I updated the workflow to test Baichuan 7b instead. Can you help restart the workflow? Thanks! |
Hi @loadams the command line of baichuan model had been changed to fix the test error. The reason is Baichuan model contains remote code so need to set trust_remote_code to true. Can you help restart the workflow? Thanks! |
Hi @loadams , I see the environment issue should have been fixed. Can you help restart the workflow? Thanks! |
@loadams I ran these two tests on my local environment. It didn't took so long. Can you help run this workflow again to see whether it is reproducible? Thanks! |
Re-running now |
Hi @loadams, I tried run these UTs in my environment and didn't see this timeout. Since CPU UT is already covered by workflow |
Done |
For Baichuan model failure. I'm seeing it pass on my local environment with exactly the same arguments. From failed log in the workflow I see 'file not found' error when acquiring a lock. Suspect because of
|
@loadams |
Hi @loadams after reading the error log I suspect Baichuan model under TRANSFORMERS_CACHE is corrupted. I unset TRANSFORMERS_CACHE since we set HF_HOME for this model. I also add a peek to TRANSFORMERS_CACHE and HF_HOME in case a manual cleanup will be needed. Can you help start the workflow? Thanks! |
@loadams @tjruwase the latest error in Baichuan model AutoTP is very wierd. It complains about lock file not found or attribute not found. Which I cannot reproduce locally. It indicates probably there is some courrpted states in hf_hub downloaded data. Currenlty bloom and opt model AutoTP is consistently running well. Can we merge this baseline first then seek add new autotp model validation in followup PR? It might take a while to debug this issue. I'll submit a commit disable baichuan test first. |
This PR add a new extendable workflow for automatic tensor parallelism (https://www.deepspeed.ai/tutorials/automatic-tensor-parallelism/). The workflow aims to provide a way to validate AutoTP for LLM models.