Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DeepSpeed Example] Fail on AWS T4 due to package import issue #4434

Open
yika-luo opened this issue Dec 3, 2024 · 0 comments
Open

[DeepSpeed Example] Fail on AWS T4 due to package import issue #4434

yika-luo opened this issue Dec 3, 2024 · 0 comments
Labels
good first issue Good for newcomers

Comments

@yika-luo
Copy link
Collaborator

yika-luo commented Dec 3, 2024

To reproduce: use accelerators: T4:1 in examples/deepspeed-multinode/sky.yaml and run sky launch examples/deepspeed-multinode/sky.yaml -c ds -i1 --down --cloud=aws

This dataset import issue is interesting, after ssh into the sky cluster and run pip install datasets>=2.8.0 the import datasets work, so I checked the setup log and saw a few attempts at Collecting datasets>=2.8.0 (from -r requirements.txt (line 1)) in the log but maybe none succeeded.

(head, rank=0, pid=2867) 172.31.10.116: [2024-12-03 20:53:28,710] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
(head, rank=0, pid=2867) 172.31.15.28:  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
(head, rank=0, pid=2867) 172.31.15.28:  [WARNING]  async_io: please install the libaio-dev package with apt
(head, rank=0, pid=2867) 172.31.15.28:  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
(head, rank=0, pid=2867) 172.31.15.28:  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
(head, rank=0, pid=2867) 172.31.15.28:  [WARNING]  NVIDIA Inference is only supported on Ampere and newer architectures
(head, rank=0, pid=2867) 172.31.10.116:  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
(head, rank=0, pid=2867) 172.31.10.116:  [WARNING]  async_io: please install the libaio-dev package with apt
(head, rank=0, pid=2867) 172.31.10.116:  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
(head, rank=0, pid=2867) 172.31.10.116:  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
(head, rank=0, pid=2867) 172.31.15.28:  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4
(head, rank=0, pid=2867) 172.31.15.28:  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible
(head, rank=0, pid=2867) 172.31.15.28: Traceback (most recent call last):
(head, rank=0, pid=2867) 172.31.15.28:   File "main.py", line 27, in <module>
(head, rank=0, pid=2867) 172.31.15.28:     from utils.data.data_utils import create_prompt_dataset
(head, rank=0, pid=2867) 172.31.15.28:   File "/home/ubuntu/sky_workdir/DeepSpeedExamples/applications/DeepSpeed-Chat/training/utils/data/data_utils.py", line 12, in <module>
(head, rank=0, pid=2867) 172.31.15.28:     from datasets import load_dataset
(head, rank=0, pid=2867) 172.31.15.28: ModuleNotFoundError: No module named 'datasets'
(head, rank=0, pid=2867) 172.31.10.116:  [WARNING]  NVIDIA Inference is only supported on Ampere and newer architectures
(head, rank=0, pid=2867) 172.31.10.116:  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4
(head, rank=0, pid=2867) 172.31.10.116:  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible
(head, rank=0, pid=2867) 172.31.10.116: Traceback (most recent call last):
(head, rank=0, pid=2867) 172.31.10.116:   File "main.py", line 27, in <module>
(head, rank=0, pid=2867) 172.31.10.116:     from utils.data.data_utils import create_prompt_dataset
(head, rank=0, pid=2867) 172.31.10.116:   File "/home/ubuntu/sky_workdir/DeepSpeedExamples/applications/DeepSpeed-Chat/training/utils/data/data_utils.py", line 12, in <module>
(head, rank=0, pid=2867) 172.31.10.116:     from datasets import load_dataset
(head, rank=0, pid=2867) 172.31.10.116: ModuleNotFoundError: No module named 'datasets'
(head, rank=0, pid=2867) 172.31.15.28: [2024-12-03 20:53:31,139] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 3352
(head, rank=0, pid=2867) 172.31.15.28: [2024-12-03 20:53:31,140] [ERROR] [launch.py:325:sigkill_handler] ['/home/ubuntu/miniconda3/envs/deepspeed/bin/python', '-u', 'main.py', '--local_rank=0', '--data_path', 'Dahoas/rm-static', 'Dahoas/full-hh-rlhf', 'Dahoas/synthetic-instruct-gptj-pairwise', 'yitingxie/rlhf-reward-datasets', '--data_split', '2,4,4', '--model_name_or_path', 'facebook/opt-1.3b', '--per_device_train_batch_size', '8', '--per_device_eval_batch_size', '8', '--max_seq_len', '512', '--learning_rate', '1e-3', '--weight_decay', '0.1', '--num_train_epochs', '16', '--gradient_accumulation_steps', '1', '--lr_scheduler_type', 'cosine', '--num_warmup_steps', '0', '--seed', '1234', '--zero_stage', '0', '--lora_dim', '128', '--lora_module_name', 'decoder.layers.', '--only_optimize_lora', '--deepspeed', '--output_dir', './output'] exits with return code = 1
(head, rank=0, pid=2867) 172.31.10.116: [2024-12-03 20:53:31,657] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 2799
(head, rank=0, pid=2867) 172.31.10.116: [2024-12-03 20:53:31,657] [ERROR] [launch.py:325:sigkill_handler] ['/home/ubuntu/miniconda3/envs/deepspeed/bin/python', '-u', 'main.py', '--local_rank=0', '--data_path', 'Dahoas/rm-static', 'Dahoas/full-hh-rlhf', 'Dahoas/synthetic-instruct-gptj-pairwise', 'yitingxie/rlhf-reward-datasets', '--data_split', '2,4,4', '--model_name_or_path', 'facebook/opt-1.3b', '--per_device_train_batch_size', '8', '--per_device_eval_batch_size', '8', '--max_seq_len', '512', '--learning_rate', '1e-3', '--weight_decay', '0.1', '--num_train_epochs', '16', '--gradient_accumulation_steps', '1', '--lr_scheduler_type', 'cosine', '--num_warmup_steps', '0', '--seed', '1234', '--zero_stage', '0', '--lora_dim', '128', '--lora_module_name', 'decoder.layers.', '--only_optimize_lora', '--deepspeed', '--output_dir', './output'] exits with return code = 1
(head, rank=0, pid=2867) pdsh@ip-172-31-15-28: 172.31.15.28: ssh exited with exit code 1
(head, rank=0, pid=2867) pdsh@ip-172-31-15-28: 172.31.10.116: ssh exited with exit code 1
✓ Job finished (status: SUCCEEDED).
@Michaelvll Michaelvll added the good first issue Good for newcomers label Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants