[DeepSpeed Example] Fail on AWS T4 due to package import issue #4434

yika-luo · 2024-12-03T21:02:52Z

To reproduce: use accelerators: T4:1 in examples/deepspeed-multinode/sky.yaml and run sky launch examples/deepspeed-multinode/sky.yaml -c ds -i1 --down --cloud=aws

This dataset import issue is interesting, after ssh into the sky cluster and run pip install datasets>=2.8.0 the import datasets work, so I checked the setup log and saw a few attempts at Collecting datasets>=2.8.0 (from -r requirements.txt (line 1)) in the log but maybe none succeeded.

(head, rank=0, pid=2867) 172.31.10.116: [2024-12-03 20:53:28,710] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
(head, rank=0, pid=2867) 172.31.15.28:  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
(head, rank=0, pid=2867) 172.31.15.28:  [WARNING]  async_io: please install the libaio-dev package with apt
(head, rank=0, pid=2867) 172.31.15.28:  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
(head, rank=0, pid=2867) 172.31.15.28:  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
(head, rank=0, pid=2867) 172.31.15.28:  [WARNING]  NVIDIA Inference is only supported on Ampere and newer architectures
(head, rank=0, pid=2867) 172.31.10.116:  [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
(head, rank=0, pid=2867) 172.31.10.116:  [WARNING]  async_io: please install the libaio-dev package with apt
(head, rank=0, pid=2867) 172.31.10.116:  [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
(head, rank=0, pid=2867) 172.31.10.116:  [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
(head, rank=0, pid=2867) 172.31.15.28:  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4
(head, rank=0, pid=2867) 172.31.15.28:  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible
(head, rank=0, pid=2867) 172.31.15.28: Traceback (most recent call last):
(head, rank=0, pid=2867) 172.31.15.28:   File "main.py", line 27, in <module>
(head, rank=0, pid=2867) 172.31.15.28:     from utils.data.data_utils import create_prompt_dataset
(head, rank=0, pid=2867) 172.31.15.28:   File "/home/ubuntu/sky_workdir/DeepSpeedExamples/applications/DeepSpeed-Chat/training/utils/data/data_utils.py", line 12, in <module>
(head, rank=0, pid=2867) 172.31.15.28:     from datasets import load_dataset
(head, rank=0, pid=2867) 172.31.15.28: ModuleNotFoundError: No module named 'datasets'
(head, rank=0, pid=2867) 172.31.10.116:  [WARNING]  NVIDIA Inference is only supported on Ampere and newer architectures
(head, rank=0, pid=2867) 172.31.10.116:  [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4
(head, rank=0, pid=2867) 172.31.10.116:  [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible
(head, rank=0, pid=2867) 172.31.10.116: Traceback (most recent call last):
(head, rank=0, pid=2867) 172.31.10.116:   File "main.py", line 27, in <module>
(head, rank=0, pid=2867) 172.31.10.116:     from utils.data.data_utils import create_prompt_dataset
(head, rank=0, pid=2867) 172.31.10.116:   File "/home/ubuntu/sky_workdir/DeepSpeedExamples/applications/DeepSpeed-Chat/training/utils/data/data_utils.py", line 12, in <module>
(head, rank=0, pid=2867) 172.31.10.116:     from datasets import load_dataset
(head, rank=0, pid=2867) 172.31.10.116: ModuleNotFoundError: No module named 'datasets'
(head, rank=0, pid=2867) 172.31.15.28: [2024-12-03 20:53:31,139] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 3352
(head, rank=0, pid=2867) 172.31.15.28: [2024-12-03 20:53:31,140] [ERROR] [launch.py:325:sigkill_handler] ['/home/ubuntu/miniconda3/envs/deepspeed/bin/python', '-u', 'main.py', '--local_rank=0', '--data_path', 'Dahoas/rm-static', 'Dahoas/full-hh-rlhf', 'Dahoas/synthetic-instruct-gptj-pairwise', 'yitingxie/rlhf-reward-datasets', '--data_split', '2,4,4', '--model_name_or_path', 'facebook/opt-1.3b', '--per_device_train_batch_size', '8', '--per_device_eval_batch_size', '8', '--max_seq_len', '512', '--learning_rate', '1e-3', '--weight_decay', '0.1', '--num_train_epochs', '16', '--gradient_accumulation_steps', '1', '--lr_scheduler_type', 'cosine', '--num_warmup_steps', '0', '--seed', '1234', '--zero_stage', '0', '--lora_dim', '128', '--lora_module_name', 'decoder.layers.', '--only_optimize_lora', '--deepspeed', '--output_dir', './output'] exits with return code = 1
(head, rank=0, pid=2867) 172.31.10.116: [2024-12-03 20:53:31,657] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 2799
(head, rank=0, pid=2867) 172.31.10.116: [2024-12-03 20:53:31,657] [ERROR] [launch.py:325:sigkill_handler] ['/home/ubuntu/miniconda3/envs/deepspeed/bin/python', '-u', 'main.py', '--local_rank=0', '--data_path', 'Dahoas/rm-static', 'Dahoas/full-hh-rlhf', 'Dahoas/synthetic-instruct-gptj-pairwise', 'yitingxie/rlhf-reward-datasets', '--data_split', '2,4,4', '--model_name_or_path', 'facebook/opt-1.3b', '--per_device_train_batch_size', '8', '--per_device_eval_batch_size', '8', '--max_seq_len', '512', '--learning_rate', '1e-3', '--weight_decay', '0.1', '--num_train_epochs', '16', '--gradient_accumulation_steps', '1', '--lr_scheduler_type', 'cosine', '--num_warmup_steps', '0', '--seed', '1234', '--zero_stage', '0', '--lora_dim', '128', '--lora_module_name', 'decoder.layers.', '--only_optimize_lora', '--deepspeed', '--output_dir', './output'] exits with return code = 1
(head, rank=0, pid=2867) pdsh@ip-172-31-15-28: 172.31.15.28: ssh exited with exit code 1
(head, rank=0, pid=2867) pdsh@ip-172-31-15-28: 172.31.10.116: ssh exited with exit code 1
✓ Job finished (status: SUCCEEDED).

The text was updated successfully, but these errors were encountered:

Michaelvll added the good first issue Good for newcomers label Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DeepSpeed Example] Fail on AWS T4 due to package import issue #4434

[DeepSpeed Example] Fail on AWS T4 due to package import issue #4434

yika-luo commented Dec 3, 2024

[DeepSpeed Example] Fail on AWS T4 due to package import issue #4434

[DeepSpeed Example] Fail on AWS T4 due to package import issue #4434

Comments

yika-luo commented Dec 3, 2024