You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm following the instructions in the README to set up a local environment to run the project. When I ran the unit tests, some tests failed. After going through the error messages, they seem to all related to incorrect device.type somehow. The simplest failed test is tests/models/llm_embed/test_llm_embedding.py::test_contrastive_eval_loss_device_handling:
tests/models/llm_embed/test_llm_embedding.py:303: in test_contrastive_eval_loss_device_handling
assert metric.loss.device.type == device
E AssertionError: assert 'cpu' == 'cuda'
E
E - cuda
E + cpu
It looks like the tensor is supposed to be on cuda, but it is on cpu instead. I feel like there is something obvious I'm missing, but couldn't figure out what. Any insight is appreciated!
Additional context
I set up my environment on a Google cloud machine with one A100 GPU. This is the output of uname -a:
I installed all necessary packages and drivers for CUDA and docker, and downloaded the docker image mosaicml/llm-foundry:2.5.1_cu124-latest. This is how I started the docker container with access to the GPU:
sudo docker run -it --shm-size=2g --rm --gpus all mosaicml/llm-foundry:2.5.1_cu124-latest bash
Within the docker container, the output of nvidia-smi is:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01 Driver Version: 565.57.01 CUDA Version: 12.7 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100-SXM4-40GB On | 00000000:00:04.0 Off | 0 |
| N/A 30C P0 42W / 400W | 1MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
After successfully installed llm-foundry in the docker container, I tried to run pytest, but there are about 16 failed cases all related to the above error. I can recreate the error by running that specific test alone:
I also ran the first example in the README (python data_prep/convert_dataset_hf.py --dataset allenai/c4 --data_subset en --out_root my-copy-c4 --splits train_small val_small --concat_tokens 2048 --tokenizer EleutherAI/gpt-neox-20b --eos_text '<|endoftext|>') without any issue.
Another example failed test is:
__________________________________________________________________________________________________ test_icl_eval __________________________________________________________________________________________________
/llm-foundry/tests/a_scripts/eval/test_eval.py:78: in test_icl_eval
evaluate(eval_cfg)
/llm-foundry/llmfoundry/command_utils/eval.py:267: in evaluate
dist.initialize_dist(get_device(None), timeout=eval_config.dist_timeout)
/usr/lib/python3/dist-packages/composer/utils/dist.py:539: in initialize_dist
raise RuntimeError(
E RuntimeError: The requested backend (nccl) differs from the backend of the current process group (gloo). If you wish to change backends, please restart the python process.
---------------------------------------------------------------------------------------------- Captured stderr setup ----------------------------------------------------------------------------------------------
MPTForCausalLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
- If you're using `trust_remote_code=True`, you can get rid of this warning by loading the model with an auto class. See https://huggingface.co/docs/transformers/en/model_doc/auto#auto-classes
- If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
- If you are not the owner of the model architecture class, please contact the model code owner to update it.
which also seems like something is supposed to be initialised on cuda, but it is on cpu instead.
Please let me know if any further information is needed.
The text was updated successfully, but these errors were encountered:
❓ Question
I'm following the instructions in the README to set up a local environment to run the project. When I ran the unit tests, some tests failed. After going through the error messages, they seem to all related to incorrect device.type somehow. The simplest failed test is
tests/models/llm_embed/test_llm_embedding.py::test_contrastive_eval_loss_device_handling
:It looks like the tensor is supposed to be on cuda, but it is on cpu instead. I feel like there is something obvious I'm missing, but couldn't figure out what. Any insight is appreciated!
Additional context
I set up my environment on a Google cloud machine with one A100 GPU. This is the output of
uname -a
:I installed all necessary packages and drivers for CUDA and docker, and downloaded the docker image mosaicml/llm-foundry:2.5.1_cu124-latest. This is how I started the docker container with access to the GPU:
Within the docker container, the output of
nvidia-smi
is:After successfully installed llm-foundry in the docker container, I tried to run
pytest
, but there are about 16 failed cases all related to the above error. I can recreate the error by running that specific test alone:I also ran the first example in the README (
python data_prep/convert_dataset_hf.py --dataset allenai/c4 --data_subset en --out_root my-copy-c4 --splits train_small val_small --concat_tokens 2048 --tokenizer EleutherAI/gpt-neox-20b --eos_text '<|endoftext|>'
) without any issue.Another example failed test is:
which also seems like something is supposed to be initialised on cuda, but it is on cpu instead.
Please let me know if any further information is needed.
The text was updated successfully, but these errors were encountered: