Tests fail in local environment due to device type mismatch #1705

fan-turintech · 2025-01-08T19:02:01Z

❓ Question

I'm following the instructions in the README to set up a local environment to run the project. When I ran the unit tests, some tests failed. After going through the error messages, they seem to all related to incorrect device.type somehow. The simplest failed test is tests/models/llm_embed/test_llm_embedding.py::test_contrastive_eval_loss_device_handling:

tests/models/llm_embed/test_llm_embedding.py:303: in test_contrastive_eval_loss_device_handling
    assert metric.loss.device.type == device
E   AssertionError: assert 'cpu' == 'cuda'
E     
E     - cuda
E     + cpu

It looks like the tensor is supposed to be on cuda, but it is on cpu instead. I feel like there is something obvious I'm missing, but couldn't figure out what. Any insight is appreciated!

Additional context

I set up my environment on a Google cloud machine with one A100 GPU. This is the output of uname -a:

Linux research-llmfoundry 6.1.0-28-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.119-1 (2024-11-22) x86_64 GNU/Linux

I installed all necessary packages and drivers for CUDA and docker, and downloaded the docker image mosaicml/llm-foundry:2.5.1_cu124-latest. This is how I started the docker container with access to the GPU:

sudo docker run -it --shm-size=2g --rm --gpus all mosaicml/llm-foundry:2.5.1_cu124-latest bash

Within the docker container, the output of nvidia-smi is:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01              Driver Version: 565.57.01      CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-40GB          On  |   00000000:00:04.0 Off |                    0 |
| N/A   30C    P0             42W /  400W |       1MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

After successfully installed llm-foundry in the docker container, I tried to run pytest, but there are about 16 failed cases all related to the above error. I can recreate the error by running that specific test alone:

pytest tests/models/llm_embed/test_llm_embedding.py::test_contrastive_eval_loss_device_handling

I also ran the first example in the README (python data_prep/convert_dataset_hf.py --dataset allenai/c4 --data_subset en --out_root my-copy-c4 --splits train_small val_small --concat_tokens 2048 --tokenizer EleutherAI/gpt-neox-20b --eos_text '<|endoftext|>') without any issue.

Another example failed test is:

__________________________________________________________________________________________________ test_icl_eval __________________________________________________________________________________________________
/llm-foundry/tests/a_scripts/eval/test_eval.py:78: in test_icl_eval
    evaluate(eval_cfg)
/llm-foundry/llmfoundry/command_utils/eval.py:267: in evaluate
    dist.initialize_dist(get_device(None), timeout=eval_config.dist_timeout)
/usr/lib/python3/dist-packages/composer/utils/dist.py:539: in initialize_dist
    raise RuntimeError(
E   RuntimeError: The requested backend (nccl) differs from the backend of the current process group (gloo). If you wish to change backends, please restart the python process.
---------------------------------------------------------------------------------------------- Captured stderr setup ----------------------------------------------------------------------------------------------
MPTForCausalLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you're using `trust_remote_code=True`, you can get rid of this warning by loading the model with an auto class. See https://huggingface.co/docs/transformers/en/model_doc/auto#auto-classes
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.

which also seems like something is supposed to be initialised on cuda, but it is on cpu instead.

Please let me know if any further information is needed.

The text was updated successfully, but these errors were encountered:

fan-turintech added the question Further information is requested label Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tests fail in local environment due to device type mismatch #1705

Tests fail in local environment due to device type mismatch #1705

fan-turintech commented Jan 8, 2025

Tests fail in local environment due to device type mismatch #1705

Tests fail in local environment due to device type mismatch #1705

Comments

fan-turintech commented Jan 8, 2025

❓ Question

Additional context