Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tests fail in local environment due to device type mismatch #1705

Open
fan-turintech opened this issue Jan 8, 2025 · 0 comments
Open

Tests fail in local environment due to device type mismatch #1705

fan-turintech opened this issue Jan 8, 2025 · 0 comments
Labels
question Further information is requested

Comments

@fan-turintech
Copy link

❓ Question

I'm following the instructions in the README to set up a local environment to run the project. When I ran the unit tests, some tests failed. After going through the error messages, they seem to all related to incorrect device.type somehow. The simplest failed test is tests/models/llm_embed/test_llm_embedding.py::test_contrastive_eval_loss_device_handling:

tests/models/llm_embed/test_llm_embedding.py:303: in test_contrastive_eval_loss_device_handling
    assert metric.loss.device.type == device
E   AssertionError: assert 'cpu' == 'cuda'
E     
E     - cuda
E     + cpu

It looks like the tensor is supposed to be on cuda, but it is on cpu instead. I feel like there is something obvious I'm missing, but couldn't figure out what. Any insight is appreciated!

Additional context

I set up my environment on a Google cloud machine with one A100 GPU. This is the output of uname -a:

Linux research-llmfoundry 6.1.0-28-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.119-1 (2024-11-22) x86_64 GNU/Linux

I installed all necessary packages and drivers for CUDA and docker, and downloaded the docker image mosaicml/llm-foundry:2.5.1_cu124-latest. This is how I started the docker container with access to the GPU:

sudo docker run -it --shm-size=2g --rm --gpus all mosaicml/llm-foundry:2.5.1_cu124-latest bash

Within the docker container, the output of nvidia-smi is:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01              Driver Version: 565.57.01      CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-40GB          On  |   00000000:00:04.0 Off |                    0 |
| N/A   30C    P0             42W /  400W |       1MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

After successfully installed llm-foundry in the docker container, I tried to run pytest, but there are about 16 failed cases all related to the above error. I can recreate the error by running that specific test alone:

pytest tests/models/llm_embed/test_llm_embedding.py::test_contrastive_eval_loss_device_handling

I also ran the first example in the README (python data_prep/convert_dataset_hf.py --dataset allenai/c4 --data_subset en --out_root my-copy-c4 --splits train_small val_small --concat_tokens 2048 --tokenizer EleutherAI/gpt-neox-20b --eos_text '<|endoftext|>') without any issue.

Another example failed test is:

__________________________________________________________________________________________________ test_icl_eval __________________________________________________________________________________________________
/llm-foundry/tests/a_scripts/eval/test_eval.py:78: in test_icl_eval
    evaluate(eval_cfg)
/llm-foundry/llmfoundry/command_utils/eval.py:267: in evaluate
    dist.initialize_dist(get_device(None), timeout=eval_config.dist_timeout)
/usr/lib/python3/dist-packages/composer/utils/dist.py:539: in initialize_dist
    raise RuntimeError(
E   RuntimeError: The requested backend (nccl) differs from the backend of the current process group (gloo). If you wish to change backends, please restart the python process.
---------------------------------------------------------------------------------------------- Captured stderr setup ----------------------------------------------------------------------------------------------
MPTForCausalLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you're using `trust_remote_code=True`, you can get rid of this warning by loading the model with an auto class. See https://huggingface.co/docs/transformers/en/model_doc/auto#auto-classes
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.

which also seems like something is supposed to be initialised on cuda, but it is on cpu instead.

Please let me know if any further information is needed.

@fan-turintech fan-turintech added the question Further information is requested label Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

1 participant