-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deepspeed strategy can't save checkpoint, TypeError: cannot pickle torch._C._distributed_c10d.ProcessGroup
object
#17369
Comments
having the same issue. |
torch._C._distributed_c10d.ProcessGroup
object
Hello! |
hi @SpirinEgor, i can't remember how i fixed it. i tried installing torch with conda and right cu version, install deepspeed with conda, etc. |
hi @SpirinEgor |
Same issue here. deepspeed: 0.9.2 |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team! |
same issue here |
same issue here |
same issue |
Same issue. deepspeed: 0.12.6 |
same issue |
Having the same issue when using DeepSpeedStrategy https://github.com/Lightning-AI/pytorch-lightning/blob/master/src/lightning/pytorch/strategies/deepspeed.py
Checked the keys which make sense because
|
Bug description
I try use https://github.com/ashleve/lightning-hydra-template with deepspeed strategy.
Here is my fork https://github.com/dmitrymailk/ru_lm/tree/61ab735110b3c80a3cb3d58b3d7c5c05d4cf56af
And I got this error TypeError: cannot pickle 'torch._C._distributed_c10d.ProcessGroup' object
I don't think that it's a pytorch-lighting problem itsels because
The error raise in deepspeed/runtime/checkpoint_engine/torch_checkpoint_engine.py
state_dict is
What version are you seeing the problem on?
2.0+
How to reproduce the bug
you must change devices in configs/trainer/deepspeed.yaml
Error messages and logs
Environment
Current environment
- GPU:
- NVIDIA A100-SXM4-40GB
- NVIDIA A100-SXM4-40GB
- NVIDIA A100-SXM4-40GB
- NVIDIA A100-SXM4-40GB
- available: True
- version: 11.8
- lightning: 2.0.1.post0
- lightning-cloud: 0.5.33
- lightning-colossalai: 0.1.0
- lightning-utilities: 0.8.0
- pytorch-lightning: 2.0.1.post0
- torch: 2.0.0+cu118
- torchaudio: 2.0.1+cu118
- torchmetrics: 0.11.4
- torchvision: 0.15.1+cu118
- absl-py: 1.4.0
- accelerate: 0.18.0
- aiofiles: 23.1.0
- aiohttp: 3.8.4
- aiosignal: 1.3.1
- alembic: 1.10.3
- altair: 4.2.2
- antlr4-python3-runtime: 4.9.3
- anyio: 3.6.2
- apex: 0.1
- appdirs: 1.4.4
- arrow: 1.2.3
- asttokens: 2.2.1
- async-timeout: 4.0.2
- attrs: 22.2.0
- autopage: 0.5.1
- backcall: 0.2.0
- backports.functools-lru-cache: 1.6.4
- bcrypt: 4.0.1
- beautifulsoup4: 4.12.2
- bitsandbytes: 0.37.2
- black: 23.3.0
- blessed: 1.20.0
- boltons: 23.0.0
- brotlipy: 0.7.0
- cachetools: 5.3.0
- certifi: 2022.12.7
- cffi: 1.15.1
- cfgv: 3.3.1
- charset-normalizer: 2.0.4
- click: 8.1.3
- cliff: 4.2.0
- cmaes: 0.9.1
- cmake: 3.25.0
- cmd2: 2.4.3
- colorlog: 6.7.0
- colossalai: 0.2.8
- conda: 23.3.1
- conda-content-trust: 0.1.3
- conda-package-handling: 2.0.2
- conda-package-streaming: 0.7.0
- contexttimer: 0.3.3
- contourpy: 1.0.7
- croniter: 1.3.14
- cryptography: 38.0.4
- cycler: 0.11.0
- datasets: 2.11.0
- dateutils: 0.6.12
- debugpy: 1.5.1
- decorator: 5.1.1
- deepdiff: 6.3.0
- deepspeed: 0.8.3
- dill: 0.3.6
- distlib: 0.3.6
- docker-pycreds: 0.4.0
- einops: 0.6.0
- entrypoints: 0.4
- evaluate: 0.4.0
- exceptiongroup: 1.1.1
- executing: 1.2.0
- fabric: 3.0.0
- fastapi: 0.88.0
- ffmpy: 0.3.0
- filelock: 3.9.0
- fire: 0.5.0
- flash-attn: 0.2.8
- flit-core: 3.8.0
- fonttools: 4.39.3
- frozenlist: 1.3.3
- fschat: 0.1.10
- fsspec: 2023.4.0
- gitdb: 4.0.10
- gitpython: 3.1.31
- gmpy2: 2.1.2
- google-auth: 2.17.3
- google-auth-oauthlib: 1.0.0
- gradio: 3.23.0
- gradio-client: 0.0.8
- greenlet: 2.0.2
- grpcio: 1.53.0
- h11: 0.14.0
- hjson: 3.1.0
- html2text: 2020.1.16
- httpcore: 0.16.3
- httpx: 0.23.3
- huggingface-hub: 0.13.4
- hydra-colorlog: 1.2.0
- hydra-core: 1.3.2
- hydra-optuna-sweeper: 1.2.0
- identify: 2.5.22
- idna: 3.4
- importlib-metadata: 6.3.0
- iniconfig: 2.0.0
- inquirer: 3.1.3
- invoke: 2.0.0
- ipykernel: 6.15.0
- ipython: 8.12.0
- itsdangerous: 2.1.2
- jedi: 0.18.2
- jinja2: 3.1.2
- joblib: 1.2.0
- jsonlines: 3.1.0
- jsonpatch: 1.32
- jsonpointer: 2.1
- jsonschema: 4.17.3
- jupyter-client: 7.3.4
- jupyter-core: 4.12.0
- kiwisolver: 1.4.4
- lightning: 2.0.1.post0
- lightning-cloud: 0.5.33
- lightning-colossalai: 0.1.0
- lightning-utilities: 0.8.0
- linkify-it-py: 2.0.0
- lit: 15.0.7
- loralib: 0.1.1
- mako: 1.2.4
- markdown: 3.4.3
- markdown-it-py: 2.2.0
- markdown2: 2.4.8
- markupsafe: 2.1.1
- matplotlib: 3.7.1
- matplotlib-inline: 0.1.6
- mdit-py-plugins: 0.3.3
- mdurl: 0.1.2
- mkl-fft: 1.3.1
- mkl-random: 1.2.2
- mkl-service: 2.4.0
- mpmath: 1.2.1
- multidict: 6.0.4
- multiprocess: 0.70.14
- mypy-extensions: 1.0.0
- nest-asyncio: 1.5.6
- networkx: 2.8.4
- ninja: 1.11.1
- nodeenv: 1.7.0
- numpy: 1.23.5
- nvidia-cublas-cu11: 11.10.3.66
- nvidia-cuda-nvrtc-cu11: 11.7.99
- nvidia-cuda-runtime-cu11: 11.7.99
- nvidia-cudnn-cu11: 8.5.0.96
- oauthlib: 3.2.2
- omegaconf: 2.3.0
- optuna: 2.10.1
- ordered-set: 4.1.0
- orjson: 3.8.10
- packaging: 23.0
- pandas: 2.0.0
- paramiko: 3.1.0
- parso: 0.8.3
- pathspec: 0.11.1
- pathtools: 0.1.2
- pbr: 5.11.1
- peft: 0.3.0.dev0
- pexpect: 4.8.0
- pickleshare: 0.7.5
- pillow: 9.4.0
- pip: 22.3.1
- platformdirs: 3.2.0
- pluggy: 1.0.0
- pre-commit: 3.2.2
- prettytable: 3.7.0
- prompt-toolkit: 3.0.38
- protobuf: 3.20.3
- psutil: 5.9.4
- ptyprocess: 0.7.0
- pure-eval: 0.2.2
- py-cpuinfo: 9.0.0
- pyarrow: 11.0.0
- pyasn1: 0.4.8
- pyasn1-modules: 0.2.8
- pycosat: 0.6.4
- pycparser: 2.21
- pydantic: 1.10.7
- pydeprecate: 0.3.2
- pydub: 0.25.1
- pygments: 2.14.0
- pyjwt: 2.6.0
- pynacl: 1.5.0
- pyopenssl: 22.0.0
- pyparsing: 3.0.9
- pyperclip: 1.8.2
- pyrootutils: 1.0.4
- pyrsistent: 0.19.3
- pysocks: 1.7.1
- pytest: 7.3.0
- python-dateutil: 2.8.2
- python-dotenv: 1.0.0
- python-editor: 1.0.4
- python-multipart: 0.0.6
- pytorch-lightning: 2.0.1.post0
- pytz: 2023.3
- pyyaml: 6.0
- pyzmq: 23.2.0
- readchar: 4.0.5
- regex: 2023.3.23
- requests: 2.28.1
- requests-oauthlib: 1.3.1
- responses: 0.18.0
- rfc3986: 1.5.0
- rich: 13.3.3
- rsa: 4.9
- ruamel.yaml: 0.17.21
- ruamel.yaml.clib: 0.2.6
- safetensors: 0.3.0
- scikit-learn: 1.2.2
- scipy: 1.10.1
- semantic-version: 2.10.0
- sentencepiece: 0.1.97
- sentry-sdk: 1.19.1
- setproctitle: 1.3.2
- setuptools: 65.6.3
- six: 1.16.0
- smmap: 5.0.0
- sniffio: 1.3.0
- soupsieve: 2.4
- sqlalchemy: 2.0.9
- stack-data: 0.6.2
- starlette: 0.22.0
- starsessions: 1.3.0
- stevedore: 5.0.0
- svgwrite: 1.4.3
- sympy: 1.11.1
- tensorboard: 2.12.2
- tensorboard-data-server: 0.7.0
- tensorboard-plugin-wit: 1.8.1
- termcolor: 2.2.0
- threadpoolctl: 3.1.0
- tokenize-rt: 5.0.0
- tokenizers: 0.13.3
- tomli: 2.0.1
- toolz: 0.12.0
- torch: 2.0.0+cu118
- torchaudio: 2.0.1+cu118
- torchmetrics: 0.11.4
- torchvision: 0.15.1+cu118
- tornado: 6.1
- tqdm: 4.64.1
- traitlets: 5.9.0
- transformers: 4.28.0.dev0
- triton: 2.0.0
- typing-extensions: 4.4.0
- tzdata: 2023.3
- uc-micro-py: 1.0.1
- urllib3: 1.26.14
- uvicorn: 0.21.1
- virtualenv: 20.21.0
- wandb: 0.14.2
- wavedrom: 2.0.3.post3
- wcwidth: 0.2.6
- websocket-client: 1.5.1
- websockets: 11.0.1
- werkzeug: 2.2.3
- wheel: 0.37.1
- xxhash: 3.2.0
- yarl: 1.8.2
- zipp: 3.15.0
- zstandard: 0.18.0
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.10.9
- version: Quantisation and Pruning Support #76~20.04.1-Ubuntu SMP Mon Mar 20 15:54:19 UTC 2023
More info
No response
cc @awaelchli
The text was updated successfully, but these errors were encountered: