Replies: 1 comment
-
I fixed the error. I should have put 'shm-size' option for tp shared memory. Thanks One more question, how to set appropriate shm size? Each model needs different shm-size? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I want to run vllm with 'tensor-parallel' option. But I got the error 'RuntimeError: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details)' detailed one is attached below.
Instance is AWS g4dn.12xlarge with 4gpus, and I tried with other model (bigger one) but I got the same error.
How can I use tensor-parallel option (https://docs.vllm.ai/en/stable/serving/distributed_serving.html#multi-node-inference-and-serving) ?
Is there anything I should do with nccl?
$ sudo docker run --runtime nvidia --gpus all -v ~/.cache/huggingface:/root/.cache/huggingface -p 8000:8000 --env "HUGGING_FACE_HUB_TOKEN=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" vllm/vllm-openai:latest --model facebook/opt-125m --tensor-parallel-size 4
INFO 08-13 05:59:12 api_server.py:339] vLLM API server version 0.5.4
INFO 08-13 05:59:12 api_server.py:340] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, model='facebook/opt-125m', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 08-13 05:59:12 config.py:729] Defaulting to use mp for distributed inference
INFO 08-13 05:59:12 llm_engine.py:174] Initializing an LLM engine (v0.5.4) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=facebook/opt-125m, use_v2_block_manager=False, enable_prefix_caching=False)
WARNING 08-13 05:59:13 multiproc_gpu_executor.py:59] Reducing Torch parallelism from 24 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 08-13 05:59:13 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
INFO 08-13 05:59:13 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 08-13 05:59:13 selector.py:54] Using XFormers backend.
(VllmWorkerProcess pid=110) INFO 08-13 05:59:13 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(VllmWorkerProcess pid=110) INFO 08-13 05:59:13 selector.py:54] Using XFormers backend.
(VllmWorkerProcess pid=111) INFO 08-13 05:59:13 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(VllmWorkerProcess pid=111) INFO 08-13 05:59:13 selector.py:54] Using XFormers backend.
(VllmWorkerProcess pid=112) INFO 08-13 05:59:13 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(VllmWorkerProcess pid=112) INFO 08-13 05:59:13 selector.py:54] Using XFormers backend.
/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning:
torch.library.impl_abstract
was renamed totorch.library.register_fake
. Please use that instead; we will removetorch.library.impl_abstract
in a future version of PyTorch.@torch.library.impl_abstract("xformers_flash::flash_fwd")
(VllmWorkerProcess pid=111) /usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning:
torch.library.impl_abstract
was renamed totorch.library.register_fake
. Please use that instead; we will removetorch.library.impl_abstract
in a future version of PyTorch.(VllmWorkerProcess pid=112) /usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning:
torch.library.impl_abstract
was renamed totorch.library.register_fake
. Please use that instead; we will removetorch.library.impl_abstract
in a future version of PyTorch.(VllmWorkerProcess pid=110) /usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning:
torch.library.impl_abstract
was renamed totorch.library.register_fake
. Please use that instead; we will removetorch.library.impl_abstract
in a future version of PyTorch.(VllmWorkerProcess pid=112) @torch.library.impl_abstract("xformers_flash::flash_fwd")
(VllmWorkerProcess pid=111) @torch.library.impl_abstract("xformers_flash::flash_fwd")
(VllmWorkerProcess pid=110) @torch.library.impl_abstract("xformers_flash::flash_fwd")
(VllmWorkerProcess pid=112) /usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning:
torch.library.impl_abstract
was renamed totorch.library.register_fake
. Please use that instead; we will removetorch.library.impl_abstract
in a future version of PyTorch.(VllmWorkerProcess pid=111) /usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning:
torch.library.impl_abstract
was renamed totorch.library.register_fake
. Please use that instead; we will removetorch.library.impl_abstract
in a future version of PyTorch.(VllmWorkerProcess pid=110) /usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning:
torch.library.impl_abstract
was renamed totorch.library.register_fake
. Please use that instead; we will removetorch.library.impl_abstract
in a future version of PyTorch./usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning:
torch.library.impl_abstract
was renamed totorch.library.register_fake
. Please use that instead; we will removetorch.library.impl_abstract
in a future version of PyTorch.@torch.library.impl_abstract("xformers_flash::flash_bwd")
(VllmWorkerProcess pid=112) @torch.library.impl_abstract("xformers_flash::flash_bwd")
(VllmWorkerProcess pid=111) @torch.library.impl_abstract("xformers_flash::flash_bwd")
(VllmWorkerProcess pid=110) @torch.library.impl_abstract("xformers_flash::flash_bwd")
(VllmWorkerProcess pid=110) INFO 08-13 05:59:15 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=111) INFO 08-13 05:59:15 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=112) INFO 08-13 05:59:15 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
INFO 08-13 05:59:16 utils.py:841] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=110) INFO 08-13 05:59:16 utils.py:841] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=111) INFO 08-13 05:59:16 utils.py:841] Found nccl from library libnccl.so.2
INFO 08-13 05:59:16 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=112) INFO 08-13 05:59:16 utils.py:841] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=110) INFO 08-13 05:59:16 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=111) INFO 08-13 05:59:16 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=112) INFO 08-13 05:59:16 pynccl.py:63] vLLM is using nccl==2.20.5
ERROR 08-13 05:59:16 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 110 died, exit code: -15
INFO 08-13 05:59:16 multiproc_worker_utils.py:123] Killing local vLLM worker processes
Process Process-1:
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 217, in run_rpc_server
server = AsyncEngineRPCServer(async_engine_args, usage_context, port)
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 25, in init
self.engine = AsyncLLMEngine.from_engine_args(async_engine_args,
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 471, in from_engine_args
engine = cls(
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 381, in init
self.engine = self._init_engine(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 552, in _init_engine
return engine_class(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 249, in init
self.model_executor = executor_class(
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 215, in init
super().init(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 25, in init
super().init(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 47, in init
self._init_executor()
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 137, in _init_executor
self._run_workers("init_device")
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 192, in _run_workers
driver_worker_output = driver_worker_method(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 132, in init_device
init_worker_distributed_environment(self.parallel_config, self.rank,
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 348, in init_worker_distributed_environment
ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 965, in ensure_model_parallel_initialized
initialize_model_parallel(tensor_model_parallel_size,
File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 931, in initialize_model_parallel
_TP = init_model_parallel_group(group_ranks,
File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 773, in init_model_parallel_group
return GroupCoordinator(
File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 154, in init
self.pynccl_comm = PyNcclCommunicator(
File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 89, in init
self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 244, in ncclCommInitRank
self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 223, in NCCL_CHECK
raise RuntimeError(f"NCCL error: {error_str}")
RuntimeError: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details)
Beta Was this translation helpful? Give feedback.
All reactions