Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Tried to run DeepSeek V3 by amd instructions #3200

Open
5 tasks done
Vladimir-A opened this issue Jan 28, 2025 · 1 comment
Open
5 tasks done

[Bug] Tried to run DeepSeek V3 by amd instructions #3200

Vladimir-A opened this issue Jan 28, 2025 · 1 comment

Comments

@Vladimir-A
Copy link

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

I tried to use AMD instruction but i have an error.

Reproduction

After running in a container

python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --port 30000 --tp 8 --trust-remote-code

Log:

/opt/conda/envs/py_3.9/lib/python3.9/site-packages/scipy/__init__.py:138: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.26.4)
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion} is required for this version of "
[2025-01-28 22:32:01 TP6] Process 97 gpu_id 6 is running on CPUs: [6, 14]
[2025-01-28 22:32:01 TP2] Process 63 gpu_id 2 is running on CPUs: [2, 10]
[2025-01-28 22:32:01 TP7] Process 113 gpu_id 7 is running on CPUs: [7, 15]
[2025-01-28 22:32:02 TP5] Process 66 gpu_id 5 is running on CPUs: [5, 13]
[2025-01-28 22:32:02 TP4] Process 65 gpu_id 4 is running on CPUs: [4, 12]
[2025-01-28 22:32:02 TP3] Process 64 gpu_id 3 is running on CPUs: [3, 11]
[2025-01-28 22:32:02 TP1] Process 62 gpu_id 1 is running on CPUs: [1, 9]
[2025-01-28 22:32:03 TP0] Process 61 gpu_id 0 is running on CPUs: [0, 8]
[2025-01-28 22:32:03 TP2] MLA optimization is turned on. Use triton backend.
[2025-01-28 22:32:03 TP2] Init torch distributed begin.
[2025-01-28 22:32:03 TP2] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1609, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 203, in __init__
    self.tp_worker = TpWorkerClass(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in __init__
    self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 68, in __init__
    self.model_runner = ModelRunner(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 159, in __init__
    min_per_gpu_memory = self.init_torch_distributed()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 197, in init_torch_distributed
    torch.get_device_module(self.device).set_device(self.gpu_id)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/cuda/__init__.py", line 478, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: HIP error: invalid device ordinal
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.


Killed

Environment

Python: 3.9.19 (main, May  6 2024, 19:43:03) [GCC 11.2.0]
ROCM available: True
GPU 0: AMD Radeon RX 6800 XT
GPU 0 Compute Capability: 10.3
ROCM_HOME: /opt/rocm
HIPCC: HIP version: 6.2.41133-dd7f95766
ROCM Driver Version: 6.12.10-zen1-1-zen
PyTorch: 2.5.0a0+gitcedc116
sglang: 0.4.1.post4
flashinfer: Module Not Found
triton: 3.0.0
transformers: 4.46.1
torchao: 0.7.0
numpy: 1.26.4
aiohttp: 3.10.10
fastapi: 0.115.4
hf_transfer: 0.1.8
huggingface_hub: 0.26.2
interegular: 0.3.3
modelscope: 1.21.1
orjson: 3.10.13
packaging: 24.1
psutil: 6.1.0
pydantic: 2.9.2
multipart: 0.0.20
zmq: 26.2.0
uvicorn: 0.32.0
uvloop: 0.21.0
vllm: 0.6.3.post2.dev1+g1ef171e0
openai: 1.59.3
anthropic: 0.42.0
decord: 0.6.0
AMD Topology: 


============================ ROCm System Management Interface ============================
=============================== Link Type between two GPUs ===============================
       GPU0         
GPU0   0            
================================== End of ROCm SMI Log ===================================

ulimit soft: 1024
@MrRothstein
Copy link

I'm seeing the same issue. Here's my rocminfo in case it helps

root@Ubuntu:/workspace# /opt/rocm/bin/rocminfo
ROCk module version 6.10.5 is loaded
=====================
HSA System Attributes
=====================
Runtime Version:         1.14
Runtime Ext Version:     1.6
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE
System Endianness:       LITTLE
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========
HSA Agents
==========
*******
Agent 1
*******
  Name:                    AMD Ryzen 5 5600 6-Core Processor
  Uuid:                    CPU-XX
  Marketing Name:          AMD Ryzen 5 5600 6-Core Processor
  Vendor Name:             CPU
  Feature:                 None specified
  Profile:                 FULL_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        0(0x0)
  Queue Min Size:          0(0x0)
  Queue Max Size:          0(0x0)
  Queue Type:              MULTI
  Node:                    0
  Device Type:             CPU
  Cache Info:
    L1:                      32768(0x8000) KB
  Chip ID:                 0(0x0)
  ASIC Revision:           0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   3500
  BDFID:                   0
  Internal Node ID:        0
  Compute Unit:            12
  SIMDs per CU:            0
  Shader Engines:          0
  Shader Arrs. per Eng.:   0
  WatchPts on Addr. Ranges:1
  Memory Properties:
  Features:                None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    32782028(0x1f436cc) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 2
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    32782028(0x1f436cc) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 3
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    32782028(0x1f436cc) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
  ISA Info:
*******
Agent 2
*******
  Name:                    gfx1032
  Uuid:                    GPU-XX
  Marketing Name:          AMD Radeon RX 6600 XT
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    1
  Device Type:             GPU
  Cache Info:
    L1:                      16(0x10) KB
    L2:                      2048(0x800) KB
    L3:                      32768(0x8000) KB
  Chip ID:                 29695(0x73ff)
  ASIC Revision:           0(0x0)
  Cacheline Size:          128(0x80)
  Max Clock Freq. (MHz):   2900
  BDFID:                   11520
  Internal Node ID:        1
  Compute Unit:            32
  SIMDs per CU:            2
  Shader Engines:          2
  Shader Arrs. per Eng.:   2
  WatchPts on Addr. Ranges:4
  Coherent Host Access:    FALSE
  Memory Properties:
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          32(0x20)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    1024(0x400)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Packet Processor uCode:: 120
  SDMA engine uCode::      76
  IOMMU Support::          None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    8372224(0x7fc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    8372224(0x7fc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 3
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    8372224(0x7fc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 4
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Recommended Granule:0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx1032
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
*** Done ***

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants