[Bug]: Getting `Failed to get instance IP address. Instance not found` when running on TPU #2012

peterschmidt85 · 2024-11-19T19:20:44Z

Steps to reproduce

type: service
# The name is optional, if not specified, generated randomly
name: llama31

# Using a Docker image with a fix instead of the official one
# More details at https://github.com/huggingface/optimum-tpu/pull/87
image: dstackai/optimum-tpu:llama31
# Required environment variables
env:
  - HF_TOKEN
  - MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct
  - MAX_TOTAL_TOKENS=4096
  - MAX_BATCH_PREFILL_TOKENS=4095
commands:
  - text-generation-launcher --port 8000
# Expose the TGI port
port: 8000
model: meta-llama/Meta-Llama-3.1-8B-Instruct

resources:
  # Required resources
  gpu: v5litepod-4

Doesn't matter spot or on-demand

Actual behaviour

dstack apply -f tpu/tgi.dstack.yml
 Project                main
 User                   admin
 Configuration          tpu/tgi.dstack.yml
 Type                   service
 Resources              2..xCPU, 8GB.., 1xv5litepod-4, 100GB.. (disk)
 Max price              -
 Max duration           -
 Spot policy            on-demand
 Retry policy           no
 Creation policy        reuse-or-create
 Termination policy     destroy-after-idle
 Termination idle time  5m

 #  BACKEND  REGION       INSTANCE     RESOURCES                      SPOT  PRICE
 1  gcp      us-central1  v5litepod-4  1xv5litepod-4, 100.0GB (disk)  no    $4.8
 2  gcp      us-east5     v5litepod-4  1xv5litepod-4, 100.0GB (disk)  no    $4.8
 3  gcp      us-south1    v5litepod-4  1xv5litepod-4, 100.0GB (disk)  no    $4.8
    ...
 Shown 3 of 8 offers, $6.24 max

Finished run llama31 already exists.
Override the run? [y/n]: y
llama31 provisioning completed (terminating)
All provisioning attempts failed. This is likely due to cloud providers not having enough capacity. Check CLI and
server logs for more details.

Expected behaviour

No response

dstack version

master

Server logs

[20:17:29] INFO     dstack._internal.server.services.backends:404 Requesting instance offers from backends:
                    ['runpod', 'cudo', 'lambda', 'gcp', 'aws', 'azure']
[20:17:55] INFO     dstack._internal.server.background.tasks.process_submitted_jobs:255 job(7602a8)llama31-0-0: now
                    is provisioning a new instance
           INFO     dstack._internal.server.background.tasks.process_submitted_jobs:280 The job llama31-0-0 created
                    the new instance llama31-0
[20:17:57] INFO     dstack._internal.server.background.tasks.process_runs:330 run(4b9cae)llama31: run status has
                    changed SUBMITTED -> PROVISIONING
[20:18:13] WARNING  dstack._internal.server.background.tasks.process_instances:718 Error while waiting for instance
                    llama31-0 to become running: ProvisioningError('Failed to get instance IP address. Instance not
                    found.')
[20:18:18] INFO     dstack._internal.server.background.tasks.process_instances:783 Instance llama31-0 terminated
[20:18:21] INFO     dstack._internal.server.background.tasks.process_runs:330 run(4b9cae)llama31: run status has
                    changed PROVISIONING -> TERMINATING
[20:18:27] INFO     dstack._internal.server.services.jobs:268 job(7602a8)llama31-0-0: instance 'llama31-0' has been
                    released, new status is TERMINATED
           INFO     dstack._internal.server.services.jobs:283 job(7602a8)llama31-0-0: job status is FAILED, reason:
                    FAILED_TO_START_DUE_TO_NO_CAPACITY
[20:18:29] INFO     dstack._internal.server.services.runs:952 run(4b9cae)llama31: run status has changed
                    TERMINATING -> FAILED, reason: JOB_FAILED
[20:18:40] INFO     dstack._internal.server.background.tasks.process_fleets:72 Automatic cleanup of an empty fleet
                    llama31
           INFO     dstack._internal.server.background.tasks.process_fleets:78 Fleet llama31 deleted



### Additional information

_No response_

The text was updated successfully, but these errors were encountered:

jvstme · 2024-11-20T17:57:58Z

Can't create v5litepod-4 in GCP console too. This could be a GCP issue, we can try reproducing again later.

peterschmidt85 added bug Something isn't working tpu labels Nov 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Getting `Failed to get instance IP address. Instance not found` when running on TPU #2012

[Bug]: Getting `Failed to get instance IP address. Instance not found` when running on TPU #2012

peterschmidt85 commented Nov 19, 2024

jvstme commented Nov 20, 2024

[Bug]: Getting Failed to get instance IP address. Instance not found when running on TPU #2012

[Bug]: Getting Failed to get instance IP address. Instance not found when running on TPU #2012

Comments

peterschmidt85 commented Nov 19, 2024

Steps to reproduce

Actual behaviour

Expected behaviour

dstack version

Server logs

jvstme commented Nov 20, 2024

[Bug]: Getting `Failed to get instance IP address. Instance not found` when running on TPU #2012

[Bug]: Getting `Failed to get instance IP address. Instance not found` when running on TPU #2012