Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Getting Failed to get instance IP address. Instance not found when running on TPU #2012

Open
peterschmidt85 opened this issue Nov 19, 2024 · 1 comment
Labels
bug Something isn't working tpu

Comments

@peterschmidt85
Copy link
Contributor

Steps to reproduce

type: service
# The name is optional, if not specified, generated randomly
name: llama31

# Using a Docker image with a fix instead of the official one
# More details at https://github.com/huggingface/optimum-tpu/pull/87
image: dstackai/optimum-tpu:llama31
# Required environment variables
env:
  - HF_TOKEN
  - MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct
  - MAX_TOTAL_TOKENS=4096
  - MAX_BATCH_PREFILL_TOKENS=4095
commands:
  - text-generation-launcher --port 8000
# Expose the TGI port
port: 8000
model: meta-llama/Meta-Llama-3.1-8B-Instruct

resources:
  # Required resources
  gpu: v5litepod-4

Doesn't matter spot or on-demand

Actual behaviour

dstack apply -f tpu/tgi.dstack.yml
 Project                main
 User                   admin
 Configuration          tpu/tgi.dstack.yml
 Type                   service
 Resources              2..xCPU, 8GB.., 1xv5litepod-4, 100GB.. (disk)
 Max price              -
 Max duration           -
 Spot policy            on-demand
 Retry policy           no
 Creation policy        reuse-or-create
 Termination policy     destroy-after-idle
 Termination idle time  5m

 #  BACKEND  REGION       INSTANCE     RESOURCES                      SPOT  PRICE
 1  gcp      us-central1  v5litepod-4  1xv5litepod-4, 100.0GB (disk)  no    $4.8
 2  gcp      us-east5     v5litepod-4  1xv5litepod-4, 100.0GB (disk)  no    $4.8
 3  gcp      us-south1    v5litepod-4  1xv5litepod-4, 100.0GB (disk)  no    $4.8
    ...
 Shown 3 of 8 offers, $6.24 max

Finished run llama31 already exists.
Override the run? [y/n]: y
llama31 provisioning completed (terminating)
All provisioning attempts failed. This is likely due to cloud providers not having enough capacity. Check CLI and
server logs for more details.

Expected behaviour

No response

dstack version

master

Server logs

[20:17:29] INFO     dstack._internal.server.services.backends:404 Requesting instance offers from backends:
                    ['runpod', 'cudo', 'lambda', 'gcp', 'aws', 'azure']
[20:17:55] INFO     dstack._internal.server.background.tasks.process_submitted_jobs:255 job(7602a8)llama31-0-0: now
                    is provisioning a new instance
           INFO     dstack._internal.server.background.tasks.process_submitted_jobs:280 The job llama31-0-0 created
                    the new instance llama31-0
[20:17:57] INFO     dstack._internal.server.background.tasks.process_runs:330 run(4b9cae)llama31: run status has
                    changed SUBMITTED -> PROVISIONING
[20:18:13] WARNING  dstack._internal.server.background.tasks.process_instances:718 Error while waiting for instance
                    llama31-0 to become running: ProvisioningError('Failed to get instance IP address. Instance not
                    found.')
[20:18:18] INFO     dstack._internal.server.background.tasks.process_instances:783 Instance llama31-0 terminated
[20:18:21] INFO     dstack._internal.server.background.tasks.process_runs:330 run(4b9cae)llama31: run status has
                    changed PROVISIONING -> TERMINATING
[20:18:27] INFO     dstack._internal.server.services.jobs:268 job(7602a8)llama31-0-0: instance 'llama31-0' has been
                    released, new status is TERMINATED
           INFO     dstack._internal.server.services.jobs:283 job(7602a8)llama31-0-0: job status is FAILED, reason:
                    FAILED_TO_START_DUE_TO_NO_CAPACITY
[20:18:29] INFO     dstack._internal.server.services.runs:952 run(4b9cae)llama31: run status has changed
                    TERMINATING -> FAILED, reason: JOB_FAILED
[20:18:40] INFO     dstack._internal.server.background.tasks.process_fleets:72 Automatic cleanup of an empty fleet
                    llama31
           INFO     dstack._internal.server.background.tasks.process_fleets:78 Fleet llama31 deleted


### Additional information

_No response_
@peterschmidt85 peterschmidt85 added bug Something isn't working tpu labels Nov 19, 2024
@jvstme
Copy link
Collaborator

jvstme commented Nov 20, 2024

Can't create v5litepod-4 in GCP console too. This could be a GCP issue, we can try reproducing again later.
Screenshot From 2024-11-20 18-55-40

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working tpu
Projects
None yet
Development

No branches or pull requests

2 participants