Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] GPU tolerations are not correctly passed to MPI workers #4422

Closed
2 tasks done
gdabisias opened this issue Nov 14, 2023 · 2 comments · Fixed by #4467
Closed
2 tasks done

[BUG] GPU tolerations are not correctly passed to MPI workers #4422

gdabisias opened this issue Nov 14, 2023 · 2 comments · Fixed by #4467
Assignees
Labels
bug Something isn't working

Comments

@gdabisias
Copy link
Contributor

Describe the bug

When defining an MPITask, the tolerations that are set as default for GPU usage in the deployment charts, are not applied to the created workers.
i.e. this configuration is not applied

configmap:
  k8s:
    plugins:
      k8s:
        "resource-tolerations":
          "nvidia.com/gpu":
            key: "flyte/GPU"
            operator: "Exists"
            effect: "NoSchedule"

This has also an additional bad side effect: the MPI launcher will end up on a GPU node, even if it could just run on a cpu node, wasting resources.

Expected behavior

The GPU tolerations should be applied automatically to MPI workers when GPUs are requested.
Currently this can be fixed by passing a flytekit.PodTemplate to the MPI Task, but this makes the code very cumbersome
e.g.

flytekit.PodTemplate(
        V1PodSpec(
            tolerations=[
                V1Toleration(
                    key="flyte/GPU",
                    operator="Exists",
                    effect="NoSchedule",
                ),
            ]
        )
    )

Additional context to reproduce

No response

Screenshots

No response

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@gdabisias gdabisias added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Nov 14, 2023
@gdabisias
Copy link
Contributor Author

Notice that this issue is similar to
#4378

@jeevb
Copy link
Contributor

jeevb commented Nov 15, 2023

Confirmed that this can be reproduced with:

@task(
    task_config=MPIJob(
        launcher=Launcher(
            replicas=1,
        ),
        worker=Worker(
            replicas=1,
            limits=Resources(cpu="2", gpu="1"),
        ),
    ),
    retries=3,
    cache=True,
    cache_version="0.1",
    requests=Resources(cpu="1", mem="1000Mi"),
    limits=Resources(cpu="2"),
)

The tolerations are applied based on the universal resources specified in the @task decorator, but not for the launcher/worker-specific resource specification.

@eapolinario eapolinario removed the untriaged This issues has not yet been looked at by the Maintainers label Nov 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants