[BUG] GPU tolerations are not correctly passed to MPI workers #4422

gdabisias · 2023-11-14T15:41:15Z

Describe the bug

When defining an MPITask, the tolerations that are set as default for GPU usage in the deployment charts, are not applied to the created workers.
i.e. this configuration is not applied

configmap:
  k8s:
    plugins:
      k8s:
        "resource-tolerations":
          "nvidia.com/gpu":
            key: "flyte/GPU"
            operator: "Exists"
            effect: "NoSchedule"

This has also an additional bad side effect: the MPI launcher will end up on a GPU node, even if it could just run on a cpu node, wasting resources.

Expected behavior

The GPU tolerations should be applied automatically to MPI workers when GPUs are requested.
Currently this can be fixed by passing a flytekit.PodTemplate to the MPI Task, but this makes the code very cumbersome
e.g.

flytekit.PodTemplate(
        V1PodSpec(
            tolerations=[
                V1Toleration(
                    key="flyte/GPU",
                    operator="Exists",
                    effect="NoSchedule",
                ),
            ]
        )
    )

Additional context to reproduce

No response

Screenshots

No response

Are you sure this issue hasn't been raised already?

Yes

Have you read the Code of Conduct?

Yes

The text was updated successfully, but these errors were encountered:

gdabisias · 2023-11-14T16:03:46Z

Notice that this issue is similar to
#4378

jeevb · 2023-11-15T00:15:37Z

Confirmed that this can be reproduced with:

@task(
    task_config=MPIJob(
        launcher=Launcher(
            replicas=1,
        ),
        worker=Worker(
            replicas=1,
            limits=Resources(cpu="2", gpu="1"),
        ),
    ),
    retries=3,
    cache=True,
    cache_version="0.1",
    requests=Resources(cpu="1", mem="1000Mi"),
    limits=Resources(cpu="2"),
)

The tolerations are applied based on the universal resources specified in the @task decorator, but not for the launcher/worker-specific resource specification.

gdabisias added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Nov 14, 2023

jeevb assigned eapolinario Nov 15, 2023

jeevb mentioned this issue Nov 21, 2023

Correctly handle resource overrides in KF plugins #4467

Merged

3 tasks

eapolinario removed the untriaged This issues has not yet been looked at by the Maintainers label Nov 30, 2023

jeevb closed this as completed in #4467 Nov 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] GPU tolerations are not correctly passed to MPI workers #4422

[BUG] GPU tolerations are not correctly passed to MPI workers #4422

gdabisias commented Nov 14, 2023

gdabisias commented Nov 14, 2023

jeevb commented Nov 15, 2023 •

edited

Loading

[BUG] GPU tolerations are not correctly passed to MPI workers #4422

[BUG] GPU tolerations are not correctly passed to MPI workers #4422

Comments

gdabisias commented Nov 14, 2023

Describe the bug

Expected behavior

Additional context to reproduce

Screenshots

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

gdabisias commented Nov 14, 2023

jeevb commented Nov 15, 2023 • edited Loading

jeevb commented Nov 15, 2023 •

edited

Loading