[Core feature] Decouple submitterPod resources from ray task pod_template #5666

jpoler · 2024-08-16T17:11:53Z

Motivation: Why do you think this is important?

Currently the ray plugin uses the pod_template provided to the task as the basis for all pod specs:

The RayCluster head
RayCluster workers
The ray job submit kubernetes Job

This is a pain point when the RayCluster head and workers are intended to be scheduled on GPU nodes. I do not want to waste an entire GPU node for the submitter.

Goal: What should the final outcome look like, ideally?

It is not possible to configure RayCluster pod templates and the submitter pod template separately. If it were, it would be possible to schedule the submitter with appropriately minimal resource requests and leave out other configurations that have nothing to do with the submitter pod (for example in my use case only the ray head/worker need the GPU, shared memory volume mount, service account, etc.)

I found #4170, which looks like it was trying to address this issue, but it hasn't seen any progress since October 2023. At a high level the approach it takes makes sense to me, where the pod_template provided to the task configures the resources for the submitter job, and then the ray head/worker have new config fields to configure their resources explicitly. In my opinion this change looks like it is headed in the right direction, but would be improved with a slight adaptation where it allows for the user to provide the entire pod template alongside resources. Otherwise it won't be possible to do things on the ray head/worker like configure volume mounts and env vars, etc.

Describe alternatives you've considered

I don't see an alternative to adding separate config parameters for separate pod specs. It doesn't seem like a good idea to hard-code the submitter pod spec for minimal resource requests (e.g. just a small request/limit for CPU and memory), because there very well could be a use case where someone wants a GPU for the submitter. It wouldn't make a lot of sense to preclude that use-case IMO.

I do see this PR that adds a Resource config to

Propose: Link/Inline OR Additional context

No response

Are you sure this issue hasn't been raised already?

Yes

Have you read the Code of Conduct?

Yes

The text was updated successfully, but these errors were encountered:

welcome · 2024-08-16T17:11:56Z

Thank you for opening your first issue here! 🛠

runllm · 2024-08-16T17:12:16Z

To decouple the submitterPod resources from the Ray task pod_template, you can configure the RayCluster head, workers, and submitter pod separately. This approach allows you to allocate minimal resources to the submitter while providing the necessary resources for the RayCluster head and workers.

According to Flyte Issue #4377 and Flyte Issue #4674, you can specify different configurations for the head node and worker nodes using RayJobConfig. However, the current implementation does not support separate configurations for the submitter pod. The proposed solution involves adding new configuration fields to explicitly configure the resources and pod templates for the Ray head and worker nodes, while the submitter pod can be configured separately.

Here's an example configuration that demonstrates how to set up different resources for the Ray head, workers, and submitter pod:

from flytekit import task, Resources
from flytekitplugins.ray import RayJobConfig, HeadNodeConfig, WorkerNodeConfig

ray_config = RayJobConfig(
    head_node_config=HeadNodeConfig(
        requests=Resources(mem="64Gi", cpu="4"),
        limits=Resources(mem="64Gi", cpu="4"),
        pod_template_name="ray_head_node"
    ),
    worker_node_config=[
        WorkerNodeConfig(
            group_name="cpu-group",
            replicas=4,
            requests=Resources(mem="256Gi", cpu="64"),
            limits=Resources(mem="256Gi", cpu="64"),
            pod_template_name="ray_worker_cpu"
        ),
        WorkerNodeConfig(
            group_name="gpu-group",
            replicas=2,
            requests=Resources(mem="480Gi", cpu="60", gpu="2"),
            limits=Resources(mem="480Gi", cpu="60", gpu="2"),
            pod_template_name="ray_worker_gpu"
        )
    ],
    submitter_pod_config=PodConfig(
        requests=Resources(mem="2Gi", cpu="1"),
        limits=Resources(mem="2Gi", cpu="1"),
        pod_template_name="ray_submitter"
    )
)

@task(task_config=ray_config)
def ray_task():
    # Task implementation
    pass

Would you like more details on any specific part of this configuration?

Sources:

This is an AI-generated response and your feedback is appreciated! Please leave a 👍 if this is helpful and 👎 if it is not.

Mention @RunLLM to ask follow-up questions. See runllm.com for more!

pingsutw · 2024-08-22T20:05:26Z

How about adding a submitter spec in the ray cluster proto? and add resource to the head, worker, and submitter spec.

kumare3 · 2024-09-11T04:47:19Z

@jpoler would you be open to contributing?

Sovietaced · 2024-10-21T03:52:09Z

@pingsutw mind if I take this issue of your hands? I'm working on Flyte + Ray at work and we'll need this change.

amitani · 2024-11-05T01:43:14Z

How about using a subset of TaskExecutionMetadata, instead of just resources?
That is what's used when creating the podSpec for tasks. TaskNodeOverrides may work too, but I'm hoping that we can set interruptible separately to the head node and the worker nodes.

Sovietaced · 2024-11-19T01:43:22Z

How about using a subset of TaskExecutionMetadata, instead of just resources?
That is what's used when creating the podSpec for tasks. TaskNodeOverrides may work too, but I'm hoping that we can set interruptible separately to the head node and the worker nodes.

We ended up adding support for plumbing the whole pod spec which I think will be sufficient.

Sovietaced · 2024-11-19T01:48:27Z

The flytepropeller and flytekit changes have landed. I think we're just waiting for a flytekit release at this point which should come in December hopefully.

jpoler added enhancement New feature or request untriaged This issues has not yet been looked at by the Maintainers labels Aug 16, 2024

github-project-automation bot added this to Flyte Issues/PRs maintenance Aug 16, 2024

github-project-automation bot moved this to Backlog in Flyte Issues/PRs maintenance Aug 16, 2024

eapolinario assigned pingsutw Aug 22, 2024

eapolinario removed the untriaged This issues has not yet been looked at by the Maintainers label Aug 22, 2024

Sovietaced self-assigned this Oct 21, 2024

Sovietaced unassigned pingsutw Oct 28, 2024

Sovietaced mentioned this issue Oct 29, 2024

Decouple ray submitter, worker, and head resources #5933

Merged

3 tasks

Sovietaced mentioned this issue Nov 13, 2024

Decouple ray submitter, worker, and head resources flyteorg/flytekit#2924

Merged

3 tasks

fiedlerNr9 mentioned this issue Nov 20, 2024

Decouple Ray Resources: Construct ray k8spods from Resources flyteorg/flytekit#2943

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core feature] Decouple submitterPod resources from ray task pod_template #5666

[Core feature] Decouple submitterPod resources from ray task pod_template #5666

jpoler commented Aug 16, 2024

welcome bot commented Aug 16, 2024

runllm bot commented Aug 16, 2024

pingsutw commented Aug 22, 2024

kumare3 commented Sep 11, 2024

Sovietaced commented Oct 21, 2024

amitani commented Nov 5, 2024

Sovietaced commented Nov 19, 2024

Sovietaced commented Nov 19, 2024 •

edited

Loading

[Core feature] Decouple submitterPod resources from ray task pod_template #5666

[Core feature] Decouple submitterPod resources from ray task pod_template #5666

Comments

jpoler commented Aug 16, 2024

Motivation: Why do you think this is important?

Goal: What should the final outcome look like, ideally?

Describe alternatives you've considered

Propose: Link/Inline OR Additional context

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

welcome bot commented Aug 16, 2024

runllm bot commented Aug 16, 2024

Sources:

pingsutw commented Aug 22, 2024

kumare3 commented Sep 11, 2024

Sovietaced commented Oct 21, 2024

amitani commented Nov 5, 2024

Sovietaced commented Nov 19, 2024

Sovietaced commented Nov 19, 2024 • edited Loading

Sovietaced commented Nov 19, 2024 •

edited

Loading