Skip to content
This repository has been archived by the owner on Oct 23, 2023. It is now read-only.

Add driver and executor pod template fields to SparkJob #450

Closed
wants to merge 1 commit into from

Conversation

andrewwdye
Copy link
Contributor

TL;DR

Add driver and executor pod template fields to SparkJob

Type

  • Bug Fix
  • Feature
  • Plugin

Are all requirements met?

  • Code completed
  • Smoke tested
  • Unit tests added
  • Code documentation added
  • Any pending items have an associated Issue

Complete description

This change adds optional fields for pod and pod template name for the driver and executor pods. This mirrors the interface we have for container tasks in TaskTemplate and TaskMetadata.

Tracking Issue

flyteorg/flyte#4105

@andrewwdye andrewwdye marked this pull request as ready for review September 29, 2023 23:18
@codecov
Copy link

codecov bot commented Sep 29, 2023

Codecov Report

All modified lines are covered by tests ✅

Comparison is base (b130a81) 75.92% compared to head (83993d9) 78.48%.
Report is 2 commits behind head on master.

❗ Current head 83993d9 differs from pull request most recent head 4e55aa4. Consider uploading reports for the commit 4e55aa4 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #450      +/-   ##
==========================================
+ Coverage   75.92%   78.48%   +2.55%     
==========================================
  Files          18       18              
  Lines        1458     1250     -208     
==========================================
- Hits         1107      981     -126     
+ Misses        294      212      -82     
  Partials       57       57              
Flag Coverage Δ
unittests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

see 18 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Comment on lines +48 to +55
oneof executorPodValue {
core.K8sPod executorPod = 12;
}
// Reference to an existing PodTemplate k8s resource to be used as the base configuration when creating the
// executor Pods for this task. If this value is set, the specified PodTemplate will be used instead of, but applied
// identically as, the default PodTemplate configured in FlytePropeller.
// +optional
string executorPodTemplateName = 13;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason these can't just be the PodTemplate / PodTemplateName from the TaskExecutionContext? IIRC this is how we built some of the other plugins (ie. Dask, KF operator).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean use the same pair for both executor and driver, use the existing pair from TaskExecutionContext for one of these and plumb the other here, or something else?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either or. Both of these are available in the task decorator in flytekit, so if we added them both at the spark_config level -- ex:

@task(
    task_config=Spark(
        spark_config=[
            executor_pod_template=...
            driver_pod_template=...
        ]
    )
    pod_template=odTemplate(
        primary_container_name="primary",
        labels={"lKeyA": "lValA", "lKeyB": "lValB"},
        annotations={"aKeyA": "aValA", "aKeyB": "aValB"},
        pod_spec=V1PodSpec(
            volumes=[V1Volume(name="volume")],
            tolerations=[
                V1Toleration(
                    key="num-gpus",
                    operator="Equal",
                    value=1,
                    effect="NoSchedule",
                ),
            ],
        )
)

it doesn't make sense to me intuitively how they would be applied. For things like resources / etc that are already available in the task decorator we implemented (in other plugins) that if set at the task-level they would be applied to all replicas (ie. executor / driver) unless there was a specific override there. So IMO we should apply task-level to the executor / driver by default and put an override on the driver only.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I waffled on this a bit since the spark on k8s op itself has moved to an abstraction of explicit configs for each driver and executor, but yeah I tend to agree here that utilizing the top level TaskTemplate.PodTemplate as the primary override for both + a driver-specific PodTemplate in the SparkJob is more consistent with flytekit and other plugins.

We'll need to clarify the order of precedence w.r.t. top level configs (i.e., TaskTemplate.Container), top level pod template (i.e., TaskTemplate.PodTemplate.PrimaryContainerName), and driver-specifc config (i.e., SparkJob.PodTemplate.PrimaryContainerName). The same set of clarifications are needed for other configs (i.e., memory/cores, which can also be specified as spark config -- spark.executor.requests.memory).

So the proposed python devx would be

@task(
    task_config=Spark(
        spark_conf={
           "spark.driver.memory": "4G",
           "spark.executor.memory": "8G",
        },
        # driver only pod template overrides (takes precedence over top level pod_template)
        driver_pod_template=...
    ),
    # default pod template overrides for BOTH driver and executor
    pod_template=odTemplate(
        primary_container_name="primary",
        labels={"lKeyA": "lValA", "lKeyB": "lValB"},
        annotations={"aKeyA": "aValA", "aKeyB": "aValB"},
        pod_spec=V1PodSpec(
            volumes=[V1Volume(name="volume")],
            tolerations=[
                V1Toleration(
                    key="num-gpus",
                    operator="Equal",
                    value=1,
                    effect="NoSchedule",
                ),
            ],
        )
    ),
)

@pingsutw @wild-endeavor any thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it's very confusing to have podTemplate in both @task and spark config, but I still think we need worker_pod_template in spark config.

People may only want to override the worker_pod_template, for example, add gpu selector, or something else.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd personally rather have the original where both pod templates are in the spark config, and then just error if the pod level config is specified. but i'm okay any way. I just feel like applying one could make equally convincing arguments that the normal top level config should be for executors or for drivers.

Copy link
Contributor Author

@andrewwdye andrewwdye Oct 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thoughtful feedback. After weighing this + the suggestions in the migrated PR, I'd like to propose

  • A shared RoleSpec definition (could be used by other plugins in the future)
  • Inclusion of driver and executor role specs in SparkJob
  • Error in spark plugin if TaskTemplate.PodTemplate is set
  • Use this pattern as a deprecation path for TaskTemplate.PodTemplate for Dask, MPI, PyTorch, and Ray should we choose to have per-role PodTemplates/overrides (i.e., start erroring if both TaskTemplate.PodTemplate and worker RoleSpec are specified

Will summarize fully in flyteorg/flyte#4179 and try to move the remainder of the communication there

@andrewwdye
Copy link
Contributor Author

Closing in favor of flyteorg/flyte#4179

@andrewwdye andrewwdye closed this Oct 6, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants