Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Controller Impl #5

Merged
merged 3 commits into from
Jun 12, 2024
Merged

Controller Impl #5

merged 3 commits into from
Jun 12, 2024

Conversation

yhwang
Copy link
Collaborator

@yhwang yhwang commented Jun 12, 2024

Use the kubebuilder to generate the skeleton of the controller and implement the first version of EvalJob CRD

  • Create the EvalJob CRD
  • Implement the controller to handle the CR objects
  • Implement a driver to wrap the Python program that runs the lm-eval job
  • Update the manifests to deploy the CRD, controller, and related resources

@yhwang yhwang merged commit 9bac6f9 into main Jun 12, 2024
3 checks passed
@yhwang yhwang deleted the controller branch June 12, 2024 06:53
// +kubebuilder:subresource:status

// EvalJob is the Schema for the evaljobs API
type EvalJob struct {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll want to be careful about how we name the CR. I think EvalJob is too vague since there are numerous things that users may want to evaluate in a kube cluster. I think we could call it LMEvalJob to explicitly link it to the lm-eval python package.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure thing. This should be related to your other comment about the GroupName and Kind. The reason I have the EvalJob in the first place is that the GroupName is lm-eval-service.github.com.

If we change the GroupName to foundation-model-stack.github.com, then the LMEvalJob is definitely better.

Context("When creating EvalJob under Defaulting Webhook", func() {
It("Should fill in the default value if a required field is empty", func() {

// TODO(user): Add your logic here

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we should do these!


var (
// GroupVersion is group version used to register these objects
GroupVersion = schema.GroupVersion{Group: "lm-eval-service.github.com", Version: "v1beta1"}
Copy link

@gabe-l-hart gabe-l-hart Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should use something like foundation-model-stack.github.com as the group with LMEvalJob as the kind. Feels like we may have other kinds we want to bundle under this. I could also see not coupling this to other projects in the org. If we go that route, I'd drop the "service" since this is not managing the REST microservice.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change the GroupName to foundation-model-stack.github.com and Kind to LMEvalJob. Then we can expand to other kinds for other services in the future.

type EvalJobSpec struct {
// INSERT ADDITIONAL SPEC FIELDS - desired state of cluster
// Important: Run "make" to regenerate code after modifying this file

Copy link

@gabe-l-hart gabe-l-hart Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these fields under spec relate to the CLI args of the job. Do you think we need any further config around the job itself? I would think at a minimum we'd need:

  • image name
  • resources
  • passthrough labels and annotations
  • config for mounting storage volumes for inputs/outputs

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll also eventually need the ability to run sidecars and the like inside the job (e.g. for proprietary storage connectivity)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do use ConfigMap to store some settings which may relate to the things you listed. We definitely need to expand the settings to support more configuration-wide resources.

}

type ServiceOptions struct {
PodImage string

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anywhere that this can be set in the CRD that I'm not seeing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, it's not linked to any field in the CRD yet. I use ConfigMap to store these settings for now. we can have further discussion about how this should be mapped to the REST API and how it converts to CRD fields and etc.

return ctrl.Result{}, client.IgnoreNotFound(err)
}

if !evalJob.ObjectMeta.DeletionTimestamp.IsZero() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Is this how a finalizer is triggered for a kubebuilder controller?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, for the controller, it's reconciliation fashion, which means you don't know which event triggers the reconcile function. and the DeletionTimestamp is the field that says the current state of the CR is being deleted. After the CR is deleted, there will be another reconcile call. but I use a predicate to ignore the deletion events since we have nothing to do after the CR has been deleted.

return r.checkScheduledPod(ctx, log, evalJob)
case lmevalservicev1beta1.CompleteJobState:
return r.handleComplete(ctx, log, evalJob)
case lmevalservicev1beta1.CancelledJobState:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have cancellation separate from deletion? I could see that making sense so that the status and artifacts from a cancelled job. I think we may also want to consider some kind of retention policy for old jobs so that they don't pile up in the kube master.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Cancelled state is used to trigger the cancellation and is different from the Complete state in the current design. After a job is canceled, it will change to the Complete state and the reason will be Cancelled. Yes, we do need a retention policy but it could be either a separate mechanism from the controller or in the controller. We need to figure out a good way to consolidate reconcile-based and time-based logic.

client.Client
Scheme *runtime.Scheme
Recorder record.EventRecorder
ConfigMap string

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the user point to this ConfigMap? I'm not seeing it in the CRD (could be missing it somewhere).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I plan to use the ConfigMap to store settings for the controller and the fields in the REST APIs. For the sake of the argument, if there is a size property in the REST API that allows users to specify the pod's spec. we may have a big-size settings in the ConfigMap and it contains the 4000m CPU and 8GB memory settings. and a median and small size settings. So it's not directly used by the CR, but somehow related.

// construct a new pod and create a pod for the job
currentTime := v1.Now()
pod := r.createPod(job)
if err := r.Create(ctx, pod, &client.CreateOptions{}); err != nil {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super familiar with how kubebuilder controllers get wired into the operator process. Is there any chance of a race here where the check to see if it's a new CR says "yes," but by the time this r.Create happens it's already been reconciled by another reconciliation process or a sibling operator? I suppose there's always the race that a user might manually make the pod themselves, but we probably don't need to account for that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ha, you are talking about the scaling of the controller now. for scaling, each controller should work on its own bucket and different CRs should go to different buckets. At least that's how knative works. I will check on how multiple replica of the kubebuilder controller works. For the racing condition, theoretically, one reconciliation would throw an error and the other reconciliation would pass. So we can have a logic to validate the pod status to correct the state of the CR. I noted this down and will add some logic to handle that.


if pod.Status.ContainerStatuses == nil {
// wait for the pod to initialize and run the containers
return ctrl.Result{Requeue: true, RequeueAfter: time.Second * 10}, nil

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth making this requeue timing configurable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I will put it into the ConfigMap

for _, cstatus := range pod.Status.ContainerStatuses {
if cstatus.Name == "main" {
if cstatus.LastTerminationState.Terminated == nil {
return ctrl.Result{Requeue: true, RequeueAfter: time.Second * 10}, nil

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These conditionals of when a short-circuit return is used vs when the status is updated without returning and it falls through to the final return are a little tangled. I think this could all be done more simply by pulling the "main" container out instead of looping and then having all conditionals fall through to a single return statement.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense. but the containerStatuses is not map-based structure, need to loop through them to get the main one. let me see if I can simplify this part.


func (r *EvalJobReconciler) getPod(ctx context.Context, job *lmevalservicev1beta1.EvalJob) (*corev1.Pod, error) {
var pod = corev1.Pod{}
if err := r.Get(ctx, types.NamespacedName{Namespace: job.Namespace, Name: job.Name}, &pod); err != nil {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it's using the name of the job as the name of the pod? Is that always going to be the case? Seems like we might want to do some kind of hash suffix to correlate with the kind of naming used for other parent/child resource types like Job or ReplicaSet. This would be a nice shorthand for users to identify the pod as "machine created" vs "human created."

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the pod name is stored in the status subresource. we definitely can have our own naming logic. apology for lazy on this in the first draft. :-p. I put the OwnerReference in the pod, so even from the pod, we know which job it belongs to.

return &pod, nil
}
}
return nil, fmt.Errorf("pod doesn't have proper entry in the OwnerReferences")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does this error go if it triggers? Would this prevent future reconciliations from mucking with the pod? In other words, could a user manually remove the ownerReferences entry to cause the controller to "disown" the pod?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if users manually modify the OwnerReference entry, then the controller won't be able to get Delete events for the pod and no reconcile cycle. That's how the controller works. But with the driver embedded in the pod. We still know whether the job is done or not

APIVersion: "v1",
},
ObjectMeta: v1.ObjectMeta{
Name: job.Status.PodName,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, here it looks like we're pulling the pod name from status.podName

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, store it here. so we can have some naming logic for the pod. and because of the reconcile mechanism, we can only use the data in the LMEvalJob data struct to get the corresponding Pod. the OwnerRefernece is for watching the deletion event from the pod

fmt.Sprintf("Tthe EvalJob %s in namespace %s has completed",
job.Name,
job.Namespace))
// TODO: final wrap up/clean up

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be interesting. I suspect one thing that would be really important is persisting the pod's logs somewhere that the user would be responsible for cleaning up. Those could probably get pretty verbose, but it would be great to be able to store the pod logs with the job somehow.

The internet is not very helpful, which indicates that it's probably not possible, but it would be really cool if we could somehow extend the kube logging architecture such that kubectl logs lmevaljob/<jobname> would do the same thing as kubectl logs <job pod name>. Probably not a thing though!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting idea. right now, the stderr/stdout are stored in the specific location in the pod and the driver only retrieves the results.json. the logs would only contain information/messages from the driver for now.

we can adjust the behavior in the driver and store the logs somewhere else. Not sure if we can customize the kubectl get lmevaljob command to get logs. I know you can customize the get in the CRD but the info should come from its status.

}

func (r *EvalJobReconciler) handleCancel(ctx context.Context, log logr.Logger, job *lmevalservicev1beta1.EvalJob) (ctrl.Result, error) {
// delete the pod and update the state to complete

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the same concern here around log retention would be at play. A user may want to retain the logs from the pod even after cancellation so that they can analyze the progress/problems from the job.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it. we can do that in the driver directly.

log.Error(err, "failed to update status for cancellation")
}
r.Recorder.Event(job, "Normal", "Cancelled",
fmt.Sprintf("Tthe EvalJob %s in namespace %s has cancelled and changed its state to Complete",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: s,Tthe,The,

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fix this in the new PR

}

func (r *EvalJobReconciler) createPod(job *lmevalservicev1beta1.EvalJob) *corev1.Pod {
var allowPrivilegeEscalation = false

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. I think it should be safe to always have these settings set, right? We certainly want to allow these settings to be set, but we might want to make sure that it's possible for users to do otherwise. I could imagine the case where a user has a custom image with custom Task that does things like use temp files and read from disk. While this would certainly be an anti-pattern for a production-grade eval task, I wouldn't be at all surprised if users still wanted to be able to do it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wolud say this kind of images will hit the SCC on OpenShift and fail. I hope we don't need to support that use case.

var allowPrivilegeEscalation = false
var runAsNonRootUser = true
var ownerRefController = true
var runAsUser int64 = 1001030000

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to just let this setting be auto-assigned, right? We certainly want non-root, but I think the best practice is to not hard-code an explicit non-root UID and instead let a random (anonymous) one be assigned.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this would be auto-assigned on OpenShift. I forgot to remove it since my local dev env is using kind. The proper way to support both local dev env and OpenShift is:

  • getting the uid range from the namespace's annotation
  • specify an ID in that range or a random id if the annotation is not found (which is other k8s cluster)

UID: job.UID,
},
},
Labels: map[string]string{

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll want some kind of user passthrough label/annotation ability I think (useful for monitoring and visibility of all resources owned by a parent application).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me create an issue to trace this feature
#7

{
Name: "driver",
Image: r.options.DriverImage,
ImagePullPolicy: corev1.PullAlways,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will probably need to be configurable too. I always struggle with this kind of controller on how much to make the CRD's spec include a generic "put anything to configure the pod here" section vs being prescriptive about the body of the pod.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is used by the controller and we do have a configmap for the settings. I can add the image pull policy there.

VolumeMounts: []corev1.VolumeMount{
{
Name: "shared",
MountPath: "/opt/app-root/src/bin",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Why is this emptydir mounting to a bin path?

Copy link

@gabe-l-hart gabe-l-hart Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oooh, I think I see now. Is this getting the binaries dropped into it during the init container? Why not just require that the pod's image have the necessary binaries? Maybe I'll get there once I make it to driver.go

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! I think I understand the role of the driver now. Is this basically just a wrapper around executing the python script that will push the status updates to the parent resource?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming this is correct, I think we may need to be pretty careful here using a push model for status updates rather than a pull model where the controller polls the pod. The reason for this is that some security-focused users will want to isolate the pod where the job body runs in a namespace that fully disallows any RBAC for pods.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's used to copy the driver from the driver image in the init container to lm-eval + unitxt image in the main container. So in the main container, we are still running the driver and have the driver to run the "lm-eval + unitxt" python program.

SecurityContext: &corev1.PodSecurityContext{
RunAsNonRoot: &runAsNonRootUser,
SeccompProfile: &corev1.SeccompProfile{
Type: corev1.SeccompProfileTypeRuntimeDefault,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhere in the back of my mind, this is different depending on the version of kubernetes/openshift you're running. We probably don't need to go back in time too far though?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I lost track too. but most of the recent k8s clusters need it.

},
},
Command: generateCmd(job),
Args: generateArgs(job),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll need resources here too eventually, especially for custom tasks that may do meaty work inline (vs calling out to an external model)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

created the issue to track it: #8

}

cmds := make([]string, 0, 10)
cmds = append(cmds, "python", "-m", "lm_eval", "--output_path", "/opt/app-root/src/output")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will need to be a mounted volume and/or configurable. A user will need to be able to get the output results stored somewhere.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, let me create an issue for this feature too
#9

return err
}

if err := d.updateStatus(lmevalservicev1beta1.RunningJobState); err != nil {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is going to pose a problem for running jobs without RBACs. There will definitely be users that want to do this, so I think we may need to consider if we can publish status updates to some medium that would be visible to the controller for polling. This might just be logs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for now, if we want to avoid RBACs, the solutions I have on top of my mind for now are:

  • a polling mechanism on the controller which I don't like
  • internal API calls for the driver to update the status and upload logs

Let me also create an issue to track this
#10

d.job.Status.Message = err.Error()
} else {
// read the content of result*.json
pattern := filepath.Join(d.Option.OutputPath, "result*.json")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this "result*.json" something that is static to lm-eval, or is it something a user might configure at some point with a custom task?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's static for lm-eval. but I will keep an eye on that to see if it's still true in the future release of lm-eval

if err != nil {
d.Option.Logger.Error(err, "failed to retrieve the results")
} else {
d.job.Status.Results = string(bytes)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Looking at this, it seems like we're just putting the whole content of the output file into the job's status. If we expect that content to always be small, I think this is OK (and then I would think we remove the ability to set outputPath in the CR). If we think the user would need the ability to fetch an arbitrary set of output files, we're going to need some kind of storage pointer or mounted volume.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if the results would exceed 2.5 MB. I guess for the lm-evaluation job, it wouldn't. but we definitely need the storage feature. I believe I created an issue for this when replying to one of your comments above. #9

for the outputPath, I actually don't have it in my draft CRD :-p

yhwang added a commit that referenced this pull request Jun 13, 2024
- Add `image-pull-policy` in the ConfigMap to specify the
  imagePullPolicy of the job's pod
- Add  `pod-checking-interval` in the ConfigMap to set
  the pod checking interval
- Update the Group and Kind of the CRD to
  `foundation-model-stack.github.com.github.com` and
  `LMEvalJob`
- Refine the `checkScheduledPod` func of the controller
  based on the comment

Signed-off-by: Yihong Wang <[email protected]>
yhwang added a commit that referenced this pull request Jun 13, 2024
- Add `image-pull-policy` in the ConfigMap to specify the
  imagePullPolicy of the job's pod
- Add  `pod-checking-interval` in the ConfigMap to set
  the pod checking interval
- Update the Group and Kind of the CRD to
  `foundation-model-stack.github.com.github.com` and
  `LMEvalJob`
- Refine the `checkScheduledPod` func of the controller
  based on the comment

Signed-off-by: Yihong Wang <[email protected]>
yhwang added a commit that referenced this pull request Jun 13, 2024
- Add `image-pull-policy` in the ConfigMap to specify the
  imagePullPolicy of the job's pod
- Add  `pod-checking-interval` in the ConfigMap to set
  the pod checking interval
- Update the Group and Kind of the CRD to
  `foundation-model-stack.github.com.github.com` and
  `LMEvalJob`
- Refine the `checkScheduledPod` func of the controller
  based on the comment

Signed-off-by: Yihong Wang <[email protected]>
yhwang added a commit that referenced this pull request Jun 24, 2024
- Add `image-pull-policy` in the ConfigMap to specify the
  imagePullPolicy of the job's pod
- Add  `pod-checking-interval` in the ConfigMap to set
  the pod checking interval
- Update the Group and Kind of the CRD to
  `foundation-model-stack.github.com.github.com` and
  `LMEvalJob`
- Refine the `checkScheduledPod` func of the controller
  based on the comment

Signed-off-by: Yihong Wang <[email protected]>
yhwang added a commit that referenced this pull request Jun 25, 2024
- Add `image-pull-policy` in the ConfigMap to specify the
  imagePullPolicy of the job's pod
- Add  `pod-checking-interval` in the ConfigMap to set
  the pod checking interval
- Update the Group and Kind of the CRD to
  `foundation-model-stack.github.com.github.com` and
  `LMEvalJob`
- Refine the `checkScheduledPod` func of the controller
  based on the comment

Signed-off-by: Yihong Wang <[email protected]>
yhwang added a commit that referenced this pull request Jun 26, 2024
- Add `image-pull-policy` in the ConfigMap to specify the
  imagePullPolicy of the job's pod
- Add  `pod-checking-interval` in the ConfigMap to set
  the pod checking interval
- Update the Group and Kind of the CRD to
  `foundation-model-stack.github.com.github.com` and
  `LMEvalJob`
- Refine the `checkScheduledPod` func of the controller
  based on the comment

Signed-off-by: Yihong Wang <[email protected]>
yhwang added a commit that referenced this pull request Jun 26, 2024
- Add `image-pull-policy` in the ConfigMap to specify the
  imagePullPolicy of the job's pod
- Add  `pod-checking-interval` in the ConfigMap to set
  the pod checking interval
- Update the Group and Kind of the CRD to
  `foundation-model-stack.github.com.github.com` and
  `LMEvalJob`
- Refine the `checkScheduledPod` func of the controller
  based on the comment

Signed-off-by: Yihong Wang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants