Skip to content

Installing ODH on OpenShift and adding LM‐Eval

James Busche edited this page Dec 19, 2024 · 16 revisions

Install of ODH on Open Shift Cluster:

Note, this is the method I use to get LM-Eval installed. It's not the only method and I'm not even sure it's the best method, but perhaps you can leverage it to quickly get an environment up and running.

Table of Contents

0. Prerequisites

0.1 Need an OpenShift Cluster. (Mine was OpenShift 4.16.11)

0.2 Logged onto the OpenShift UI. Note, I install everything from the command line, but I need the UI to get the "Copy login command" to get the oc login token.

0.3 Need to be logged into the cluster with oc login. Soemething like this:

oc login --token=sha256~XXXX --server=https://api.jim414fips.cp.fyre.ibm.com:6443

1. Install ODH with Fast Channel

Using your terminal where you're logged in with oc login, issue this command:

cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  labels:
    operators.coreos.com/opendatahub-operator.openshift-operators: ""
  name: opendatahub-operator
  namespace: openshift-operators
spec:
  channel: fast
  installPlanApproval: Automatic
  name: opendatahub-operator
  source: community-operators
  sourceNamespace: openshift-marketplace
  startingCSV: opendatahub-operator.v2.22.0
EOF

You can check it started with:

watch oc get pods,csv -n openshift-operators

2. Install the DSCI prerequisite Operators

2.1 Install service mesh

cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  labels:
    operators.coreos.com/servicemeshoperator.openshift-operators: ""
  name: servicemeshoperator
  namespace: openshift-operators
spec:
  channel: stable
  installPlanApproval: Automatic
  name: servicemeshoperator
  source: redhat-operators
  sourceNamespace: openshift-marketplace
  startingCSV: servicemeshoperator.v2.6.4
EOF

And then check it with:

watch oc get pods,csv -n openshift-operators

2.2 Install the serverless operator

cat <<EOF | kubectl apply -f -
---
apiVersion: v1
kind: Namespace
metadata:
  name: openshift-serverless
---
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: serverless-operators
  namespace: openshift-serverless
spec: {}
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: serverless-operator
  namespace: openshift-serverless
spec:
  channel: stable 
  name: serverless-operator 
  source: redhat-operators 
  sourceNamespace: openshift-marketplace 
EOF

And then check it with:

watch oc get pods -n openshift-serverless

3. Install DSCi

cat << EOF | oc apply -f -
apiVersion: dscinitialization.opendatahub.io/v1
kind: DSCInitialization
metadata:
  name: default-dsci
  labels:
    app.kubernetes.io/created-by: opendatahub-operator
    app.kubernetes.io/instance: default
    app.kubernetes.io/managed-by: kustomize
    app.kubernetes.io/name: dscinitialization
    app.kubernetes.io/part-of: opendatahub-operator
spec:
  applicationsNamespace: opendatahub
  devFlags:
    logmode: production
  monitoring:
    namespace: opendatahub
    managementState: Managed
  serviceMesh:
    auth:
      audiences:
        - 'https://kubernetes.default.svc'
    controlPlane:
      metricsCollection: Istio
      name: data-science-smcp
      namespace: istio-system
    managementState: Managed
  trustedCABundle:
    customCABundle: ''
    managementState: Managed
EOF

And then check it: (It should go into "Ready" state after about a minute or so)

watch oc get dsci

Also note that you'll see the istio control pane start up as well here:

oc get pods -n openshift-operators

4. Install the DSC

cat << EOF | oc apply -f -
apiVersion: datasciencecluster.opendatahub.io/v1
kind: DataScienceCluster
metadata:
  name: default-dsc
spec:
  components:
    codeflare:
      managementState: Removed
    dashboard:
      managementState: Removed
    datasciencepipelines:
      managementState: Removed
    kserve:
      managementState: Managed
      serving:
        ingressGateway:
          certificate:
            type: SelfSigned
        managementState: Managed
        name: knative-serving
    kueue:
      managementState: Managed
    modelmeshserving:
      managementState: Removed
    modelregistry:
      managementState: Removed
    ray:
      managementState: Removed
    trainingoperator:
      managementState: Managed
    trustyai:
      managementState: Managed
    workbenches:
      managementState: Removed
EOF

Check that the pods are running:

watch oc get pods -n opendatahub

You should see these pods:

oc get pods -n opendatahub
NAME                                                            READY   STATUS    RESTARTS   AGE
kserve-controller-manager-5766998974-mjxjc                      1/1     Running   0          21m
kubeflow-training-operator-5dbf85f955-j9cf6                     1/1     Running   0          5h26m
kueue-controller-manager-5449d484c7-phmm6                       1/1     Running   0          5h27m
odh-model-controller-688594d55b-qwfxm                           1/1     Running   0          22m
trustyai-service-operator-controller-manager-5d7f76d9fb-8xc2r   1/1     Running   0          21m

5. Configure your Kueue minimum requirements

cat <<EOF | kubectl apply -f -
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: "cpu-flavor"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "cq-small"
spec:
  namespaceSelector: {} # match all.
  resourceGroups:
  - coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
    flavors:
    - name: "cpu-flavor"
      resources:
      - name: "cpu"
        nominalQuota: 5
      - name: "memory"
        nominalQuota: 20Gi
      - name: "nvidia.com/gpu"
        nominalQuota: 5
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  name: lq-trainer
  namespace: lm-eval-test
spec:
  clusterQueue: cq-small
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: WorkloadPriorityClass
metadata:
  name: p1
value: 30000
description: "high priority"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: WorkloadPriorityClass
metadata:
  name: p2
value: 10000
description: "low priority"
EOF

6. Configure LM-Eval

See this Getting started with LM-Eval article for the latest info

6.1 Create a namespace to use. (Note, it can't use default namespace until Ted Chang's PR is approved and part of ODH)

oc create namespace lm-eval-test

7. Submit a sample LM-Eval job

cat <<EOF | kubectl apply -f -
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: evaljob-sample
  namespace: lm-eval-test
spec:
  model: hf
  modelArgs:
  - name: pretrained
    value: google/flan-t5-base
  taskList:
    taskRecipes:
    - card:
        name: "cards.wnli"
      template: "templates.classification.multi_class.relation.default"
  logSamples: true
EOF

And then watch that it starts and runs:

watch oc get pods,lmevaljobs -n lm-eval-test

And once it pulls the image and runs for about 5 minutes it should look like this:

oc get pods,lmevaljobs -n lm-eval-test                                                                                                 api.jim414fips.cp.fyre.ibm.com: Tue Oct 29 14:58:47 2024

NAME                 READY   STATUS    RESTARTS   AGE
pod/evaljob-sample   1/1     Running   0          25s

NAME                                               STATE
lmevaljob.trustyai.opendatahub.io/evaljob-sample   Running

Cleanup

Cleanup your lmevaljob(s), for example:

oc delete lmevaljob evaljob-sample -n lm-eval-test

Cleanup of your Kueue resouces, if you want that:

cat <<EOF | kubectl delete -f -
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: "cpu-flavor"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "cq-small"
spec:
  namespaceSelector: {} # match all.
  resourceGroups:
  - coveredResources: ["cpu", "memory"]
    flavors:
    - name: "cpu-flavor"
      resources:
      - name: "cpu"
        nominalQuota: 5
      - name: "memory"
        nominalQuota: 20Gi
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  name: lq-trainer
  namespace: lm-eval-test
spec:
  clusterQueue: cq-small
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: WorkloadPriorityClass
metadata:
  name: p1
value: 30000
description: "high priority"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: WorkloadPriorityClass
metadata:
  name: p2
value: 10000
description: "low priority"
EOF

Cleanup of dsc items (if you want that)

oc delete dsc default-dsc

Cleanup of DSCI (if you want that)

oc delete dsci default-dsci

Cleanup of ODH operators (if you want that) Note! The csv versions update on occasion, do oc get csv to get latest version(s).

oc delete sub authorino-operator opendatahub-operator servicemeshoperator -n openshift-operators
oc delete sub serverless-operator -n openshift-serverless
oc delete csv authorino-operator.v0.13.0 opendatahub-operator.v2.22.0 servicemeshoperator.v2.6.4 -n openshift-operators
oc delete csv serverless-operator.v1.34.1 -n openshift-serverless
oc delete crd servicemeshcontrolplanes.maistra.io  servicemeshmemberrolls.maistra.io servicemeshmembers.maistra.io servicemeshpeers.federation.maistra.io  servicemeshpolicies.authentication.maistra.io  servicemeshrbacconfigs.rbac.maistra.io lmevaljobs.trustyai.opendatahub.io