Skip to content

Installing RHOAI on OpenShift and adding LM‐Eval

James Busche edited this page Dec 19, 2024 · 10 revisions

Refer to the Red Hat docs here for more detail: https://docs.redhat.com/en/documentation/red_hat_openshift_ai_self-managed/2.12/html-single/installing_and_uninstalling_openshift_ai_self-managed/index

Table of Contents

0. Prerequisites

0.1 OpenShift Cluster up and running. (I've been using OpenShift 4.14.17)

0.2 Logged onto the OpenShift UI. Note, I install everything from the command line, but I need the UI to get the "Copy login command" to get the oc login token.

0.3 Also logged into the terminal with oc login: For example:

oc login --token=sha256~OgYOYAA0ONu.... --server=https://api.jim414.cp.fyre.ibm.com:6443

Note: If you have a GPU cluster:

0.4 Also need GPU prereqs from here: https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/index.html

1. Install the Red Hat OpenShift AI Operator

1.1 Create a namespace:

cat << EOF | oc apply -f -
apiVersion: v1
kind: Namespace
metadata:
  name: redhat-ods-operator 
EOF

1.2 Create an OperatorGroup

cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: rhods-operator
  namespace: redhat-ods-operator
EOF

1.3 Install Servicemesh operator

Note, if you are installing in production, you probably want installPlanApproval: Manual so that you're not surprised with operator updates until you've had chance to verify them on a dev/stage server frist.

cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  labels:
    operators.coreos.com/servicemeshoperator.openshift-operators: ""
  name: servicemeshoperator
  namespace: openshift-operators
spec:
  channel: stable
  installPlanApproval: Automatic
  name: servicemeshoperator
  source: redhat-operators
  sourceNamespace: openshift-marketplace
  startingCSV: servicemeshoperator.v2.6.4
EOF

and make sure it works:

watch oc get pods,csv -n openshift-operators

and it should look something like this:

NAME                              READY   STATUS    RESTARTS   AGE
istio-operator-6c99f6bf7b-rrh2j   1/1     Running   0          13m

1.4 Install the serverless operator

cat <<EOF | kubectl apply -f -
---
apiVersion: v1
kind: Namespace
metadata:
  name: openshift-serverless
---
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: serverless-operators
  namespace: openshift-serverless
spec: {}
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: serverless-operator
  namespace: openshift-serverless
spec:
  channel: stable 
  name: serverless-operator 
  source: redhat-operators 
  sourceNamespace: openshift-marketplace 
EOF

And then check it with:

watch oc get pods,csv -n openshift-serverless

1.5 Create a subscription (Recommend changing installPlanApproval to Manual in production)

cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: rhods-operator
  namespace: redhat-ods-operator 
spec:
  name: rhods-operator
  channel: fast
  installPlanApproval: Automatic 
  source: redhat-operators
  sourceNamespace: openshift-marketplace
EOF

And watch that it starts:

watch oc get pods,csv -n redhat-ods-operator

2. Monitor DSCI

Watch the dsci until it's complete:

watch oc get dsci

and it'll finish up like this:

NAME           AGE   PHASE   CREATED AT
default-dsci   16m   Ready   2024-07-02T19:56:18Z

3. Install the Red Hat OpenShift AI components via DSC

cat << EOF | oc apply -f -
apiVersion: datasciencecluster.opendatahub.io/v1
kind: DataScienceCluster
metadata:
  name: default-dsc
spec:
  components:
    codeflare:
      managementState: Removed
    dashboard:
      managementState: Removed
    datasciencepipelines:
      managementState: Removed
    kserve:
      managementState: Managed
      defaultDeploymentMode: RawDeployment
      serving:
        ingressGateway:
          certificate:
            secretName: knative-serving-cert
            type: SelfSigned
        managementState: Managed
        name: knative-serving 
    kueue:
      managementState: Managed
    modelmeshserving:
      managementState: Removed
    ray:
      managementState: Removed
    workbenches:
      managementState: Removed
    trainingoperator:
      managementState: Managed
    trustyai:
      devFlags:
        manifests:
          - contextDir: config
            sourcePath: overlays/rhoai
            uri: 'https://github.com/ruivieira/trustyai-service-operator/tarball/test/rhoai-2.16.1'
      managementState: Managed
EOF

4. Check that everything is running

4.1 Check that your operators are running:

oc get pods -n redhat-ods-operator

Will return:

NAME                              READY   STATUS    RESTARTS   AGE
rhods-operator-7c54d9d6b5-j97mv   1/1     Running   0          22h

4.2 Check that the service mesh operator is running:

oc get pods -n openshift-operators 

Will return:

NAME                              READY   STATUS    RESTARTS        AGE
istio-cni-node-v2-5-9qkw7         1/1     Running   0               84s
istio-cni-node-v2-5-dbtz5         1/1     Running   0               84s
istio-cni-node-v2-5-drc9l         1/1     Running   0               84s
istio-cni-node-v2-5-k4x4t         1/1     Running   0               84s
istio-cni-node-v2-5-pbltn         1/1     Running   0               84s
istio-cni-node-v2-5-xbmz5         1/1     Running   0               84s
istio-operator-6c99f6bf7b-4ckdx   1/1     Running   1 (2m39s ago)   2m56s

4.3 Check that the DSC components are running:

watch oc get pods -n redhat-ods-applications

Will return:

NAME                                                            READY   STATUS    RESTARTS   AGE
kserve-controller-manager-7784c9878b-4fkv9                      1/1     Running   0          51s
kubeflow-training-operator-cb487d469-s78ch                      1/1     Running   0          2m11s
kueue-controller-manager-5fb585c7c4-zpdcj                       1/1     Running   0          4m21s
odh-model-controller-7b57f4b9d8-ztrgx                           1/1     Running   0          5m6s
trustyai-service-operator-controller-manager-5745f74966-2hc2z   1/1     Running   0          2m16

6. Configure LM-Eval

See this Getting started with LM-Eval article for the latest info

6.1 Create a namespace to use. (Note, it can't use default namespace until Ted Chang's PR is approved and part of ODH)

oc create namespace lm-eval-test

6.2. Submit a sample LM-Eval job

cat <<EOF | kubectl apply -f -
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: evaljob-sample
  namespace: lm-eval-test
spec:
  model: hf
  modelArgs:
  - name: pretrained
    value: google/flan-t5-base
  taskList:
    taskRecipes:
    - card:
        name: "cards.wnli"
      template: "templates.classification.multi_class.relation.default"
  logSamples: true
EOF

And then watch that it starts and runs:

watch oc get pods,lmevaljobs -n lm-eval-test

And once it pulls the image and runs for about 5 minutes it should look like this:

oc get pods,lmevaljobs -n lm-eval-test                                                                                                 api.jim414fips.cp.fyre.ibm.com: Tue Oct 29 14:58:47 2024

NAME                 READY   STATUS    RESTARTS   AGE
pod/evaljob-sample   1/1     Running   0          25s

NAME                                               STATE
lmevaljob.trustyai.opendatahub.io/evaljob-sample   Running

7. Cleanup

7.1 Cleanup of your lmevaljob(s), for example

oc delete lmevaljob evaljob-sample -n lm-eval-test

7.2 Cleanup of your Kueue resouces, if you want that:

oc delete flavor gpu-flavor non-gpu-flavor cpu-flavor
oc delete cq cq-small
oc delete lq lq-trainer
oc delete WorkloadPriorityClass p1 p2

7.3 Cleanup of dsc items (if you want that)

oc delete dsc default-dsc

7.4 Cleanup of DSCI (if you want that)

oc delete dsci default-dsci

7.5 Cleanup of the Operators (if you want that)

oc delete sub servicemeshoperator -n openshift-operators
oc delete sub serverless-operator -n openshift-serverless
oc delete csv servicemeshoperator.v2.6.4 -n openshift-operators
oc delete csv serverless-operator.v1.34.1 -n openshift-serverless
oc delete crd servicemeshcontrolplanes.maistra.io  servicemeshmemberrolls.maistra.io servicemeshmembers.maistra.io servicemeshpeers.federation.maistra.io  servicemeshpolicies.authentication.maistra.io  servicemeshrbacconfigs.rbac.maistra.io lmevaljobs.trustyai.opendatahub.io
oc delete sub rhods-operator -n redhat-ods-operator
oc delete csv rhods-operator.2.16.0 -n redhat-ods-operator

7.6 Cleanup of the operatorgroup

oc delete OperatorGroup rhods-operator -n redhat-ods-operator