-
Notifications
You must be signed in to change notification settings - Fork 2
Installing RHOAI on OpenShift and adding LM‐Eval
Refer to the Red Hat docs here for more detail: https://docs.redhat.com/en/documentation/red_hat_openshift_ai_self-managed/2.12/html-single/installing_and_uninstalling_openshift_ai_self-managed/index
Table of Contents
- Prerequisites
- Install the Red Hat OpenShift AI Operator
- Monitor DSCI
- Install the Red Hat OpenShift AI components via DSC
- Check that everything is running
- TBD - Kueue setup
- Configure LM-Eval
- Clenaup
0.1 OpenShift Cluster up and running. (I've been using OpenShift 4.14.17)
0.2 Logged onto the OpenShift UI. Note, I install everything from the command line, but I need the UI to get the "Copy login command" to get the oc login token.
0.3 Also logged into the terminal with oc login: For example:
oc login --token=sha256~OgYOYAA0ONu.... --server=https://api.jim414.cp.fyre.ibm.com:6443
0.4 Also need GPU prereqs from here: https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/index.html
1.1 Create a namespace:
cat << EOF | oc apply -f -
apiVersion: v1
kind: Namespace
metadata:
name: redhat-ods-operator
EOF
1.2 Create an OperatorGroup
cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
name: rhods-operator
namespace: redhat-ods-operator
EOF
1.3 Install Servicemesh operator
Note, if you are installing in production, you probably want installPlanApproval: Manual
so that you're not surprised with operator updates until you've had chance to verify them on a dev/stage server frist.
cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
labels:
operators.coreos.com/servicemeshoperator.openshift-operators: ""
name: servicemeshoperator
namespace: openshift-operators
spec:
channel: stable
installPlanApproval: Automatic
name: servicemeshoperator
source: redhat-operators
sourceNamespace: openshift-marketplace
startingCSV: servicemeshoperator.v2.6.4
EOF
and make sure it works:
watch oc get pods,csv -n openshift-operators
and it should look something like this:
NAME READY STATUS RESTARTS AGE
istio-operator-6c99f6bf7b-rrh2j 1/1 Running 0 13m
1.4 Install the serverless operator
cat <<EOF | kubectl apply -f -
---
apiVersion: v1
kind: Namespace
metadata:
name: openshift-serverless
---
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
name: serverless-operators
namespace: openshift-serverless
spec: {}
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: serverless-operator
namespace: openshift-serverless
spec:
channel: stable
name: serverless-operator
source: redhat-operators
sourceNamespace: openshift-marketplace
EOF
And then check it with:
watch oc get pods,csv -n openshift-serverless
1.5 Create a subscription (Recommend changing installPlanApproval to Manual in production)
cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: rhods-operator
namespace: redhat-ods-operator
spec:
name: rhods-operator
channel: fast
installPlanApproval: Automatic
source: redhat-operators
sourceNamespace: openshift-marketplace
EOF
And watch that it starts:
watch oc get pods,csv -n redhat-ods-operator
Watch the dsci until it's complete:
watch oc get dsci
and it'll finish up like this:
NAME AGE PHASE CREATED AT
default-dsci 16m Ready 2024-07-02T19:56:18Z
cat << EOF | oc apply -f -
apiVersion: datasciencecluster.opendatahub.io/v1
kind: DataScienceCluster
metadata:
name: default-dsc
spec:
components:
codeflare:
managementState: Removed
dashboard:
managementState: Removed
datasciencepipelines:
managementState: Removed
kserve:
managementState: Managed
defaultDeploymentMode: RawDeployment
serving:
ingressGateway:
certificate:
secretName: knative-serving-cert
type: SelfSigned
managementState: Managed
name: knative-serving
kueue:
managementState: Managed
modelmeshserving:
managementState: Removed
ray:
managementState: Removed
workbenches:
managementState: Removed
trainingoperator:
managementState: Managed
trustyai:
devFlags:
manifests:
- contextDir: config
sourcePath: overlays/rhoai
uri: 'https://github.com/ruivieira/trustyai-service-operator/tarball/test/rhoai-2.16.1'
managementState: Managed
EOF
4.1 Check that your operators are running:
oc get pods -n redhat-ods-operator
Will return:
NAME READY STATUS RESTARTS AGE
rhods-operator-7c54d9d6b5-j97mv 1/1 Running 0 22h
4.2 Check that the service mesh operator is running:
oc get pods -n openshift-operators
Will return:
NAME READY STATUS RESTARTS AGE
istio-cni-node-v2-5-9qkw7 1/1 Running 0 84s
istio-cni-node-v2-5-dbtz5 1/1 Running 0 84s
istio-cni-node-v2-5-drc9l 1/1 Running 0 84s
istio-cni-node-v2-5-k4x4t 1/1 Running 0 84s
istio-cni-node-v2-5-pbltn 1/1 Running 0 84s
istio-cni-node-v2-5-xbmz5 1/1 Running 0 84s
istio-operator-6c99f6bf7b-4ckdx 1/1 Running 1 (2m39s ago) 2m56s
4.3 Check that the DSC components are running:
watch oc get pods -n redhat-ods-applications
Will return:
NAME READY STATUS RESTARTS AGE
kserve-controller-manager-7784c9878b-4fkv9 1/1 Running 0 51s
kubeflow-training-operator-cb487d469-s78ch 1/1 Running 0 2m11s
kueue-controller-manager-5fb585c7c4-zpdcj 1/1 Running 0 4m21s
odh-model-controller-7b57f4b9d8-ztrgx 1/1 Running 0 5m6s
trustyai-service-operator-controller-manager-5745f74966-2hc2z 1/1 Running 0 2m16
See this Getting started with LM-Eval article for the latest info
6.1 Create a namespace to use. (Note, it can't use default namespace until Ted Chang's PR is approved and part of ODH)
oc create namespace lm-eval-test
6.2. Submit a sample LM-Eval job
cat <<EOF | kubectl apply -f -
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
name: evaljob-sample
namespace: lm-eval-test
spec:
model: hf
modelArgs:
- name: pretrained
value: google/flan-t5-base
taskList:
taskRecipes:
- card:
name: "cards.wnli"
template: "templates.classification.multi_class.relation.default"
logSamples: true
EOF
And then watch that it starts and runs:
watch oc get pods,lmevaljobs -n lm-eval-test
And once it pulls the image and runs for about 5 minutes it should look like this:
oc get pods,lmevaljobs -n lm-eval-test api.jim414fips.cp.fyre.ibm.com: Tue Oct 29 14:58:47 2024
NAME READY STATUS RESTARTS AGE
pod/evaljob-sample 1/1 Running 0 25s
NAME STATE
lmevaljob.trustyai.opendatahub.io/evaljob-sample Running
7.1 Cleanup of your lmevaljob(s), for example
oc delete lmevaljob evaljob-sample -n lm-eval-test
7.2 Cleanup of your Kueue resouces, if you want that:
oc delete flavor gpu-flavor non-gpu-flavor cpu-flavor
oc delete cq cq-small
oc delete lq lq-trainer
oc delete WorkloadPriorityClass p1 p2
7.3 Cleanup of dsc items (if you want that)
oc delete dsc default-dsc
7.4 Cleanup of DSCI (if you want that)
oc delete dsci default-dsci
7.5 Cleanup of the Operators (if you want that)
oc delete sub servicemeshoperator -n openshift-operators
oc delete sub serverless-operator -n openshift-serverless
oc delete csv servicemeshoperator.v2.6.4 -n openshift-operators
oc delete csv serverless-operator.v1.34.1 -n openshift-serverless
oc delete crd servicemeshcontrolplanes.maistra.io servicemeshmemberrolls.maistra.io servicemeshmembers.maistra.io servicemeshpeers.federation.maistra.io servicemeshpolicies.authentication.maistra.io servicemeshrbacconfigs.rbac.maistra.io lmevaljobs.trustyai.opendatahub.io
oc delete sub rhods-operator -n redhat-ods-operator
oc delete csv rhods-operator.2.16.0 -n redhat-ods-operator
7.6 Cleanup of the operatorgroup
oc delete OperatorGroup rhods-operator -n redhat-ods-operator