-
Notifications
You must be signed in to change notification settings - Fork 2
Installing ODH on OpenShift and adding LM‐Eval
Note, this is the method I use to get LM-Eval installed. It's not the only method and I'm not even sure it's the best method, but perhaps you can leverage it to quickly get an environment up and running.
- Prerequisites
- Install ODH with Fast Channel
- Install the DSCI prerequisite Operators
- Install DSCi
- Install the DSC
- Configure your Kueue minimum requirements
- Configure LM-Eval
- Submit a sample LM-Eval job
- Cleanup
0.1 Need an OpenShift Cluster. (Mine was OpenShift 4.16.11)
0.2 Logged onto the OpenShift UI. Note, I install everything from the command line, but I need the UI to get the "Copy login command" to get the oc login token.
0.3 Need to be logged into the cluster with oc login. Soemething like this:
oc login --token=sha256~XXXX --server=https://api.jim414fips.cp.fyre.ibm.com:6443
Using your terminal where you're logged in with oc login, issue this command:
cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
labels:
operators.coreos.com/opendatahub-operator.openshift-operators: ""
name: opendatahub-operator
namespace: openshift-operators
spec:
channel: fast
installPlanApproval: Automatic
name: opendatahub-operator
source: community-operators
sourceNamespace: openshift-marketplace
startingCSV: opendatahub-operator.v2.22.0
EOF
You can check it started with:
watch oc get pods,csv -n openshift-operators
2.1 Install service mesh
cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
labels:
operators.coreos.com/servicemeshoperator.openshift-operators: ""
name: servicemeshoperator
namespace: openshift-operators
spec:
channel: stable
installPlanApproval: Automatic
name: servicemeshoperator
source: redhat-operators
sourceNamespace: openshift-marketplace
startingCSV: servicemeshoperator.v2.6.4
EOF
And then check it with:
watch oc get pods,csv -n openshift-operators
2.2 Install the serverless operator
cat <<EOF | kubectl apply -f -
---
apiVersion: v1
kind: Namespace
metadata:
name: openshift-serverless
---
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
name: serverless-operators
namespace: openshift-serverless
spec: {}
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: serverless-operator
namespace: openshift-serverless
spec:
channel: stable
name: serverless-operator
source: redhat-operators
sourceNamespace: openshift-marketplace
EOF
And then check it with:
watch oc get pods -n openshift-serverless
cat << EOF | oc apply -f -
apiVersion: dscinitialization.opendatahub.io/v1
kind: DSCInitialization
metadata:
name: default-dsci
labels:
app.kubernetes.io/created-by: opendatahub-operator
app.kubernetes.io/instance: default
app.kubernetes.io/managed-by: kustomize
app.kubernetes.io/name: dscinitialization
app.kubernetes.io/part-of: opendatahub-operator
spec:
applicationsNamespace: opendatahub
devFlags:
logmode: production
monitoring:
namespace: opendatahub
managementState: Managed
serviceMesh:
auth:
audiences:
- 'https://kubernetes.default.svc'
controlPlane:
metricsCollection: Istio
name: data-science-smcp
namespace: istio-system
managementState: Managed
trustedCABundle:
customCABundle: ''
managementState: Managed
EOF
And then check it: (It should go into "Ready" state after about a minute or so)
watch oc get dsci
Also note that you'll see the istio control pane start up as well here:
oc get pods -n openshift-operators
cat << EOF | oc apply -f -
apiVersion: datasciencecluster.opendatahub.io/v1
kind: DataScienceCluster
metadata:
name: default-dsc
spec:
components:
codeflare:
managementState: Removed
dashboard:
managementState: Removed
datasciencepipelines:
managementState: Removed
kserve:
managementState: Managed
serving:
ingressGateway:
certificate:
type: SelfSigned
managementState: Managed
name: knative-serving
kueue:
managementState: Managed
modelmeshserving:
managementState: Removed
modelregistry:
managementState: Removed
ray:
managementState: Removed
trainingoperator:
managementState: Managed
trustyai:
managementState: Managed
workbenches:
managementState: Removed
EOF
Check that the pods are running:
watch oc get pods -n opendatahub
You should see these pods:
oc get pods -n opendatahub
NAME READY STATUS RESTARTS AGE
kserve-controller-manager-5766998974-mjxjc 1/1 Running 0 21m
kubeflow-training-operator-5dbf85f955-j9cf6 1/1 Running 0 5h26m
kueue-controller-manager-5449d484c7-phmm6 1/1 Running 0 5h27m
odh-model-controller-688594d55b-qwfxm 1/1 Running 0 22m
trustyai-service-operator-controller-manager-5d7f76d9fb-8xc2r 1/1 Running 0 21m
cat <<EOF | kubectl apply -f -
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
name: "cpu-flavor"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: "cq-small"
spec:
namespaceSelector: {} # match all.
resourceGroups:
- coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
flavors:
- name: "cpu-flavor"
resources:
- name: "cpu"
nominalQuota: 5
- name: "memory"
nominalQuota: 20Gi
- name: "nvidia.com/gpu"
nominalQuota: 5
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
name: lq-trainer
namespace: lm-eval-test
spec:
clusterQueue: cq-small
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: WorkloadPriorityClass
metadata:
name: p1
value: 30000
description: "high priority"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: WorkloadPriorityClass
metadata:
name: p2
value: 10000
description: "low priority"
EOF
See this Getting started with LM-Eval article for the latest info
6.1 Create a namespace to use. (Note, it can't use default namespace until Ted Chang's PR is approved and part of ODH)
oc create namespace lm-eval-test
cat <<EOF | kubectl apply -f -
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
name: evaljob-sample
namespace: lm-eval-test
spec:
model: hf
modelArgs:
- name: pretrained
value: google/flan-t5-base
taskList:
taskRecipes:
- card:
name: "cards.wnli"
template: "templates.classification.multi_class.relation.default"
logSamples: true
EOF
And then watch that it starts and runs:
watch oc get pods,lmevaljobs -n lm-eval-test
And once it pulls the image and runs for about 5 minutes it should look like this:
oc get pods,lmevaljobs -n lm-eval-test api.jim414fips.cp.fyre.ibm.com: Tue Oct 29 14:58:47 2024
NAME READY STATUS RESTARTS AGE
pod/evaljob-sample 1/1 Running 0 25s
NAME STATE
lmevaljob.trustyai.opendatahub.io/evaljob-sample Running
Cleanup your lmevaljob(s), for example:
oc delete lmevaljob evaljob-sample -n lm-eval-test
Cleanup of your Kueue resouces, if you want that:
cat <<EOF | kubectl delete -f -
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
name: "cpu-flavor"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: "cq-small"
spec:
namespaceSelector: {} # match all.
resourceGroups:
- coveredResources: ["cpu", "memory"]
flavors:
- name: "cpu-flavor"
resources:
- name: "cpu"
nominalQuota: 5
- name: "memory"
nominalQuota: 20Gi
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
name: lq-trainer
namespace: lm-eval-test
spec:
clusterQueue: cq-small
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: WorkloadPriorityClass
metadata:
name: p1
value: 30000
description: "high priority"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: WorkloadPriorityClass
metadata:
name: p2
value: 10000
description: "low priority"
EOF
Cleanup of dsc items (if you want that)
oc delete dsc default-dsc
Cleanup of DSCI (if you want that)
oc delete dsci default-dsci
Cleanup of ODH operators (if you want that) Note! The csv versions update on occasion, do oc get csv
to get latest version(s).
oc delete sub authorino-operator opendatahub-operator servicemeshoperator -n openshift-operators
oc delete sub serverless-operator -n openshift-serverless
oc delete csv authorino-operator.v0.13.0 opendatahub-operator.v2.22.0 servicemeshoperator.v2.6.4 -n openshift-operators
oc delete csv serverless-operator.v1.34.1 -n openshift-serverless
oc delete crd servicemeshcontrolplanes.maistra.io servicemeshmemberrolls.maistra.io servicemeshmembers.maistra.io servicemeshpeers.federation.maistra.io servicemeshpolicies.authentication.maistra.io servicemeshrbacconfigs.rbac.maistra.io lmevaljobs.trustyai.opendatahub.io