In this part, you will setup Kubeflow on Google Kubernetes Engine.
- A GCP project with sufficient GPU quota to train the model with the desired GPU, e.g., K80, P100, V100...
kubectl
CLI (command line interface) pointing to the kubernetes cluster- Make sure that you can run
kubectl get nodes
from your terminal successfully
- Make sure that you can run
- The ksonnet CLI, v0.9.2 or higher: ks
-
In case you want to install a particular version of ksonnet, you can run
export KS_VER=ks_0.11.0_linux_amd64 wget -O /tmp/$KS_VER.tar.gz https://github.com/ksonnet/ksonnet/releases/download/v0.11.0/$KS_VER.tar.gz mkdir -p ${HOME}/bin tar -xvf /tmp/$KS_VER.tar.gz -C ${HOME}/bin export PATH=$PATH:${HOME}/bin/$KS_VER
-
Refer to the user guide for detailed instructions on how to setup Kubeflow on your kubernetes cluster. Specifically, complete the following sections:
- Deploy
Kubeflow on Google Kubernetes Engine
- Zone(s) must have the GPU types you specify, e.g., us-central1-b for Nvidia V100
- If you run into
API rate limiting errors,
ensure you have a
${GITHUB_TOKEN}
environment variable set. - If you run into
RBAC permissions issues
running
ks apply
commands, be sure you have created acluster-admin
ClusterRoleBinding for your username.
- Setup a persistent disk using Google Filestore
- We need a shared persistent disk to store our trained model as containers' filesystems are ephemeral.
- For this example, provision a
10GB
cluster-wide shared NFS mount with the namekubeflow-gcfs
. - Enable the component in the Kubeflow cluster with
ks apply default -c google-cloud-filestore-pv
- Enable Seldon component to serve our Pytorch models
- Install the seldon package, generate the core component as per the instructions
- Enable the component in the Kubeflow cluster with
ks apply default -c seldon
After completing that, you should have the following ready:
- A ksonnet app in a directory named
ks_app
- An output similar to this for
kubectl get pods
command
ambassador-7fb86f6bc5-8xj5h 3/3 Running 0 13m
ambassador-7fb86f6bc5-h2hmt 3/3 Running 0 13m
ambassador-7fb86f6bc5-tz98s 3/3 Running 0 13m
argo-ui-7b6585d85d-mmxbb 1/1 Running 0 13m
centraldashboard-f8d7d97fb-5g76z 1/1 Running 0 13m
cert-manager-798d77c76-wflcb 1/1 Running 0 12m
cloud-endpoints-controller-b55c586b6-7pdnn 1/1 Running 0 12m
cm-acme-http-solver-8lcvn 1/1 Running 0 11m
envoy-79ff8d86b-4dgg7 2/2 Running 2 12m
envoy-79ff8d86b-t6mv8 2/2 Running 2 12m
envoy-79ff8d86b-xbfdj 2/2 Running 2 12m
iap-enabler-6565b855d6-j2rt6 1/1 Running 0 12m
ingress-bootstrap-kfzmg 1/1 Running 0 12m
kube-metacontroller-6d567f9fb4-6g5sr 1/1 Running 0 12m
modeldb-backend-69dfc464df-h9lt7 1/1 Running 0 12m
modeldb-db-6cf5bb764-qcxc7 1/1 Running 0 12m
modeldb-frontend-795bcf6df9-8smn8 1/1 Running 0 12m
pytorch-operator-76f89745c-2qtt9 1/1 Running 0 12m
seldon-redis-6cfc779655-lvphr 1/1 Running 0 6m
seldon-seldon-cluster-manager-54c7794dcb-jdq6p 1/1 Running 0 6m
set-gcfs-permissions-w65j9 0/1 Completed 0 18s
spartakus-volunteer-77446c94b5-7f5vf 1/1 Running 0 12m
studyjob-controller-68f5948984-c66g5 1/1 Running 0 12m
tf-hub-0 1/1 Running 0 13m
tf-job-dashboard-7cddcdf9c4-bqwbg 1/1 Running 0 13m
tf-job-operator-v1alpha2-6566f45db-v78tp 1/1 Running 0 13m
vizier-core-6d8c9d7bf7-chkkc 1/1 Running 1 12m
vizier-db-cc59bc8bd-bgvp6 1/1 Running 0 12m
vizier-suggestion-grid-76fb6b49c7-c22ss 1/1 Running 0 12m
vizier-suggestion-random-c5c64dfc9-svqc8 1/1 Running 0 12m
whoami-app-b7fb9f875-hchzc 1/1 Running 0 12m
workflow-controller-59c7967f59-df424 1/1 Running 0 13m
- A seldon component to serve our models
- A Pytorch operator to train our models
- A 10GB Filestore instance volume "kubeflow-gcfs"
- We deployed a Kubernetes cluster with Deployment Manager using kfctl.sh script
- We created a ksonnet app for our Kubeflow deployment
- We deployed the Kubeflow components to our kubernetes cluster
- We created a persistent disk using Google Filestore for storing our trained model
Next: Training the model