diff --git a/master/modelserving/v1beta1/llm/vllm/index.html b/master/modelserving/v1beta1/llm/vllm/index.html index 8aee8a450..94687965f 100644 --- a/master/modelserving/v1beta1/llm/vllm/index.html +++ b/master/modelserving/v1beta1/llm/vllm/index.html @@ -1178,7 +1178,7 @@

Deploy the LLaMA model with vL command: - python3 - -m - - vllm.entrypoints.api_server + - vllm.entrypoints.openai.api_server env: - name: STORAGE_URI value: gs://kfserving-examples/llm/huggingface/llama @@ -1217,7 +1217,7 @@

Benchmarking vLLM Runtimethis instruction to find out your ingress IP and port.

You can run the benchmarking script and send the inference request to the exposed URL.

-
python benchmark.py --backend vllm --port ${INGRESS_PORT} --host ${INGRESS_HOST} --dataset ./ShareGPT_V3_unfiltered_cleaned_split.json --tokenizer ./tokenizer --request-rate 5
+
python benchmark_serving.py --backend openai --port ${INGRESS_PORT} --host ${INGRESS_HOST} --dataset ./ShareGPT_V3_unfiltered_cleaned_split.json --tokenizer ./tokenizer --request-rate 5
 

Expected Output

diff --git a/master/search/search_index.json b/master/search/search_index.json index 1ccc6cd58..eb4372811 100644 --- a/master/search/search_index.json +++ b/master/search/search_index.json @@ -1 +1 @@ -{"config":{"indexing":"full","lang":["en"],"min_search_length":3,"prebuild_index":false,"separator":"[\\s\\-]+"},"docs":[{"location":"","text":"","title":"Home"},{"location":"admin/kubernetes_deployment/","text":"Kubernetes Deployment Installation Guide \u00b6 KServe supports RawDeployment mode to enable InferenceService deployment with Kubernetes resources Deployment , Service , Ingress and Horizontal Pod Autoscaler . Comparing to serverless deployment it unlocks Knative limitations such as mounting multiple volumes, on the other hand Scale down and from Zero is not supported in RawDeployment mode. Kubernetes 1.22 is the minimally required version and please check the following recommended Istio versions for the corresponding Kubernetes version. Recommended Version Matrix \u00b6 Kubernetes Version Recommended Istio Version 1.27 1.18, 1.19 1.28 1.19, 1.20 1.29 1.20, 1.21 1. Install Istio \u00b6 The minimally required Istio version is 1.13 and you can refer to the Istio install guide . Once Istio is installed, create IngressClass resource for istio. apiVersion : networking.k8s.io/v1 kind : IngressClass metadata : name : istio spec : controller : istio.io/ingress-controller Note Istio ingress is recommended, but you can choose to install with other Ingress controllers and create IngressClass resource for your Ingress option. 2. Install Cert Manager \u00b6 The minimally required Cert Manager version is 1.9.0 and you can refer to Cert Manager installation guide . Note Cert manager is required to provision webhook certs for production grade installation, alternatively you can run self signed certs generation script. 3. Install KServe \u00b6 Note The default KServe deployment mode is Serverless which depends on Knative. The following step changes the default deployment mode to RawDeployment before installing KServe. i. Install KServe kubectl kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.12.0/kserve.yaml Install KServe default serving runtimes: kubectl kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.12.0/kserve-cluster-resources.yaml ii. Change default deployment mode and ingress option First in ConfigMap inferenceservice-config modify the defaultDeploymentMode in the deploy section, kubectl kubectl patch configmap/inferenceservice-config -n kserve --type = strategic -p '{\"data\": {\"deploy\": \"{\\\"defaultDeploymentMode\\\": \\\"RawDeployment\\\"}\"}}' then modify the ingressClassName in ingress section to point to IngressClass name created in step 1 . ingress : |- { \"ingressClassName\" : \"your-ingress-class\" , }","title":"Kubernetes deployment installation"},{"location":"admin/kubernetes_deployment/#kubernetes-deployment-installation-guide","text":"KServe supports RawDeployment mode to enable InferenceService deployment with Kubernetes resources Deployment , Service , Ingress and Horizontal Pod Autoscaler . Comparing to serverless deployment it unlocks Knative limitations such as mounting multiple volumes, on the other hand Scale down and from Zero is not supported in RawDeployment mode. Kubernetes 1.22 is the minimally required version and please check the following recommended Istio versions for the corresponding Kubernetes version.","title":"Kubernetes Deployment Installation Guide"},{"location":"admin/kubernetes_deployment/#recommended-version-matrix","text":"Kubernetes Version Recommended Istio Version 1.27 1.18, 1.19 1.28 1.19, 1.20 1.29 1.20, 1.21","title":"Recommended Version Matrix"},{"location":"admin/kubernetes_deployment/#1-install-istio","text":"The minimally required Istio version is 1.13 and you can refer to the Istio install guide . Once Istio is installed, create IngressClass resource for istio. apiVersion : networking.k8s.io/v1 kind : IngressClass metadata : name : istio spec : controller : istio.io/ingress-controller Note Istio ingress is recommended, but you can choose to install with other Ingress controllers and create IngressClass resource for your Ingress option.","title":"1. Install Istio"},{"location":"admin/kubernetes_deployment/#2-install-cert-manager","text":"The minimally required Cert Manager version is 1.9.0 and you can refer to Cert Manager installation guide . Note Cert manager is required to provision webhook certs for production grade installation, alternatively you can run self signed certs generation script.","title":"2. Install Cert Manager"},{"location":"admin/kubernetes_deployment/#3-install-kserve","text":"Note The default KServe deployment mode is Serverless which depends on Knative. The following step changes the default deployment mode to RawDeployment before installing KServe. i. Install KServe kubectl kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.12.0/kserve.yaml Install KServe default serving runtimes: kubectl kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.12.0/kserve-cluster-resources.yaml ii. Change default deployment mode and ingress option First in ConfigMap inferenceservice-config modify the defaultDeploymentMode in the deploy section, kubectl kubectl patch configmap/inferenceservice-config -n kserve --type = strategic -p '{\"data\": {\"deploy\": \"{\\\"defaultDeploymentMode\\\": \\\"RawDeployment\\\"}\"}}' then modify the ingressClassName in ingress section to point to IngressClass name created in step 1 . ingress : |- { \"ingressClassName\" : \"your-ingress-class\" , }","title":"3. Install KServe"},{"location":"admin/migration/","text":"Migrating from KFServing \u00b6 This doc explains how to migrate existing inference services from KFServing to KServe without downtime. Note The migration job will by default delete the leftover KFServing installation after migrating the inference services from serving.kubeflow.org to serving.kserve.io . Migrating from standalone KFServing \u00b6 Install KServe v0.7 using the install YAML This will not affect existing services yet. kubectl apply -f https://raw.githubusercontent.com/kserve/kserve/master/install/v0.7.0/kserve.yaml Run the KServe Migration YAML This will begin the migration. Any errors here may affect your existing services. If you do not want to delete the KFServing resources after migrating, download and edit the env REMOVE_KFSERVING in the YAML before applying it If your KFServing is installed in a namespace other than kfserving-system , then download and set the env KFSERVING_NAMESPACE in the YAML before applying it kubectl apply -f https://raw.githubusercontent.com/kserve/kserve/master/hack/kserve_migration/kserve_migration_job.yaml Clean up the migration resources kubectl delete ClusterRoleBinding cluster-migration-rolebinding kubectl delete ClusterRole cluster-migration-role kubectl delete ServiceAccount cluster-migration-svcaccount -n kserve Migrating from Kubeflow-based KFServing \u00b6 Install Kubeflow-based KServe 0.7 using the install YAML This will not affect existing services yet. kubectl apply -f https://raw.githubusercontent.com/kserve/kserve/master/install/v0.7.0/kserve_kubeflow.yaml Run the KServe Migration YAML for Kubeflow-based installations This will begin the migration. Any errors here may affect your existing services. If you do not want to delete the KFServing resources after migrating, download and edit the env REMOVE_KFSERVING in the YAML before applying it kubectl apply -f https://raw.githubusercontent.com/kserve/kserve/master/hack/kserve_migration/kserve_migration_job_kubeflow.yaml Clean up the migration resources kubectl delete ClusterRoleBinding cluster-migration-rolebinding kubectl delete ClusterRole cluster-migration-role kubectl delete ServiceAccount cluster-migration-svcaccount -n kubeflow Update the models web app to use the new InferenceService API group serving.kserve.io Change the deployment image to kserve/models-web-app:v0.7.0-rc0 This is a temporary fix until the next Kubeflow release includes these changes kubectl edit deployment kfserving-models-web-app -n kubeflow Update the cluster role to be able to access the new InferenceService API group serving.kserve.io Edit the apiGroups from serving.kubeflow.org to serving.kserve.io This is a temporary fix until the next Kubeflow release includes these changes kubectl edit clusterrole kfserving-models-web-app-cluster-role","title":"Migrating from KFServing"},{"location":"admin/migration/#migrating-from-kfserving","text":"This doc explains how to migrate existing inference services from KFServing to KServe without downtime. Note The migration job will by default delete the leftover KFServing installation after migrating the inference services from serving.kubeflow.org to serving.kserve.io .","title":"Migrating from KFServing"},{"location":"admin/migration/#migrating-from-standalone-kfserving","text":"Install KServe v0.7 using the install YAML This will not affect existing services yet. kubectl apply -f https://raw.githubusercontent.com/kserve/kserve/master/install/v0.7.0/kserve.yaml Run the KServe Migration YAML This will begin the migration. Any errors here may affect your existing services. If you do not want to delete the KFServing resources after migrating, download and edit the env REMOVE_KFSERVING in the YAML before applying it If your KFServing is installed in a namespace other than kfserving-system , then download and set the env KFSERVING_NAMESPACE in the YAML before applying it kubectl apply -f https://raw.githubusercontent.com/kserve/kserve/master/hack/kserve_migration/kserve_migration_job.yaml Clean up the migration resources kubectl delete ClusterRoleBinding cluster-migration-rolebinding kubectl delete ClusterRole cluster-migration-role kubectl delete ServiceAccount cluster-migration-svcaccount -n kserve","title":"Migrating from standalone KFServing"},{"location":"admin/migration/#migrating-from-kubeflow-based-kfserving","text":"Install Kubeflow-based KServe 0.7 using the install YAML This will not affect existing services yet. kubectl apply -f https://raw.githubusercontent.com/kserve/kserve/master/install/v0.7.0/kserve_kubeflow.yaml Run the KServe Migration YAML for Kubeflow-based installations This will begin the migration. Any errors here may affect your existing services. If you do not want to delete the KFServing resources after migrating, download and edit the env REMOVE_KFSERVING in the YAML before applying it kubectl apply -f https://raw.githubusercontent.com/kserve/kserve/master/hack/kserve_migration/kserve_migration_job_kubeflow.yaml Clean up the migration resources kubectl delete ClusterRoleBinding cluster-migration-rolebinding kubectl delete ClusterRole cluster-migration-role kubectl delete ServiceAccount cluster-migration-svcaccount -n kubeflow Update the models web app to use the new InferenceService API group serving.kserve.io Change the deployment image to kserve/models-web-app:v0.7.0-rc0 This is a temporary fix until the next Kubeflow release includes these changes kubectl edit deployment kfserving-models-web-app -n kubeflow Update the cluster role to be able to access the new InferenceService API group serving.kserve.io Edit the apiGroups from serving.kubeflow.org to serving.kserve.io This is a temporary fix until the next Kubeflow release includes these changes kubectl edit clusterrole kfserving-models-web-app-cluster-role","title":"Migrating from Kubeflow-based KFServing"},{"location":"admin/modelmesh/","text":"ModelMesh Installation Guide \u00b6 KServe ModelMesh installation enables high-scale, high-density and frequently-changing model serving use cases. A Kubernetes cluster is required. You will need cluster-admin authority. Additionally, kustomize and an etcd server on the Kubernetes cluster are required. 1. Standard Installation \u00b6 You can find the standard installation instructions in the ModelMesh Serving installation guide . This approach assumes you have installed the prerequisites such as etcd and S3-compatible object storage. 2. Quick Installation \u00b6 A quick installation allows you to quickly get ModelMesh Serving up and running without having to manually install the prerequisites. The steps are described in the ModelMesh Serving quick start guide . Note ModelMesh Serving is namespace scoped, meaning all of its components must exist within a single namespace and only one instance of ModelMesh Serving can be installed per namespace. For more details, you can check out the ModelMesh Serving getting started guide .","title":"ModelMesh installation"},{"location":"admin/modelmesh/#modelmesh-installation-guide","text":"KServe ModelMesh installation enables high-scale, high-density and frequently-changing model serving use cases. A Kubernetes cluster is required. You will need cluster-admin authority. Additionally, kustomize and an etcd server on the Kubernetes cluster are required.","title":"ModelMesh Installation Guide"},{"location":"admin/modelmesh/#1-standard-installation","text":"You can find the standard installation instructions in the ModelMesh Serving installation guide . This approach assumes you have installed the prerequisites such as etcd and S3-compatible object storage.","title":"1. Standard Installation"},{"location":"admin/modelmesh/#2-quick-installation","text":"A quick installation allows you to quickly get ModelMesh Serving up and running without having to manually install the prerequisites. The steps are described in the ModelMesh Serving quick start guide . Note ModelMesh Serving is namespace scoped, meaning all of its components must exist within a single namespace and only one instance of ModelMesh Serving can be installed per namespace. For more details, you can check out the ModelMesh Serving getting started guide .","title":"2. Quick Installation"},{"location":"admin/serverless/serverless/","text":"Serverless Installation Guide \u00b6 KServe Serverless installation enables autoscaling based on request volume and supports scale down to and from zero. It also supports revision management and canary rollout based on revisions. Kubernetes 1.22 is the minimally required version and please check the following recommended Knative, Istio versions for the corresponding Kubernetes version. Recommended Version Matrix \u00b6 Kubernetes Version Recommended Istio Version Recommended Knative Version 1.27 1.18,1.19 1.10,1.11 1.28 1.19,1.20 1.11,1.12.4 1.29 1.20,1.21 1.12.4,1.13.1 1. Install Knative Serving \u00b6 Please refer to Knative Serving install guide . Note If you are looking to use PodSpec fields such as nodeSelector, affinity or tolerations which are now supported in the v1beta1 API spec, you need to turn on the corresponding feature flags in your Knative configuration. Warning Knative 1.13.1 requires Istio 1.20+, gRPC routing does not work with previous Istio releases, see release notes . 2. Install Networking Layer \u00b6 The recommended networking layer for KServe is Istio as currently it works best with KServe, please refer to the Istio install guide . Alternatively you can also choose other networking layers like Kourier or Contour , see how to install Kourier with KServe guide . 3. Install Cert Manager \u00b6 The minimally required Cert Manager version is 1.9.0 and you can refer to Cert Manager . Note Cert manager is required to provision webhook certs for production grade installation, alternatively you can run self signed certs generation script. 4. Install KServe \u00b6 kubectl kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.12.0/kserve.yaml 5. Install KServe Built-in ClusterServingRuntimes \u00b6 kubectl kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.12.0/kserve-cluster-resources.yaml Note ClusterServingRuntimes are required to create InferenceService for built-in model serving runtimes with KServe v0.8.0 or higher.","title":"Serverless installation"},{"location":"admin/serverless/serverless/#serverless-installation-guide","text":"KServe Serverless installation enables autoscaling based on request volume and supports scale down to and from zero. It also supports revision management and canary rollout based on revisions. Kubernetes 1.22 is the minimally required version and please check the following recommended Knative, Istio versions for the corresponding Kubernetes version.","title":"Serverless Installation Guide"},{"location":"admin/serverless/serverless/#recommended-version-matrix","text":"Kubernetes Version Recommended Istio Version Recommended Knative Version 1.27 1.18,1.19 1.10,1.11 1.28 1.19,1.20 1.11,1.12.4 1.29 1.20,1.21 1.12.4,1.13.1","title":"Recommended Version Matrix"},{"location":"admin/serverless/serverless/#1-install-knative-serving","text":"Please refer to Knative Serving install guide . Note If you are looking to use PodSpec fields such as nodeSelector, affinity or tolerations which are now supported in the v1beta1 API spec, you need to turn on the corresponding feature flags in your Knative configuration. Warning Knative 1.13.1 requires Istio 1.20+, gRPC routing does not work with previous Istio releases, see release notes .","title":"1. Install Knative Serving"},{"location":"admin/serverless/serverless/#2-install-networking-layer","text":"The recommended networking layer for KServe is Istio as currently it works best with KServe, please refer to the Istio install guide . Alternatively you can also choose other networking layers like Kourier or Contour , see how to install Kourier with KServe guide .","title":"2. Install Networking Layer"},{"location":"admin/serverless/serverless/#3-install-cert-manager","text":"The minimally required Cert Manager version is 1.9.0 and you can refer to Cert Manager . Note Cert manager is required to provision webhook certs for production grade installation, alternatively you can run self signed certs generation script.","title":"3. Install Cert Manager"},{"location":"admin/serverless/serverless/#4-install-kserve","text":"kubectl kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.12.0/kserve.yaml","title":"4. Install KServe"},{"location":"admin/serverless/serverless/#5-install-kserve-built-in-clusterservingruntimes","text":"kubectl kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.12.0/kserve-cluster-resources.yaml Note ClusterServingRuntimes are required to create InferenceService for built-in model serving runtimes with KServe v0.8.0 or higher.","title":"5. Install KServe Built-in ClusterServingRuntimes"},{"location":"admin/serverless/kourier_networking/","text":"Deploy InferenceService with Alternative Networking Layer \u00b6 KServe creates the top level Istio Virtual Service for routing to InferenceService components based on the virtual host or path based routing. Now KServe provides an option for disabling the top level virtual service to allow configuring other networking layers Knative supports. For example, Kourier is an alternative networking layer and the following steps show how you can deploy KServe with Kourier . Install Kourier Networking Layer \u00b6 Please refer to the Serverless Installation Guide and change the second step to install Kourier instead of Istio . Install the Kourier networking layer: kubectl apply -f https://github.com/knative/net-kourier/releases/download/ ${ KNATIVE_VERSION } /kourier.yaml Configure Knative Serving to use Kourier: kubectl patch configmap/config-network \\ --namespace knative-serving \\ --type merge \\ --patch '{\"data\":{\"ingress-class\":\"kourier.ingress.networking.knative.dev\"}}' Verify Kourier installation: kubectl get pods -n knative-serving && kubectl get pods -n kourier-system Expected Output NAME READY STATUS RESTARTS AGE activator-77db7d9dd7-kbrgr 1 /1 Running 0 10m autoscaler-67dbf79b95-htnp9 1 /1 Running 0 10m controller-684b6bc97f-ffm58 1 /1 Running 0 10m domain-mapping-6d99d99978-ktmrf 1 /1 Running 0 10m domainmapping-webhook-5f998498b6-sddnm 1 /1 Running 0 10m net-kourier-controller-68967d76dc-ncj2n 1 /1 Running 0 10m webhook-97bdc7b4d-nr7qf 1 /1 Running 0 10m NAME READY STATUS RESTARTS AGE 3scale-kourier-gateway-54c49c8ff5-x8tgn 1 /1 Running 0 10m Edit inferenceservice-config configmap to disable Istio top level virtual host: kubectl edit configmap/inferenceservice-config --namespace kserve # Add the flag `\"disableIstioVirtualHost\": true` under the ingress section ingress : | - { \"disableIstioVirtualHost\" : true } Restart the KServe Controller kubectl rollout restart deployment kserve-controller-manager -n kserve Deploy InferenceService for Testing Kourier Gateway \u00b6 Create the InferenceService \u00b6 New Schema Old Schema apiVersion : \"serving.kserve.io/v1beta1\" kind : \"InferenceService\" metadata : name : \"pmml-demo\" spec : predictor : model : modelFormat : name : pmml storageUri : \"gs://kfserving-examples/models/pmml\" apiVersion : \"serving.kserve.io/v1beta1\" kind : \"InferenceService\" metadata : name : \"pmml-demo\" spec : predictor : pmml : storageUri : gs://kfserving-examples/models/pmml kubectl apply -f pmml.yaml Expected Output $ inferenceservice.serving.kserve.io/pmml-demo created Run a Prediction \u00b6 Note that when setting INGRESS_HOST and INGRESS_PORT following the determining the ingress IP and ports guide you need to replace istio-ingressgateway with kourier-gateway . For example if you choose to do Port Forward for testing you need to select the kourier-gateway pod as following. kubectl port-forward --namespace kourier-system \\ $( kubectl get pod -n kourier-system -l \"app=3scale-kourier-gateway\" --output = jsonpath = \"{.items[0].metadata.name}\" ) 8080 :8080 export INGRESS_HOST = localhost export INGRESS_PORT = 8080 Make sure that you create a file named pmml-input.json with the following content, under your current terminal path. { \"instances\" : [ [ 5.1 , 3.5 , 1.4 , 0.2 ] ] } Send a prediction request to the InferenceService and check the output. MODEL_NAME = pmml-demo INPUT_PATH = @./pmml-input.json SERVICE_HOSTNAME = $( kubectl get inferenceservice pmml-demo -o jsonpath = '{.status.url}' | cut -d \"/\" -f 3 ) curl -v -H \"Host: ${ SERVICE_HOSTNAME } \" -H \"Content-Type: application/json\" http:// ${ INGRESS_HOST } : ${ INGRESS_PORT } /v1/models/ $MODEL_NAME :predict -d $INPUT_PATH Expected Output * Trying 127 .0.0.1... * TCP_NODELAY set * Connected to localhost ( 127 .0.0.1 ) port 8080 ( #0) > POST /v1/models/pmml-demo:predict HTTP/1.1 > Host: pmml-demo-predictor-default.default.example.com > User-Agent: curl/7.58.0 > Accept: */* > Content-Length: 45 > Content-Type: application/x-www-form-urlencoded > * upload completely sent off: 45 out of 45 bytes < HTTP/1.1 200 OK < content-length: 144 < content-type: application/json ; charset = UTF-8 < date: Wed, 14 Sep 2022 13 :30:09 GMT < server: envoy < x-envoy-upstream-service-time: 58 < * Connection #0 to host localhost left intact { \"predictions\" : [{ \"Species\" : \"setosa\" , \"Probability_setosa\" : 1 .0, \"Probability_versicolor\" : 0 .0, \"Probability_virginica\" : 0 .0, \"Node_Id\" : \"2\" }]}","title":"Kourier Networking Layer"},{"location":"admin/serverless/kourier_networking/#deploy-inferenceservice-with-alternative-networking-layer","text":"KServe creates the top level Istio Virtual Service for routing to InferenceService components based on the virtual host or path based routing. Now KServe provides an option for disabling the top level virtual service to allow configuring other networking layers Knative supports. For example, Kourier is an alternative networking layer and the following steps show how you can deploy KServe with Kourier .","title":"Deploy InferenceService with Alternative Networking Layer"},{"location":"admin/serverless/kourier_networking/#install-kourier-networking-layer","text":"Please refer to the Serverless Installation Guide and change the second step to install Kourier instead of Istio . Install the Kourier networking layer: kubectl apply -f https://github.com/knative/net-kourier/releases/download/ ${ KNATIVE_VERSION } /kourier.yaml Configure Knative Serving to use Kourier: kubectl patch configmap/config-network \\ --namespace knative-serving \\ --type merge \\ --patch '{\"data\":{\"ingress-class\":\"kourier.ingress.networking.knative.dev\"}}' Verify Kourier installation: kubectl get pods -n knative-serving && kubectl get pods -n kourier-system Expected Output NAME READY STATUS RESTARTS AGE activator-77db7d9dd7-kbrgr 1 /1 Running 0 10m autoscaler-67dbf79b95-htnp9 1 /1 Running 0 10m controller-684b6bc97f-ffm58 1 /1 Running 0 10m domain-mapping-6d99d99978-ktmrf 1 /1 Running 0 10m domainmapping-webhook-5f998498b6-sddnm 1 /1 Running 0 10m net-kourier-controller-68967d76dc-ncj2n 1 /1 Running 0 10m webhook-97bdc7b4d-nr7qf 1 /1 Running 0 10m NAME READY STATUS RESTARTS AGE 3scale-kourier-gateway-54c49c8ff5-x8tgn 1 /1 Running 0 10m Edit inferenceservice-config configmap to disable Istio top level virtual host: kubectl edit configmap/inferenceservice-config --namespace kserve # Add the flag `\"disableIstioVirtualHost\": true` under the ingress section ingress : | - { \"disableIstioVirtualHost\" : true } Restart the KServe Controller kubectl rollout restart deployment kserve-controller-manager -n kserve","title":"Install Kourier Networking Layer"},{"location":"admin/serverless/kourier_networking/#deploy-inferenceservice-for-testing-kourier-gateway","text":"","title":"Deploy InferenceService for Testing Kourier Gateway"},{"location":"admin/serverless/kourier_networking/#create-the-inferenceservice","text":"New Schema Old Schema apiVersion : \"serving.kserve.io/v1beta1\" kind : \"InferenceService\" metadata : name : \"pmml-demo\" spec : predictor : model : modelFormat : name : pmml storageUri : \"gs://kfserving-examples/models/pmml\" apiVersion : \"serving.kserve.io/v1beta1\" kind : \"InferenceService\" metadata : name : \"pmml-demo\" spec : predictor : pmml : storageUri : gs://kfserving-examples/models/pmml kubectl apply -f pmml.yaml Expected Output $ inferenceservice.serving.kserve.io/pmml-demo created","title":"Create the InferenceService"},{"location":"admin/serverless/kourier_networking/#run-a-prediction","text":"Note that when setting INGRESS_HOST and INGRESS_PORT following the determining the ingress IP and ports guide you need to replace istio-ingressgateway with kourier-gateway . For example if you choose to do Port Forward for testing you need to select the kourier-gateway pod as following. kubectl port-forward --namespace kourier-system \\ $( kubectl get pod -n kourier-system -l \"app=3scale-kourier-gateway\" --output = jsonpath = \"{.items[0].metadata.name}\" ) 8080 :8080 export INGRESS_HOST = localhost export INGRESS_PORT = 8080 Make sure that you create a file named pmml-input.json with the following content, under your current terminal path. { \"instances\" : [ [ 5.1 , 3.5 , 1.4 , 0.2 ] ] } Send a prediction request to the InferenceService and check the output. MODEL_NAME = pmml-demo INPUT_PATH = @./pmml-input.json SERVICE_HOSTNAME = $( kubectl get inferenceservice pmml-demo -o jsonpath = '{.status.url}' | cut -d \"/\" -f 3 ) curl -v -H \"Host: ${ SERVICE_HOSTNAME } \" -H \"Content-Type: application/json\" http:// ${ INGRESS_HOST } : ${ INGRESS_PORT } /v1/models/ $MODEL_NAME :predict -d $INPUT_PATH Expected Output * Trying 127 .0.0.1... * TCP_NODELAY set * Connected to localhost ( 127 .0.0.1 ) port 8080 ( #0) > POST /v1/models/pmml-demo:predict HTTP/1.1 > Host: pmml-demo-predictor-default.default.example.com > User-Agent: curl/7.58.0 > Accept: */* > Content-Length: 45 > Content-Type: application/x-www-form-urlencoded > * upload completely sent off: 45 out of 45 bytes < HTTP/1.1 200 OK < content-length: 144 < content-type: application/json ; charset = UTF-8 < date: Wed, 14 Sep 2022 13 :30:09 GMT < server: envoy < x-envoy-upstream-service-time: 58 < * Connection #0 to host localhost left intact { \"predictions\" : [{ \"Species\" : \"setosa\" , \"Probability_setosa\" : 1 .0, \"Probability_versicolor\" : 0 .0, \"Probability_virginica\" : 0 .0, \"Node_Id\" : \"2\" }]}","title":"Run a Prediction"},{"location":"admin/serverless/servicemesh/","text":"Secure InferenceService with ServiceMesh \u00b6 A service mesh is a dedicated infrastructure layer that you can add to your InferenceService to allow you to transparently add capabilities like observability, traffic management and security. In this example we show how you can turn on the Istio service mesh mode to provide a uniform and efficient way to secure service-to-service communication in a cluster with TLS encryption, strong identity-based authentication and authorization. Turn on strict mTLS and Authorization Policy \u00b6 For namespace traffic isolation, we lock down the in cluster traffic to only allow requests from the same namespace and enable mTLS for TLS encryption and strong identity-based authentication. Because Knative requests are frequently routed through activator, when turning on mTLS additional traffic rules are required and activator/autoscaler in knative-serving namespace must have sidecar injected as well. For more details please see mTLS in Knative , to understand when requests are forwarded through the activator, see target burst capacity docs. Create the namespace user1 which is used for this example. kubectl create namespace user1 When activator is not on the request path, the rule checks if the source namespace of the request is the same as the destination namespace of InferenceService . When activator is on the request path, the rule checks the source namespace knative-serving namespace as the request is proxied through activator. Warning Currently when activator is on the request path, it is not able to check the originated namespace or original identity due to the net-istio issue . apiVersion : security.istio.io/v1beta1 kind : PeerAuthentication metadata : name : default namespace : user1 spec : mtls : mode : STRICT --- apiVersion : security.istio.io/v1beta1 kind : AuthorizationPolicy metadata : name : allow-serving-tests namespace : user1 spec : action : ALLOW rules : # 1. mTLS for service from source \"user1\" namespace to destination service when TargetBurstCapacity=0 without local gateway and activator on the path # Source Service from \"user1\" namespace -> Destination Service in \"user1\" namespace - from : - source : namespaces : [ \"user1\" ] # 2. mTLS for service from source \"user1\" namespace to destination service with activator on the path # Source Service from \"user1\" namespace -> Activator(Knative Serving namespace) -> Destination service in \"user1\" namespace # unfortunately currently we could not lock down the source namespace as Activator does not capture the source namespace when proxying the request, see https://github.com/knative-sandbox/net-istio/issues/554. - from : - source : namespaces : [ \"knative-serving\" ] # 3. allow metrics and probes from knative serving namespaces - from : - source : namespaces : [ \"knative-serving\" ] to : - operation : paths : [ \"/metrics\" , \"/healthz\" , \"/ready\" , \"/wait-for-drain\" ] Apply the PeerAuthentication and AuthorizationPolicy rules with auth.yaml : kubectl apply -f auth.yaml Disable Top Level Virtual Service \u00b6 KServe currently creates an Istio top level virtual service to support routing between InferenceService components like predictor, transformer and explainer, as well as support path based routing as an alternative routing with service hosts. In serverless service mesh mode this creates a problem that in order to route through the underlying virtual service created by Knative Service, the top level virtual service is required to route to the Istio Gateway instead of to the InferenceService component on the service mesh directly. By disabling the top level virtual service, it eliminates the extra route to Istio local gateway and the authorization policy can check the source namespace when mTLS is established directly between service to service and activator is not on the request path. To disable the top level virtual service, add the flag \"disableIstioVirtualHost\": true under the ingress config in inferenceservice configmap. kubectl edit configmap/inferenceservice-config --namespace kserve ingress : | - { \"disableIstioVirtualHost\" : true } Deploy InferenceService with Istio sidecar injection \u00b6 First label the namespace with istio-injection=enabled to turn on the sidecar injection for the namespace. kubectl label namespace user1 istio-injection = enabled --overwrite Create the InferenceService with and without Knative activator on the path: When autoscaling.knative.dev/targetBurstCapacity is set to 0, Knative removes the activator from the request path so the test service can directly establish the mTLS connection to the InferenceService and the authorization policy can check the original namespace of the request to lock down the traffic for namespace isolation. InferenceService with activator on path InferenceService without activator on path apiVersion : \"serving.kserve.io/v1beta1\" kind : \"InferenceService\" metadata : name : \"sklearn-iris-burst\" namespace : user1 annotations : \"sidecar.istio.io/inject\" : \"true\" spec : predictor : model : modelFormat : name : sklearn storageUri : \"gs://kfserving-examples/models/sklearn/1.0/model\" apiVersion : \"serving.kserve.io/v1beta1\" kind : \"InferenceService\" metadata : name : \"sklearn-iris\" namespace : user1 annotations : \"autoscaling.knative.dev/targetBurstCapacity\" : \"0\" \"sidecar.istio.io/inject\" : \"true\" spec : predictor : model : modelFormat : name : sklearn storageUri : \"gs://kfserving-examples/models/sklearn/1.0/model\" kubectl apply -f sklearn_iris.yaml Expected Output $ inferenceservice.serving.kserve.io/sklearn-iris created $ inferenceservice.serving.kserve.io/sklearn-iris-burst created kubectl get pods -n user1 NAME READY STATUS RESTARTS AGE httpbin-6484879498-qxqj8 2 /2 Running 0 19h sklearn-iris-burst-predictor-default-00001-deployment-5685n46f6 3 /3 Running 0 12h sklearn-iris-predictor-default-00001-deployment-985d5cd46-zzw4x 3 /3 Running 0 12h Run a prediction from the same namespace \u00b6 Deploy a test service in user1 namespace with httpbin.yaml . kubectl apply -f httpbin.yaml Run a prediction request to the sklearn-iris InferenceService without activator on the path, you are expected to get HTTP 200 as the authorization rule allows traffic from the same namespace. kubectl exec -it httpbin-6484879498-qxqj8 -c istio-proxy -n user1 -- curl -v sklearn-iris-predictor-default.user1.svc.cluster.local/v1/models/sklearn-iris Expected Output * Connected to sklearn-iris-predictor-default.user1.svc.cluster.local ( 10 .96.137.152 ) port 80 ( #0) > GET /v1/models/sklearn-iris HTTP/1.1 > Host: sklearn-iris-predictor-default.user1.svc.cluster.local > User-Agent: curl/7.81.0 > Accept: */* > * Mark bundle as not supporting multiuse < HTTP/1.1 200 OK < content-length: 36 < content-type: application/json < date: Sat, 26 Nov 2022 01 :45:10 GMT < server: istio-envoy < x-envoy-upstream-service-time: 42 < * Connection #0 to host sklearn-iris-predictor-default.user1.svc.cluster.local left intact { \"name\" : \"sklearn-iris\" , \"ready\" :true } Run a prediction request to the sklearn-iris-burst InferenceService with activator on the path, you are expected to get HTTP 200 as the authorization rule allows traffic from knative-serving namespace. kubectl exec -it httpbin-6484879498-qxqj8 -c istio-proxy -n user1 -- curl -v sklearn-iris-burst-predictor-default.user1.svc.cluster.local/v1/models/sklearn-iris-burst Expected Output * Connected to sklearn-iris-burst-predictor-default.user1.svc.cluster.local ( 10 .96.137.152 ) port 80 ( #0) > GET /v1/models/sklearn-iris-burst HTTP/1.1 > Host: sklearn-iris-burst-predictor-default.user1.svc.cluster.local > User-Agent: curl/7.81.0 > Accept: */* > * Mark bundle as not supporting multiuse < HTTP/1.1 200 OK < content-length: 42 < content-type: application/json < date: Sat, 26 Nov 2022 13 :55:14 GMT < server: istio-envoy < x-envoy-upstream-service-time: 209 < * Connection #0 to host sklearn-iris-burst-predictor-default.user1.svc.cluster.local left intact { \"name\" : \"sklearn-iris-burst\" , \"ready\" :true } Run a prediction from a different namespace \u00b6 Deploy a test service in default namespace with sleep.yaml which is different from the namespace the InferenceService is deployed to. kubectl apply -f sleep.yaml When you send a prediction request to the sklearn-iris InferenceService without activator on the request path from a different namespace, you are expected to get HTTP 403 \"RBAC denied\" as the authorization rule only allows the traffic from the same namespace user1 where the InferenceService is deployed. kubectl exec -it sleep-6d6b49d8b8-6ths6 -- curl -v sklearn-iris-predictor-default.user1.svc.cluster.local/v1/models/sklearn-iris Expected Output * Connected to sklearn-iris-predictor-default.user1.svc.cluster.local ( 10 .96.137.152 ) port 80 ( #0) > GET /v1/models/sklearn-iris HTTP/1.1 > Host: sklearn-iris-predictor-default.user1.svc.cluster.local > User-Agent: curl/7.86.0-DEV > Accept: */* > * Mark bundle as not supporting multiuse < HTTP/1.1 403 Forbidden < content-length: 19 < content-type: text/plain < date: Sat, 26 Nov 2022 02 :45:46 GMT < server: envoy < x-envoy-upstream-service-time: 14 < * Connection #0 to host sklearn-iris-predictor-default.user1.svc.cluster.local left intact When you send a prediction request to the sklearn-iris-burst InferenceService with activator on the request path from a different namespace, you actually get HTTP 200 response due to the above limitation as the authorization policy is not able to lock down the traffic only from the same namespace as the request is proxied through activator in knative-serving namespace, we expect to get HTTP 403 once upstream Knative net-istio is fixed. kubectl exec -it sleep-6d6b49d8b8-6ths6 -- curl -v sklearn-iris-burst-predictor-default.user1.svc.cluster.local/v1/models/sklearn-iris-burst Expected Output * Connected to sklearn-iris-burst-predictor-default.user1.svc.cluster.local ( 10 .96.137.152 ) port 80 ( #0) > GET /v1/models/sklearn-iris-burst HTTP/1.1 > Host: sklearn-iris-burst-predictor-default.user1.svc.cluster.local > User-Agent: curl/7.86.0-DEV > Accept: */* > * Mark bundle as not supporting multiuse < HTTP/1.1 200 OK < content-length: 42 < content-type: application/json < date: Sat, 26 Nov 2022 13 :59:04 GMT < server: envoy < x-envoy-upstream-service-time: 6 < * Connection #0 to host sklearn-iris-burst-predictor-default.user1.svc.cluster.local left intact { \"name\" : \"sklearn-iris-burst\" , \"ready\" :true }","title":"Istio Service Mesh"},{"location":"admin/serverless/servicemesh/#secure-inferenceservice-with-servicemesh","text":"A service mesh is a dedicated infrastructure layer that you can add to your InferenceService to allow you to transparently add capabilities like observability, traffic management and security. In this example we show how you can turn on the Istio service mesh mode to provide a uniform and efficient way to secure service-to-service communication in a cluster with TLS encryption, strong identity-based authentication and authorization.","title":"Secure InferenceService with ServiceMesh"},{"location":"admin/serverless/servicemesh/#turn-on-strict-mtls-and-authorization-policy","text":"For namespace traffic isolation, we lock down the in cluster traffic to only allow requests from the same namespace and enable mTLS for TLS encryption and strong identity-based authentication. Because Knative requests are frequently routed through activator, when turning on mTLS additional traffic rules are required and activator/autoscaler in knative-serving namespace must have sidecar injected as well. For more details please see mTLS in Knative , to understand when requests are forwarded through the activator, see target burst capacity docs. Create the namespace user1 which is used for this example. kubectl create namespace user1 When activator is not on the request path, the rule checks if the source namespace of the request is the same as the destination namespace of InferenceService . When activator is on the request path, the rule checks the source namespace knative-serving namespace as the request is proxied through activator. Warning Currently when activator is on the request path, it is not able to check the originated namespace or original identity due to the net-istio issue . apiVersion : security.istio.io/v1beta1 kind : PeerAuthentication metadata : name : default namespace : user1 spec : mtls : mode : STRICT --- apiVersion : security.istio.io/v1beta1 kind : AuthorizationPolicy metadata : name : allow-serving-tests namespace : user1 spec : action : ALLOW rules : # 1. mTLS for service from source \"user1\" namespace to destination service when TargetBurstCapacity=0 without local gateway and activator on the path # Source Service from \"user1\" namespace -> Destination Service in \"user1\" namespace - from : - source : namespaces : [ \"user1\" ] # 2. mTLS for service from source \"user1\" namespace to destination service with activator on the path # Source Service from \"user1\" namespace -> Activator(Knative Serving namespace) -> Destination service in \"user1\" namespace # unfortunately currently we could not lock down the source namespace as Activator does not capture the source namespace when proxying the request, see https://github.com/knative-sandbox/net-istio/issues/554. - from : - source : namespaces : [ \"knative-serving\" ] # 3. allow metrics and probes from knative serving namespaces - from : - source : namespaces : [ \"knative-serving\" ] to : - operation : paths : [ \"/metrics\" , \"/healthz\" , \"/ready\" , \"/wait-for-drain\" ] Apply the PeerAuthentication and AuthorizationPolicy rules with auth.yaml : kubectl apply -f auth.yaml","title":"Turn on strict mTLS and Authorization Policy"},{"location":"admin/serverless/servicemesh/#disable-top-level-virtual-service","text":"KServe currently creates an Istio top level virtual service to support routing between InferenceService components like predictor, transformer and explainer, as well as support path based routing as an alternative routing with service hosts. In serverless service mesh mode this creates a problem that in order to route through the underlying virtual service created by Knative Service, the top level virtual service is required to route to the Istio Gateway instead of to the InferenceService component on the service mesh directly. By disabling the top level virtual service, it eliminates the extra route to Istio local gateway and the authorization policy can check the source namespace when mTLS is established directly between service to service and activator is not on the request path. To disable the top level virtual service, add the flag \"disableIstioVirtualHost\": true under the ingress config in inferenceservice configmap. kubectl edit configmap/inferenceservice-config --namespace kserve ingress : | - { \"disableIstioVirtualHost\" : true }","title":"Disable Top Level Virtual Service"},{"location":"admin/serverless/servicemesh/#deploy-inferenceservice-with-istio-sidecar-injection","text":"First label the namespace with istio-injection=enabled to turn on the sidecar injection for the namespace. kubectl label namespace user1 istio-injection = enabled --overwrite Create the InferenceService with and without Knative activator on the path: When autoscaling.knative.dev/targetBurstCapacity is set to 0, Knative removes the activator from the request path so the test service can directly establish the mTLS connection to the InferenceService and the authorization policy can check the original namespace of the request to lock down the traffic for namespace isolation. InferenceService with activator on path InferenceService without activator on path apiVersion : \"serving.kserve.io/v1beta1\" kind : \"InferenceService\" metadata : name : \"sklearn-iris-burst\" namespace : user1 annotations : \"sidecar.istio.io/inject\" : \"true\" spec : predictor : model : modelFormat : name : sklearn storageUri : \"gs://kfserving-examples/models/sklearn/1.0/model\" apiVersion : \"serving.kserve.io/v1beta1\" kind : \"InferenceService\" metadata : name : \"sklearn-iris\" namespace : user1 annotations : \"autoscaling.knative.dev/targetBurstCapacity\" : \"0\" \"sidecar.istio.io/inject\" : \"true\" spec : predictor : model : modelFormat : name : sklearn storageUri : \"gs://kfserving-examples/models/sklearn/1.0/model\" kubectl apply -f sklearn_iris.yaml Expected Output $ inferenceservice.serving.kserve.io/sklearn-iris created $ inferenceservice.serving.kserve.io/sklearn-iris-burst created kubectl get pods -n user1 NAME READY STATUS RESTARTS AGE httpbin-6484879498-qxqj8 2 /2 Running 0 19h sklearn-iris-burst-predictor-default-00001-deployment-5685n46f6 3 /3 Running 0 12h sklearn-iris-predictor-default-00001-deployment-985d5cd46-zzw4x 3 /3 Running 0 12h","title":"Deploy InferenceService with Istio sidecar injection"},{"location":"admin/serverless/servicemesh/#run-a-prediction-from-the-same-namespace","text":"Deploy a test service in user1 namespace with httpbin.yaml . kubectl apply -f httpbin.yaml Run a prediction request to the sklearn-iris InferenceService without activator on the path, you are expected to get HTTP 200 as the authorization rule allows traffic from the same namespace. kubectl exec -it httpbin-6484879498-qxqj8 -c istio-proxy -n user1 -- curl -v sklearn-iris-predictor-default.user1.svc.cluster.local/v1/models/sklearn-iris Expected Output * Connected to sklearn-iris-predictor-default.user1.svc.cluster.local ( 10 .96.137.152 ) port 80 ( #0) > GET /v1/models/sklearn-iris HTTP/1.1 > Host: sklearn-iris-predictor-default.user1.svc.cluster.local > User-Agent: curl/7.81.0 > Accept: */* > * Mark bundle as not supporting multiuse < HTTP/1.1 200 OK < content-length: 36 < content-type: application/json < date: Sat, 26 Nov 2022 01 :45:10 GMT < server: istio-envoy < x-envoy-upstream-service-time: 42 < * Connection #0 to host sklearn-iris-predictor-default.user1.svc.cluster.local left intact { \"name\" : \"sklearn-iris\" , \"ready\" :true } Run a prediction request to the sklearn-iris-burst InferenceService with activator on the path, you are expected to get HTTP 200 as the authorization rule allows traffic from knative-serving namespace. kubectl exec -it httpbin-6484879498-qxqj8 -c istio-proxy -n user1 -- curl -v sklearn-iris-burst-predictor-default.user1.svc.cluster.local/v1/models/sklearn-iris-burst Expected Output * Connected to sklearn-iris-burst-predictor-default.user1.svc.cluster.local ( 10 .96.137.152 ) port 80 ( #0) > GET /v1/models/sklearn-iris-burst HTTP/1.1 > Host: sklearn-iris-burst-predictor-default.user1.svc.cluster.local > User-Agent: curl/7.81.0 > Accept: */* > * Mark bundle as not supporting multiuse < HTTP/1.1 200 OK < content-length: 42 < content-type: application/json < date: Sat, 26 Nov 2022 13 :55:14 GMT < server: istio-envoy < x-envoy-upstream-service-time: 209 < * Connection #0 to host sklearn-iris-burst-predictor-default.user1.svc.cluster.local left intact { \"name\" : \"sklearn-iris-burst\" , \"ready\" :true }","title":"Run a prediction from the same namespace"},{"location":"admin/serverless/servicemesh/#run-a-prediction-from-a-different-namespace","text":"Deploy a test service in default namespace with sleep.yaml which is different from the namespace the InferenceService is deployed to. kubectl apply -f sleep.yaml When you send a prediction request to the sklearn-iris InferenceService without activator on the request path from a different namespace, you are expected to get HTTP 403 \"RBAC denied\" as the authorization rule only allows the traffic from the same namespace user1 where the InferenceService is deployed. kubectl exec -it sleep-6d6b49d8b8-6ths6 -- curl -v sklearn-iris-predictor-default.user1.svc.cluster.local/v1/models/sklearn-iris Expected Output * Connected to sklearn-iris-predictor-default.user1.svc.cluster.local ( 10 .96.137.152 ) port 80 ( #0) > GET /v1/models/sklearn-iris HTTP/1.1 > Host: sklearn-iris-predictor-default.user1.svc.cluster.local > User-Agent: curl/7.86.0-DEV > Accept: */* > * Mark bundle as not supporting multiuse < HTTP/1.1 403 Forbidden < content-length: 19 < content-type: text/plain < date: Sat, 26 Nov 2022 02 :45:46 GMT < server: envoy < x-envoy-upstream-service-time: 14 < * Connection #0 to host sklearn-iris-predictor-default.user1.svc.cluster.local left intact When you send a prediction request to the sklearn-iris-burst InferenceService with activator on the request path from a different namespace, you actually get HTTP 200 response due to the above limitation as the authorization policy is not able to lock down the traffic only from the same namespace as the request is proxied through activator in knative-serving namespace, we expect to get HTTP 403 once upstream Knative net-istio is fixed. kubectl exec -it sleep-6d6b49d8b8-6ths6 -- curl -v sklearn-iris-burst-predictor-default.user1.svc.cluster.local/v1/models/sklearn-iris-burst Expected Output * Connected to sklearn-iris-burst-predictor-default.user1.svc.cluster.local ( 10 .96.137.152 ) port 80 ( #0) > GET /v1/models/sklearn-iris-burst HTTP/1.1 > Host: sklearn-iris-burst-predictor-default.user1.svc.cluster.local > User-Agent: curl/7.86.0-DEV > Accept: */* > * Mark bundle as not supporting multiuse < HTTP/1.1 200 OK < content-length: 42 < content-type: application/json < date: Sat, 26 Nov 2022 13 :59:04 GMT < server: envoy < x-envoy-upstream-service-time: 6 < * Connection #0 to host sklearn-iris-burst-predictor-default.user1.svc.cluster.local left intact { \"name\" : \"sklearn-iris-burst\" , \"ready\" :true }","title":"Run a prediction from a different namespace"},{"location":"api/api/","text":"KServe API \u00b6","title":"KServe API"},{"location":"api/api/#kserve-api","text":"","title":"KServe API"},{"location":"blog/_index/","text":"","title":" index"},{"location":"blog/articles/2021-09-27-kfserving-transition/","text":"Authors \u00b6 Dan Sun and Animesh Singh on behalf of the Kubeflow Serving Working Group KFServing is now KServe \u00b6 We are excited to announce the next chapter for KFServing. In coordination with the Kubeflow Project Steering Group, the KFServing GitHub repository has now been transferred to an independent KServe GitHub organization under the stewardship of the Kubeflow Serving Working Group leads. The project has been rebranded from KFServing to KServe , and we are planning to graduate the project from Kubeflow Project later this year. Developed collaboratively by Google, IBM, Bloomberg, NVIDIA, and Seldon in 2019, KFServing was published as open source in early 2019. The project sets out to provide the following features: - A simple, yet powerful, Kubernetes Custom Resource for deploying machine learning (ML) models on production across ML frameworks. - Provide performant, standardized inference protocol. - Serverless inference according to live traffic patterns, supporting \u201cScale-to-zero\u201d on both CPUs and GPUs. - Complete story for production ML Model Serving including prediction, pre/post-processing, explainability, and monitoring. - Support for deploying thousands of models at scale and inference graph capability for multiple models. KFServing was created to address the challenges of deploying and monitoring machine learning models on production for organizations. After publishing the open source project, we\u2019ve seen an explosion in demand for the software, leading to strong adoption and community growth. The scope of the project has since increased, and we have developed multiple components along the way, including our own growing body of documentation that needs it's own website and independent GitHub organization. What's Next \u00b6 Over the coming weeks, we will be releasing KServe 0.7 outside of the Kubeflow Project and will provide more details on how to migrate from KFServing to KServe with minimal disruptions. KFServing 0.5.x/0.6.x releases are still supported in next six months after KServe 0.7 release. We are also working on integrating core Kubeflow APIs and standards for the conformance program . For contributors, please follow the KServe developer and doc contribution guide to make code or doc contributions. We are excited to work with you to make KServe better and promote its adoption by more and more users! KServe Key Links \u00b6 Website Github Slack(#kubeflow-kfserving) Contributor Acknowledgement \u00b6 We'd like to thank all the KServe contributors for this transition work! Andrews Arokiam Animesh Singh Chin Huang Dan Sun Jagadeesh Jinchi He Nick Hill Paul Van Eck Qianshan Chen Suresh Nakkiran Sukumar Gaonkar Theofilos Papapanagiotou Tommy Li Vedant Padwal Yao Xiao Yuzhui Liu","title":"KFserving Transition"},{"location":"blog/articles/2021-09-27-kfserving-transition/#authors","text":"Dan Sun and Animesh Singh on behalf of the Kubeflow Serving Working Group","title":"Authors"},{"location":"blog/articles/2021-09-27-kfserving-transition/#kfserving-is-now-kserve","text":"We are excited to announce the next chapter for KFServing. In coordination with the Kubeflow Project Steering Group, the KFServing GitHub repository has now been transferred to an independent KServe GitHub organization under the stewardship of the Kubeflow Serving Working Group leads. The project has been rebranded from KFServing to KServe , and we are planning to graduate the project from Kubeflow Project later this year. Developed collaboratively by Google, IBM, Bloomberg, NVIDIA, and Seldon in 2019, KFServing was published as open source in early 2019. The project sets out to provide the following features: - A simple, yet powerful, Kubernetes Custom Resource for deploying machine learning (ML) models on production across ML frameworks. - Provide performant, standardized inference protocol. - Serverless inference according to live traffic patterns, supporting \u201cScale-to-zero\u201d on both CPUs and GPUs. - Complete story for production ML Model Serving including prediction, pre/post-processing, explainability, and monitoring. - Support for deploying thousands of models at scale and inference graph capability for multiple models. KFServing was created to address the challenges of deploying and monitoring machine learning models on production for organizations. After publishing the open source project, we\u2019ve seen an explosion in demand for the software, leading to strong adoption and community growth. The scope of the project has since increased, and we have developed multiple components along the way, including our own growing body of documentation that needs it's own website and independent GitHub organization.","title":"KFServing is now KServe"},{"location":"blog/articles/2021-09-27-kfserving-transition/#whats-next","text":"Over the coming weeks, we will be releasing KServe 0.7 outside of the Kubeflow Project and will provide more details on how to migrate from KFServing to KServe with minimal disruptions. KFServing 0.5.x/0.6.x releases are still supported in next six months after KServe 0.7 release. We are also working on integrating core Kubeflow APIs and standards for the conformance program . For contributors, please follow the KServe developer and doc contribution guide to make code or doc contributions. We are excited to work with you to make KServe better and promote its adoption by more and more users!","title":"What's Next"},{"location":"blog/articles/2021-09-27-kfserving-transition/#kserve-key-links","text":"Website Github Slack(#kubeflow-kfserving)","title":"KServe Key Links"},{"location":"blog/articles/2021-09-27-kfserving-transition/#contributor-acknowledgement","text":"We'd like to thank all the KServe contributors for this transition work! Andrews Arokiam Animesh Singh Chin Huang Dan Sun Jagadeesh Jinchi He Nick Hill Paul Van Eck Qianshan Chen Suresh Nakkiran Sukumar Gaonkar Theofilos Papapanagiotou Tommy Li Vedant Padwal Yao Xiao Yuzhui Liu","title":"Contributor Acknowledgement"},{"location":"blog/articles/2021-10-11-KServe-0.7-release/","text":"Authors \u00b6 Dan Sun , Animesh Singh , Yuzhui Liu , Vedant Padwal on behalf of the KServe Working Group. KFServing is now KServe and KServe 0.7 release is available, the release also ensures a smooth user migration experience from KFServing to KServe. What's Changed? \u00b6 InferenceService API group is changed from serving.kubeflow.org to serving.kserve.io #1826 , the migration job is created for smooth transition. Python SDK name is changed from kfserving to kserve . KServe Installation manifests #1824 . Models-web-app is separated out of the kserve repository to models-web-app . Docs and examples are moved to separate repository website . KServe images are migrated to kserve docker hub account. v1alpha2 API group is deprecated #1850 . \ud83c\udf08 What's New? \u00b6 ModelMesh project is joining KServe under repository modelmesh-serving ! ModelMesh is designed for high-scale, high-density and frequently-changing model use cases. ModelMesh intelligently loads and unloads AI models to and from memory to strike an intelligent trade-off between responsiveness to users and computational footprint. To learn more about ModelMesh features and components, check out the ModelMesh announcement blog and Join talk at #KubeCon NA to get a deeper dive into ModelMesh and KServe . (Alpha feature) Raw Kubernetes deployment support, Istio/Knative dependency is now optional and please follow the guide to install and turn on RawDeployment mode. KServe now has its own documentation website temporarily hosted on website . Support v1 crd and webhook configuration for Kubernetes 1.22 #1837 . Triton model serving runtime now defaults to 21.09 version #1840 . \ud83d\udc1e What's Fixed? \u00b6 Bug fix for Azure blob storage #1845 . Tar/Zip support for all storage options #1836 . Fix AWS_REGION env variable and add AWS_CA_BUNDLE for S3 #1780 . Torchserve custom package install fix #1619 . Join the community \u00b6 Visit our Website or GitHub Join the Slack(#kubeflow-kfserving) Attend a Biweekly community meeting on Wednesday 9am PST Contribute at developer and doc contribution guide to make code or doc contributions. We are excited to work with you to make KServe better and promote its adoption by more and more users! Contributors \u00b6 We would like to thank everyone for their efforts on v0.7 Andrews Arokiam Animesh Singh Chin Huang Dan Sun Jagadeesh Jinchi He Nick Hill Paul Van Eck Qianshan Chen Suresh Nakkiran Sukumar Gaonkar Theofilos Papapanagiotou Tommy Li Vedant Padwal Yao Xiao Yuzhui Liu","title":"KServe 0.7 Release"},{"location":"blog/articles/2021-10-11-KServe-0.7-release/#authors","text":"Dan Sun , Animesh Singh , Yuzhui Liu , Vedant Padwal on behalf of the KServe Working Group. KFServing is now KServe and KServe 0.7 release is available, the release also ensures a smooth user migration experience from KFServing to KServe.","title":"Authors"},{"location":"blog/articles/2021-10-11-KServe-0.7-release/#whats-changed","text":"InferenceService API group is changed from serving.kubeflow.org to serving.kserve.io #1826 , the migration job is created for smooth transition. Python SDK name is changed from kfserving to kserve . KServe Installation manifests #1824 . Models-web-app is separated out of the kserve repository to models-web-app . Docs and examples are moved to separate repository website . KServe images are migrated to kserve docker hub account. v1alpha2 API group is deprecated #1850 .","title":"What's Changed?"},{"location":"blog/articles/2021-10-11-KServe-0.7-release/#whats-new","text":"ModelMesh project is joining KServe under repository modelmesh-serving ! ModelMesh is designed for high-scale, high-density and frequently-changing model use cases. ModelMesh intelligently loads and unloads AI models to and from memory to strike an intelligent trade-off between responsiveness to users and computational footprint. To learn more about ModelMesh features and components, check out the ModelMesh announcement blog and Join talk at #KubeCon NA to get a deeper dive into ModelMesh and KServe . (Alpha feature) Raw Kubernetes deployment support, Istio/Knative dependency is now optional and please follow the guide to install and turn on RawDeployment mode. KServe now has its own documentation website temporarily hosted on website . Support v1 crd and webhook configuration for Kubernetes 1.22 #1837 . Triton model serving runtime now defaults to 21.09 version #1840 .","title":"\ud83c\udf08 What's New?"},{"location":"blog/articles/2021-10-11-KServe-0.7-release/#whats-fixed","text":"Bug fix for Azure blob storage #1845 . Tar/Zip support for all storage options #1836 . Fix AWS_REGION env variable and add AWS_CA_BUNDLE for S3 #1780 . Torchserve custom package install fix #1619 .","title":"\ud83d\udc1e What's Fixed?"},{"location":"blog/articles/2021-10-11-KServe-0.7-release/#join-the-community","text":"Visit our Website or GitHub Join the Slack(#kubeflow-kfserving) Attend a Biweekly community meeting on Wednesday 9am PST Contribute at developer and doc contribution guide to make code or doc contributions. We are excited to work with you to make KServe better and promote its adoption by more and more users!","title":"Join the community"},{"location":"blog/articles/2021-10-11-KServe-0.7-release/#contributors","text":"We would like to thank everyone for their efforts on v0.7 Andrews Arokiam Animesh Singh Chin Huang Dan Sun Jagadeesh Jinchi He Nick Hill Paul Van Eck Qianshan Chen Suresh Nakkiran Sukumar Gaonkar Theofilos Papapanagiotou Tommy Li Vedant Padwal Yao Xiao Yuzhui Liu","title":"Contributors"},{"location":"blog/articles/2022-02-18-KServe-0.8-release/","text":"Authors \u00b6 Dan Sun , Paul Van Eck , Vedant Padwal , Andrews Arokiam on behalf of the KServe Working Group. Announcing: KServe v0.8 \u00b6 February 18, 2022 Today, we are pleased to announce the v0.8.0 release of KServe! While the last release was focused on the transition of KFServing to KServe, this release was focused on unifying the InferenceService API for deploying models on KServe and ModelMesh. Note : For current users of KFServing/KServe, please take a few minutes to answer this short survey and provide your feedback! Now, let's take a look at some of the changes and additions to KServe. What\u2019s changed? \u00b6 ONNX Runtime Server has been removed from the supported serving runtime list. KServe by default now uses the Triton Inference Server to serve ONNX models. KServe\u2019s PyTorchServer has been removed from the supported serving runtime list. KServe by default now uses TorchServe to serve PyTorch models. A few main KServe SDK class names have been changed: KFModel is renamed to Model KFServer is renamed to ModelServer KFModelRepository is renamed to ModelRepository What's new? \u00b6 Some notable updates are: ClusterServingRuntime and ServingRuntime CRDs are introduced. Learn more below . A new Model Spec was introduced to the InferenceService Predictor Spec as a new way to specify models. Learn more below . Knative 1.0 is now supported and certified for the KServe Serverless installation. gRPC is now supported for transformer to predictor network communication. TorchServe Serving runtime has been updated to 0.5.2 which now supports the KServe V2 REST protocol. ModelMesh now has multi-namespace support, and users can now deploy GCS or HTTP(S) hosted models. To see all release updates, check out the KServe release notes and ModelMesh Serving release notes ! ServingRuntimes and ClusterServingRuntimes \u00b6 This release introduces two new CRDs ServingRuntimes and ClusterServingRuntimes with the only difference between these two is that one is namespace-scoped and one is cluster-scoped. A ServingRuntime defines the templates for Pods that can serve one or more particular model formats. Each ServingRuntime defines key information such as the container image of the runtime and a list of the model formats that the runtime supports. In previous versions of KServe, supported predictor formats and container images were defined in a config map in the control plane namespace. The ServingRuntime CRD should allow for improved flexibility and extensibility for defining or customizing runtimes to how you see fit without having to modify any controller code or any resources in the controller namespace. Several out-of-the-box ClusterServingRuntimes are provided with KServe so that users can continue to use KServe how they did before without having to define the runtimes themselves. Example SKLearn ClusterServingRuntime: apiVersion : serving.kserve.io/v1alpha1 kind : ClusterServingRuntime metadata : name : kserve-sklearnserver spec : supportedModelFormats : - name : sklearn version : \"1\" autoSelect : true containers : - name : kserve-container image : kserve/sklearnserver:latest args : - --model_name={{.Name}} - --model_dir=/mnt/models - --http_port=8080 resources : requests : cpu : \"1\" memory : 2Gi limits : cpu : \"1\" memory : 2Gi Updated InferenceService Predictor Spec \u00b6 A new Model spec was also introduced as a part of the Predictor spec for InferenceServices. One of the problems KServe was having was that the InferenceService CRD was becoming unwieldy with each model serving runtime being an object in the Predictor spec. This generated a lot of field duplication in the schema, bloating the overall size of the CRD. If a user wanted to introduce a new model serving framework for KServe to support, the CRD would have to be modified, and subsequently the controller code. Now, with the Model spec, a user can specify a model format and optionally a corresponding version. The KServe control plane will automatically select and use the ClusterServingRuntime or ServingRuntime that supports the given format. Each ServingRuntime maintains a list of supported model formats and versions. If a format has autoselect as true , then that opens the ServingRuntime up for automatic model placement for that model format. New Schema Previous Schema apiVersion : serving.kserve.io/v1beta1 kind : InferenceService metadata : name : example-sklearn-isvc spec : predictor : model : modelFormat : name : sklearn storageUri : s3://bucket/sklearn/mnist.joblib apiVersion : serving.kserve.io/v1beta1 kind : InferenceService metadata : name : example-sklearn-isvc spec : predictor : sklearn : storageUri : s3://bucket/sklearn/mnist.joblib The previous way of defining predictors is still supported, however, the new approach will be the preferred one going forward. Eventually, the previous schema, with the framework names as keys in the predictor spec, will be removed. ModelMesh Updates \u00b6 ModelMesh has been in the process of integrating as KServe\u2019s multi-model serving backend. With the inclusion of the aforementioned ServingRuntime CRDs and the Predictor Model spec, the two projects are now much more aligned, with continual improvements underway. ModelMesh now supports multi-namespace reconciliation. Previously, the ModelMesh controller would only reconcile against resources deployed in the same namespace as the controller. Now, by default, ModelMesh will be able to handle InferenceService deployments in any \"modelmesh-enabled\" namespace. Learn more here . Also, while ModelMesh previously only supported S3-based storage, we are happy to share that ModelMesh now works with models hosted using GCS and HTTP(S). Join the community \u00b6 Visit our Website or GitHub Join the Slack ( #kubeflow-kfserving ) Attend a biweekly community meeting on Wednesday 9am PST View our developer and doc contribution guides to learn how to make contributions. We are excited to work with you to make KServe better and promote its adoption! Thank you for trying out KServe!","title":"KServe 0.8 Release"},{"location":"blog/articles/2022-02-18-KServe-0.8-release/#authors","text":"Dan Sun , Paul Van Eck , Vedant Padwal , Andrews Arokiam on behalf of the KServe Working Group.","title":"Authors"},{"location":"blog/articles/2022-02-18-KServe-0.8-release/#announcing-kserve-v08","text":"February 18, 2022 Today, we are pleased to announce the v0.8.0 release of KServe! While the last release was focused on the transition of KFServing to KServe, this release was focused on unifying the InferenceService API for deploying models on KServe and ModelMesh. Note : For current users of KFServing/KServe, please take a few minutes to answer this short survey and provide your feedback! Now, let's take a look at some of the changes and additions to KServe.","title":"Announcing: KServe v0.8"},{"location":"blog/articles/2022-02-18-KServe-0.8-release/#whats-changed","text":"ONNX Runtime Server has been removed from the supported serving runtime list. KServe by default now uses the Triton Inference Server to serve ONNX models. KServe\u2019s PyTorchServer has been removed from the supported serving runtime list. KServe by default now uses TorchServe to serve PyTorch models. A few main KServe SDK class names have been changed: KFModel is renamed to Model KFServer is renamed to ModelServer KFModelRepository is renamed to ModelRepository","title":"What\u2019s changed?"},{"location":"blog/articles/2022-02-18-KServe-0.8-release/#whats-new","text":"Some notable updates are: ClusterServingRuntime and ServingRuntime CRDs are introduced. Learn more below . A new Model Spec was introduced to the InferenceService Predictor Spec as a new way to specify models. Learn more below . Knative 1.0 is now supported and certified for the KServe Serverless installation. gRPC is now supported for transformer to predictor network communication. TorchServe Serving runtime has been updated to 0.5.2 which now supports the KServe V2 REST protocol. ModelMesh now has multi-namespace support, and users can now deploy GCS or HTTP(S) hosted models. To see all release updates, check out the KServe release notes and ModelMesh Serving release notes !","title":"What's new?"},{"location":"blog/articles/2022-02-18-KServe-0.8-release/#servingruntimes-and-clusterservingruntimes","text":"This release introduces two new CRDs ServingRuntimes and ClusterServingRuntimes with the only difference between these two is that one is namespace-scoped and one is cluster-scoped. A ServingRuntime defines the templates for Pods that can serve one or more particular model formats. Each ServingRuntime defines key information such as the container image of the runtime and a list of the model formats that the runtime supports. In previous versions of KServe, supported predictor formats and container images were defined in a config map in the control plane namespace. The ServingRuntime CRD should allow for improved flexibility and extensibility for defining or customizing runtimes to how you see fit without having to modify any controller code or any resources in the controller namespace. Several out-of-the-box ClusterServingRuntimes are provided with KServe so that users can continue to use KServe how they did before without having to define the runtimes themselves. Example SKLearn ClusterServingRuntime: apiVersion : serving.kserve.io/v1alpha1 kind : ClusterServingRuntime metadata : name : kserve-sklearnserver spec : supportedModelFormats : - name : sklearn version : \"1\" autoSelect : true containers : - name : kserve-container image : kserve/sklearnserver:latest args : - --model_name={{.Name}} - --model_dir=/mnt/models - --http_port=8080 resources : requests : cpu : \"1\" memory : 2Gi limits : cpu : \"1\" memory : 2Gi","title":"ServingRuntimes and ClusterServingRuntimes"},{"location":"blog/articles/2022-02-18-KServe-0.8-release/#updated-inferenceservice-predictor-spec","text":"A new Model spec was also introduced as a part of the Predictor spec for InferenceServices. One of the problems KServe was having was that the InferenceService CRD was becoming unwieldy with each model serving runtime being an object in the Predictor spec. This generated a lot of field duplication in the schema, bloating the overall size of the CRD. If a user wanted to introduce a new model serving framework for KServe to support, the CRD would have to be modified, and subsequently the controller code. Now, with the Model spec, a user can specify a model format and optionally a corresponding version. The KServe control plane will automatically select and use the ClusterServingRuntime or ServingRuntime that supports the given format. Each ServingRuntime maintains a list of supported model formats and versions. If a format has autoselect as true , then that opens the ServingRuntime up for automatic model placement for that model format. New Schema Previous Schema apiVersion : serving.kserve.io/v1beta1 kind : InferenceService metadata : name : example-sklearn-isvc spec : predictor : model : modelFormat : name : sklearn storageUri : s3://bucket/sklearn/mnist.joblib apiVersion : serving.kserve.io/v1beta1 kind : InferenceService metadata : name : example-sklearn-isvc spec : predictor : sklearn : storageUri : s3://bucket/sklearn/mnist.joblib The previous way of defining predictors is still supported, however, the new approach will be the preferred one going forward. Eventually, the previous schema, with the framework names as keys in the predictor spec, will be removed.","title":"Updated InferenceService Predictor Spec"},{"location":"blog/articles/2022-02-18-KServe-0.8-release/#modelmesh-updates","text":"ModelMesh has been in the process of integrating as KServe\u2019s multi-model serving backend. With the inclusion of the aforementioned ServingRuntime CRDs and the Predictor Model spec, the two projects are now much more aligned, with continual improvements underway. ModelMesh now supports multi-namespace reconciliation. Previously, the ModelMesh controller would only reconcile against resources deployed in the same namespace as the controller. Now, by default, ModelMesh will be able to handle InferenceService deployments in any \"modelmesh-enabled\" namespace. Learn more here . Also, while ModelMesh previously only supported S3-based storage, we are happy to share that ModelMesh now works with models hosted using GCS and HTTP(S).","title":"ModelMesh Updates"},{"location":"blog/articles/2022-02-18-KServe-0.8-release/#join-the-community","text":"Visit our Website or GitHub Join the Slack ( #kubeflow-kfserving ) Attend a biweekly community meeting on Wednesday 9am PST View our developer and doc contribution guides to learn how to make contributions. We are excited to work with you to make KServe better and promote its adoption! Thank you for trying out KServe!","title":"Join the community"},{"location":"blog/articles/2022-07-21-KServe-0.9-release/","text":"Announcing: KServe v0.9.0 \u00b6 Today, we are pleased to announce the v0.9.0 release of KServe! KServe has now fully onboarded to LF AI & Data Foundation as an Incubation Project ! In this release we are excited to introduce the new InferenceGraph feature which has long been asked from the community. Also continuing the effort from the last release for unifying the InferenceService API for deploying models on KServe and ModelMesh, ModelMesh is now fully compatible with KServe InferenceService API! Introduce InferenceGraph \u00b6 The ML Inference system is getting bigger and more complex. It often consists of many models to make a single prediction. The common use cases are image classification and natural language multi-stage processing pipelines. For example, an image classification pipeline needs to run top level classification first then downstream further classification based on previous prediction results. KServe has the unique strength to build the distributed inference graph with its native integration of InferenceServices, standard inference protocol for chaining models and serverless auto-scaling capabilities. KServe leverages these strengths to build the InferenceGraph and enable users to deploy complex ML Inference pipelines to production in a declarative and scalable way. InferenceGraph is made up of a list of routing nodes with each node consisting of a set of routing steps. Each step can either route to an InferenceService or another node defined on the graph which makes the InferenceGraph highly composable. The graph router is deployed behind an HTTP endpoint and can be scaled dynamically based on request volume. The InferenceGraph supports four different types of routing nodes: Sequence , Switch , Ensemble , Splitter . Sequence Node : It allows users to define multiple Steps with InferenceServices or Nodes as routing targets in a sequence. The Steps are executed in sequence and the request/response from the previous step and be passed to the next step as input based on configuration. Switch Node : It allows users to define routing conditions and select a Step to execute if it matches the condition. The response is returned as soon as it finds the first step that matches the condition. If no condition is matched, the graph returns the original request. Ensemble Node : A model ensemble requires scoring each model separately and then combines the results into a single prediction response. You can then use different combination methods to produce the final result. Multiple classification trees, for example, are commonly combined using a \"majority vote\" method. Multiple regression trees are often combined using various averaging techniques. Splitter Node : It allows users to split the traffic to multiple targets using a weighted distribution. apiVersion : \"serving.kserve.io/v1beta1\" kind : \"InferenceService\" metadata : name : \"cat-dog-classifier\" spec : predictor : pytorch : resources : requests : cpu : 100m storageUri : gs://kfserving-examples/models/torchserve/cat_dog_classification --- apiVersion : \"serving.kserve.io/v1beta1\" kind : \"InferenceService\" metadata : name : \"dog-breed-classifier\" spec : predictor : pytorch : resources : requests : cpu : 100m storageUri : gs://kfserving-examples/models/torchserve/dog_breed_classification --- apiVersion : \"serving.kserve.io/v1alpha1\" kind : \"InferenceGraph\" metadata : name : \"dog-breed-pipeline\" spec : nodes : root : routerType : Sequence steps : - serviceName : cat-dog-classifier name : cat_dog_classifier # step name - serviceName : dog-breed-classifier name : dog_breed_classifier data : $request condition : \"[@this].#(predictions.0==\\\"dog\\\")\" Currently InferenceGraph is supported with the Serverless deployment mode. You can try it out following the tutorial . InferenceService API for ModelMesh \u00b6 The InferenceService CRD is now the primary interface for interacting with ModelMesh. Some changes were made to the InferenceService spec to better facilitate ModelMesh\u2019s needs. Storage Spec \u00b6 To unify how model storage is defined for both single and multi-model serving, a new storage spec was added to the predictor model spec. With this storage spec, users can specify a key inside a common secret holding config/credentials for each of the storage backends from which models can be loaded. Example: storage : key : localMinIO # Credential key for the destination storage in the common secret path : sklearn # Model path inside the bucket # schemaPath: null # Optional schema files for payload schema parameters : # Parameters to override the default values inside the common secret. bucket : example-models Learn more here . Model Status \u00b6 For further alignment between ModelMesh and KServe, some additions to the InferenceService status were made. There is now a Model Status section which contains information about the model loaded in the predictor. New fields include: states - State information of the predictor's model. activeModelState - The state of the model currently being served by the predictor's endpoints. targetModelState - This will be set only when transitionStatus is not UpToDate , meaning that the target model differs from the currently-active model. transitionStatus - Indicates state of the predictor relative to its current spec. modelCopies - Model copy information of the predictor's model. lastFailureInfo - Details about the most recent error associated with this predictor. Not all of the contained fields will necessarily have a value. Deploying on ModelMesh \u00b6 For deploying InferenceServices on ModelMesh, the ModelMesh and KServe controllers will still require that the user specifies the serving.kserve.io/deploymentMode: ModelMesh annotation. A complete example on an InferenceService with the new storage spec is showing below: apiVersion : serving.kserve.io/v1beta1 kind : InferenceService metadata : name : example-tensorflow-mnist annotations : serving.kserve.io/deploymentMode : ModelMesh spec : predictor : model : modelFormat : name : tensorflow storage : key : localMinIO path : tensorflow/mnist.savedmodel Other New Features: \u00b6 Support serving MLFlow model format via MLServer serving runtime. Support unified autoscaling target and metric fields for InferenceService components with both Serverless and RawDeployment mode. Support InferenceService ingress class and url domain template configuration for RawDeployment mode. ModelMesh now has a default OpenVINO Model Server ServingRuntime. What\u2019s Changed? \u00b6 The KServe controller manager is changed from StatefulSet to Deployment to support HA mode. log4j security vulnerability fix Upgrade TorchServe serving runtime to 0.6.0 Update MLServer serving runtime to 1.0.0 Check out the full release notes for KServe and ModelMesh for more details. Join the community \u00b6 Visit our Website or GitHub Join the Slack ( #kserve ) Attend our community meeting by subscribing to the KServe calendar . View our community github repository to learn how to make contributions. We are excited to work with you to make KServe better and promote its adoption! Thank you for contributing or checking out KServe! \u2013 The KServe Working Group","title":"KServe 0.9 Release"},{"location":"blog/articles/2022-07-21-KServe-0.9-release/#announcing-kserve-v090","text":"Today, we are pleased to announce the v0.9.0 release of KServe! KServe has now fully onboarded to LF AI & Data Foundation as an Incubation Project ! In this release we are excited to introduce the new InferenceGraph feature which has long been asked from the community. Also continuing the effort from the last release for unifying the InferenceService API for deploying models on KServe and ModelMesh, ModelMesh is now fully compatible with KServe InferenceService API!","title":"Announcing: KServe v0.9.0"},{"location":"blog/articles/2022-07-21-KServe-0.9-release/#introduce-inferencegraph","text":"The ML Inference system is getting bigger and more complex. It often consists of many models to make a single prediction. The common use cases are image classification and natural language multi-stage processing pipelines. For example, an image classification pipeline needs to run top level classification first then downstream further classification based on previous prediction results. KServe has the unique strength to build the distributed inference graph with its native integration of InferenceServices, standard inference protocol for chaining models and serverless auto-scaling capabilities. KServe leverages these strengths to build the InferenceGraph and enable users to deploy complex ML Inference pipelines to production in a declarative and scalable way. InferenceGraph is made up of a list of routing nodes with each node consisting of a set of routing steps. Each step can either route to an InferenceService or another node defined on the graph which makes the InferenceGraph highly composable. The graph router is deployed behind an HTTP endpoint and can be scaled dynamically based on request volume. The InferenceGraph supports four different types of routing nodes: Sequence , Switch , Ensemble , Splitter . Sequence Node : It allows users to define multiple Steps with InferenceServices or Nodes as routing targets in a sequence. The Steps are executed in sequence and the request/response from the previous step and be passed to the next step as input based on configuration. Switch Node : It allows users to define routing conditions and select a Step to execute if it matches the condition. The response is returned as soon as it finds the first step that matches the condition. If no condition is matched, the graph returns the original request. Ensemble Node : A model ensemble requires scoring each model separately and then combines the results into a single prediction response. You can then use different combination methods to produce the final result. Multiple classification trees, for example, are commonly combined using a \"majority vote\" method. Multiple regression trees are often combined using various averaging techniques. Splitter Node : It allows users to split the traffic to multiple targets using a weighted distribution. apiVersion : \"serving.kserve.io/v1beta1\" kind : \"InferenceService\" metadata : name : \"cat-dog-classifier\" spec : predictor : pytorch : resources : requests : cpu : 100m storageUri : gs://kfserving-examples/models/torchserve/cat_dog_classification --- apiVersion : \"serving.kserve.io/v1beta1\" kind : \"InferenceService\" metadata : name : \"dog-breed-classifier\" spec : predictor : pytorch : resources : requests : cpu : 100m storageUri : gs://kfserving-examples/models/torchserve/dog_breed_classification --- apiVersion : \"serving.kserve.io/v1alpha1\" kind : \"InferenceGraph\" metadata : name : \"dog-breed-pipeline\" spec : nodes : root : routerType : Sequence steps : - serviceName : cat-dog-classifier name : cat_dog_classifier # step name - serviceName : dog-breed-classifier name : dog_breed_classifier data : $request condition : \"[@this].#(predictions.0==\\\"dog\\\")\" Currently InferenceGraph is supported with the Serverless deployment mode. You can try it out following the tutorial .","title":"Introduce InferenceGraph"},{"location":"blog/articles/2022-07-21-KServe-0.9-release/#inferenceservice-api-for-modelmesh","text":"The InferenceService CRD is now the primary interface for interacting with ModelMesh. Some changes were made to the InferenceService spec to better facilitate ModelMesh\u2019s needs.","title":"InferenceService API for ModelMesh"},{"location":"blog/articles/2022-07-21-KServe-0.9-release/#storage-spec","text":"To unify how model storage is defined for both single and multi-model serving, a new storage spec was added to the predictor model spec. With this storage spec, users can specify a key inside a common secret holding config/credentials for each of the storage backends from which models can be loaded. Example: storage : key : localMinIO # Credential key for the destination storage in the common secret path : sklearn # Model path inside the bucket # schemaPath: null # Optional schema files for payload schema parameters : # Parameters to override the default values inside the common secret. bucket : example-models Learn more here .","title":"Storage Spec"},{"location":"blog/articles/2022-07-21-KServe-0.9-release/#model-status","text":"For further alignment between ModelMesh and KServe, some additions to the InferenceService status were made. There is now a Model Status section which contains information about the model loaded in the predictor. New fields include: states - State information of the predictor's model. activeModelState - The state of the model currently being served by the predictor's endpoints. targetModelState - This will be set only when transitionStatus is not UpToDate , meaning that the target model differs from the currently-active model. transitionStatus - Indicates state of the predictor relative to its current spec. modelCopies - Model copy information of the predictor's model. lastFailureInfo - Details about the most recent error associated with this predictor. Not all of the contained fields will necessarily have a value.","title":"Model Status"},{"location":"blog/articles/2022-07-21-KServe-0.9-release/#deploying-on-modelmesh","text":"For deploying InferenceServices on ModelMesh, the ModelMesh and KServe controllers will still require that the user specifies the serving.kserve.io/deploymentMode: ModelMesh annotation. A complete example on an InferenceService with the new storage spec is showing below: apiVersion : serving.kserve.io/v1beta1 kind : InferenceService metadata : name : example-tensorflow-mnist annotations : serving.kserve.io/deploymentMode : ModelMesh spec : predictor : model : modelFormat : name : tensorflow storage : key : localMinIO path : tensorflow/mnist.savedmodel","title":"Deploying on ModelMesh"},{"location":"blog/articles/2022-07-21-KServe-0.9-release/#other-new-features","text":"Support serving MLFlow model format via MLServer serving runtime. Support unified autoscaling target and metric fields for InferenceService components with both Serverless and RawDeployment mode. Support InferenceService ingress class and url domain template configuration for RawDeployment mode. ModelMesh now has a default OpenVINO Model Server ServingRuntime.","title":"Other New Features:"},{"location":"blog/articles/2022-07-21-KServe-0.9-release/#whats-changed","text":"The KServe controller manager is changed from StatefulSet to Deployment to support HA mode. log4j security vulnerability fix Upgrade TorchServe serving runtime to 0.6.0 Update MLServer serving runtime to 1.0.0 Check out the full release notes for KServe and ModelMesh for more details.","title":"What\u2019s Changed?"},{"location":"blog/articles/2022-07-21-KServe-0.9-release/#join-the-community","text":"Visit our Website or GitHub Join the Slack ( #kserve ) Attend our community meeting by subscribing to the KServe calendar . View our community github repository to learn how to make contributions. We are excited to work with you to make KServe better and promote its adoption! Thank you for contributing or checking out KServe! \u2013 The KServe Working Group","title":"Join the community"},{"location":"blog/articles/2023-02-05-KServe-0.10-release/","text":"Announcing: KServe v0.10.0 \u00b6 We are excited to announce KServe 0.10 release. In this release we have enabled more KServe networking options, improved KServe telemetry for supported serving runtimes and increased support coverage for Open(aka v2) inference protocol for both standard and ModelMesh InferenceService. KServe Networking Options \u00b6 Istio is now optional for both Serverless and RawDeployment mode. Please see the alternative networking guide for how you can enable other ingress options supported by Knative with Serverless mode. For Istio users, if you want to turn on full service mesh mode to secure InferenceService with mutual TLS and enable the traffic policies, please read the service mesh setup guideline . KServe Telemetry for Serving Runtimes \u00b6 We have instrumented additional latency metrics in KServe Python ServingRuntimes for preprocess , predict and postprocess handlers. In Serverless mode we have extended Knative queue-proxy to enable metrics aggregation for both metrics exposed in queue-proxy and kserve-container from each ServingRuntime . Please read the prometheus metrics setup guideline for how to enable the metrics scraping and aggregations. Open(v2) Inference Protocol Support Coverage \u00b6 As there have been increasing adoptions for KServe v2 Inference Protocol from AMD Inference ServingRuntime which supports FPGAs and OpenVINO which now provides KServe REST and gRPC compatible API, in the issue we have proposed to rename to KServe Open Inference Protocol . In KServe 0.10, we have added Open(v2) inference protocol support for KServe custom runtimes. Now, you can enable v2 REST/gRPC for both custom transformer and predictor with images built by implementing KServe Python SDK API. gRPC enables high performance inference data plane as it is built on top of HTTP/2 and binary data transportation which is more efficient to send over the wire compared to REST. Please see the detailed example for transformer and predictor . from kserve import Model def image_transform ( byte_array ): image_processing = transforms . Compose ([ transforms . ToTensor (), transforms . Normalize (( 0.1307 ,), ( 0.3081 ,)) ]) image = Image . open ( io . BytesIO ( byte_array )) tensor = image_processing ( image ) . numpy () return tensor class CustomModel ( Model ): def predict ( self , request : InferRequest , headers : Dict [ str , str ]) -> InferResponse : input_tensors = [ image_transform ( instance ) for instance in request . inputs [ 0 ] . data ] input_tensors = np . asarray ( input_tensors ) output = self . model ( input_tensors ) torch . nn . functional . softmax ( output , dim = 1 ) values , top_5 = torch . topk ( output , 5 ) result = values . flatten () . tolist () response_id = generate_uuid () infer_output = InferOutput ( name = \"output-0\" , shape = list ( values . shape ), datatype = \"FP32\" , data = result ) infer_response = InferResponse ( model_name = self . name , infer_outputs = [ infer_output ], response_id = response_id ) return infer_response class CustomTransformer ( Model ): def preprocess ( self , request : InferRequest , headers : Dict [ str , str ]) -> InferRequest : input_tensors = [ image_transform ( instance ) for instance in request . inputs [ 0 ] . data ] input_tensors = np . asarray ( input_tensors ) infer_inputs = [ InferInput ( name = \"INPUT__0\" , datatype = 'FP32' , shape = list ( input_tensors . shape ), data = input_tensors )] infer_request = InferRequest ( model_name = self . model_name , infer_inputs = infer_inputs ) return infer_request You can use the same Python API type InferRequest and InferResponse for both REST and gRPC protocol. KServe handles the underlying decoding and encoding according to the protocol. Warning A new headers argument is added to the custom handlers to pass http/gRPC headers or other metadata. You can also use this as context dict to pass data between handlers. If you have existing custom transformer or predictor, the headers argument is now required to add to the preprocess , predict and postprocess handlers. Please check the following matrix for supported ModelFormats and ServingRuntimes . Model Format v1 Open(v2) REST/gRPC Tensorflow \u2705 TFServing \u2705 Triton PyTorch \u2705 TorchServe \u2705 TorchServe TorchScript \u2705 TorchServe \u2705 Triton ONNX \u274c \u2705 Triton Scikit-learn \u2705 KServe \u2705 MLServer XGBoost \u2705 KServe \u2705 MLServer LightGBM \u2705 KServe \u2705 MLServer MLFlow \u274c \u2705 MLServer Custom \u2705 KServe \u2705 KServe Multi-Arch Image Support \u00b6 KServe control plane images kserve-controller , kserve/agent , kserve/router are now supported for multiple architectures: ppc64le , arm64 , amd64 , s390x . KServe Storage Credentials Support \u00b6 Currently, AWS users need to create a secret with long term/static IAM credentials for downloading models stored in S3. Security best practice is to use IAM role for service account(IRSA) which enables automatic credential rotation and fine-grained access control, see how to setup IRSA . Support Azure Blobs with managed identity . ModelMesh updates \u00b6 ModelMesh has continued to integrate itself as KServe's multi-model serving backend, introducing improvements and features that better align the two projects. For example, it now supports ClusterServingRuntimes, allowing use of cluster-scoped ServingRuntimes, originally introduced in KServe 0.8. Additionally, ModelMesh introduced support for TorchServe enabling users to serve arbitrary PyTorch models (e.g. eager-mode) in the context of distributed-multi-model serving. Other limitations have been addressed as well, such as adding support for BYTES/string type tensors when using the REST inference API for inference requests that require them. Other Changes: \u00b6 For a complete change list please read the release notes from KServe v0.10 and ModelMesh v0.10 . Join the community \u00b6 Visit our Website or GitHub Join the Slack ( #kserve ) Attend our community meeting by subscribing to the KServe calendar . View our community github repository to learn how to make contributions. We are excited to work with you to make KServe better and promote its adoption! Thanks for all the contributors who have made the commits to 0.10 release! Steve Larkin Stephan Schielke Curtis Maddalozzo Zhongcheng Lao Dimitris Aragiorgis Pan Li tjandy98 Sukumar Gaonkar Rachit Chauhan Rafael Vasquez Tim Kleinloog Christian Kadner ddelange Lize Cai sangjune.park Suresh Nakkeran Konstantinos Messis Matt Rose Alexa Griffith Jagadeesh J Alex Lembiyeuski Yuki Iwai Andrews Arokiam Xin Fu adilhusain-s Pranav Pandit C1berwiz dilverse Yuan Tang Dan Sun Nick Hill The KServe Working Group","title":"KServe 0.10 Release"},{"location":"blog/articles/2023-02-05-KServe-0.10-release/#announcing-kserve-v0100","text":"We are excited to announce KServe 0.10 release. In this release we have enabled more KServe networking options, improved KServe telemetry for supported serving runtimes and increased support coverage for Open(aka v2) inference protocol for both standard and ModelMesh InferenceService.","title":"Announcing: KServe v0.10.0"},{"location":"blog/articles/2023-02-05-KServe-0.10-release/#kserve-networking-options","text":"Istio is now optional for both Serverless and RawDeployment mode. Please see the alternative networking guide for how you can enable other ingress options supported by Knative with Serverless mode. For Istio users, if you want to turn on full service mesh mode to secure InferenceService with mutual TLS and enable the traffic policies, please read the service mesh setup guideline .","title":"KServe Networking Options"},{"location":"blog/articles/2023-02-05-KServe-0.10-release/#kserve-telemetry-for-serving-runtimes","text":"We have instrumented additional latency metrics in KServe Python ServingRuntimes for preprocess , predict and postprocess handlers. In Serverless mode we have extended Knative queue-proxy to enable metrics aggregation for both metrics exposed in queue-proxy and kserve-container from each ServingRuntime . Please read the prometheus metrics setup guideline for how to enable the metrics scraping and aggregations.","title":"KServe Telemetry for Serving Runtimes"},{"location":"blog/articles/2023-02-05-KServe-0.10-release/#openv2-inference-protocol-support-coverage","text":"As there have been increasing adoptions for KServe v2 Inference Protocol from AMD Inference ServingRuntime which supports FPGAs and OpenVINO which now provides KServe REST and gRPC compatible API, in the issue we have proposed to rename to KServe Open Inference Protocol . In KServe 0.10, we have added Open(v2) inference protocol support for KServe custom runtimes. Now, you can enable v2 REST/gRPC for both custom transformer and predictor with images built by implementing KServe Python SDK API. gRPC enables high performance inference data plane as it is built on top of HTTP/2 and binary data transportation which is more efficient to send over the wire compared to REST. Please see the detailed example for transformer and predictor . from kserve import Model def image_transform ( byte_array ): image_processing = transforms . Compose ([ transforms . ToTensor (), transforms . Normalize (( 0.1307 ,), ( 0.3081 ,)) ]) image = Image . open ( io . BytesIO ( byte_array )) tensor = image_processing ( image ) . numpy () return tensor class CustomModel ( Model ): def predict ( self , request : InferRequest , headers : Dict [ str , str ]) -> InferResponse : input_tensors = [ image_transform ( instance ) for instance in request . inputs [ 0 ] . data ] input_tensors = np . asarray ( input_tensors ) output = self . model ( input_tensors ) torch . nn . functional . softmax ( output , dim = 1 ) values , top_5 = torch . topk ( output , 5 ) result = values . flatten () . tolist () response_id = generate_uuid () infer_output = InferOutput ( name = \"output-0\" , shape = list ( values . shape ), datatype = \"FP32\" , data = result ) infer_response = InferResponse ( model_name = self . name , infer_outputs = [ infer_output ], response_id = response_id ) return infer_response class CustomTransformer ( Model ): def preprocess ( self , request : InferRequest , headers : Dict [ str , str ]) -> InferRequest : input_tensors = [ image_transform ( instance ) for instance in request . inputs [ 0 ] . data ] input_tensors = np . asarray ( input_tensors ) infer_inputs = [ InferInput ( name = \"INPUT__0\" , datatype = 'FP32' , shape = list ( input_tensors . shape ), data = input_tensors )] infer_request = InferRequest ( model_name = self . model_name , infer_inputs = infer_inputs ) return infer_request You can use the same Python API type InferRequest and InferResponse for both REST and gRPC protocol. KServe handles the underlying decoding and encoding according to the protocol. Warning A new headers argument is added to the custom handlers to pass http/gRPC headers or other metadata. You can also use this as context dict to pass data between handlers. If you have existing custom transformer or predictor, the headers argument is now required to add to the preprocess , predict and postprocess handlers. Please check the following matrix for supported ModelFormats and ServingRuntimes . Model Format v1 Open(v2) REST/gRPC Tensorflow \u2705 TFServing \u2705 Triton PyTorch \u2705 TorchServe \u2705 TorchServe TorchScript \u2705 TorchServe \u2705 Triton ONNX \u274c \u2705 Triton Scikit-learn \u2705 KServe \u2705 MLServer XGBoost \u2705 KServe \u2705 MLServer LightGBM \u2705 KServe \u2705 MLServer MLFlow \u274c \u2705 MLServer Custom \u2705 KServe \u2705 KServe","title":"Open(v2) Inference Protocol Support Coverage"},{"location":"blog/articles/2023-02-05-KServe-0.10-release/#multi-arch-image-support","text":"KServe control plane images kserve-controller , kserve/agent , kserve/router are now supported for multiple architectures: ppc64le , arm64 , amd64 , s390x .","title":"Multi-Arch Image Support"},{"location":"blog/articles/2023-02-05-KServe-0.10-release/#kserve-storage-credentials-support","text":"Currently, AWS users need to create a secret with long term/static IAM credentials for downloading models stored in S3. Security best practice is to use IAM role for service account(IRSA) which enables automatic credential rotation and fine-grained access control, see how to setup IRSA . Support Azure Blobs with managed identity .","title":"KServe Storage Credentials Support"},{"location":"blog/articles/2023-02-05-KServe-0.10-release/#modelmesh-updates","text":"ModelMesh has continued to integrate itself as KServe's multi-model serving backend, introducing improvements and features that better align the two projects. For example, it now supports ClusterServingRuntimes, allowing use of cluster-scoped ServingRuntimes, originally introduced in KServe 0.8. Additionally, ModelMesh introduced support for TorchServe enabling users to serve arbitrary PyTorch models (e.g. eager-mode) in the context of distributed-multi-model serving. Other limitations have been addressed as well, such as adding support for BYTES/string type tensors when using the REST inference API for inference requests that require them.","title":"ModelMesh updates"},{"location":"blog/articles/2023-02-05-KServe-0.10-release/#other-changes","text":"For a complete change list please read the release notes from KServe v0.10 and ModelMesh v0.10 .","title":"Other Changes:"},{"location":"blog/articles/2023-02-05-KServe-0.10-release/#join-the-community","text":"Visit our Website or GitHub Join the Slack ( #kserve ) Attend our community meeting by subscribing to the KServe calendar . View our community github repository to learn how to make contributions. We are excited to work with you to make KServe better and promote its adoption! Thanks for all the contributors who have made the commits to 0.10 release! Steve Larkin Stephan Schielke Curtis Maddalozzo Zhongcheng Lao Dimitris Aragiorgis Pan Li tjandy98 Sukumar Gaonkar Rachit Chauhan Rafael Vasquez Tim Kleinloog Christian Kadner ddelange Lize Cai sangjune.park Suresh Nakkeran Konstantinos Messis Matt Rose Alexa Griffith Jagadeesh J Alex Lembiyeuski Yuki Iwai Andrews Arokiam Xin Fu adilhusain-s Pranav Pandit C1berwiz dilverse Yuan Tang Dan Sun Nick Hill The KServe Working Group","title":"Join the community"},{"location":"blog/articles/2023-10-08-KServe-0.11-release/","text":"Announcing: KServe v0.11 \u00b6 We are excited to announce the release of KServe 0.11, in this release we introduced Large Language Model (LLM) runtimes, made enhancements to the KServe control plane, Python SDK Open Inference Protocol support and dependency managemenet. For ModelMesh we have added features PVC, HPA, payload logging to ensure feature parity with KServe. Here is a summary of the key changes: KServe Core Inference Enhancements \u00b6 Support path based routing which is served as an alternative way to the host based routing, the URL of the InferenceService could look like http:///serving// . Please refer to the doc for how to enable path based routing. Introduced priority field for Serving Runtime custom resource to handle the case when you have multiple serving runtimes which support the same model formats, see more details from the serving runtime doc . Introduced Custom Storage Container CRD to allow customized implementations with supported storage URI prefixes, example use cases are private model registry integration: apiVersion : \"serving.kserve.io/v1alpha1\" kind : ClusterStorageContainer metadata : name : default spec : container : name : storage-initializer image : kserve/model-registry:latest resources : requests : memory : 100Mi cpu : 100m limits : memory : 1Gi cpu : \"1\" supportedUriFormats : - prefix : model-registry:// Inference Graph enhancements for improving the API spec to support pod affinity and resource requirement fields. Dependency field with options Soft and Hard is introduced to handle error responses from the inference steps to decide whether to short-circuit the request in case of errors, see the following example with hard dependency with the node steps: apiVersion : serving.kserve.io/v1alpha1 kind : InferenceGraph metadata : name : graph_with_switch_node spec : nodes : root : routerType : Sequence steps : - name : \"rootStep1\" nodeName : node1 dependency : Hard - name : \"rootStep2\" serviceName : {{ success_200_isvc_id }} node1 : routerType : Switch steps : - name : \"node1Step1\" serviceName : {{ error_404_isvc_id }} condition : \"[@this].#(decision_picker==ERROR)\" dependency : Hard For more details please refer to the issue . Improved InferenceService debugging experience by adding the aggregated RoutesReady status and LastDeploymentReady condition to the InferenceService Status to differentiate the endpoint and deployment status. This applies to the serverless mode and for more details refer to the API docs . Enhanced Python SDK Dependency Management \u00b6 KServe has adopted poetry to manage python dependencies. You can now install the KServe SDK with locked dependencies using poetry install . While pip install still works, we highly recommend using poetry to ensure predictable dependency management. The KServe SDK is also slimmed down by making the cloud storage dependency optional, if you require storage dependency for custom serving runtimes you can still install with pip install kserve[storage] . KServe Python Runtimes Improvements \u00b6 KServe Python Runtimes including sklearnserver , lgbserver , xgbserver now support the open inference protocol for both REST and gRPC. Logging improvements including adding Uvicorn access logging and a default KServe logger. Postprocess handler has been aligned with open inference protocol, simplifying the underlying transportation protocol complexities. LLM Runtimes \u00b6 TorchServe LLM Runtime \u00b6 KServe now integrates with TorchServe 0.8, offering the support for LLM models that may not fit onto a single GPU. Huggingface Accelerate and Deepspeed are available options to split the model into multiple partitions over multiple GPUs. You can see the detailed example for how to serve the LLM on KServe with TorchServe runtime. vLLM Runtime \u00b6 Serving LLM models can be surprisingly slow even on high end GPUs, vLLM is a fast and easy-to-use LLM inference engine. It can achieve 10x-20x higher throughput than Huggingface transformers. It supports continuous batching for increased throughput and GPU utilization, paged attention to address the memory bottleneck where in the autoregressive decoding process all the attention key value tensors(KV Cache) are kept in the GPU memory to generate next tokens. In the example we show how to deploy vLLM on KServe and expects further integration in KServe 0.12 with proposed generate endpoint for open inference protocol. ModelMesh Updates \u00b6 Storing Models on Kubernetes Persistent Volumes (PVC) \u00b6 ModelMesh now allows to directly mount model files onto serving runtimes pods using Kubernetes Persistent Volumes . Depending on the selected storage solution this approach can significantly reduce latency when deploying new predictors, potentially remove the need for additional S3 cloud object storage like AWS S3, GCS, or Azure Blob Storage altogether. Horizontal Pod Autoscaling (HPA) \u00b6 Kubernetes Horizontal Pod Autoscaling can now be used at the serving runtime pod level. With HPA enabled, the ModelMesh controller no longer manages the number of replicas. Instead, a HorizontalPodAutoscaler automatically updates the serving runtime deployment with the number of Pods to best match the demand. Model Metrics, Metrics Dashboard, Payload Event Logging \u00b6 ModelMesh v0.11 introduces a new configuration option to emit a subset of useful metrics at the individual model level. These metrics can help identify outlier or \"heavy hitter\" models and consequently fine-tune the deployments of those inference services, like allocating more resources or increasing the number of replicas for improved responsiveness or avoid frequent cache misses. A new Grafana dashboard was added to display the comprehensive set of Prometheus metrics like model loading and unloading rates, internal queuing delays, capacity and usage, cache state, etc. to monitor the general health of the ModelMesh Serving deployment. The new PayloadProcessor interface can be implemented to log prediction requests and responses, to create data sinks for data visualization, for model quality assessment, or for drift and outlier detection by external monitoring systems. What's Changed? \u00b6 To allow longer InferenceService name due to DNS max length limits from issue , the Default suffix in the inference service component(predictor/transformer/explainer) name has been removed for newly created InferenceServices. This affects the client that is using the component url directly instead of the top level InferenceService url. Status.address.url is now consistent for both serverless and raw deployment mode, the url path portion is dropped in serverless mode. Raw bytes are now accepted in v1 protocol, setting the right content-type header to application/json is required to recognize and decode the json payload if content-type is specified. curl -v -H \"Content-Type: application/json\" http://sklearn-iris.kserve-test. ${ CUSTOM_DOMAIN } /v1/models/sklearn-iris:predict -d @./iris-input.json For a complete change list please read the release notes from KServe v0.11 and ModelMesh v0.11 . Join the community \u00b6 Visit our Website or GitHub Join the Slack ( #kserve ) Attend our community meeting by subscribing to the KServe calendar . View our community github repository to learn how to make contributions. We are excited to work with you to make KServe better and promote its adoption! Thanks for all the contributors who have made the commits to 0.11 release! The KServe Working Group","title":"KServe 0.11 Release"},{"location":"blog/articles/2023-10-08-KServe-0.11-release/#announcing-kserve-v011","text":"We are excited to announce the release of KServe 0.11, in this release we introduced Large Language Model (LLM) runtimes, made enhancements to the KServe control plane, Python SDK Open Inference Protocol support and dependency managemenet. For ModelMesh we have added features PVC, HPA, payload logging to ensure feature parity with KServe. Here is a summary of the key changes:","title":"Announcing: KServe v0.11"},{"location":"blog/articles/2023-10-08-KServe-0.11-release/#kserve-core-inference-enhancements","text":"Support path based routing which is served as an alternative way to the host based routing, the URL of the InferenceService could look like http:///serving// . Please refer to the doc for how to enable path based routing. Introduced priority field for Serving Runtime custom resource to handle the case when you have multiple serving runtimes which support the same model formats, see more details from the serving runtime doc . Introduced Custom Storage Container CRD to allow customized implementations with supported storage URI prefixes, example use cases are private model registry integration: apiVersion : \"serving.kserve.io/v1alpha1\" kind : ClusterStorageContainer metadata : name : default spec : container : name : storage-initializer image : kserve/model-registry:latest resources : requests : memory : 100Mi cpu : 100m limits : memory : 1Gi cpu : \"1\" supportedUriFormats : - prefix : model-registry:// Inference Graph enhancements for improving the API spec to support pod affinity and resource requirement fields. Dependency field with options Soft and Hard is introduced to handle error responses from the inference steps to decide whether to short-circuit the request in case of errors, see the following example with hard dependency with the node steps: apiVersion : serving.kserve.io/v1alpha1 kind : InferenceGraph metadata : name : graph_with_switch_node spec : nodes : root : routerType : Sequence steps : - name : \"rootStep1\" nodeName : node1 dependency : Hard - name : \"rootStep2\" serviceName : {{ success_200_isvc_id }} node1 : routerType : Switch steps : - name : \"node1Step1\" serviceName : {{ error_404_isvc_id }} condition : \"[@this].#(decision_picker==ERROR)\" dependency : Hard For more details please refer to the issue . Improved InferenceService debugging experience by adding the aggregated RoutesReady status and LastDeploymentReady condition to the InferenceService Status to differentiate the endpoint and deployment status. This applies to the serverless mode and for more details refer to the API docs .","title":"KServe Core Inference Enhancements"},{"location":"blog/articles/2023-10-08-KServe-0.11-release/#enhanced-python-sdk-dependency-management","text":"KServe has adopted poetry to manage python dependencies. You can now install the KServe SDK with locked dependencies using poetry install . While pip install still works, we highly recommend using poetry to ensure predictable dependency management. The KServe SDK is also slimmed down by making the cloud storage dependency optional, if you require storage dependency for custom serving runtimes you can still install with pip install kserve[storage] .","title":"Enhanced Python SDK Dependency Management"},{"location":"blog/articles/2023-10-08-KServe-0.11-release/#kserve-python-runtimes-improvements","text":"KServe Python Runtimes including sklearnserver , lgbserver , xgbserver now support the open inference protocol for both REST and gRPC. Logging improvements including adding Uvicorn access logging and a default KServe logger. Postprocess handler has been aligned with open inference protocol, simplifying the underlying transportation protocol complexities.","title":"KServe Python Runtimes Improvements"},{"location":"blog/articles/2023-10-08-KServe-0.11-release/#llm-runtimes","text":"","title":"LLM Runtimes"},{"location":"blog/articles/2023-10-08-KServe-0.11-release/#torchserve-llm-runtime","text":"KServe now integrates with TorchServe 0.8, offering the support for LLM models that may not fit onto a single GPU. Huggingface Accelerate and Deepspeed are available options to split the model into multiple partitions over multiple GPUs. You can see the detailed example for how to serve the LLM on KServe with TorchServe runtime.","title":"TorchServe LLM Runtime"},{"location":"blog/articles/2023-10-08-KServe-0.11-release/#vllm-runtime","text":"Serving LLM models can be surprisingly slow even on high end GPUs, vLLM is a fast and easy-to-use LLM inference engine. It can achieve 10x-20x higher throughput than Huggingface transformers. It supports continuous batching for increased throughput and GPU utilization, paged attention to address the memory bottleneck where in the autoregressive decoding process all the attention key value tensors(KV Cache) are kept in the GPU memory to generate next tokens. In the example we show how to deploy vLLM on KServe and expects further integration in KServe 0.12 with proposed generate endpoint for open inference protocol.","title":"vLLM Runtime"},{"location":"blog/articles/2023-10-08-KServe-0.11-release/#modelmesh-updates","text":"","title":"ModelMesh Updates"},{"location":"blog/articles/2023-10-08-KServe-0.11-release/#storing-models-on-kubernetes-persistent-volumes-pvc","text":"ModelMesh now allows to directly mount model files onto serving runtimes pods using Kubernetes Persistent Volumes . Depending on the selected storage solution this approach can significantly reduce latency when deploying new predictors, potentially remove the need for additional S3 cloud object storage like AWS S3, GCS, or Azure Blob Storage altogether.","title":"Storing Models on Kubernetes Persistent Volumes (PVC)"},{"location":"blog/articles/2023-10-08-KServe-0.11-release/#horizontal-pod-autoscaling-hpa","text":"Kubernetes Horizontal Pod Autoscaling can now be used at the serving runtime pod level. With HPA enabled, the ModelMesh controller no longer manages the number of replicas. Instead, a HorizontalPodAutoscaler automatically updates the serving runtime deployment with the number of Pods to best match the demand.","title":"Horizontal Pod Autoscaling (HPA)"},{"location":"blog/articles/2023-10-08-KServe-0.11-release/#model-metrics-metrics-dashboard-payload-event-logging","text":"ModelMesh v0.11 introduces a new configuration option to emit a subset of useful metrics at the individual model level. These metrics can help identify outlier or \"heavy hitter\" models and consequently fine-tune the deployments of those inference services, like allocating more resources or increasing the number of replicas for improved responsiveness or avoid frequent cache misses. A new Grafana dashboard was added to display the comprehensive set of Prometheus metrics like model loading and unloading rates, internal queuing delays, capacity and usage, cache state, etc. to monitor the general health of the ModelMesh Serving deployment. The new PayloadProcessor interface can be implemented to log prediction requests and responses, to create data sinks for data visualization, for model quality assessment, or for drift and outlier detection by external monitoring systems.","title":"Model Metrics, Metrics Dashboard, Payload Event Logging"},{"location":"blog/articles/2023-10-08-KServe-0.11-release/#whats-changed","text":"To allow longer InferenceService name due to DNS max length limits from issue , the Default suffix in the inference service component(predictor/transformer/explainer) name has been removed for newly created InferenceServices. This affects the client that is using the component url directly instead of the top level InferenceService url. Status.address.url is now consistent for both serverless and raw deployment mode, the url path portion is dropped in serverless mode. Raw bytes are now accepted in v1 protocol, setting the right content-type header to application/json is required to recognize and decode the json payload if content-type is specified. curl -v -H \"Content-Type: application/json\" http://sklearn-iris.kserve-test. ${ CUSTOM_DOMAIN } /v1/models/sklearn-iris:predict -d @./iris-input.json For a complete change list please read the release notes from KServe v0.11 and ModelMesh v0.11 .","title":"What's Changed?"},{"location":"blog/articles/2023-10-08-KServe-0.11-release/#join-the-community","text":"Visit our Website or GitHub Join the Slack ( #kserve ) Attend our community meeting by subscribing to the KServe calendar . View our community github repository to learn how to make contributions. We are excited to work with you to make KServe better and promote its adoption! Thanks for all the contributors who have made the commits to 0.11 release! The KServe Working Group","title":"Join the community"},{"location":"blog/articles/_index/","text":"","title":" index"},{"location":"community/adopters/","text":"Adopters of KServe \u00b6 This page contains a list of organizations who are using KServe either in production, or providing integrations or deployment options with their Cloud or product offerings. If you'd like to be included here, please send a pull request which modifies this file. Please keep the list in alphabetical order. Organization Contact Advanced Micro Devices Varun Sharma Amazon Web Services Ellis Tarn Bloomberg Dan Sun Cars24 Swapnesh Khare Chamred Kubeflow from Canonical Daniela Plasencia Cisco Krishna Durai Cloudera Zoram Thanga CoreWeave Peter Salanki Gojek Willem Pienaar Deeploy Tim Kleinloog Halodoc ID Joinal Ahmed IBM Nick Hill Kubeflow on Google Cloud James Liu Inspur Qingshan Chen Max Kelsen Jacob O'Farrell Naver Mark Winter Nuance Jeff Griffith NVIDIA David Goodwin One Convergence Subra Ongole PITS Global Data Recovery Services Pheianox Red Hat Taneem Ibrahim Seldon Clive Cox Patterson Consulting Josh Patterson Samsung SDS Hanbae Seo Striveworks Jordan Yono Zillow Peilun Li Upstage JuHyung Son Intuit Rachit Chauhan","title":"Adopters"},{"location":"community/adopters/#adopters-of-kserve","text":"This page contains a list of organizations who are using KServe either in production, or providing integrations or deployment options with their Cloud or product offerings. If you'd like to be included here, please send a pull request which modifies this file. Please keep the list in alphabetical order. Organization Contact Advanced Micro Devices Varun Sharma Amazon Web Services Ellis Tarn Bloomberg Dan Sun Cars24 Swapnesh Khare Chamred Kubeflow from Canonical Daniela Plasencia Cisco Krishna Durai Cloudera Zoram Thanga CoreWeave Peter Salanki Gojek Willem Pienaar Deeploy Tim Kleinloog Halodoc ID Joinal Ahmed IBM Nick Hill Kubeflow on Google Cloud James Liu Inspur Qingshan Chen Max Kelsen Jacob O'Farrell Naver Mark Winter Nuance Jeff Griffith NVIDIA David Goodwin One Convergence Subra Ongole PITS Global Data Recovery Services Pheianox Red Hat Taneem Ibrahim Seldon Clive Cox Patterson Consulting Josh Patterson Samsung SDS Hanbae Seo Striveworks Jordan Yono Zillow Peilun Li Upstage JuHyung Son Intuit Rachit Chauhan","title":"Adopters of KServe"},{"location":"community/presentations/","text":"KServe(Formally KFServing) Presentations and Demoes \u00b6 This page contains a list of presentations and demos. If you'd like to add a presentation or demo here, please send a pull request. Presentation/Demo Presenters Distributed Machine Learning Patterns from Manning Publications Yuan Tang KubeCon 2019: Introducing KFServing: Serverless Model Serving on Kubernetes Dan Sun, Ellis Tarn KubeCon 2019: Advanced Model Inferencing Leveraging KNative, Istio & Kubeflow Serving Animesh Singh, Clive Cox KubeflowDojo: KFServing - Production Model Serving Platform Animesh Singh, Tommy Li NVIDIA: Accelerate and Autoscale Deep Learning Inference on GPUs with KFServing Dan Sun, David Goodwin KF Community: KFServing - Enabling Serverless Workloads Across Model Frameworks Ellis Tarn KubeflowDojo: Demo - KFServing End to End through Notebook Animesh Singh, Tommy Li KubeflowDojo: Demo - KFServing with Kafka and Kubeflow Pipelines Animesh Singh Anchor MLOps Podcast: Serving Models with KFServing David Aponte, Demetrios Brinkmann Kubeflow 101: What is KFServing? Stephanie Wong ICML 2020, Workshop on Challenges in Deploying and Monitoring Machine Learning Systems : Serverless inferencing on Kubernetes Clive Cox Serverless Practitioners Summit 2020: Serverless Machine Learning Inference with KFServing Clive Cox, Yuzhui Liu MLOps Meetup: KServe Live Coding Session Theofilos Papapanagiotou KubeCon AI Days 2021: Serving Machine Learning Models at Scale Using KServe Yuzhui Liu KubeCon 2021: Serving Machine Learning Models at Scale Using KServe Animesh Singh KubeCon China 2021: Accelerate Federated Learning Model Deployment with KServe Fangchi Wang & Jiahao Chen KubeCon AI Days 2022: Exploring ML Model Serving with KServe Alexa Nicole Griffith KubeCon AI Days 2022: Enhancing the Performance Testing Process for gRPC Model Inferencing at Scale Ted Chang, Paul Van Eck KubeCon Edge Days 2022: Model Serving at the Edge Made Easier Paul Van Eck KnativeCon 2022: How We Built an ML inference Platform with Knative Dan Sun KubeCon EU 2023: The state and future of cloud native model serving Dan Sun, Theofilos Papapanagiotou Kubeflow Summit 2023: Scale your Models to Zero with Knative and KServe Jooho Lee Kubeflow Summit 2023: What to choose? ModelMesh vs Model Serving? Vaibhav Jain","title":"Demos and Presentations"},{"location":"community/presentations/#kserveformally-kfserving-presentations-and-demoes","text":"This page contains a list of presentations and demos. If you'd like to add a presentation or demo here, please send a pull request. Presentation/Demo Presenters Distributed Machine Learning Patterns from Manning Publications Yuan Tang KubeCon 2019: Introducing KFServing: Serverless Model Serving on Kubernetes Dan Sun, Ellis Tarn KubeCon 2019: Advanced Model Inferencing Leveraging KNative, Istio & Kubeflow Serving Animesh Singh, Clive Cox KubeflowDojo: KFServing - Production Model Serving Platform Animesh Singh, Tommy Li NVIDIA: Accelerate and Autoscale Deep Learning Inference on GPUs with KFServing Dan Sun, David Goodwin KF Community: KFServing - Enabling Serverless Workloads Across Model Frameworks Ellis Tarn KubeflowDojo: Demo - KFServing End to End through Notebook Animesh Singh, Tommy Li KubeflowDojo: Demo - KFServing with Kafka and Kubeflow Pipelines Animesh Singh Anchor MLOps Podcast: Serving Models with KFServing David Aponte, Demetrios Brinkmann Kubeflow 101: What is KFServing? Stephanie Wong ICML 2020, Workshop on Challenges in Deploying and Monitoring Machine Learning Systems : Serverless inferencing on Kubernetes Clive Cox Serverless Practitioners Summit 2020: Serverless Machine Learning Inference with KFServing Clive Cox, Yuzhui Liu MLOps Meetup: KServe Live Coding Session Theofilos Papapanagiotou KubeCon AI Days 2021: Serving Machine Learning Models at Scale Using KServe Yuzhui Liu KubeCon 2021: Serving Machine Learning Models at Scale Using KServe Animesh Singh KubeCon China 2021: Accelerate Federated Learning Model Deployment with KServe Fangchi Wang & Jiahao Chen KubeCon AI Days 2022: Exploring ML Model Serving with KServe Alexa Nicole Griffith KubeCon AI Days 2022: Enhancing the Performance Testing Process for gRPC Model Inferencing at Scale Ted Chang, Paul Van Eck KubeCon Edge Days 2022: Model Serving at the Edge Made Easier Paul Van Eck KnativeCon 2022: How We Built an ML inference Platform with Knative Dan Sun KubeCon EU 2023: The state and future of cloud native model serving Dan Sun, Theofilos Papapanagiotou Kubeflow Summit 2023: Scale your Models to Zero with Knative and KServe Jooho Lee Kubeflow Summit 2023: What to choose? ModelMesh vs Model Serving? Vaibhav Jain","title":"KServe(Formally KFServing) Presentations and Demoes"},{"location":"developer/debug/","text":"KServe Debugging Guide \u00b6 Debug KServe InferenceService Status \u00b6 You deployed an InferenceService to KServe, but it is not in ready state. Go through this step by step guide to understand what failed. kubectl get inferenceservices sklearn-iris NAME URL READY DEFAULT TRAFFIC CANARY TRAFFIC AGE model-example False 1m IngressNotConfigured \u00b6 If you see IngressNotConfigured error, this indicates Istio Ingress Gateway probes are failing. kubectl get ksvc NAME URL LATESTCREATED LATESTREADY READY REASON sklearn-iris-predictor-default http://sklearn-iris-predictor-default.default.example.com sklearn-iris-predictor-default-jk794 mnist-sample-predictor-default-jk794 Unknown IngressNotConfigured You can then check Knative networking-istio pod logs for more details. kubectl logs -l app = networking-istio -n knative-serving If you are seeing HTTP 403, then you may have Istio RBAC turned on which blocks the probes to your service. { \"level\" : \"error\" , \"ts\" : \"2020-03-26T19:12:00.749Z\" , \"logger\" : \"istiocontroller.ingress-controller.status-manager\" , \"caller\" : \"ingress/status.go:366\" , \"msg\" : \"Probing of http://flowers-sample-predictor-default.kubeflow-jeanarmel-luce.example.com:80/ failed, IP: 10.0.0.29:80, ready: false, error: unexpected status code: want [200], got 403 (depth: 0)\" , \"commit\" : \"6b0e5c6\" , \"knative.dev/controller\" : \"ingress-controller\" , \"stacktrace\" : \"knative.dev/serving/pkg/reconciler/ingress.(*StatusProber).processWorkItem\\n\\t/home/prow/go/src/knative.dev/serving/pkg/reconciler/ingress/status.go:366\\nknative.dev/serving/pkg/reconciler/ingress.(*StatusProber).Start.func1\\n\\t/home/prow/go/src/knative.dev/serving/pkg/reconciler/ingress/status.go:268\" } RevisionMissing Error \u00b6 If you see RevisionMissing error, then your service pods are not in ready state. Knative Service creates Knative Revision which represents a snapshot of the InferenceService code and configuration. Storage Initializer fails to download model \u00b6 kubectl get revision $( kubectl get configuration sklearn-iris-predictor-default --output jsonpath = \"{.status.latestCreatedRevisionName}\" ) NAME CONFIG NAME K8S SERVICE NAME GENERATION READY REASON sklearn-iris-predictor-default-csjpw sklearn-iris-predictor-default sklearn-iris-predictor-default-csjpw 2 Unknown Deploying If you see READY status in Unknown error, this usually indicates that the KServe Storage Initializer init container fails to download the model and you can check the init container logs to see why it fails, note that the pod scales down after sometime if the init container fails . kubectl get pod -l serving.kserve.io/inferenceservice = sklearn-iris NAME READY STATUS RESTARTS AGE sklearn-iris-predictor-default-29jks-deployment-5f7d4b9996hzrnc 0 /3 Init:Error 1 10s kubectl logs -l model = sklearn-iris -c storage-initializer [ I 200517 03 :56:19 initializer-entrypoint:13 ] Initializing, args: src_uri [ gs://kfserving-examples/models/sklearn/iris-1 ] dest_path [ [ /mnt/models ] [ I 200517 03 :56:19 storage:35 ] Copying contents of gs://kfserving-examples/models/sklearn/iris-1 to local Traceback ( most recent call last ) : File \"/storage-initializer/scripts/initializer-entrypoint\" , line 14 , in kserve.Storage.download ( src_uri, dest_path ) File \"/usr/local/lib/python3.7/site-packages/kfserving/storage.py\" , line 48 , in download Storage._download_gcs ( uri, out_dir ) File \"/usr/local/lib/python3.7/site-packages/kfserving/storage.py\" , line 116 , in _download_gcs The path or model %s does not exist. \" % (uri)) RuntimeError: Failed to fetch model. The path or model gs://kfserving-examples/models/sklearn/iris-1 does not exist. [I 200517 03:40:19 initializer-entrypoint:13] Initializing, args: src_uri [gs://kfserving-examples/models/sklearn/iris] dest_path[ [/mnt/models] [I 200517 03:40:19 storage:35] Copying contents of gs://kfserving-examples/models/sklearn/iris to local [I 200517 03:40:20 storage:111] Downloading: /mnt/models/model.joblib [I 200517 03:40:20 storage:60] Successfully copied gs://kfserving-examples/models/sklearn/iris to /mnt/models Inference Service in OOM status \u00b6 If you see ExitCode137 from the revision status, this means the revision has failed and this usually happens when the inference service pod is out of memory. To address it, you might need to bump up the memory limit of the InferenceService . kubectl get revision $( kubectl get configuration sklearn-iris-predictor-default --output jsonpath = \"{.status.latestCreatedRevisionName}\" ) NAME CONFIG NAME K8S SERVICE NAME GENERATION READY REASON sklearn-iris-predictor-default-84bzf sklearn-iris-predictor-default sklearn-iris-predictor-default-84bzf 8 False ExitCode137s Inference Service fails to start \u00b6 If you see other exit codes from the revision status you can further check the pod status. kubectl get pods -l serving.kserve.io/inferenceservice = sklearn-iris sklearn-iris-predictor-default-rvhmk-deployment-867c6444647tz7n 1 /3 CrashLoopBackOff 3 80s If you see the CrashLoopBackOff , then check the kserve-container log to see more details where it fails, the error log is usually propagated on revision container status also. kubectl logs sklearn-iris-predictor-default-rvhmk-deployment-867c6444647tz7n kserve-container [ I 200517 04 :58:21 storage:35 ] Copying contents of /mnt/models to local Traceback ( most recent call last ) : File \"/usr/local/lib/python3.7/runpy.py\" , line 193 , in _run_module_as_main \"__main__\" , mod_spec ) File \"/usr/local/lib/python3.7/runpy.py\" , line 85 , in _run_code exec ( code, run_globals ) File \"/sklearnserver/sklearnserver/__main__.py\" , line 33 , in model.load () File \"/sklearnserver/sklearnserver/model.py\" , line 36 , in load model_file = next ( path for path in paths if os.path.exists ( path )) StopIteration Inference Service cannot fetch docker images from AWS ECR \u00b6 If you don't see the inference service created at all for custom images from private registries (such as AWS ECR), it might be that the Knative Serving Controller fails to authenticate itself against the registry. failed to resolve image to digest: failed to fetch image information: unsupported status code 401 ; body: Not Authorized You can verify that this is actually the case by spinning up a pod that uses your image. The pod should be able to fetch it, if the correct IAM roles are attached, while Knative is not able to. To circumvent this issue you can either skip tag resolution or provide certificates for your registry as detailed in the official knative docs . kubectl -n knative-serving edit configmap config-deployment The resultant yaml will look like something below. apiVersion : v1 kind : ConfigMap metadata : name : config-deployment namespace : knative-serving data : # List of repositories for which tag to digest resolving should be skipped (for AWS ECR: {account_id}.dkr.ecr.{region}.amazonaws.com) registriesSkippingTagResolving : registry.example.com Debug KServe Request flow \u00b6 +----------------------+ +-----------------------+ +--------------------------+ |Istio Virtual Service | |Istio Virtual Service | | K8S Service | | | | | | | |sklearn-iris | |sklearn-iris-predictor | | sklearn-iris-predictor | | +------->|-default +----->| -default-$revision | | | | | | | |KServe Route | |Knative Route | | Knative Revision Service | +----------------------+ +-----------------------+ +------------+-------------+ Knative Ingress Gateway Knative Local Gateway Kube Proxy (Istio gateway) (Istio gateway) | | | +-------------------------------------------------------+ | | Knative Revision Pod | | | | | | +-------------------+ +-----------------+ | | | | | | | | | | |kserve-container |<-----+ Queue Proxy | |<------------------+ | | | | | | | +-------------------+ +--------------^--+ | | | | +-----------------------^-------------------------------+ | scale deployment | +--------+--------+ | pull metrics | Knative | | | Autoscaler |----------- | KPA/HPA | +-----------------+ 1.Traffic arrives through Knative Ingress/Local Gateway for external/internal traffic \u00b6 Istio Gateway resource describes the edge of the mesh receiving incoming or outgoing HTTP/TCP connections. The specification describes a set of ports that should be exposed and the type of protocol to use. If you are using Standalone mode, it installs the Gateway in knative-serving namespace, if you are using Kubeflow KServe (KServe installed with Kubeflow), it installs the Gateway in kubeflow namespace e.g on GCP the gateway is protected behind IAP with Istio authentication policy . kubectl get gateway knative-ingress-gateway -n knative-serving -oyaml kind : Gateway metadata : labels : networking.knative.dev/ingress-provider : istio serving.knative.dev/release : v0.12.1 name : knative-ingress-gateway namespace : knative-serving spec : selector : istio : ingressgateway servers : - hosts : - '*' port : name : http number : 80 protocol : HTTP - hosts : - '*' port : name : https number : 443 protocol : HTTPS tls : mode : SIMPLE privateKey : /etc/istio/ingressgateway-certs/tls.key serverCertificate : /etc/istio/ingressgateway-certs/tls.crt The InferenceService request routes to the Istio Ingress Gateway by matching the host and port from the url, by default http is configured, you can configure HTTPS with TLS certificates . 2. KServe Istio virtual service to route for predictor, transformer, explainer. \u00b6 kubectl get vs sklearn-iris -oyaml apiVersion : networking.istio.io/v1alpha3 kind : VirtualService metadata : name : sklearn-iris namespace : default gateways : - knative-serving/knative-local-gateway - knative-serving/knative-ingress-gateway hosts : - sklearn-iris.default.svc.cluster.local - sklearn-iris.default.example.com http : - headers : request : set : Host : sklearn-iris-predictor-default.default.svc.cluster.local match : - authority : regex : ^sklearn-iris\\.default(\\.svc(\\.cluster\\.local)?)?(?::\\d{1,5})?$ gateways : - knative-serving/knative-local-gateway - authority : regex : ^sklearn-iris\\.default\\.example\\.com(?::\\d{1,5})?$ gateways : - knative-serving/knative-ingress-gateway route : - destination : host : knative-local-gateway.istio-system.svc.cluster.local port : number : 80 weight : 100 KServe creates the routing rule which by default routes to Predictor if you only have Predictor specified on InferenceService . When Transformer and Explainer are specified on InferenceService the routing rule configures the traffic to route to Transformer or Explainer based on the verb. The request then routes to the second level Knative created virtual service via local gateway with the matching host header. 3. Knative Istio virtual service to route the inference request to the latest ready revision. \u00b6 kubectl get vs sklearn-iris-predictor-default-ingress -oyaml apiVersion : networking.istio.io/v1alpha3 kind : VirtualService metadata : name : sklearn-iris-predictor-default-mesh namespace : default spec : gateways : - knative-serving/knative-ingress-gateway - knative-serving/knative-local-gateway hosts : - sklearn-iris-predictor-default.default - sklearn-iris-predictor-default.default.example.com - sklearn-iris-predictor-default.default.svc - sklearn-iris-predictor-default.default.svc.cluster.local http : - match : - authority : prefix : sklearn-iris-predictor-default.default gateways : - knative-serving/knative-local-gateway - authority : prefix : sklearn-iris-predictor-default.default.svc gateways : - knative-serving/knative-local-gateway - authority : prefix : sklearn-iris-predictor-default.default gateways : - knative-serving/knative-local-gateway retries : {} route : - destination : host : sklearn-iris-predictor-default-00001.default.svc.cluster.local port : number : 80 headers : request : set : Knative-Serving-Namespace : default Knative-Serving-Revision : sklearn-iris-predictor-default-00001 weight : 100 - match : - authority : prefix : sklearn-iris-predictor-default.default.example.com gateways : - knative-serving/knative-ingress-gateway retries : {} route : - destination : host : sklearn-iris-predictor-default-00001.default.svc.cluster.local port : number : 80 headers : request : set : Knative-Serving-Namespace : default Knative-Serving-Revision : sklearn-iris-predictor-default-00001 weight : 100 The destination here is the k8s Service for the latest ready Knative Revision and it is reconciled by Knative every time user rolls out a new revision. When a new revision is rolled out and in ready state, the old revision is then scaled down, after configured revision GC time the revision resource is garbage collected if the revision no longer has traffic referenced. 4. Kubernetes Service routes the requests to the queue proxy sidecar of the inference service pod on port 8012 . \u00b6 kubectl get svc sklearn-iris-predictor-default-fhmjk-private -oyaml apiVersion : v1 kind : Service metadata : name : sklearn-iris-predictor-default-fhmjk-private namespace : default spec : clusterIP : 10.105.186.18 ports : - name : http port : 80 protocol : TCP targetPort : 8012 - name : queue-metrics port : 9090 protocol : TCP targetPort : queue-metrics - name : http-usermetric port : 9091 protocol : TCP targetPort : http-usermetric - name : http-queueadm port : 8022 protocol : TCP targetPort : 8022 selector : serving.knative.dev/revisionUID : a8f1eafc-3c64-4930-9a01-359f3235333a sessionAffinity : None type : ClusterIP 5. The queue proxy routes to kserve container with max concurrent requests configured with ContainerConcurrency . \u00b6 If the queue proxy has more requests than it can handle, the Knative Autoscaler creates more pods to handle additional requests. 6. Finally The queue proxy routes traffic to the kserve-container for processing the inference requests. \u00b6","title":"Debugging guide"},{"location":"developer/debug/#kserve-debugging-guide","text":"","title":"KServe Debugging Guide"},{"location":"developer/debug/#debug-kserve-inferenceservice-status","text":"You deployed an InferenceService to KServe, but it is not in ready state. Go through this step by step guide to understand what failed. kubectl get inferenceservices sklearn-iris NAME URL READY DEFAULT TRAFFIC CANARY TRAFFIC AGE model-example False 1m","title":"Debug KServe InferenceService Status"},{"location":"developer/debug/#ingressnotconfigured","text":"If you see IngressNotConfigured error, this indicates Istio Ingress Gateway probes are failing. kubectl get ksvc NAME URL LATESTCREATED LATESTREADY READY REASON sklearn-iris-predictor-default http://sklearn-iris-predictor-default.default.example.com sklearn-iris-predictor-default-jk794 mnist-sample-predictor-default-jk794 Unknown IngressNotConfigured You can then check Knative networking-istio pod logs for more details. kubectl logs -l app = networking-istio -n knative-serving If you are seeing HTTP 403, then you may have Istio RBAC turned on which blocks the probes to your service. { \"level\" : \"error\" , \"ts\" : \"2020-03-26T19:12:00.749Z\" , \"logger\" : \"istiocontroller.ingress-controller.status-manager\" , \"caller\" : \"ingress/status.go:366\" , \"msg\" : \"Probing of http://flowers-sample-predictor-default.kubeflow-jeanarmel-luce.example.com:80/ failed, IP: 10.0.0.29:80, ready: false, error: unexpected status code: want [200], got 403 (depth: 0)\" , \"commit\" : \"6b0e5c6\" , \"knative.dev/controller\" : \"ingress-controller\" , \"stacktrace\" : \"knative.dev/serving/pkg/reconciler/ingress.(*StatusProber).processWorkItem\\n\\t/home/prow/go/src/knative.dev/serving/pkg/reconciler/ingress/status.go:366\\nknative.dev/serving/pkg/reconciler/ingress.(*StatusProber).Start.func1\\n\\t/home/prow/go/src/knative.dev/serving/pkg/reconciler/ingress/status.go:268\" }","title":"IngressNotConfigured"},{"location":"developer/debug/#revisionmissing-error","text":"If you see RevisionMissing error, then your service pods are not in ready state. Knative Service creates Knative Revision which represents a snapshot of the InferenceService code and configuration.","title":"RevisionMissing Error"},{"location":"developer/debug/#storage-initializer-fails-to-download-model","text":"kubectl get revision $( kubectl get configuration sklearn-iris-predictor-default --output jsonpath = \"{.status.latestCreatedRevisionName}\" ) NAME CONFIG NAME K8S SERVICE NAME GENERATION READY REASON sklearn-iris-predictor-default-csjpw sklearn-iris-predictor-default sklearn-iris-predictor-default-csjpw 2 Unknown Deploying If you see READY status in Unknown error, this usually indicates that the KServe Storage Initializer init container fails to download the model and you can check the init container logs to see why it fails, note that the pod scales down after sometime if the init container fails . kubectl get pod -l serving.kserve.io/inferenceservice = sklearn-iris NAME READY STATUS RESTARTS AGE sklearn-iris-predictor-default-29jks-deployment-5f7d4b9996hzrnc 0 /3 Init:Error 1 10s kubectl logs -l model = sklearn-iris -c storage-initializer [ I 200517 03 :56:19 initializer-entrypoint:13 ] Initializing, args: src_uri [ gs://kfserving-examples/models/sklearn/iris-1 ] dest_path [ [ /mnt/models ] [ I 200517 03 :56:19 storage:35 ] Copying contents of gs://kfserving-examples/models/sklearn/iris-1 to local Traceback ( most recent call last ) : File \"/storage-initializer/scripts/initializer-entrypoint\" , line 14 , in kserve.Storage.download ( src_uri, dest_path ) File \"/usr/local/lib/python3.7/site-packages/kfserving/storage.py\" , line 48 , in download Storage._download_gcs ( uri, out_dir ) File \"/usr/local/lib/python3.7/site-packages/kfserving/storage.py\" , line 116 , in _download_gcs The path or model %s does not exist. \" % (uri)) RuntimeError: Failed to fetch model. The path or model gs://kfserving-examples/models/sklearn/iris-1 does not exist. [I 200517 03:40:19 initializer-entrypoint:13] Initializing, args: src_uri [gs://kfserving-examples/models/sklearn/iris] dest_path[ [/mnt/models] [I 200517 03:40:19 storage:35] Copying contents of gs://kfserving-examples/models/sklearn/iris to local [I 200517 03:40:20 storage:111] Downloading: /mnt/models/model.joblib [I 200517 03:40:20 storage:60] Successfully copied gs://kfserving-examples/models/sklearn/iris to /mnt/models","title":"Storage Initializer fails to download model"},{"location":"developer/debug/#inference-service-in-oom-status","text":"If you see ExitCode137 from the revision status, this means the revision has failed and this usually happens when the inference service pod is out of memory. To address it, you might need to bump up the memory limit of the InferenceService . kubectl get revision $( kubectl get configuration sklearn-iris-predictor-default --output jsonpath = \"{.status.latestCreatedRevisionName}\" ) NAME CONFIG NAME K8S SERVICE NAME GENERATION READY REASON sklearn-iris-predictor-default-84bzf sklearn-iris-predictor-default sklearn-iris-predictor-default-84bzf 8 False ExitCode137s","title":"Inference Service in OOM status"},{"location":"developer/debug/#inference-service-fails-to-start","text":"If you see other exit codes from the revision status you can further check the pod status. kubectl get pods -l serving.kserve.io/inferenceservice = sklearn-iris sklearn-iris-predictor-default-rvhmk-deployment-867c6444647tz7n 1 /3 CrashLoopBackOff 3 80s If you see the CrashLoopBackOff , then check the kserve-container log to see more details where it fails, the error log is usually propagated on revision container status also. kubectl logs sklearn-iris-predictor-default-rvhmk-deployment-867c6444647tz7n kserve-container [ I 200517 04 :58:21 storage:35 ] Copying contents of /mnt/models to local Traceback ( most recent call last ) : File \"/usr/local/lib/python3.7/runpy.py\" , line 193 , in _run_module_as_main \"__main__\" , mod_spec ) File \"/usr/local/lib/python3.7/runpy.py\" , line 85 , in _run_code exec ( code, run_globals ) File \"/sklearnserver/sklearnserver/__main__.py\" , line 33 , in model.load () File \"/sklearnserver/sklearnserver/model.py\" , line 36 , in load model_file = next ( path for path in paths if os.path.exists ( path )) StopIteration","title":"Inference Service fails to start"},{"location":"developer/debug/#inference-service-cannot-fetch-docker-images-from-aws-ecr","text":"If you don't see the inference service created at all for custom images from private registries (such as AWS ECR), it might be that the Knative Serving Controller fails to authenticate itself against the registry. failed to resolve image to digest: failed to fetch image information: unsupported status code 401 ; body: Not Authorized You can verify that this is actually the case by spinning up a pod that uses your image. The pod should be able to fetch it, if the correct IAM roles are attached, while Knative is not able to. To circumvent this issue you can either skip tag resolution or provide certificates for your registry as detailed in the official knative docs . kubectl -n knative-serving edit configmap config-deployment The resultant yaml will look like something below. apiVersion : v1 kind : ConfigMap metadata : name : config-deployment namespace : knative-serving data : # List of repositories for which tag to digest resolving should be skipped (for AWS ECR: {account_id}.dkr.ecr.{region}.amazonaws.com) registriesSkippingTagResolving : registry.example.com","title":"Inference Service cannot fetch docker images from AWS ECR"},{"location":"developer/debug/#debug-kserve-request-flow","text":"+----------------------+ +-----------------------+ +--------------------------+ |Istio Virtual Service | |Istio Virtual Service | | K8S Service | | | | | | | |sklearn-iris | |sklearn-iris-predictor | | sklearn-iris-predictor | | +------->|-default +----->| -default-$revision | | | | | | | |KServe Route | |Knative Route | | Knative Revision Service | +----------------------+ +-----------------------+ +------------+-------------+ Knative Ingress Gateway Knative Local Gateway Kube Proxy (Istio gateway) (Istio gateway) | | | +-------------------------------------------------------+ | | Knative Revision Pod | | | | | | +-------------------+ +-----------------+ | | | | | | | | | | |kserve-container |<-----+ Queue Proxy | |<------------------+ | | | | | | | +-------------------+ +--------------^--+ | | | | +-----------------------^-------------------------------+ | scale deployment | +--------+--------+ | pull metrics | Knative | | | Autoscaler |----------- | KPA/HPA | +-----------------+","title":"Debug KServe Request flow"},{"location":"developer/debug/#1traffic-arrives-through-knative-ingresslocal-gateway-for-externalinternal-traffic","text":"Istio Gateway resource describes the edge of the mesh receiving incoming or outgoing HTTP/TCP connections. The specification describes a set of ports that should be exposed and the type of protocol to use. If you are using Standalone mode, it installs the Gateway in knative-serving namespace, if you are using Kubeflow KServe (KServe installed with Kubeflow), it installs the Gateway in kubeflow namespace e.g on GCP the gateway is protected behind IAP with Istio authentication policy . kubectl get gateway knative-ingress-gateway -n knative-serving -oyaml kind : Gateway metadata : labels : networking.knative.dev/ingress-provider : istio serving.knative.dev/release : v0.12.1 name : knative-ingress-gateway namespace : knative-serving spec : selector : istio : ingressgateway servers : - hosts : - '*' port : name : http number : 80 protocol : HTTP - hosts : - '*' port : name : https number : 443 protocol : HTTPS tls : mode : SIMPLE privateKey : /etc/istio/ingressgateway-certs/tls.key serverCertificate : /etc/istio/ingressgateway-certs/tls.crt The InferenceService request routes to the Istio Ingress Gateway by matching the host and port from the url, by default http is configured, you can configure HTTPS with TLS certificates .","title":"1.Traffic arrives through Knative Ingress/Local Gateway for external/internal traffic"},{"location":"developer/debug/#2-kserve-istio-virtual-service-to-route-for-predictor-transformer-explainer","text":"kubectl get vs sklearn-iris -oyaml apiVersion : networking.istio.io/v1alpha3 kind : VirtualService metadata : name : sklearn-iris namespace : default gateways : - knative-serving/knative-local-gateway - knative-serving/knative-ingress-gateway hosts : - sklearn-iris.default.svc.cluster.local - sklearn-iris.default.example.com http : - headers : request : set : Host : sklearn-iris-predictor-default.default.svc.cluster.local match : - authority : regex : ^sklearn-iris\\.default(\\.svc(\\.cluster\\.local)?)?(?::\\d{1,5})?$ gateways : - knative-serving/knative-local-gateway - authority : regex : ^sklearn-iris\\.default\\.example\\.com(?::\\d{1,5})?$ gateways : - knative-serving/knative-ingress-gateway route : - destination : host : knative-local-gateway.istio-system.svc.cluster.local port : number : 80 weight : 100 KServe creates the routing rule which by default routes to Predictor if you only have Predictor specified on InferenceService . When Transformer and Explainer are specified on InferenceService the routing rule configures the traffic to route to Transformer or Explainer based on the verb. The request then routes to the second level Knative created virtual service via local gateway with the matching host header.","title":"2. KServe Istio virtual service to route for predictor, transformer, explainer."},{"location":"developer/debug/#3-knative-istio-virtual-service-to-route-the-inference-request-to-the-latest-ready-revision","text":"kubectl get vs sklearn-iris-predictor-default-ingress -oyaml apiVersion : networking.istio.io/v1alpha3 kind : VirtualService metadata : name : sklearn-iris-predictor-default-mesh namespace : default spec : gateways : - knative-serving/knative-ingress-gateway - knative-serving/knative-local-gateway hosts : - sklearn-iris-predictor-default.default - sklearn-iris-predictor-default.default.example.com - sklearn-iris-predictor-default.default.svc - sklearn-iris-predictor-default.default.svc.cluster.local http : - match : - authority : prefix : sklearn-iris-predictor-default.default gateways : - knative-serving/knative-local-gateway - authority : prefix : sklearn-iris-predictor-default.default.svc gateways : - knative-serving/knative-local-gateway - authority : prefix : sklearn-iris-predictor-default.default gateways : - knative-serving/knative-local-gateway retries : {} route : - destination : host : sklearn-iris-predictor-default-00001.default.svc.cluster.local port : number : 80 headers : request : set : Knative-Serving-Namespace : default Knative-Serving-Revision : sklearn-iris-predictor-default-00001 weight : 100 - match : - authority : prefix : sklearn-iris-predictor-default.default.example.com gateways : - knative-serving/knative-ingress-gateway retries : {} route : - destination : host : sklearn-iris-predictor-default-00001.default.svc.cluster.local port : number : 80 headers : request : set : Knative-Serving-Namespace : default Knative-Serving-Revision : sklearn-iris-predictor-default-00001 weight : 100 The destination here is the k8s Service for the latest ready Knative Revision and it is reconciled by Knative every time user rolls out a new revision. When a new revision is rolled out and in ready state, the old revision is then scaled down, after configured revision GC time the revision resource is garbage collected if the revision no longer has traffic referenced.","title":"3. Knative Istio virtual service to route the inference request to the latest ready revision."},{"location":"developer/debug/#4-kubernetes-service-routes-the-requests-to-the-queue-proxy-sidecar-of-the-inference-service-pod-on-port-8012","text":"kubectl get svc sklearn-iris-predictor-default-fhmjk-private -oyaml apiVersion : v1 kind : Service metadata : name : sklearn-iris-predictor-default-fhmjk-private namespace : default spec : clusterIP : 10.105.186.18 ports : - name : http port : 80 protocol : TCP targetPort : 8012 - name : queue-metrics port : 9090 protocol : TCP targetPort : queue-metrics - name : http-usermetric port : 9091 protocol : TCP targetPort : http-usermetric - name : http-queueadm port : 8022 protocol : TCP targetPort : 8022 selector : serving.knative.dev/revisionUID : a8f1eafc-3c64-4930-9a01-359f3235333a sessionAffinity : None type : ClusterIP","title":"4. Kubernetes Service routes the requests to the queue proxy sidecar of the inference service pod on port 8012."},{"location":"developer/debug/#5-the-queue-proxy-routes-to-kserve-container-with-max-concurrent-requests-configured-with-containerconcurrency","text":"If the queue proxy has more requests than it can handle, the Knative Autoscaler creates more pods to handle additional requests.","title":"5. The queue proxy routes to kserve container with max concurrent requests configured with ContainerConcurrency."},{"location":"developer/debug/#6-finally-the-queue-proxy-routes-traffic-to-the-kserve-container-for-processing-the-inference-requests","text":"","title":"6. Finally The queue proxy routes traffic to the kserve-container for processing the inference requests."},{"location":"developer/developer/","text":"Development \u00b6 This doc explains how to setup a development environment so you can get started contributing . Prerequisites \u00b6 Follow the instructions below to set up your development environment. Once you meet these requirements, you can make changes and deploy your own version of kserve ! Before submitting a PR, see also CONTRIBUTING.md . Install requirements \u00b6 You must install these tools: go : KServe controller is written in Go and requires Go 1.20.0+. git : For source control. Go Module : Go's new dependency management system. ko : For development. kubectl : For managing development environments. kustomize To customize YAMLs for different environments, requires v5.0.0+. yq yq is used in the project makefiles to parse and display YAML output, requires yq 4.* . Install Knative on a Kubernetes cluster \u00b6 KServe currently requires Knative Serving for auto-scaling, canary rollout, Istio for traffic routing and ingress. To install Knative components on your Kubernetes cluster, follow the installation guide or alternatively, use the Knative Operators to manage your installation. Observability, tracing and logging are optional but are often very valuable tools for troubleshooting difficult issues, they can be installed via the directions here . If you start from scratch, KServe requires Kubernetes 1.25+, Knative 1.7+, Istio 1.15+. If you already have Istio or Knative (e.g. from a Kubeflow install) then you don't need to install them explicitly, as long as version dependencies are satisfied. Note On a local environment, when using minikube or kind as Kubernetes cluster, there has been a reported issue that knative quickstart bootstrap does not work as expected. It is recommended to follow the installation manual from knative using yaml or using knative operator for a better result. Setup your environment \u00b6 To start your environment you'll need to set these environment variables (we recommend adding them to your .bashrc ): GOPATH : If you don't have one, simply pick a directory and add export GOPATH=... $GOPATH/bin on PATH : This is so that tooling installed via go get will work properly. KO_DEFAULTPLATFORMS : If you are using M1 Mac book the value is linux/arm64 . KO_DOCKER_REPO : The docker repository to which developer images should be pushed (e.g. docker.io/ ). Note : Set up a docker repository for pushing images. You can use any container image registry by adjusting the authentication methods and repository paths mentioned in the sections below. Google Container Registry quickstart Docker Hub quickstart Azure Container Registry quickstart Note if you are using docker hub to store your images your KO_DOCKER_REPO variable should be docker.io/ . Currently Docker Hub doesn't let you create subdirs under your username. .bashrc example: export GOPATH = \" $HOME /go\" export PATH = \" ${ PATH } : ${ GOPATH } /bin\" export KO_DOCKER_REPO = 'docker.io/' Checkout your fork \u00b6 The Go tools require that you clone the repository to the src/github.com/kserve/kserve directory in your GOPATH . To check out this repository: Create your own fork of this repo Clone it to your machine: mkdir -p ${ GOPATH } /src/github.com/kserve cd ${ GOPATH } /src/github.com/kserve git clone git@github.com: ${ YOUR_GITHUB_USERNAME } /kserve.git cd kserve git remote add upstream git@github.com:kserve/kserve.git git remote set-url --push upstream no_push Adding the upstream remote sets you up nicely for regularly syncing your fork . Once you reach this point you are ready to do a full build and deploy as described below. Deploy KServe \u00b6 Check Knative Serving installation \u00b6 Once you've setup your development environment , you can verify the installation with following: Success $ kubectl -n knative-serving get pods NAME READY STATUS RESTARTS AGE activator-77784645fc-t2pjf 1 /1 Running 0 11d autoscaler-6fddf74d5-z2fzf 1 /1 Running 0 11d autoscaler-hpa-5bf4476cc5-tsbw6 1 /1 Running 0 11d controller-7b8cd7f95c-6jxxj 1 /1 Running 0 11d istio-webhook-866c5bc7f8-t5ztb 1 /1 Running 0 11d networking-istio-54fb8b5d4b-xznwd 1 /1 Running 0 11d webhook-5f5f7bd9b4-cv27c 1 /1 Running 0 11d $ kubectl get gateway -n knative-serving NAME AGE knative-ingress-gateway 11d knative-local-gateway 11d $ kubectl get svc -n istio-system NAME TYPE CLUSTER-IP EXTERNAL-IP PORT ( S ) AGE istio-ingressgateway LoadBalancer 10 .101.196.89 X.X.X.X 15021 :31101/TCP,80:31781/TCP,443:30372/TCP,15443:31067/TCP 11d istiod ClusterIP 10 .101.116.203 15010 /TCP,15012/TCP,443/TCP,15014/TCP,853/TCP 11d Deploy KServe from master branch \u00b6 We suggest using cert manager for provisioning the certificates for the webhook server. Other solutions should also work as long as they put the certificates in the desired location. You can follow the cert manager documentation to install it. If you don't want to install cert manager, you can set the KSERVE_ENABLE_SELF_SIGNED_CA environment variable to true. KSERVE_ENABLE_SELF_SIGNED_CA will execute a script to create a self-signed CA and patch it to the webhook config. export KSERVE_ENABLE_SELF_SIGNED_CA = true After that you can run following command to deploy KServe , you can skip above step if cert manager is already installed. make deploy Optional you can change CPU and memory limits when deploying KServe . export KSERVE_CONTROLLER_CPU_LIMIT = export KSERVE_CONTROLLER_MEMORY_LIMIT = make deploy Expected Output $ kubectl get pods -n kserve -l control-plane = kserve-controller-manager NAME READY STATUS RESTARTS AGE kserve-controller-manager-0 2/2 Running 0 13m Note By default it installs to kserve namespace with the published controller manager image from master branch. Deploy KServe with your own version \u00b6 Run the following command to deploy KServe controller and model agent with your local change. make deploy-dev Note deploy-dev builds the image from your local code, publishes to KO_DOCKER_REPO and deploys the kserve-controller-manager and model agent with the image digest to your cluster for testing. Please also ensure you are logged in to KO_DOCKER_REPO from your client machine. Run the following command to deploy model server with your local change. make deploy-dev-sklearn make deploy-dev-xgb Run the following command to deploy explainer with your local change. make deploy-dev-alibi Run the following command to deploy storage initializer with your local change. make deploy-dev-storageInitializer Warning The deploy command publishes the image to KO_DOCKER_REPO with the version latest , it changes the InferenceService configmap to point to the newly built image sha. The built image is only for development and testing purpose, the current limitation is that it changes the image impacted and reset all other images including the kserver-controller-manager to use the default ones. Smoke test after deployment \u00b6 Run the following command to smoke test the deployment kubectl apply -f https://raw.githubusercontent.com/kserve/kserve/master/docs/samples/v1beta1/tensorflow/tensorflow.yaml You should see model serving deployment running under default or your specified namespace. $ kubectl get pods -n default -l serving.kserve.io/inferenceservice=flower-sample Expected Output NAME READY STATUS RESTARTS AGE flower-sample-default-htz8r-deployment-8fd979f9b-w2qbv 3/3 Running 0 10s Running unit/integration tests \u00b6 kserver-controller-manager has a few integration tests which requires mock apiserver and etcd, they get installed along with kubebuilder . To run all unit/integration tests: make test Run e2e tests locally \u00b6 To setup from local code, do: ./hack/quick_install.sh make undeploy make deploy-dev Go to python/kserve and install kserve python sdk deps pip3 install -e . [ test ] Then go to test/e2e . Run kubectl create namespace kserve-ci-e2e-test For KIND/minikube: Run export KSERVE_INGRESS_HOST_PORT=localhost:8080 In a different window run kubectl port-forward -n istio-system svc/istio-ingressgateway 8080:80 Note that not all tests will pass as the pytorch test requires gpu. These will show as pending pods at the end or you can add marker to skip the test. Run pytest > testresults.txt Tests may not clean up. To re-run, first do kubectl delete namespace kserve-ci-e2e-test , recreate namespace and run again. Iterating \u00b6 As you make changes to the code-base, there are two special cases to be aware of: If you change an input to generated code , then you must run make manifests . Inputs include: API type definitions in apis/serving Manifests or kustomize patches stored in config . To generate the KServe python/go clients, you should run make generate . If you want to add new dependencies , then you add the imports and the specific version of the dependency module in go.mod . When it encounters an import of a package not provided by any module in go.mod , the go command automatically looks up the module containing the package and adds it to go.mod using the latest version. If you want to upgrade the dependency , then you run go get command e.g go get golang.org/x/text to upgrade to the latest version, go get golang.org/x/text@v0.3.0 to upgrade to a specific version. make deploy-dev Contribute to the code \u00b6 See the guidelines for contributing a feature contributing to an existing issue Releases \u00b6 Please check out the documentation here to understand the release schedule and process. Feedback \u00b6 The best place to provide feedback about the KServe code is via a Github issue. See creating a Github issue for guidelines on submitting bugs and feature requests.","title":"How to contribute"},{"location":"developer/developer/#development","text":"This doc explains how to setup a development environment so you can get started contributing .","title":"Development"},{"location":"developer/developer/#prerequisites","text":"Follow the instructions below to set up your development environment. Once you meet these requirements, you can make changes and deploy your own version of kserve ! Before submitting a PR, see also CONTRIBUTING.md .","title":"Prerequisites"},{"location":"developer/developer/#install-requirements","text":"You must install these tools: go : KServe controller is written in Go and requires Go 1.20.0+. git : For source control. Go Module : Go's new dependency management system. ko : For development. kubectl : For managing development environments. kustomize To customize YAMLs for different environments, requires v5.0.0+. yq yq is used in the project makefiles to parse and display YAML output, requires yq 4.* .","title":"Install requirements"},{"location":"developer/developer/#install-knative-on-a-kubernetes-cluster","text":"KServe currently requires Knative Serving for auto-scaling, canary rollout, Istio for traffic routing and ingress. To install Knative components on your Kubernetes cluster, follow the installation guide or alternatively, use the Knative Operators to manage your installation. Observability, tracing and logging are optional but are often very valuable tools for troubleshooting difficult issues, they can be installed via the directions here . If you start from scratch, KServe requires Kubernetes 1.25+, Knative 1.7+, Istio 1.15+. If you already have Istio or Knative (e.g. from a Kubeflow install) then you don't need to install them explicitly, as long as version dependencies are satisfied. Note On a local environment, when using minikube or kind as Kubernetes cluster, there has been a reported issue that knative quickstart bootstrap does not work as expected. It is recommended to follow the installation manual from knative using yaml or using knative operator for a better result.","title":"Install Knative on a Kubernetes cluster"},{"location":"developer/developer/#setup-your-environment","text":"To start your environment you'll need to set these environment variables (we recommend adding them to your .bashrc ): GOPATH : If you don't have one, simply pick a directory and add export GOPATH=... $GOPATH/bin on PATH : This is so that tooling installed via go get will work properly. KO_DEFAULTPLATFORMS : If you are using M1 Mac book the value is linux/arm64 . KO_DOCKER_REPO : The docker repository to which developer images should be pushed (e.g. docker.io/ ). Note : Set up a docker repository for pushing images. You can use any container image registry by adjusting the authentication methods and repository paths mentioned in the sections below. Google Container Registry quickstart Docker Hub quickstart Azure Container Registry quickstart Note if you are using docker hub to store your images your KO_DOCKER_REPO variable should be docker.io/ . Currently Docker Hub doesn't let you create subdirs under your username. .bashrc example: export GOPATH = \" $HOME /go\" export PATH = \" ${ PATH } : ${ GOPATH } /bin\" export KO_DOCKER_REPO = 'docker.io/'","title":"Setup your environment"},{"location":"developer/developer/#checkout-your-fork","text":"The Go tools require that you clone the repository to the src/github.com/kserve/kserve directory in your GOPATH . To check out this repository: Create your own fork of this repo Clone it to your machine: mkdir -p ${ GOPATH } /src/github.com/kserve cd ${ GOPATH } /src/github.com/kserve git clone git@github.com: ${ YOUR_GITHUB_USERNAME } /kserve.git cd kserve git remote add upstream git@github.com:kserve/kserve.git git remote set-url --push upstream no_push Adding the upstream remote sets you up nicely for regularly syncing your fork . Once you reach this point you are ready to do a full build and deploy as described below.","title":"Checkout your fork"},{"location":"developer/developer/#deploy-kserve","text":"","title":"Deploy KServe"},{"location":"developer/developer/#check-knative-serving-installation","text":"Once you've setup your development environment , you can verify the installation with following: Success $ kubectl -n knative-serving get pods NAME READY STATUS RESTARTS AGE activator-77784645fc-t2pjf 1 /1 Running 0 11d autoscaler-6fddf74d5-z2fzf 1 /1 Running 0 11d autoscaler-hpa-5bf4476cc5-tsbw6 1 /1 Running 0 11d controller-7b8cd7f95c-6jxxj 1 /1 Running 0 11d istio-webhook-866c5bc7f8-t5ztb 1 /1 Running 0 11d networking-istio-54fb8b5d4b-xznwd 1 /1 Running 0 11d webhook-5f5f7bd9b4-cv27c 1 /1 Running 0 11d $ kubectl get gateway -n knative-serving NAME AGE knative-ingress-gateway 11d knative-local-gateway 11d $ kubectl get svc -n istio-system NAME TYPE CLUSTER-IP EXTERNAL-IP PORT ( S ) AGE istio-ingressgateway LoadBalancer 10 .101.196.89 X.X.X.X 15021 :31101/TCP,80:31781/TCP,443:30372/TCP,15443:31067/TCP 11d istiod ClusterIP 10 .101.116.203 15010 /TCP,15012/TCP,443/TCP,15014/TCP,853/TCP 11d","title":"Check Knative Serving installation"},{"location":"developer/developer/#deploy-kserve-from-master-branch","text":"We suggest using cert manager for provisioning the certificates for the webhook server. Other solutions should also work as long as they put the certificates in the desired location. You can follow the cert manager documentation to install it. If you don't want to install cert manager, you can set the KSERVE_ENABLE_SELF_SIGNED_CA environment variable to true. KSERVE_ENABLE_SELF_SIGNED_CA will execute a script to create a self-signed CA and patch it to the webhook config. export KSERVE_ENABLE_SELF_SIGNED_CA = true After that you can run following command to deploy KServe , you can skip above step if cert manager is already installed. make deploy Optional you can change CPU and memory limits when deploying KServe . export KSERVE_CONTROLLER_CPU_LIMIT = export KSERVE_CONTROLLER_MEMORY_LIMIT = make deploy Expected Output $ kubectl get pods -n kserve -l control-plane = kserve-controller-manager NAME READY STATUS RESTARTS AGE kserve-controller-manager-0 2/2 Running 0 13m Note By default it installs to kserve namespace with the published controller manager image from master branch.","title":"Deploy KServe from master branch"},{"location":"developer/developer/#deploy-kserve-with-your-own-version","text":"Run the following command to deploy KServe controller and model agent with your local change. make deploy-dev Note deploy-dev builds the image from your local code, publishes to KO_DOCKER_REPO and deploys the kserve-controller-manager and model agent with the image digest to your cluster for testing. Please also ensure you are logged in to KO_DOCKER_REPO from your client machine. Run the following command to deploy model server with your local change. make deploy-dev-sklearn make deploy-dev-xgb Run the following command to deploy explainer with your local change. make deploy-dev-alibi Run the following command to deploy storage initializer with your local change. make deploy-dev-storageInitializer Warning The deploy command publishes the image to KO_DOCKER_REPO with the version latest , it changes the InferenceService configmap to point to the newly built image sha. The built image is only for development and testing purpose, the current limitation is that it changes the image impacted and reset all other images including the kserver-controller-manager to use the default ones.","title":"Deploy KServe with your own version"},{"location":"developer/developer/#smoke-test-after-deployment","text":"Run the following command to smoke test the deployment kubectl apply -f https://raw.githubusercontent.com/kserve/kserve/master/docs/samples/v1beta1/tensorflow/tensorflow.yaml You should see model serving deployment running under default or your specified namespace. $ kubectl get pods -n default -l serving.kserve.io/inferenceservice=flower-sample Expected Output NAME READY STATUS RESTARTS AGE flower-sample-default-htz8r-deployment-8fd979f9b-w2qbv 3/3 Running 0 10s","title":"Smoke test after deployment"},{"location":"developer/developer/#running-unitintegration-tests","text":"kserver-controller-manager has a few integration tests which requires mock apiserver and etcd, they get installed along with kubebuilder . To run all unit/integration tests: make test","title":"Running unit/integration tests"},{"location":"developer/developer/#run-e2e-tests-locally","text":"To setup from local code, do: ./hack/quick_install.sh make undeploy make deploy-dev Go to python/kserve and install kserve python sdk deps pip3 install -e . [ test ] Then go to test/e2e . Run kubectl create namespace kserve-ci-e2e-test For KIND/minikube: Run export KSERVE_INGRESS_HOST_PORT=localhost:8080 In a different window run kubectl port-forward -n istio-system svc/istio-ingressgateway 8080:80 Note that not all tests will pass as the pytorch test requires gpu. These will show as pending pods at the end or you can add marker to skip the test. Run pytest > testresults.txt Tests may not clean up. To re-run, first do kubectl delete namespace kserve-ci-e2e-test , recreate namespace and run again.","title":"Run e2e tests locally"},{"location":"developer/developer/#iterating","text":"As you make changes to the code-base, there are two special cases to be aware of: If you change an input to generated code , then you must run make manifests . Inputs include: API type definitions in apis/serving Manifests or kustomize patches stored in config . To generate the KServe python/go clients, you should run make generate . If you want to add new dependencies , then you add the imports and the specific version of the dependency module in go.mod . When it encounters an import of a package not provided by any module in go.mod , the go command automatically looks up the module containing the package and adds it to go.mod using the latest version. If you want to upgrade the dependency , then you run go get command e.g go get golang.org/x/text to upgrade to the latest version, go get golang.org/x/text@v0.3.0 to upgrade to a specific version. make deploy-dev","title":"Iterating"},{"location":"developer/developer/#contribute-to-the-code","text":"See the guidelines for contributing a feature contributing to an existing issue","title":"Contribute to the code"},{"location":"developer/developer/#releases","text":"Please check out the documentation here to understand the release schedule and process.","title":"Releases"},{"location":"developer/developer/#feedback","text":"The best place to provide feedback about the KServe code is via a Github issue. See creating a Github issue for guidelines on submitting bugs and feature requests.","title":"Feedback"},{"location":"get_started/","text":"Getting Started with KServe \u00b6 Before you begin \u00b6 Warning KServe Quickstart Environments are for experimentation use only. For production installation, see our Administrator's Guide Before you can get started with a KServe Quickstart deployment you must install kind and the Kubernetes CLI. Install Kind (Kubernetes in Docker) \u00b6 You can use kind (Kubernetes in Docker) to run a local Kubernetes cluster with Docker container nodes. Install the Kubernetes CLI \u00b6 The Kubernetes CLI ( kubectl ) , allows you to run commands against Kubernetes clusters. You can use kubectl to deploy applications, inspect and manage cluster resources, and view logs. Install the KServe \"Quickstart\" environment \u00b6 After having kind installed, create a kind cluster with: kind create cluster Then run: kubectl config get-contexts It should list out a list of contexts you have, one of them should be kind-kind . Then run: kubectl config use-context kind-kind to use this context. You can then get started with a local deployment of KServe by using KServe Quick installation script on Kind : curl -s \"https://raw.githubusercontent.com/kserve/kserve/release-0.12/hack/quick_install.sh\" | bash or install via our published Helm Charts: helm install kserve-crd oci://ghcr.io/kserve/charts/kserve-crd --version v0.12.0 helm install kserve oci://ghcr.io/kserve/charts/kserve --version v0.12.0","title":"KServe Quickstart"},{"location":"get_started/#getting-started-with-kserve","text":"","title":"Getting Started with KServe"},{"location":"get_started/#before-you-begin","text":"Warning KServe Quickstart Environments are for experimentation use only. For production installation, see our Administrator's Guide Before you can get started with a KServe Quickstart deployment you must install kind and the Kubernetes CLI.","title":"Before you begin"},{"location":"get_started/#install-kind-kubernetes-in-docker","text":"You can use kind (Kubernetes in Docker) to run a local Kubernetes cluster with Docker container nodes.","title":"Install Kind (Kubernetes in Docker)"},{"location":"get_started/#install-the-kubernetes-cli","text":"The Kubernetes CLI ( kubectl ) , allows you to run commands against Kubernetes clusters. You can use kubectl to deploy applications, inspect and manage cluster resources, and view logs.","title":"Install the Kubernetes CLI"},{"location":"get_started/#install-the-kserve-quickstart-environment","text":"After having kind installed, create a kind cluster with: kind create cluster Then run: kubectl config get-contexts It should list out a list of contexts you have, one of them should be kind-kind . Then run: kubectl config use-context kind-kind to use this context. You can then get started with a local deployment of KServe by using KServe Quick installation script on Kind : curl -s \"https://raw.githubusercontent.com/kserve/kserve/release-0.12/hack/quick_install.sh\" | bash or install via our published Helm Charts: helm install kserve-crd oci://ghcr.io/kserve/charts/kserve-crd --version v0.12.0 helm install kserve oci://ghcr.io/kserve/charts/kserve --version v0.12.0","title":"Install the KServe \"Quickstart\" environment"},{"location":"get_started/first_isvc/","text":"Run your first InferenceService \u00b6 In this tutorial, you will deploy an InferenceService with a predictor that will load a scikit-learn model trained with the iris dataset. This dataset has three output class: Iris Setosa, Iris Versicolour, and Iris Virginica. You will then send an inference request to your deployed model in order to get a prediction for the class of iris plant your request corresponds to. Since your model is being deployed as an InferenceService, not a raw Kubernetes Service, you just need to provide the storage location of the model and it gets some super powers out of the box . 1. Create a namespace \u00b6 First, create a namespace to use for deploying KServe resources: kubectl create namespace kserve-test 2. Create an InferenceService \u00b6 Next, define a new InferenceService YAML for the model and apply it to the cluster. A new predictor schema was introduced in v0.8.0 . New InferenceServices should be deployed using the new schema. The old schema is provided as reference. New Schema Old Schema kubectl apply -n kserve-test -f - < \"./iris-input.json\" { \"instances\": [ [6.8, 2.8, 4.8, 1.4], [6.0, 3.4, 4.5, 1.6] ] } EOF Depending on your setup, use one of the following commands to curl the InferenceService : Real DNS Magic DNS From Ingress gateway with HOST Header From local cluster gateway If you have configured the DNS, you can directly curl the InferenceService with the URL obtained from the status print. e.g curl -v -H \"Content-Type: application/json\" http://sklearn-iris.kserve-test. ${ CUSTOM_DOMAIN } /v1/models/sklearn-iris:predict -d @./iris-input.json If you don't want to go through the trouble to get a real domain, you can instead use \"magic\" dns xip.io . The key is to get the external IP for your cluster. kubectl get svc istio-ingressgateway --namespace istio-system Look for the EXTERNAL-IP column's value(in this case 35.237.217.209) NAME TYPE CLUSTER-IP EXTERNAL-IP PORT ( S ) AGE istio-ingressgateway LoadBalancer 10 .51.253.94 35 .237.217.209 Next step is to setting up the custom domain: kubectl edit cm config-domain --namespace knative-serving Now in your editor, change example.com to {{external-ip}}.xip.io (make sure to replace {{external-ip}} with the IP you found earlier). With the change applied you can now directly curl the URL curl -v -H \"Content-Type: application/json\" http://sklearn-iris.kserve-test.35.237.217.209.xip.io/v1/models/sklearn-iris:predict -d @./iris-input.json If you do not have DNS, you can still curl with the ingress gateway external IP using the HOST Header. SERVICE_HOSTNAME = $( kubectl get inferenceservice sklearn-iris -n kserve-test -o jsonpath = '{.status.url}' | cut -d \"/\" -f 3 ) curl -v -H \"Host: ${ SERVICE_HOSTNAME } \" -H \"Content-Type: application/json\" \"http:// ${ INGRESS_HOST } : ${ INGRESS_PORT } /v1/models/sklearn-iris:predict\" -d @./iris-input.json If you are calling from in cluster you can curl with the internal url with host {{InferenceServiceName}}.{{namespace}} curl -v -H \"Content-Type: application/json\" http://sklearn-iris.kserve-test/v1/models/sklearn-iris:predict -d @./iris-input.json You should see two predictions returned (i.e. {\"predictions\": [1, 1]} ). Both sets of data points sent for inference correspond to the flower with index 1 . In this case, the model predicts that both flowers are \"Iris Versicolour\". 6. Run performance test (optional) \u00b6 If you want to load test the deployed model, try deploying the following Kubernetes Job to drive load to the model: # use kubectl create instead of apply because the job template is using generateName which doesn't work with kubectl apply kubectl create -f https://raw.githubusercontent.com/kserve/kserve/release-0.11/docs/samples/v1beta1/sklearn/v1/perf.yaml -n kserve-test Execute the following command to view output: kubectl logs load-test8b58n-rgfxr -n kserve-test Expected Output Requests [ total, rate, throughput ] 30000 , 500 .02, 499 .99 Duration [ total, attack, wait ] 1m0s, 59 .998s, 3 .336ms Latencies [ min, mean, 50 , 90 , 95 , 99 , max ] 1 .743ms, 2 .748ms, 2 .494ms, 3 .363ms, 4 .091ms, 7 .749ms, 46 .354ms Bytes In [ total, mean ] 690000 , 23 .00 Bytes Out [ total, mean ] 2460000 , 82 .00 Success [ ratio ] 100 .00% Status Codes [ code:count ] 200 :30000 Error Set:","title":"First InferenceService"},{"location":"get_started/first_isvc/#run-your-first-inferenceservice","text":"In this tutorial, you will deploy an InferenceService with a predictor that will load a scikit-learn model trained with the iris dataset. This dataset has three output class: Iris Setosa, Iris Versicolour, and Iris Virginica. You will then send an inference request to your deployed model in order to get a prediction for the class of iris plant your request corresponds to. Since your model is being deployed as an InferenceService, not a raw Kubernetes Service, you just need to provide the storage location of the model and it gets some super powers out of the box .","title":"Run your first InferenceService"},{"location":"get_started/first_isvc/#1-create-a-namespace","text":"First, create a namespace to use for deploying KServe resources: kubectl create namespace kserve-test","title":"1. Create a namespace"},{"location":"get_started/first_isvc/#2-create-an-inferenceservice","text":"Next, define a new InferenceService YAML for the model and apply it to the cluster. A new predictor schema was introduced in v0.8.0 . New InferenceServices should be deployed using the new schema. The old schema is provided as reference. New Schema Old Schema kubectl apply -n kserve-test -f - < \"./iris-input.json\" { \"instances\": [ [6.8, 2.8, 4.8, 1.4], [6.0, 3.4, 4.5, 1.6] ] } EOF Depending on your setup, use one of the following commands to curl the InferenceService : Real DNS Magic DNS From Ingress gateway with HOST Header From local cluster gateway If you have configured the DNS, you can directly curl the InferenceService with the URL obtained from the status print. e.g curl -v -H \"Content-Type: application/json\" http://sklearn-iris.kserve-test. ${ CUSTOM_DOMAIN } /v1/models/sklearn-iris:predict -d @./iris-input.json If you don't want to go through the trouble to get a real domain, you can instead use \"magic\" dns xip.io . The key is to get the external IP for your cluster. kubectl get svc istio-ingressgateway --namespace istio-system Look for the EXTERNAL-IP column's value(in this case 35.237.217.209) NAME TYPE CLUSTER-IP EXTERNAL-IP PORT ( S ) AGE istio-ingressgateway LoadBalancer 10 .51.253.94 35 .237.217.209 Next step is to setting up the custom domain: kubectl edit cm config-domain --namespace knative-serving Now in your editor, change example.com to {{external-ip}}.xip.io (make sure to replace {{external-ip}} with the IP you found earlier). With the change applied you can now directly curl the URL curl -v -H \"Content-Type: application/json\" http://sklearn-iris.kserve-test.35.237.217.209.xip.io/v1/models/sklearn-iris:predict -d @./iris-input.json If you do not have DNS, you can still curl with the ingress gateway external IP using the HOST Header. SERVICE_HOSTNAME = $( kubectl get inferenceservice sklearn-iris -n kserve-test -o jsonpath = '{.status.url}' | cut -d \"/\" -f 3 ) curl -v -H \"Host: ${ SERVICE_HOSTNAME } \" -H \"Content-Type: application/json\" \"http:// ${ INGRESS_HOST } : ${ INGRESS_PORT } /v1/models/sklearn-iris:predict\" -d @./iris-input.json If you are calling from in cluster you can curl with the internal url with host {{InferenceServiceName}}.{{namespace}} curl -v -H \"Content-Type: application/json\" http://sklearn-iris.kserve-test/v1/models/sklearn-iris:predict -d @./iris-input.json You should see two predictions returned (i.e. {\"predictions\": [1, 1]} ). Both sets of data points sent for inference correspond to the flower with index 1 . In this case, the model predicts that both flowers are \"Iris Versicolour\".","title":"5. Perform inference"},{"location":"get_started/first_isvc/#6-run-performance-test-optional","text":"If you want to load test the deployed model, try deploying the following Kubernetes Job to drive load to the model: # use kubectl create instead of apply because the job template is using generateName which doesn't work with kubectl apply kubectl create -f https://raw.githubusercontent.com/kserve/kserve/release-0.11/docs/samples/v1beta1/sklearn/v1/perf.yaml -n kserve-test Execute the following command to view output: kubectl logs load-test8b58n-rgfxr -n kserve-test Expected Output Requests [ total, rate, throughput ] 30000 , 500 .02, 499 .99 Duration [ total, attack, wait ] 1m0s, 59 .998s, 3 .336ms Latencies [ min, mean, 50 , 90 , 95 , 99 , max ] 1 .743ms, 2 .748ms, 2 .494ms, 3 .363ms, 4 .091ms, 7 .749ms, 46 .354ms Bytes In [ total, mean ] 690000 , 23 .00 Bytes Out [ total, mean ] 2460000 , 82 .00 Success [ ratio ] 100 .00% Status Codes [ code:count ] 200 :30000 Error Set:","title":"6. Run performance test (optional)"},{"location":"get_started/swagger_ui/","text":"InferenceService Swagger UI \u00b6 KServe ModelServer is built on top of FastAPI , which brings out-of-box support for OpenAPI specification and Swagger UI . Swagger UI allows visualizing and interacting with the KServe InferenceService API directly in the browser , making it easy for exploring the endpoints and validating the outputs without using any command-line tool. Enable Swagger UI \u00b6 Warning Be careful when enabling this for your production InferenceService deployments since the endpoint does not require authentication at this time. Currently, POST request only work for v2 endpoints in the UI. To enable, simply add an extra argument to the InferenceService YAML example from First Inference chapter: kubectl apply -n kserve-test -f - <.github.io/docs/ Where is your Github handle. After a few moments, your changes should be available for public preview at the link provided by MkDocs! This means you can rapidly prototype and share your changes before making a PR! Navigation \u00b6 Navigation in MkDocs uses the \"mkdocs.yml\" file (found in the /docs directory) to organize navigation. For more in-depth information on Navigation, see: https://www.mkdocs.org/user-guide/writing-your-docs/#configure-pages-and-navigation and https://squidfunk.github.io/mkdocs-material/setup/setting-up-navigation/ Content Tabs \u00b6 Content tabs are handy way to organize lots of information in a visually pleasing way. Some documentation from https://squidfunk.github.io/mkdocs-material/reference/content-tabs/#usage is reproduced here: Grouping Code blocks Grouping other content Code blocks are one of the primary targets to be grouped, and can be considered a special case of content tabs, as tabs with a single code block are always rendered without horizontal spacing. Example: === \"C\" ``` c #include int main(void) { printf(\"Hello world!\\n\"); return 0; } ``` === \"C++\" ``` c++ #include int main(void) { std::cout << \"Hello world!\" << std::endl; return 0; } ``` Result: C C++ #include int main ( void ) { printf ( \"Hello world! \\n \" ); return 0 ; } #include int main ( void ) { std :: cout << \"Hello world!\" << std :: endl ; return 0 ; } When a content tab contains more than one code block, it is rendered with horizontal spacing. Vertical spacing is never added, but can be achieved by nesting tabs in other blocks. Example: === \"Unordered list\" * Sed sagittis eleifend rutrum * Donec vitae suscipit est * Nulla tempor lobortis orci === \"Ordered list\" 1. Sed sagittis eleifend rutrum 2. Donec vitae suscipit est 3. Nulla tempor lobortis orci Result: Unordered list Ordered list Sed sagittis eleifend rutrum Donec vitae suscipit est Nulla tempor lobortis orci Sed sagittis eleifend rutrum Donec vitae suscipit est Nulla tempor lobortis orci For more information, see: https://squidfunk.github.io/mkdocs-material/reference/content-tabs/#usage File Includes (Content Reuse) \u00b6 KServe strives to reduce duplicative effort by reusing commonly used bits of information, see the docs/snippet directory for some examples. Snippets does not require a specific extension, and as long as a valid file name is specified, it will attempt to process it. Snippets can handle recursive file inclusion. And if Snippets encounters the same file in the current stack, it will avoid re-processing it in order to avoid an infinite loop (or crash on hitting max recursion depth). For more info, see: https://facelessuser.github.io/pymdown-extensions/extensions/snippets/ Admonitions \u00b6 We use the following admonition boxes only. Use admonitions sparingly; too many admonitions can be distracting. Admonitions Formatting Note A Note contains information that is useful, but not essential. A reader can skip a note without bypassing required information. If the information suggests an action to take, use a tip instead. Tip A Tip suggests an helpful, but not mandatory, action to take. Warning A Warning draws attention to potential trouble. !!! note A Note contains information that is useful, but not essential. A reader can skip a note without bypassing required information. If the information suggests an action to take, use a tip instead. !!! tip A Tip suggests a helpful, but not mandatory, action to take. !!! warning A Warning draws attention to potential trouble. Icons and Emojis \u00b6 Material for MkDocs supports using Material Icons and Emojis using easy shortcodes. Emojs Formatting :taco: To search a database of Icons and Emojis (all of which can be used on kserve.io), as well as usage information, see: https://squidfunk.github.io/mkdocs-material/reference/icons-emojis/#search Redirects \u00b6 The KServe site uses mkdocs-redirects to \"redirect\" users from a page that may no longer exist (or has been moved) to their desired location. Adding re-directs to the KServe site is done in one centralized place, docs/config/redirects.yml . The format is shown here: plugins: redirects: redirect_maps: ... path_to_old_or_moved_URL : path_to_new_URL","title":"MkDocs Contributions"},{"location":"help/contributor/mkdocs-contributor-guide/#mkdocs-contributions","text":"This is a temporary home for contribution guidelines for the MkDocs branch. When MkDocs becomes \"main\" this will be moved to the appropriate place on the website","title":"MkDocs Contributions"},{"location":"help/contributor/mkdocs-contributor-guide/#install-material-for-mkdocs","text":"kserve.io uses Material for MkDocs to render documentation. Material for MkDocs is Python based and uses pip to install most of it's required packages as well as optional add-ons (which we use). You can choose to install MkDocs locally or using a Docker image. pip actually comes pre-installed with Python so it is included in many operating systems (like MacOSx or Ubuntu) but if you don\u2019t have Python, you can install it here: https://www.python.org For some (e.g. folks using RHEL), you may have to use pip3. pip pip3 pip install mkdocs-material mike More detailed instructions can be found here: https://squidfunk.github.io/mkdocs-material/getting-started/#installation pip3 install mkdocs-material mike More detailed instructions can be found here: https://squidfunk.github.io/mkdocs-material/getting-started/#installation","title":"Install Material for MkDocs"},{"location":"help/contributor/mkdocs-contributor-guide/#install-kserve-specific-extensions","text":"KServe uses a number of extensions to MkDocs which can also be installed using pip. If you used pip to install, run the following: pip pip3 pip install mkdocs-material-extensions mkdocs-macros-plugin mkdocs-exclude mkdocs-awesome-pages-plugin mkdocs-redirects pip3 install mkdocs-material-extensions mkdocs-macros-plugin mkdocs-exclude mkdocs-awesome-pages-plugin mkdocs-redirects","title":"Install KServe-Specific Extensions"},{"location":"help/contributor/mkdocs-contributor-guide/#install-dependencies-in-requirementstxt-file","text":"Navigate to root folder and run below command to install required packages and libraries specified in the requirements.txt file. pip pip3 pip install -r requirements.txt pip3 install -r requirements.txt","title":"Install Dependencies in Requirements.txt file"},{"location":"help/contributor/mkdocs-contributor-guide/#setting-up-local-preview","text":"Once you have installed Material for MkDocs and all of the extensions, head over to and clone the repo. In your terminal, find your way over to the location of the cloned repo. Once you are in the main folder and run: Local Preview Local Preview w/ Dirty Reload Local Preview including Blog and Community Site mkdocs serve If you\u2019re only changing a single page in the /docs/ folder (i.e. not the homepage or mkdocs.yml) adding the flag --dirtyreload will make the site rebuild super crazy insta-fast. mkdocs serve --dirtyreload First, install the necessary extensions: npm install -g postcss postcss-cli autoprefixer http-server Once you have those npm packages installed, run: ./hack/build-with-blog.sh serve Note Unfortunately, there aren\u2019t live previews for this version of the local preview. After awhile, your terminal should spit out: INFO - Documentation built in 13 .54 seconds [ I 210519 10 :47:10 server:335 ] Serving on http://127.0.0.1:8000 [ I 210519 10 :47:10 handlers:62 ] Start watching changes [ I 210519 10 :47:10 handlers:64 ] Start detecting changes Now access http://127.0.0.1:8000 and you should see the site is built! \ud83c\udf89 Anytime you change any file in your /docs/ repo and hit save, the site will automatically rebuild itself to reflect your changes!","title":"Setting Up Local Preview"},{"location":"help/contributor/mkdocs-contributor-guide/#setting-up-public-preview","text":"If, for whatever reason, you want to share your work before submitting a PR (where Netlify would generate a preview for you), you can deploy your changes as a Github Page easily using the following command: mkdocs gh-deploy --force INFO - Documentation built in 14 .29 seconds WARNING - Version check skipped: No version specified in previous deployment. INFO - Your documentation should shortly be available at: https://.github.io/docs/ Where is your Github handle. After a few moments, your changes should be available for public preview at the link provided by MkDocs! This means you can rapidly prototype and share your changes before making a PR!","title":"Setting Up \"Public\" Preview"},{"location":"help/contributor/mkdocs-contributor-guide/#navigation","text":"Navigation in MkDocs uses the \"mkdocs.yml\" file (found in the /docs directory) to organize navigation. For more in-depth information on Navigation, see: https://www.mkdocs.org/user-guide/writing-your-docs/#configure-pages-and-navigation and https://squidfunk.github.io/mkdocs-material/setup/setting-up-navigation/","title":"Navigation"},{"location":"help/contributor/mkdocs-contributor-guide/#content-tabs","text":"Content tabs are handy way to organize lots of information in a visually pleasing way. Some documentation from https://squidfunk.github.io/mkdocs-material/reference/content-tabs/#usage is reproduced here: Grouping Code blocks Grouping other content Code blocks are one of the primary targets to be grouped, and can be considered a special case of content tabs, as tabs with a single code block are always rendered without horizontal spacing. Example: === \"C\" ``` c #include int main(void) { printf(\"Hello world!\\n\"); return 0; } ``` === \"C++\" ``` c++ #include int main(void) { std::cout << \"Hello world!\" << std::endl; return 0; } ``` Result: C C++ #include int main ( void ) { printf ( \"Hello world! \\n \" ); return 0 ; } #include int main ( void ) { std :: cout << \"Hello world!\" << std :: endl ; return 0 ; } When a content tab contains more than one code block, it is rendered with horizontal spacing. Vertical spacing is never added, but can be achieved by nesting tabs in other blocks. Example: === \"Unordered list\" * Sed sagittis eleifend rutrum * Donec vitae suscipit est * Nulla tempor lobortis orci === \"Ordered list\" 1. Sed sagittis eleifend rutrum 2. Donec vitae suscipit est 3. Nulla tempor lobortis orci Result: Unordered list Ordered list Sed sagittis eleifend rutrum Donec vitae suscipit est Nulla tempor lobortis orci Sed sagittis eleifend rutrum Donec vitae suscipit est Nulla tempor lobortis orci For more information, see: https://squidfunk.github.io/mkdocs-material/reference/content-tabs/#usage","title":"Content Tabs"},{"location":"help/contributor/mkdocs-contributor-guide/#file-includes-content-reuse","text":"KServe strives to reduce duplicative effort by reusing commonly used bits of information, see the docs/snippet directory for some examples. Snippets does not require a specific extension, and as long as a valid file name is specified, it will attempt to process it. Snippets can handle recursive file inclusion. And if Snippets encounters the same file in the current stack, it will avoid re-processing it in order to avoid an infinite loop (or crash on hitting max recursion depth). For more info, see: https://facelessuser.github.io/pymdown-extensions/extensions/snippets/","title":"File Includes (Content Reuse)"},{"location":"help/contributor/mkdocs-contributor-guide/#admonitions","text":"We use the following admonition boxes only. Use admonitions sparingly; too many admonitions can be distracting. Admonitions Formatting Note A Note contains information that is useful, but not essential. A reader can skip a note without bypassing required information. If the information suggests an action to take, use a tip instead. Tip A Tip suggests an helpful, but not mandatory, action to take. Warning A Warning draws attention to potential trouble. !!! note A Note contains information that is useful, but not essential. A reader can skip a note without bypassing required information. If the information suggests an action to take, use a tip instead. !!! tip A Tip suggests a helpful, but not mandatory, action to take. !!! warning A Warning draws attention to potential trouble.","title":"Admonitions"},{"location":"help/contributor/mkdocs-contributor-guide/#icons-and-emojis","text":"Material for MkDocs supports using Material Icons and Emojis using easy shortcodes. Emojs Formatting :taco: To search a database of Icons and Emojis (all of which can be used on kserve.io), as well as usage information, see: https://squidfunk.github.io/mkdocs-material/reference/icons-emojis/#search","title":"Icons and Emojis"},{"location":"help/contributor/mkdocs-contributor-guide/#redirects","text":"The KServe site uses mkdocs-redirects to \"redirect\" users from a page that may no longer exist (or has been moved) to their desired location. Adding re-directs to the KServe site is done in one centralized place, docs/config/redirects.yml . The format is shown here: plugins: redirects: redirect_maps: ... path_to_old_or_moved_URL : path_to_new_URL","title":"Redirects"},{"location":"help/contributor/templates/template-blog/","text":"Blog template instructions \u00b6 An example template with best-practices that you can use to start drafting an entry to post on the KServe blog. Copy a version of this template without the instructions Include a commented-out table with tracking info about reviews and approvals: | YYYY-MM-DD | :+1:, :monocle_face:, :-1: | | | YYYY-MM-DD | :+1:, :monocle_face:, :-1: | --> Blog content body \u00b6 Example step/section 1: \u00b6 Example step/section 2: \u00b6 Example step/section 3: \u00b6 Example section about results \u00b6 Further reading \u00b6 About the author \u00b6 Copy the template \u00b6 | YYYY-MM-DD | :+1:, :monocle_face:, :-1: | | | YYYY-MM-DD | :+1:, :monocle_face:, :-1: | --> # ## Blog content body ### Example step/section 1: ### Example step/section 2: ### Example step/section 3: ### Example section about results ## Further reading ## About the author ","title":"Blog template instructions"},{"location":"help/contributor/templates/template-blog/#blog-template-instructions","text":"An example template with best-practices that you can use to start drafting an entry to post on the KServe blog. Copy a version of this template without the instructions Include a commented-out table with tracking info about reviews and approvals: | YYYY-MM-DD | :+1:, :monocle_face:, :-1: | | | YYYY-MM-DD | :+1:, :monocle_face:, :-1: | -->","title":"Blog template instructions"},{"location":"help/contributor/templates/template-blog/#blog-content-body","text":" ","title":"Blog content body"},{"location":"help/contributor/templates/template-blog/#example-stepsection-1","text":"","title":"Example step/section 1:"},{"location":"help/contributor/templates/template-blog/#example-stepsection-2","text":"","title":"Example step/section 2:"},{"location":"help/contributor/templates/template-blog/#example-stepsection-3","text":"","title":"Example step/section 3:"},{"location":"help/contributor/templates/template-blog/#example-section-about-results","text":"","title":"Example section about results"},{"location":"help/contributor/templates/template-blog/#further-reading","text":"","title":"Further reading"},{"location":"help/contributor/templates/template-blog/#about-the-author","text":"","title":"About the author"},{"location":"help/contributor/templates/template-blog/#copy-the-template","text":" | YYYY-MM-DD | :+1:, :monocle_face:, :-1: | | | YYYY-MM-DD | :+1:, :monocle_face:, :-1: | --> # ## Blog content body ### Example step/section 1: ### Example step/section 2: ### Example step/section 3: ### Example section about results ## Further reading ## About the author ","title":"Copy the template"},{"location":"help/contributor/templates/template-concept/","text":"Concept Template \u00b6 Use this template when writing conceptual topics. Conceptual topics explain how things work or what things mean. They provide helpful context to readers. They do not include procedures. Template \u00b6 The following template includes the standard sections that should appear in conceptual topics, including a topic introduction sentence, an overview, and placeholders for additional sections and subsections. Copy and paste the markdown from the template to use it in your topic. This topic describes... Write a sentence or two that describes the topic itself, not the subject of the topic. The goal of the topic sentence is to help readers understand if this topic is for them. For example, \"This topic describes what KServe is and how it works.\" ## Overview Write a few sentences describing the subject of the topic. ## Section Title Write a sentence or two to describe the content in this section. Create more sections as necessary. Optionally, add two or more subsections to each section. Do not skip header levels: H2 >> H3, not H2 >> H4. ### Subsection Title Write a sentence or two to describe the content in this section. ### Subsection Title Write a sentence or two to describe the content in this section. Conceptual Content Samples \u00b6 This section provides common content types that appear in conceptual topics. Copy and paste the markdown to use it in your topic. Table \u00b6 Introduce the table with a sentence. For example, \u201cThe following table lists which features are available to a KServe supported ML framework.\u201d Markdown Table Template \u00b6 Header 1 Header 2 Data1 Data2 Data3 Data4 Ordered List \u00b6 Write a sentence or two to introduce the content of the list. For example, \u201cIf you want to fix or add content to a past release, you can find the source files in the following folders.\u201d. Optionally, include bold lead-ins before each list item. Markdown Ordered List Templates \u00b6 Item 1 Item 2 Item 3 Lead-in description: Item 1 Lead-in description: Item 2 Lead-in description: Item 3 Unordered List \u00b6 Write a sentence or two to introduce the content of the list. For example, \u201cYour own path to becoming a KServe contributor can begin in any of the following components:\u201d. Optionally, include bold lead-ins before each list item. Markdown Unordered List Template \u00b6 List item List item List item Lead-in : List item Lead-in : List item Lead-in : List item Note \u00b6 Ensure the text beneath the note is indented as much as note is. Note This is a note. Warning \u00b6 If the note regards an issue that could lead to data loss, the note should be a warning. Warning This is a warning.","title":"Concept Template"},{"location":"help/contributor/templates/template-concept/#concept-template","text":"Use this template when writing conceptual topics. Conceptual topics explain how things work or what things mean. They provide helpful context to readers. They do not include procedures.","title":"Concept Template"},{"location":"help/contributor/templates/template-concept/#template","text":"The following template includes the standard sections that should appear in conceptual topics, including a topic introduction sentence, an overview, and placeholders for additional sections and subsections. Copy and paste the markdown from the template to use it in your topic. This topic describes... Write a sentence or two that describes the topic itself, not the subject of the topic. The goal of the topic sentence is to help readers understand if this topic is for them. For example, \"This topic describes what KServe is and how it works.\" ## Overview Write a few sentences describing the subject of the topic. ## Section Title Write a sentence or two to describe the content in this section. Create more sections as necessary. Optionally, add two or more subsections to each section. Do not skip header levels: H2 >> H3, not H2 >> H4. ### Subsection Title Write a sentence or two to describe the content in this section. ### Subsection Title Write a sentence or two to describe the content in this section.","title":"Template"},{"location":"help/contributor/templates/template-concept/#conceptual-content-samples","text":"This section provides common content types that appear in conceptual topics. Copy and paste the markdown to use it in your topic.","title":"Conceptual Content Samples"},{"location":"help/contributor/templates/template-concept/#table","text":"Introduce the table with a sentence. For example, \u201cThe following table lists which features are available to a KServe supported ML framework.\u201d","title":"Table"},{"location":"help/contributor/templates/template-concept/#markdown-table-template","text":"Header 1 Header 2 Data1 Data2 Data3 Data4","title":"Markdown Table Template"},{"location":"help/contributor/templates/template-concept/#ordered-list","text":"Write a sentence or two to introduce the content of the list. For example, \u201cIf you want to fix or add content to a past release, you can find the source files in the following folders.\u201d. Optionally, include bold lead-ins before each list item.","title":"Ordered List"},{"location":"help/contributor/templates/template-concept/#markdown-ordered-list-templates","text":"Item 1 Item 2 Item 3 Lead-in description: Item 1 Lead-in description: Item 2 Lead-in description: Item 3","title":"Markdown Ordered List Templates"},{"location":"help/contributor/templates/template-concept/#unordered-list","text":"Write a sentence or two to introduce the content of the list. For example, \u201cYour own path to becoming a KServe contributor can begin in any of the following components:\u201d. Optionally, include bold lead-ins before each list item.","title":"Unordered List"},{"location":"help/contributor/templates/template-concept/#markdown-unordered-list-template","text":"List item List item List item Lead-in : List item Lead-in : List item Lead-in : List item","title":"Markdown Unordered List Template"},{"location":"help/contributor/templates/template-concept/#note","text":"Ensure the text beneath the note is indented as much as note is. Note This is a note.","title":"Note"},{"location":"help/contributor/templates/template-concept/#warning","text":"If the note regards an issue that could lead to data loss, the note should be a warning. Warning This is a warning.","title":"Warning"},{"location":"help/contributor/templates/template-procedure/","text":"Procedure template \u00b6 Use this template when writing procedural (how-to) topics. Procedural topics include detailed steps to perform a task as well as some context about the task. Template \u00b6 The following template includes the standard sections that should appear in procedural topics, including a topic sentence, an overview section, and sections for each task within the procedure. Copy and paste the markdown from the template to use it in your topic. This topic describes... Write a sentence or two that describes the topic itself, not the subject of the topic. The goal of the topic sentence is to help readers understand if this topic is for them. For example, \"This topic instructs how to serve a TensorFlow model.\" ## Overview Write a few sentences to describe the subject of the topic, if useful. For example, if the topic is about configuring a broker, you might provide some useful context about brokers. If there are multiple tasks in the procedure and they must be completed in order, create an ordered list that contains each task in the topic. Use bullets for sub-tasks. Include anchor links to the headings for each task. To [task]: 1. [Name of Task 1 (for example, Apply default configuration)](#task-1) 1. [Optional: Name of Task 2](#task-2) !!! note Unless the number of tasks in the procedure is particularly high, do not use numbered lead-ins in the task headings. For example, instead of \"Task 1: Apply default configuration\", use \"Apply default configuration\". ## Prerequisites Use one of the following formats for the Prerequisites section. ### Formatting for two or more prerequisites If there are two or more prerequisites, use the following format. Include links for more information, if necessary. Before you [task], you must have/do: * Prerequisite. See [Link](). * Prerequisite. See [Link](). For example: Before you deploy PyTorch model, you must have: * KServe. See [Installing the KServe](link-to-that-topic). * An Apache Kafka cluster. See [Link to Instructions to Download](link-to-that-topic). ### Format for one prerequisite If there is one prerequisite, use the following format. Include a link for more information, if necessary. Before you [task], you must have/do [prerequisite]. See [Link](link). For example: Before you create the `InferenceService`, you must have a Kubernetes cluster with KServe installed and DNS configured. See the [installation instructions](../../../install/README.md) if you need to create one. ## Task 1 Write a few sentences to describe the task and provide additional context on the task. !!! note When writing a single-step procedure, write the step in one sentence and make it a bullet. The signposting is important given readers are strongly inclined to look for numbered steps and bullet points when searching for instructions. If possible, expand the procedure to include at least one more step. Few procedures truly require a single step. [Task]: 1. Step 1 1. Step 2 ## Optional: Task 2 If the task is optional, put \"Optional:\" in the heading. Write a few sentences to describe the task and provide additional context on the task. [Task]: 1. Step 1 2. Step 2 Procedure Content Samples \u00b6 This section provides common content types that appear in procedural topics. Copy and paste the markdown to use it in your topic. \u201cFill-in-the-Fields\u201d Table \u00b6 Where the reader must enter many values in, for example, a YAML file, use a table within the procedure as follows: Open the YAML file. Key1 : Value1 Key2 : Value2 metadata : annotations : # case-sensitive Key3 : Value3 Key4 : Value4 Key5 : Value5 spec : # Configuration specific to this broker. config : Key6 : Value6 Change the relevant values to your needs, using the following table as a guide. Key Value Type Description Key1 String Description Key2 Integer Description Key3 String Description Key4 String Description Key5 Float Description Key6 String Description Table \u00b6 Introduce the table with a sentence. For example, \u201cThe following table lists which features are available to a KServe supported ML framework. Markdown Table Template \u00b6 Header 1 Header 2 Data1 Data2 Data3 Data4 Ordered List \u00b6 Write a sentence or two to introduce the content of the list. For example, \u201cIf you want to fix or add content to a past release, you can find the source files in the following folders.\u201d. Optionally, include bold lead-ins before each list item. Markdown Ordered List Templates \u00b6 Item 1 Item 2 Item 3 Lead-in description: Item 1 Lead-in description: Item 2 Lead-in description: Item 3 Unordered List \u00b6 Write a sentence or two to introduce the content of the list. For example, \u201cYour own path to becoming a KServe contributor can begin in any of the following components:\u201d. Optionally, include bold lead-ins before each list item. Markdown Unordered List Template \u00b6 List item List item List item Lead-in : List item Lead-in : List item Lead-in : List item Note \u00b6 Ensure the text beneath the note is indented as much as note is. Note This is a note. Warning \u00b6 If the note regards an issue that could lead to data loss, the note should be a warning. Warning This is a warning. Markdown Embedded Image \u00b6 The following is an embedded image reference in markdown. Tabs \u00b6 Place multiple versions of the same procedure (such as a CLI procedure vs a YAML procedure) within tabs. Indent the opening tabs tags 3 spaces to make the tabs display properly. == \"tab1 name\" This is a stem: 1. This is a step. ``` This is some code. ``` 1. This is another step. == \"tab2 name\" This is a stem: 1. This is a step. ``` This is some code. ``` 1. This is another step. Documenting Code and Code Snippets \u00b6 For instructions on how to format code and code snippets, see the Style Guide.","title":"Procedure template"},{"location":"help/contributor/templates/template-procedure/#procedure-template","text":"Use this template when writing procedural (how-to) topics. Procedural topics include detailed steps to perform a task as well as some context about the task.","title":"Procedure template"},{"location":"help/contributor/templates/template-procedure/#template","text":"The following template includes the standard sections that should appear in procedural topics, including a topic sentence, an overview section, and sections for each task within the procedure. Copy and paste the markdown from the template to use it in your topic. This topic describes... Write a sentence or two that describes the topic itself, not the subject of the topic. The goal of the topic sentence is to help readers understand if this topic is for them. For example, \"This topic instructs how to serve a TensorFlow model.\" ## Overview Write a few sentences to describe the subject of the topic, if useful. For example, if the topic is about configuring a broker, you might provide some useful context about brokers. If there are multiple tasks in the procedure and they must be completed in order, create an ordered list that contains each task in the topic. Use bullets for sub-tasks. Include anchor links to the headings for each task. To [task]: 1. [Name of Task 1 (for example, Apply default configuration)](#task-1) 1. [Optional: Name of Task 2](#task-2) !!! note Unless the number of tasks in the procedure is particularly high, do not use numbered lead-ins in the task headings. For example, instead of \"Task 1: Apply default configuration\", use \"Apply default configuration\". ## Prerequisites Use one of the following formats for the Prerequisites section. ### Formatting for two or more prerequisites If there are two or more prerequisites, use the following format. Include links for more information, if necessary. Before you [task], you must have/do: * Prerequisite. See [Link](). * Prerequisite. See [Link](). For example: Before you deploy PyTorch model, you must have: * KServe. See [Installing the KServe](link-to-that-topic). * An Apache Kafka cluster. See [Link to Instructions to Download](link-to-that-topic). ### Format for one prerequisite If there is one prerequisite, use the following format. Include a link for more information, if necessary. Before you [task], you must have/do [prerequisite]. See [Link](link). For example: Before you create the `InferenceService`, you must have a Kubernetes cluster with KServe installed and DNS configured. See the [installation instructions](../../../install/README.md) if you need to create one. ## Task 1 Write a few sentences to describe the task and provide additional context on the task. !!! note When writing a single-step procedure, write the step in one sentence and make it a bullet. The signposting is important given readers are strongly inclined to look for numbered steps and bullet points when searching for instructions. If possible, expand the procedure to include at least one more step. Few procedures truly require a single step. [Task]: 1. Step 1 1. Step 2 ## Optional: Task 2 If the task is optional, put \"Optional:\" in the heading. Write a few sentences to describe the task and provide additional context on the task. [Task]: 1. Step 1 2. Step 2","title":"Template"},{"location":"help/contributor/templates/template-procedure/#procedure-content-samples","text":"This section provides common content types that appear in procedural topics. Copy and paste the markdown to use it in your topic.","title":"Procedure Content Samples"},{"location":"help/contributor/templates/template-procedure/#fill-in-the-fields-table","text":"Where the reader must enter many values in, for example, a YAML file, use a table within the procedure as follows: Open the YAML file. Key1 : Value1 Key2 : Value2 metadata : annotations : # case-sensitive Key3 : Value3 Key4 : Value4 Key5 : Value5 spec : # Configuration specific to this broker. config : Key6 : Value6 Change the relevant values to your needs, using the following table as a guide. Key Value Type Description Key1 String Description Key2 Integer Description Key3 String Description Key4 String Description Key5 Float Description Key6 String Description","title":"\u201cFill-in-the-Fields\u201d Table"},{"location":"help/contributor/templates/template-procedure/#table","text":"Introduce the table with a sentence. For example, \u201cThe following table lists which features are available to a KServe supported ML framework.","title":"Table"},{"location":"help/contributor/templates/template-procedure/#markdown-table-template","text":"Header 1 Header 2 Data1 Data2 Data3 Data4","title":"Markdown Table Template"},{"location":"help/contributor/templates/template-procedure/#ordered-list","text":"Write a sentence or two to introduce the content of the list. For example, \u201cIf you want to fix or add content to a past release, you can find the source files in the following folders.\u201d. Optionally, include bold lead-ins before each list item.","title":"Ordered List"},{"location":"help/contributor/templates/template-procedure/#markdown-ordered-list-templates","text":"Item 1 Item 2 Item 3 Lead-in description: Item 1 Lead-in description: Item 2 Lead-in description: Item 3","title":"Markdown Ordered List Templates"},{"location":"help/contributor/templates/template-procedure/#unordered-list","text":"Write a sentence or two to introduce the content of the list. For example, \u201cYour own path to becoming a KServe contributor can begin in any of the following components:\u201d. Optionally, include bold lead-ins before each list item.","title":"Unordered List"},{"location":"help/contributor/templates/template-procedure/#markdown-unordered-list-template","text":"List item List item List item Lead-in : List item Lead-in : List item Lead-in : List item","title":"Markdown Unordered List Template"},{"location":"help/contributor/templates/template-procedure/#note","text":"Ensure the text beneath the note is indented as much as note is. Note This is a note.","title":"Note"},{"location":"help/contributor/templates/template-procedure/#warning","text":"If the note regards an issue that could lead to data loss, the note should be a warning. Warning This is a warning.","title":"Warning"},{"location":"help/contributor/templates/template-procedure/#markdown-embedded-image","text":"The following is an embedded image reference in markdown.","title":"Markdown Embedded Image"},{"location":"help/contributor/templates/template-procedure/#tabs","text":"Place multiple versions of the same procedure (such as a CLI procedure vs a YAML procedure) within tabs. Indent the opening tabs tags 3 spaces to make the tabs display properly. == \"tab1 name\" This is a stem: 1. This is a step. ``` This is some code. ``` 1. This is another step. == \"tab2 name\" This is a stem: 1. This is a step. ``` This is some code. ``` 1. This is another step.","title":"Tabs"},{"location":"help/contributor/templates/template-procedure/#documenting-code-and-code-snippets","text":"For instructions on how to format code and code snippets, see the Style Guide.","title":"Documenting Code and Code Snippets"},{"location":"help/contributor/templates/template-troubleshooting/","text":"Troubleshooting template \u00b6 When writing guidance to help to troubleshoot specific errors, the error must include: Error Description: To describe the error very briefly so that users can search for it easily. Symptom: To describe the error in a way that helps users to diagnose their issue. Include error messages or anything else users might see if they encounter this error. Explanation (or cause): To inform users about why they are seeing this error. This can be omitted if the cause of the error is unknown. Solution: To inform the user about how to fix the error. Example Troubleshooting Table \u00b6 Troubleshooting \u00b6 | Error Description | |----------|------------| | Symptom | During the event something breaks. | | Cause | The thing is broken. | | Solution | To solve this issue, do the following: 1. This. 2. That. |","title":"Troubleshooting template"},{"location":"help/contributor/templates/template-troubleshooting/#troubleshooting-template","text":"When writing guidance to help to troubleshoot specific errors, the error must include: Error Description: To describe the error very briefly so that users can search for it easily. Symptom: To describe the error in a way that helps users to diagnose their issue. Include error messages or anything else users might see if they encounter this error. Explanation (or cause): To inform users about why they are seeing this error. This can be omitted if the cause of the error is unknown. Solution: To inform the user about how to fix the error.","title":"Troubleshooting template"},{"location":"help/contributor/templates/template-troubleshooting/#example-troubleshooting-table","text":"","title":"Example Troubleshooting Table"},{"location":"help/contributor/templates/template-troubleshooting/#troubleshooting","text":"| Error Description | |----------|------------| | Symptom | During the event something breaks. | | Cause | The thing is broken. | | Solution | To solve this issue, do the following: 1. This. 2. That. |","title":"Troubleshooting"},{"location":"help/style-guide/documenting-code/","text":"Documenting Code \u00b6 Words requiring code formatting \u00b6 Apply code formatting only to special-purpose text: Filenames Path names Fields and values from a YAML file Any text that goes into a CLI CLI names Specify the programming language \u00b6 Specify the language your code is in as part of the code block Specify non-language specific code, like CLI commands, with ```bash. See the following examples for formatting. Correct Incorrect Correct Formatting Incorrect Formatting package main import \"fmt\" func main () { fmt . Println ( \"hello world\" ) } package main import \"fmt\" func main () { fmt.Println ( \"hello world\" ) } ```go package main import \"fmt\" func main() { fmt.Println(\"hello world\") } ``` ```bash package main import \"fmt\" func main() { fmt.Println(\"hello world\") } ``` Documenting YAML \u00b6 When documenting YAML, use two steps. Use step 1 to create the YAML file, and step 2 to apply the YAML file. Use kubectl apply for files/objects that the user creates: it works for both \u201ccreate\u201d and \u201cupdate\u201d, and the source of truth is their local files. Use kubectl edit for files which are shipped as part of the KServe software, like the KServe ConfigMaps. Write ```yaml at the beginning of your code block if you are typing YAML code as part of a CLI command. Correct Incorrect Creating or updating a resource: Create a YAML file using the following template: # YAML FILE CONTENTS Apply the YAML file by running the command: kubectl apply -f .yaml Where is the name of the file you created in the previous step. Editing a ConfigMap: kubectl -n edit configmap Example 1: cat < is\u2026\" Single variable \u00b6 Correct Incorrect kubectl get isvc Where is the name of your InferenceService. kubectl get isvc { SERVICE_NAME } {SERVICE_NAME} = The name of your service Multiple variables \u00b6 Correct Incorrect kn create service --revision-name Where: is the name of your Knative Service. is the desired name of your revision. kn create service --revision-name Where is the name of your Knative Service. Where is the desired name of your revision. CLI output \u00b6 CLI Output should include the custom css \"{ .bash .no-copy }\" in place of \"bash\" which removes the \"Copy to clipboard button\" on the right side of the code block Correct Incorrect Correct Formatting Incorrect Formatting ```{ .bash .no-copy } ``` ```bash ```","title":"Documenting Code"},{"location":"help/style-guide/documenting-code/#documenting-code","text":"","title":"Documenting Code"},{"location":"help/style-guide/documenting-code/#words-requiring-code-formatting","text":"Apply code formatting only to special-purpose text: Filenames Path names Fields and values from a YAML file Any text that goes into a CLI CLI names","title":"Words requiring code formatting"},{"location":"help/style-guide/documenting-code/#specify-the-programming-language","text":"Specify the language your code is in as part of the code block Specify non-language specific code, like CLI commands, with ```bash. See the following examples for formatting. Correct Incorrect Correct Formatting Incorrect Formatting package main import \"fmt\" func main () { fmt . Println ( \"hello world\" ) } package main import \"fmt\" func main () { fmt.Println ( \"hello world\" ) } ```go package main import \"fmt\" func main() { fmt.Println(\"hello world\") } ``` ```bash package main import \"fmt\" func main() { fmt.Println(\"hello world\") } ```","title":"Specify the programming language"},{"location":"help/style-guide/documenting-code/#documenting-yaml","text":"When documenting YAML, use two steps. Use step 1 to create the YAML file, and step 2 to apply the YAML file. Use kubectl apply for files/objects that the user creates: it works for both \u201ccreate\u201d and \u201cupdate\u201d, and the source of truth is their local files. Use kubectl edit for files which are shipped as part of the KServe software, like the KServe ConfigMaps. Write ```yaml at the beginning of your code block if you are typing YAML code as part of a CLI command. Correct Incorrect Creating or updating a resource: Create a YAML file using the following template: # YAML FILE CONTENTS Apply the YAML file by running the command: kubectl apply -f .yaml Where is the name of the file you created in the previous step. Editing a ConfigMap: kubectl -n edit configmap Example 1: cat < is\u2026\"","title":"Referencing variables in code blocks"},{"location":"help/style-guide/documenting-code/#single-variable","text":"Correct Incorrect kubectl get isvc Where is the name of your InferenceService. kubectl get isvc { SERVICE_NAME } {SERVICE_NAME} = The name of your service","title":"Single variable"},{"location":"help/style-guide/documenting-code/#multiple-variables","text":"Correct Incorrect kn create service --revision-name Where: is the name of your Knative Service. is the desired name of your revision. kn create service --revision-name Where is the name of your Knative Service. Where is the desired name of your revision.","title":"Multiple variables"},{"location":"help/style-guide/documenting-code/#cli-output","text":"CLI Output should include the custom css \"{ .bash .no-copy }\" in place of \"bash\" which removes the \"Copy to clipboard button\" on the right side of the code block Correct Incorrect Correct Formatting Incorrect Formatting ```{ .bash .no-copy } ``` ```bash ```","title":"CLI output"},{"location":"help/style-guide/style-and-formatting/","text":"Formatting standards and conventions \u00b6 Titles and headings \u00b6 Use sentence case for titles and headings \u00b6 Only capitalize proper nouns, acronyms, and the first word of the heading. Correct Incorrect ## Configure the feature ## Configure the Feature ### Using feature ### Using Feature ### Using HTTPS ### Using https Do not use code formatting inside headings \u00b6 Correct Incorrect ## Configure the class annotation ## Configure the `class` annotation Use imperatives for headings of procedures \u00b6 For consistency, brevity, and to better signpost where action is expected of the reader, make procedure headings imperatives. Correct Incorrect ## Install KServe ## Installation of KServe ### Configure DNS ### Configuring DNS ## Verify the installation ## How to verify the installation Links \u00b6 Describe what the link targets \u00b6 Correct Incorrect For an explanation of what makes a good hyperlink, see this this article . See this article here . Write links in Markdown, not HTML \u00b6 Correct Incorrect [Kafka Broker](../kafka-broker/README.md) Kafka Broker [Kafka Broker](../kafka-broker/README.md){target=_blank} Kafka Broker Include the .md extension in internal links \u00b6 Correct Incorrect [Setting up a custom domain](../serving/using-a-custom-domain.md) [Setting up a custom domain](../serving/using-a-custom-domain) Link to files, not folders \u00b6 Correct Incorrect [Kafka Broker](../kafka-broker/README.md) [Kafka Broker](../kafka-broker/) Ensure the letter case is correct \u00b6 Correct Incorrect [Kafka Broker](../kafka-broker/README.md) [Kafka Broker](../kafka-broker/readme.md) Formatting \u00b6 Use nonbreaking spaces in units of measurement other than percent \u00b6 For most units of measurement, when you specify a number with the unit, use a nonbreaking space between the number and the unit. Don't use spacing when the unit of measurement is percent. Correct Incorrect 3   GB 3 GB 4   CPUs 4 CPUs 14% 14   % Use bold for user interface elements \u00b6 Correct Incorrect Click Fork Click \"Fork\" Select Other Select \"Other\" Use tables for definition lists \u00b6 When listing terms and their definitions, use table formatting instead of definition list formatting. Correct Incorrect |Value |Description | |------|---------------------| |Value1|Description of Value1| |Value2|Description of Value2| Value1 : Description of Value1 Value2 : Description of Value2 General style \u00b6 Use upper camel case for KServe API objects \u00b6 Correct Incorrect Explainers explainers Transformer transformer InferenceService Inference Service Only use parentheses for acronym explanations \u00b6 Put an acronym inside parentheses after its explanation. Don\u2019t use parentheses for anything else. Parenthetical statements especially should be avoided because readers skip them. If something is important enough to be in the sentence, it should be fully part of that sentence. Correct Incorrect Custom Resource Definition (CRD) Check your CLI (you should see it there) Knative Serving creates a Revision Knative creates a Revision (a stateless, snapshot in time of your code and configuration) Use the international standard for punctuation inside quotes \u00b6 Correct Incorrect Events are recorded with an associated \"stage\". Events are recorded with an associated \"stage.\" The copy is called a \"fork\". The copy is called a \"fork.\"","title":"Formatting standards and conventions"},{"location":"help/style-guide/style-and-formatting/#formatting-standards-and-conventions","text":"","title":"Formatting standards and conventions"},{"location":"help/style-guide/style-and-formatting/#titles-and-headings","text":"","title":"Titles and headings"},{"location":"help/style-guide/style-and-formatting/#use-sentence-case-for-titles-and-headings","text":"Only capitalize proper nouns, acronyms, and the first word of the heading. Correct Incorrect ## Configure the feature ## Configure the Feature ### Using feature ### Using Feature ### Using HTTPS ### Using https","title":"Use sentence case for titles and headings"},{"location":"help/style-guide/style-and-formatting/#do-not-use-code-formatting-inside-headings","text":"Correct Incorrect ## Configure the class annotation ## Configure the `class` annotation","title":"Do not use code formatting inside headings"},{"location":"help/style-guide/style-and-formatting/#use-imperatives-for-headings-of-procedures","text":"For consistency, brevity, and to better signpost where action is expected of the reader, make procedure headings imperatives. Correct Incorrect ## Install KServe ## Installation of KServe ### Configure DNS ### Configuring DNS ## Verify the installation ## How to verify the installation","title":"Use imperatives for headings of procedures"},{"location":"help/style-guide/style-and-formatting/#links","text":"","title":"Links"},{"location":"help/style-guide/style-and-formatting/#describe-what-the-link-targets","text":"Correct Incorrect For an explanation of what makes a good hyperlink, see this this article . See this article here .","title":"Describe what the link targets"},{"location":"help/style-guide/style-and-formatting/#write-links-in-markdown-not-html","text":"Correct Incorrect [Kafka Broker](../kafka-broker/README.md) Kafka Broker [Kafka Broker](../kafka-broker/README.md){target=_blank} Kafka Broker","title":"Write links in Markdown, not HTML"},{"location":"help/style-guide/style-and-formatting/#include-the-md-extension-in-internal-links","text":"Correct Incorrect [Setting up a custom domain](../serving/using-a-custom-domain.md) [Setting up a custom domain](../serving/using-a-custom-domain)","title":"Include the .md extension in internal links"},{"location":"help/style-guide/style-and-formatting/#link-to-files-not-folders","text":"Correct Incorrect [Kafka Broker](../kafka-broker/README.md) [Kafka Broker](../kafka-broker/)","title":"Link to files, not folders"},{"location":"help/style-guide/style-and-formatting/#ensure-the-letter-case-is-correct","text":"Correct Incorrect [Kafka Broker](../kafka-broker/README.md) [Kafka Broker](../kafka-broker/readme.md)","title":"Ensure the letter case is correct"},{"location":"help/style-guide/style-and-formatting/#formatting","text":"","title":"Formatting"},{"location":"help/style-guide/style-and-formatting/#use-nonbreaking-spaces-in-units-of-measurement-other-than-percent","text":"For most units of measurement, when you specify a number with the unit, use a nonbreaking space between the number and the unit. Don't use spacing when the unit of measurement is percent. Correct Incorrect 3   GB 3 GB 4   CPUs 4 CPUs 14% 14   %","title":"Use nonbreaking spaces in units of measurement other than percent"},{"location":"help/style-guide/style-and-formatting/#use-bold-for-user-interface-elements","text":"Correct Incorrect Click Fork Click \"Fork\" Select Other Select \"Other\"","title":"Use bold for user interface elements"},{"location":"help/style-guide/style-and-formatting/#use-tables-for-definition-lists","text":"When listing terms and their definitions, use table formatting instead of definition list formatting. Correct Incorrect |Value |Description | |------|---------------------| |Value1|Description of Value1| |Value2|Description of Value2| Value1 : Description of Value1 Value2 : Description of Value2","title":"Use tables for definition lists"},{"location":"help/style-guide/style-and-formatting/#general-style","text":"","title":"General style"},{"location":"help/style-guide/style-and-formatting/#use-upper-camel-case-for-kserve-api-objects","text":"Correct Incorrect Explainers explainers Transformer transformer InferenceService Inference Service","title":"Use upper camel case for KServe API objects"},{"location":"help/style-guide/style-and-formatting/#only-use-parentheses-for-acronym-explanations","text":"Put an acronym inside parentheses after its explanation. Don\u2019t use parentheses for anything else. Parenthetical statements especially should be avoided because readers skip them. If something is important enough to be in the sentence, it should be fully part of that sentence. Correct Incorrect Custom Resource Definition (CRD) Check your CLI (you should see it there) Knative Serving creates a Revision Knative creates a Revision (a stateless, snapshot in time of your code and configuration)","title":"Only use parentheses for acronym explanations"},{"location":"help/style-guide/style-and-formatting/#use-the-international-standard-for-punctuation-inside-quotes","text":"Correct Incorrect Events are recorded with an associated \"stage\". Events are recorded with an associated \"stage.\" The copy is called a \"fork\". The copy is called a \"fork.\"","title":"Use the international standard for punctuation inside quotes"},{"location":"help/style-guide/voice-and-language/","text":"Voice and language \u00b6 Use present tense \u00b6 Correct Incorrect This command starts a proxy. This command will start a proxy. Use active voice \u00b6 Correct Incorrect You can explore the API using a browser. The API can be explored using a browser. The YAML file specifies the replica count. The replica count is specified in the YAML file. Use simple and direct language \u00b6 Use simple and direct language. Avoid using unnecessary words, such as \"please\". Correct Incorrect To create a ReplicaSet , ... In order to create a ReplicaSet , ... See the configuration file. Please see the configuration file. View the Pods. With this next command, we'll view the Pods. Address the reader as \"you\", not \"we\" \u00b6 Correct Incorrect You can create a Deployment by ... We can create a Deployment by ... In the preceding output, you can see... In the preceding output, we can see ... This page teaches you how to use pods. In this page, we are going to learn about pods. Avoid jargon, idioms, and Latin \u00b6 Some readers speak English as a second language. Avoid jargon, idioms, and Latin to help make their understanding easier. Correct Incorrect Internally, ... Under the hood, ... Create a new cluster. Turn up a new cluster. Initially, ... Out of the box, ... For example, ... e.g., ... Enter through the gateway ... Enter via the gateway ... Avoid statements about the future \u00b6 Avoid making promises or giving hints about the future. If you need to talk about a feature in development, add a boilerplate under the front matter that identifies the information accordingly. Avoid statements that will soon be out of date \u00b6 Avoid using wording that becomes outdated quickly like \"currently\" and \"new\". A feature that is new today is not new for long. Correct Incorrect In version 1.4, ... In the current version, ... The Federation feature provides ... The new Federation feature provides ... Avoid words that assume a specific level of understanding \u00b6 Avoid words such as \"just\", \"simply\", \"easy\", \"easily\", or \"simple\". These words do not add value. Correct Incorrect Include one command in ... Include just one command in ... Run the container ... Simply run the container ... You can remove ... You can easily remove ... These steps ... These simple steps ...","title":"Voice and language"},{"location":"help/style-guide/voice-and-language/#voice-and-language","text":"","title":"Voice and language"},{"location":"help/style-guide/voice-and-language/#use-present-tense","text":"Correct Incorrect This command starts a proxy. This command will start a proxy.","title":"Use present tense"},{"location":"help/style-guide/voice-and-language/#use-active-voice","text":"Correct Incorrect You can explore the API using a browser. The API can be explored using a browser. The YAML file specifies the replica count. The replica count is specified in the YAML file.","title":"Use active voice"},{"location":"help/style-guide/voice-and-language/#use-simple-and-direct-language","text":"Use simple and direct language. Avoid using unnecessary words, such as \"please\". Correct Incorrect To create a ReplicaSet , ... In order to create a ReplicaSet , ... See the configuration file. Please see the configuration file. View the Pods. With this next command, we'll view the Pods.","title":"Use simple and direct language"},{"location":"help/style-guide/voice-and-language/#address-the-reader-as-you-not-we","text":"Correct Incorrect You can create a Deployment by ... We can create a Deployment by ... In the preceding output, you can see... In the preceding output, we can see ... This page teaches you how to use pods. In this page, we are going to learn about pods.","title":"Address the reader as \"you\", not \"we\""},{"location":"help/style-guide/voice-and-language/#avoid-jargon-idioms-and-latin","text":"Some readers speak English as a second language. Avoid jargon, idioms, and Latin to help make their understanding easier. Correct Incorrect Internally, ... Under the hood, ... Create a new cluster. Turn up a new cluster. Initially, ... Out of the box, ... For example, ... e.g., ... Enter through the gateway ... Enter via the gateway ...","title":"Avoid jargon, idioms, and Latin"},{"location":"help/style-guide/voice-and-language/#avoid-statements-about-the-future","text":"Avoid making promises or giving hints about the future. If you need to talk about a feature in development, add a boilerplate under the front matter that identifies the information accordingly.","title":"Avoid statements about the future"},{"location":"help/style-guide/voice-and-language/#avoid-statements-that-will-soon-be-out-of-date","text":"Avoid using wording that becomes outdated quickly like \"currently\" and \"new\". A feature that is new today is not new for long. Correct Incorrect In version 1.4, ... In the current version, ... The Federation feature provides ... The new Federation feature provides ...","title":"Avoid statements that will soon be out of date"},{"location":"help/style-guide/voice-and-language/#avoid-words-that-assume-a-specific-level-of-understanding","text":"Avoid words such as \"just\", \"simply\", \"easy\", \"easily\", or \"simple\". These words do not add value. Correct Incorrect Include one command in ... Include just one command in ... Run the container ... Simply run the container ... You can remove ... You can easily remove ... These steps ... These simple steps ...","title":"Avoid words that assume a specific level of understanding"},{"location":"modelserving/control_plane/","text":"Control Plane \u00b6 KServe Control Plane : Responsible for reconciling the InferenceService custom resources. It creates the Knative serverless deployment for predictor, transformer, explainer to enable autoscaling based on incoming request workload including scaling down to zero when no traffic is received. When raw deployment mode is enabled, control plane creates Kubernetes deployment, service, ingress, HPA. Control Plane Components \u00b6 KServe Controller : Responsible for creating service, ingress resources, model server container and model agent container for request/response logging , batching and model pulling. Ingress Gateway : Gateway for routing external or internal requests. In Serverless Mode: Knative Serving Controller : Responsible for service revision management, creating network routing resources, serverless container with queue proxy to expose traffic metrics and enforce concurrency limit. Knative Activator : Brings back scaled-to-zero pods and forwards requests. Knative Autoscaler(KPA) : Watches traffic flow to the application, and scales replicas up or down based on configured metrics.","title":"Model Serving Control Plane"},{"location":"modelserving/control_plane/#control-plane","text":"KServe Control Plane : Responsible for reconciling the InferenceService custom resources. It creates the Knative serverless deployment for predictor, transformer, explainer to enable autoscaling based on incoming request workload including scaling down to zero when no traffic is received. When raw deployment mode is enabled, control plane creates Kubernetes deployment, service, ingress, HPA.","title":"Control Plane"},{"location":"modelserving/control_plane/#control-plane-components","text":"KServe Controller : Responsible for creating service, ingress resources, model server container and model agent container for request/response logging , batching and model pulling. Ingress Gateway : Gateway for routing external or internal requests. In Serverless Mode: Knative Serving Controller : Responsible for service revision management, creating network routing resources, serverless container with queue proxy to expose traffic metrics and enforce concurrency limit. Knative Activator : Brings back scaled-to-zero pods and forwards requests. Knative Autoscaler(KPA) : Watches traffic flow to the application, and scales replicas up or down based on configured metrics.","title":"Control Plane Components"},{"location":"modelserving/servingruntimes/","text":"Serving Runtimes \u00b6 KServe makes use of two CRDs for defining model serving environments: ServingRuntimes and ClusterServingRuntimes The only difference between the two is that one is namespace-scoped and the other is cluster-scoped. A ServingRuntime defines the templates for Pods that can serve one or more particular model formats. Each ServingRuntime defines key information such as the container image of the runtime and a list of the model formats that the runtime supports. Other configuration settings for the runtime can be conveyed through environment variables in the container specification. These CRDs allow for improved flexibility and extensibility, enabling users to quickly define or customize reusable runtimes without having to modify any controller code or any resources in the controller namespace. The following is an example of a ServingRuntime: apiVersion : serving.kserve.io/v1alpha1 kind : ServingRuntime metadata : name : example-runtime spec : supportedModelFormats : - name : example-format version : \"1\" autoSelect : true containers : - name : kserve-container image : examplemodelserver:latest args : - --model_dir=/mnt/models - --http_port=8080 Several out-of-the-box ClusterServingRuntimes are provided with KServe so that users can quickly deploy common model formats without having to define the runtimes themselves. Name Supported Model Formats kserve-lgbserver LightGBM kserve-mlserver SKLearn, XGBoost, LightGBM, MLflow kserve-paddleserver Paddle kserve-pmmlserver PMML kserve-sklearnserver SKLearn kserve-tensorflow-serving TensorFlow kserve-torchserve PyTorch kserve-tritonserver TensorFlow, ONNX, PyTorch, TensorRT kserve-xgbserver XGBoost In addition to these included runtimes, you can extend your KServe installation by adding custom runtimes. This is demonstrated in the example for the AMD Inference Server . Spec Attributes \u00b6 Available attributes in the ServingRuntime spec: Attribute Description multiModel Whether this ServingRuntime is ModelMesh-compatible and intended for multi-model usage (as opposed to KServe single-model serving). Defaults to false disabled Disables this runtime containers List of containers associated with the runtime containers[ ].image The container image for the current container containers[ ].command Executable command found in the provided image containers[ ].args List of command line arguments as strings containers[ ].resources Kubernetes limits or requests containers[ ].env List of environment variables to pass to the container containers[ ].imagePullPolicy The container image pull policy containers[ ].workingDir The working directory for current container containers[ ].livenessProbe Probe for checking container liveness containers[ ].readinessProbe Probe for checking container readiness supportedModelFormats List of model types supported by the current runtime supportedModelFormats[ ].name Name of the model format supportedModelFormats[ ].version Version of the model format. Used in validating that a predictor is supported by a runtime. It is recommended to include only the major version here, for example \"1\" rather than \"1.15.4\" supportedModelFormats[ ].autoselect Set to true to allow the ServingRuntime to be used for automatic model placement if this model format is specified with no explicit runtime. The default value is false. supportedModelFormats[ ].priority Priority of this serving runtime for auto selection. This is used to select the serving runtime if more than one serving runtime supports the same model format. The value should be greater than zero. The higher the value, the higher the priority. Priority is not considered if AutoSelect is either false or not specified. Priority can be overridden by specifying the runtime in the InferenceService. storageHelper.disabled Disables the storage helper nodeSelector Influence Kubernetes scheduling to assign pods to nodes affinity Influence Kubernetes scheduling to assign pods to nodes tolerations Allow pods to be scheduled onto nodes with matching taints ModelMesh leverages additional fields not listed here. More information here . Note: ServingRuntimes support the use of template variables of the form {{.Variable}} inside the container spec. These should map to fields inside an InferenceService's metadata object . The primary use of this is for passing in InferenceService-specific information, such as a name, to the runtime environment. Several of the out-of-box ClusterServingRuntimes make use of this by having --model_name={{.Name}} inside the runtime container args to ensure that when a user deploys an InferenceService, the name is passed to the server. Using ServingRuntimes \u00b6 ServingRuntimes can be be used both explicitly and implicitly. Explicit: Specify a runtime \u00b6 When users define predictors in their InferenceServices, they can explicitly specify the name of a ClusterServingRuntime or ServingRuntime . For example: apiVersion : serving.kserve.io/v1beta1 kind : InferenceService metadata : name : example-sklearn-isvc spec : predictor : model : modelFormat : name : sklearn storageUri : s3://bucket/sklearn/mnist.joblib runtime : kserve-mlserver Here, the runtime specified is kserve-mlserver , so the KServe controller will first search the namespace for a ServingRuntime with that name. If none exist, the controller will then search the list of ClusterServingRuntimes. If one is found, the controller will first verify that the modelFormat provided in the predictor is in the list of supportedModelFormats . If it is, then the container and pod information provided by the runtime will be used for model deployment. Implicit: Automatic selection \u00b6 In each entry of the supportedModelFormats list, autoSelect: true can optionally be specified to indicate that the given ServingRuntime can be considered for automatic selection for predictors with the corresponding model format if no runtime is explicitly specified. For example, the kserve-sklearnserver ClusterServingRuntime supports SKLearn version 1 and has autoSelect enabled: apiVersion : serving.kserve.io/v1alpha1 kind : ClusterServingRuntime metadata : name : kserve-sklearnserver spec : supportedModelFormats : - name : sklearn version : \"1\" autoSelect : true ... When the following InferenceService is deployed with no runtime specified, the controller will look for a runtime that supports sklearn : apiVersion : serving.kserve.io/v1beta1 kind : InferenceService metadata : name : example-sklearn-isvc spec : predictor : model : modelFormat : name : sklearn storageUri : s3://bucket/sklearn/mnist.joblib Since kserve-sklearnserver has an entry in its supportedModelFormats list with sklearn and autoSelect: true , this ClusterServingRuntime will be used for model deployment. If a version is also specified: ... spec : predictor : model : modelFormat : name : sklearn version : \"0\" ... Then, then the version of the supportedModelFormat must also match. In this example, kserve-sklearnserver would not be eligible for selection since it only lists support for sklearn version 1 . Priority \u00b6 If more than one serving runtime supports the same model format with same version and also supports the same protocolVersion then, we can optionally specify priority for the serving runtime. Based on the priority the runtime is automatically selected if no runtime is explicitly specified. Note that, priority is valid only if autoSelect is true . Higher value means higher priority. For example, let's consider the serving runtimes mlserver and kserve-sklearnserver . Both the serving runtimes supports the sklearn model format with version 1 and both supports the protocolVersion v2. Also note that autoSelect is enabled in both the serving runtimes. apiVersion : serving.kserve.io/v1alpha1 kind : ClusterServingRuntime metadata : name : kserve-sklearnserver spec : protocolVersions : - v1 - v2 supportedModelFormats : - name : sklearn version : \"1\" autoSelect : true priority : 1 ... apiVersion : serving.kserve.io/v1alpha1 kind : ClusterServingRuntime metadata : name : mlserver spec : protocolVersions : - v2 supportedModelFormats : - name : sklearn version : \"1\" autoSelect : true priority : 2 ... When the following InferenceService is deployed with no runtime specified, the controller will look for a runtime that supports sklearn : apiVersion : serving.kserve.io/v1beta1 kind : InferenceService metadata : name : example-sklearn-isvc spec : predictor : model : protocolVersion : v2 modelFormat : name : sklearn storageUri : s3://bucket/sklearn/mnist.joblib The controller will find the two runtimes kserve-sklearnserver and mlserver as both has an entry in its supportedModelFormats list with sklearn and autoSelect: true . Now the runtime is sorted based on the priority by the controller as there are more than one supported runtime available. Since the mlserver has the higher priority value, this ClusterServingRuntime will be used for model deployment. Constraints of priority The higher priority value means higher precedence. The value must be greater than 0. The priority is valid only if auto select is enabled otherwise the priority is not considered. The serving runtime with priority takes precedence over the serving runtime with priority not specified. Two model formats with same name and same model version cannot have the same priority. If more than one serving runtime supports the model format and none of them specified the priority then, there is no guarantee which runtime will be selected. If multiple versions of a modelFormat are supported by a serving runtime, then it should have the same priority. For example, Below shown serving runtime supports two versions of sklearn. It should have the same priority. apiVersion : serving.kserve.io/v1alpha1 kind : ClusterServingRuntime metadata : name : mlserver spec : protocolVersions : - v2 supportedModelFormats : - name : sklearn version : \"0\" autoSelect : true priority : 2 - name : sklearn version : \"1\" autoSelect : true priority : 2 ... Warning If multiple runtimes list the same format and/or version as auto-selectable and the priority is not specified, the runtime is selected based on the creationTimestamp i.e. the most recently created runtime is selected. So there is no guarantee which runtime will be selected. So users and cluster-administrators should enable autoSelect with care. Previous schema \u00b6 Currently, if a user uses the old schema for deploying predictors where you specify a framework/format as a key, then a KServe webhook will automatically map it to one of the out-of-the-box ClusterServingRuntimes . This is for backwards compatibility. For example: Previous Schema Equivalent New Schema apiVersion : serving.kserve.io/v1beta1 kind : InferenceService metadata : name : example-sklearn-isvc spec : predictor : sklearn : storageUri : s3://bucket/sklearn/mnist.joblib apiVersion : serving.kserve.io/v1beta1 kind : InferenceService metadata : name : example-sklearn-isvc spec : predictor : model : modelFormat : name : sklearn storageUri : s3://bucket/sklearn/mnist.joblib runtime : kserve-sklearnserver The previous schema would mutate into the new schema where the kserve-sklearnserver ClusterServingRuntime is explicitly specified. Warning The old schema will eventually be removed in favor of the new Model spec, where a user can specify a model format and optionally a corresponding version. In previous versions of KServe, supported predictor formats and container images were defined in a ConfigMap in the control plane namespace. Existing InferenceServices upgraded from v0.7, v0.8, v0.9 need to be converted to the new model spec as the predictor configurations are phased out in v0.10.","title":"Serving Runtimes"},{"location":"modelserving/servingruntimes/#serving-runtimes","text":"KServe makes use of two CRDs for defining model serving environments: ServingRuntimes and ClusterServingRuntimes The only difference between the two is that one is namespace-scoped and the other is cluster-scoped. A ServingRuntime defines the templates for Pods that can serve one or more particular model formats. Each ServingRuntime defines key information such as the container image of the runtime and a list of the model formats that the runtime supports. Other configuration settings for the runtime can be conveyed through environment variables in the container specification. These CRDs allow for improved flexibility and extensibility, enabling users to quickly define or customize reusable runtimes without having to modify any controller code or any resources in the controller namespace. The following is an example of a ServingRuntime: apiVersion : serving.kserve.io/v1alpha1 kind : ServingRuntime metadata : name : example-runtime spec : supportedModelFormats : - name : example-format version : \"1\" autoSelect : true containers : - name : kserve-container image : examplemodelserver:latest args : - --model_dir=/mnt/models - --http_port=8080 Several out-of-the-box ClusterServingRuntimes are provided with KServe so that users can quickly deploy common model formats without having to define the runtimes themselves. Name Supported Model Formats kserve-lgbserver LightGBM kserve-mlserver SKLearn, XGBoost, LightGBM, MLflow kserve-paddleserver Paddle kserve-pmmlserver PMML kserve-sklearnserver SKLearn kserve-tensorflow-serving TensorFlow kserve-torchserve PyTorch kserve-tritonserver TensorFlow, ONNX, PyTorch, TensorRT kserve-xgbserver XGBoost In addition to these included runtimes, you can extend your KServe installation by adding custom runtimes. This is demonstrated in the example for the AMD Inference Server .","title":"Serving Runtimes"},{"location":"modelserving/servingruntimes/#spec-attributes","text":"Available attributes in the ServingRuntime spec: Attribute Description multiModel Whether this ServingRuntime is ModelMesh-compatible and intended for multi-model usage (as opposed to KServe single-model serving). Defaults to false disabled Disables this runtime containers List of containers associated with the runtime containers[ ].image The container image for the current container containers[ ].command Executable command found in the provided image containers[ ].args List of command line arguments as strings containers[ ].resources Kubernetes limits or requests containers[ ].env List of environment variables to pass to the container containers[ ].imagePullPolicy The container image pull policy containers[ ].workingDir The working directory for current container containers[ ].livenessProbe Probe for checking container liveness containers[ ].readinessProbe Probe for checking container readiness supportedModelFormats List of model types supported by the current runtime supportedModelFormats[ ].name Name of the model format supportedModelFormats[ ].version Version of the model format. Used in validating that a predictor is supported by a runtime. It is recommended to include only the major version here, for example \"1\" rather than \"1.15.4\" supportedModelFormats[ ].autoselect Set to true to allow the ServingRuntime to be used for automatic model placement if this model format is specified with no explicit runtime. The default value is false. supportedModelFormats[ ].priority Priority of this serving runtime for auto selection. This is used to select the serving runtime if more than one serving runtime supports the same model format. The value should be greater than zero. The higher the value, the higher the priority. Priority is not considered if AutoSelect is either false or not specified. Priority can be overridden by specifying the runtime in the InferenceService. storageHelper.disabled Disables the storage helper nodeSelector Influence Kubernetes scheduling to assign pods to nodes affinity Influence Kubernetes scheduling to assign pods to nodes tolerations Allow pods to be scheduled onto nodes with matching taints ModelMesh leverages additional fields not listed here. More information here . Note: ServingRuntimes support the use of template variables of the form {{.Variable}} inside the container spec. These should map to fields inside an InferenceService's metadata object . The primary use of this is for passing in InferenceService-specific information, such as a name, to the runtime environment. Several of the out-of-box ClusterServingRuntimes make use of this by having --model_name={{.Name}} inside the runtime container args to ensure that when a user deploys an InferenceService, the name is passed to the server.","title":"Spec Attributes"},{"location":"modelserving/servingruntimes/#using-servingruntimes","text":"ServingRuntimes can be be used both explicitly and implicitly.","title":"Using ServingRuntimes"},{"location":"modelserving/servingruntimes/#explicit-specify-a-runtime","text":"When users define predictors in their InferenceServices, they can explicitly specify the name of a ClusterServingRuntime or ServingRuntime . For example: apiVersion : serving.kserve.io/v1beta1 kind : InferenceService metadata : name : example-sklearn-isvc spec : predictor : model : modelFormat : name : sklearn storageUri : s3://bucket/sklearn/mnist.joblib runtime : kserve-mlserver Here, the runtime specified is kserve-mlserver , so the KServe controller will first search the namespace for a ServingRuntime with that name. If none exist, the controller will then search the list of ClusterServingRuntimes. If one is found, the controller will first verify that the modelFormat provided in the predictor is in the list of supportedModelFormats . If it is, then the container and pod information provided by the runtime will be used for model deployment.","title":"Explicit: Specify a runtime"},{"location":"modelserving/servingruntimes/#implicit-automatic-selection","text":"In each entry of the supportedModelFormats list, autoSelect: true can optionally be specified to indicate that the given ServingRuntime can be considered for automatic selection for predictors with the corresponding model format if no runtime is explicitly specified. For example, the kserve-sklearnserver ClusterServingRuntime supports SKLearn version 1 and has autoSelect enabled: apiVersion : serving.kserve.io/v1alpha1 kind : ClusterServingRuntime metadata : name : kserve-sklearnserver spec : supportedModelFormats : - name : sklearn version : \"1\" autoSelect : true ... When the following InferenceService is deployed with no runtime specified, the controller will look for a runtime that supports sklearn : apiVersion : serving.kserve.io/v1beta1 kind : InferenceService metadata : name : example-sklearn-isvc spec : predictor : model : modelFormat : name : sklearn storageUri : s3://bucket/sklearn/mnist.joblib Since kserve-sklearnserver has an entry in its supportedModelFormats list with sklearn and autoSelect: true , this ClusterServingRuntime will be used for model deployment. If a version is also specified: ... spec : predictor : model : modelFormat : name : sklearn version : \"0\" ... Then, then the version of the supportedModelFormat must also match. In this example, kserve-sklearnserver would not be eligible for selection since it only lists support for sklearn version 1 .","title":"Implicit: Automatic selection"},{"location":"modelserving/servingruntimes/#priority","text":"If more than one serving runtime supports the same model format with same version and also supports the same protocolVersion then, we can optionally specify priority for the serving runtime. Based on the priority the runtime is automatically selected if no runtime is explicitly specified. Note that, priority is valid only if autoSelect is true . Higher value means higher priority. For example, let's consider the serving runtimes mlserver and kserve-sklearnserver . Both the serving runtimes supports the sklearn model format with version 1 and both supports the protocolVersion v2. Also note that autoSelect is enabled in both the serving runtimes. apiVersion : serving.kserve.io/v1alpha1 kind : ClusterServingRuntime metadata : name : kserve-sklearnserver spec : protocolVersions : - v1 - v2 supportedModelFormats : - name : sklearn version : \"1\" autoSelect : true priority : 1 ... apiVersion : serving.kserve.io/v1alpha1 kind : ClusterServingRuntime metadata : name : mlserver spec : protocolVersions : - v2 supportedModelFormats : - name : sklearn version : \"1\" autoSelect : true priority : 2 ... When the following InferenceService is deployed with no runtime specified, the controller will look for a runtime that supports sklearn : apiVersion : serving.kserve.io/v1beta1 kind : InferenceService metadata : name : example-sklearn-isvc spec : predictor : model : protocolVersion : v2 modelFormat : name : sklearn storageUri : s3://bucket/sklearn/mnist.joblib The controller will find the two runtimes kserve-sklearnserver and mlserver as both has an entry in its supportedModelFormats list with sklearn and autoSelect: true . Now the runtime is sorted based on the priority by the controller as there are more than one supported runtime available. Since the mlserver has the higher priority value, this ClusterServingRuntime will be used for model deployment. Constraints of priority The higher priority value means higher precedence. The value must be greater than 0. The priority is valid only if auto select is enabled otherwise the priority is not considered. The serving runtime with priority takes precedence over the serving runtime with priority not specified. Two model formats with same name and same model version cannot have the same priority. If more than one serving runtime supports the model format and none of them specified the priority then, there is no guarantee which runtime will be selected. If multiple versions of a modelFormat are supported by a serving runtime, then it should have the same priority. For example, Below shown serving runtime supports two versions of sklearn. It should have the same priority. apiVersion : serving.kserve.io/v1alpha1 kind : ClusterServingRuntime metadata : name : mlserver spec : protocolVersions : - v2 supportedModelFormats : - name : sklearn version : \"0\" autoSelect : true priority : 2 - name : sklearn version : \"1\" autoSelect : true priority : 2 ... Warning If multiple runtimes list the same format and/or version as auto-selectable and the priority is not specified, the runtime is selected based on the creationTimestamp i.e. the most recently created runtime is selected. So there is no guarantee which runtime will be selected. So users and cluster-administrators should enable autoSelect with care.","title":"Priority"},{"location":"modelserving/servingruntimes/#previous-schema","text":"Currently, if a user uses the old schema for deploying predictors where you specify a framework/format as a key, then a KServe webhook will automatically map it to one of the out-of-the-box ClusterServingRuntimes . This is for backwards compatibility. For example: Previous Schema Equivalent New Schema apiVersion : serving.kserve.io/v1beta1 kind : InferenceService metadata : name : example-sklearn-isvc spec : predictor : sklearn : storageUri : s3://bucket/sklearn/mnist.joblib apiVersion : serving.kserve.io/v1beta1 kind : InferenceService metadata : name : example-sklearn-isvc spec : predictor : model : modelFormat : name : sklearn storageUri : s3://bucket/sklearn/mnist.joblib runtime : kserve-sklearnserver The previous schema would mutate into the new schema where the kserve-sklearnserver ClusterServingRuntime is explicitly specified. Warning The old schema will eventually be removed in favor of the new Model spec, where a user can specify a model format and optionally a corresponding version. In previous versions of KServe, supported predictor formats and container images were defined in a ConfigMap in the control plane namespace. Existing InferenceServices upgraded from v0.7, v0.8, v0.9 need to be converted to the new model spec as the predictor configurations are phased out in v0.10.","title":"Previous schema"},{"location":"modelserving/autoscaling/autoscaling/","text":"Autoscale InferenceService with inference workload \u00b6 InferenceService with target concurrency \u00b6 Create InferenceService \u00b6 Apply the tensorflow example CR with scaling target set to 1. Annotation autoscaling.knative.dev/target is the soft limit rather than a strictly enforced limit, if there is sudden burst of the requests, this value can be exceeded. The scaleTarget and scaleMetric are introduced in version 0.9 of kserve and should be available in both new and old schema. This is the preferred way of defining autoscaling options. New Schema Old Schema apiVersion : \"serving.kserve.io/v1beta1\" kind : \"InferenceService\" metadata : name : \"flowers-sample\" spec : predictor : scaleTarget : 1 scaleMetric : concurrency model : modelFormat : name : tensorflow storageUri : \"gs://kfserving-examples/models/tensorflow/flowers\" apiVersion : \"serving.kserve.io/v1beta1\" kind : \"InferenceService\" metadata : name : \"flowers-sample\" annotations : autoscaling.knative.dev/target : \"1\" spec : predictor : tensorflow : storageUri : \"gs://kfserving-examples/models/tensorflow/flowers\" Apply the autoscale.yaml to create the Autoscale InferenceService. kubectl kubectl apply -f autoscale.yaml Expected Output $ inferenceservice.serving.kserve.io/flowers-sample created Predict InferenceService with concurrent requests \u00b6 The first step is to determine the ingress IP and ports and set INGRESS_HOST and INGRESS_PORT Send traffic in 30 seconds spurts maintaining 5 in-flight requests. MODEL_NAME = flowers-sample INPUT_PATH = input.json SERVICE_HOSTNAME = $( kubectl get inferenceservice $MODEL_NAME -o jsonpath = '{.status.url}' | cut -d \"/\" -f 3 ) hey -z 30s -c 5 -m POST -host ${ SERVICE_HOSTNAME } -D $INPUT_PATH http:// ${ INGRESS_HOST } : ${ INGRESS_PORT } /v1/models/ $MODEL_NAME :predict Expected Output Summary: Total: 30 .0193 secs Slowest: 10 .1458 secs Fastest: 0 .0127 secs Average: 0 .0364 secs Requests/sec: 137 .4449 Total data: 1019122 bytes Size/request: 247 bytes Response time histogram: 0 .013 [ 1 ] | 1 .026 [ 4120 ] | \u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0 2 .039 [ 0 ] | 3 .053 [ 0 ] | 4 .066 [ 0 ] | 5 .079 [ 0 ] | 6 .093 [ 0 ] | 7 .106 [ 0 ] | 8 .119 [ 0 ] | 9 .133 [ 0 ] | 10 .146 [ 5 ] | Latency distribution: 10 % in 0 .0178 secs 25 % in 0 .0188 secs 50 % in 0 .0199 secs 75 % in 0 .0210 secs 90 % in 0 .0231 secs 95 % in 0 .0328 secs 99 % in 0 .1501 secs Details ( average, fastest, slowest ) : DNS+dialup: 0 .0002 secs, 0 .0127 secs, 10 .1458 secs DNS-lookup: 0 .0002 secs, 0 .0000 secs, 0 .1502 secs req write: 0 .0000 secs, 0 .0000 secs, 0 .0020 secs resp wait: 0 .0360 secs, 0 .0125 secs, 9 .9791 secs resp read: 0 .0001 secs, 0 .0000 secs, 0 .0021 secs Status code distribution: [ 200 ] 4126 responses Check the number of running pods now, Kserve uses Knative Serving autoscaler which is based on the average number of in-flight requests per pod(concurrency). As the scaling target is set to 1 and we load the service with 5 concurrent requests, so the autoscaler tries scaling up to 5 pods. Notice that out of all the requests there are 5 requests on the histogram that take around 10s, that's the cold start time cost to initially spawn the pods and download model to be ready to serve. The cold start may take longer(to pull the serving image) if the image is not cached on the node that the pod is scheduled on. $ kubectl get pods NAME READY STATUS RESTARTS AGE flowers-sample-default-7kqt6-deployment-75d577dcdb-sr5wd 3 /3 Running 0 42s flowers-sample-default-7kqt6-deployment-75d577dcdb-swnk5 3 /3 Running 0 62s flowers-sample-default-7kqt6-deployment-75d577dcdb-t2njf 3 /3 Running 0 62s flowers-sample-default-7kqt6-deployment-75d577dcdb-vdlp9 3 /3 Running 0 64s flowers-sample-default-7kqt6-deployment-75d577dcdb-vm58d 3 /3 Running 0 42s Check Dashboard \u00b6 View the Knative Serving Scaling dashboards (if configured). kubectl kubectl port-forward --namespace knative-monitoring $( kubectl get pods --namespace knative-monitoring --selector = app = grafana --output = jsonpath = \"{.items..metadata.name}\" ) 3000 InferenceService with target QPS \u00b6 Create the InferenceService \u00b6 Apply the same tensorflow example CR kubectl kubectl apply -f autoscale.yaml Expected Output $ inferenceservice.serving.kserve.io/flowers-sample created Predict InferenceService with target QPS \u00b6 The first step is to determine the ingress IP and ports and set INGRESS_HOST and INGRESS_PORT Send 30 seconds of traffic maintaining 50 qps. MODEL_NAME = flowers-sample INPUT_PATH = input.json SERVICE_HOSTNAME = $( kubectl get inferenceservice $MODEL_NAME -o jsonpath = '{.status.url}' | cut -d \"/\" -f 3 ) hey -z 30s -q 50 -m POST -host ${ SERVICE_HOSTNAME } -D $INPUT_PATH http:// ${ INGRESS_HOST } : ${ INGRESS_PORT } /v1/models/ $MODEL_NAME :predict Expected Output Summary: Total: 30 .0264 secs Slowest: 10 .8113 secs Fastest: 0 .0145 secs Average: 0 .0731 secs Requests/sec: 683 .5644 Total data: 5069675 bytes Size/request: 247 bytes Response time histogram: 0 .014 [ 1 ] | 1 .094 [ 20474 ] | \u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0 2 .174 [ 0 ] | 3 .254 [ 0 ] | 4 .333 [ 0 ] | 5 .413 [ 0 ] | 6 .493 [ 0 ] | 7 .572 [ 0 ] | 8 .652 [ 0 ] | 9 .732 [ 0 ] | 10 .811 [ 50 ] | Latency distribution: 10 % in 0 .0284 secs 25 % in 0 .0334 secs 50 % in 0 .0408 secs 75 % in 0 .0527 secs 90 % in 0 .0765 secs 95 % in 0 .0949 secs 99 % in 0 .1334 secs Details ( average, fastest, slowest ) : DNS+dialup: 0 .0001 secs, 0 .0145 secs, 10 .8113 secs DNS-lookup: 0 .0000 secs, 0 .0000 secs, 0 .0196 secs req write: 0 .0000 secs, 0 .0000 secs, 0 .0031 secs resp wait: 0 .0728 secs, 0 .0144 secs, 10 .7688 secs resp read: 0 .0000 secs, 0 .0000 secs, 0 .0031 secs Status code distribution: [ 200 ] 20525 responses Check the number of running pods now, we are loading the service with 50 requests per second, and from the dashboard you can see that it hits the average concurrency 10 and autoscaler tries scaling up to 10 pods. Check Dashboard \u00b6 View the Knative Serving Scaling dashboards (if configured). kubectl port-forward --namespace knative-monitoring $( kubectl get pods --namespace knative-monitoring --selector = app = grafana --output = jsonpath = \"{.items..metadata.name}\" ) 3000 Autoscaler calculates average concurrency over 60 second window so it takes a minute to stabilize at the desired concurrency level, however it also calculates the 6 second panic window and will enter into panic mode if that window reaches 2x target concurrency. From the dashboard you can see that it enters panic mode in which autoscaler operates on shorter and more sensitive window. Once the panic conditions are no longer met for 60 seconds, autoscaler will return back to 60 seconds stable window. Autoscaling on GPU! \u00b6 Autoscaling on GPU is hard with GPU metrics, however thanks to Knative's concurrency based autoscaler scaling on GPU is pretty easy and effective! Create the InferenceService with GPU resource \u00b6 Apply the tensorflow gpu example CR New Schema Old Schema apiVersion : \"serving.kserve.io/v1beta1\" kind : \"InferenceService\" metadata : name : \"flowers-sample-gpu\" spec : predictor : scaleTarget : 1 scaleMetric : concurrency model : modelFormat : name : tensorflow storageUri : \"gs://kfserving-examples/models/tensorflow/flowers\" runtimeVersion : \"2.6.2-gpu\" resources : limits : nvidia.com/gpu : 1 apiVersion : \"serving.kserve.io/v1beta1\" kind : \"InferenceService\" metadata : name : \"flowers-sample-gpu\" annotations : autoscaling.knative.dev/target : \"1\" spec : predictor : tensorflow : storageUri : \"gs://kfserving-examples/models/tensorflow/flowers\" runtimeVersion : \"2.6.2-gpu\" resources : limits : nvidia.com/gpu : 1 Apply the autoscale-gpu.yaml . kubectl kubectl apply -f autoscale-gpu.yaml Predict InferenceService with concurrent requests \u00b6 The first step is to determine the ingress IP and ports and set INGRESS_HOST and INGRESS_PORT Send 30 seconds of traffic maintaining 5 in-flight requests. MODEL_NAME = flowers-sample-gpu INPUT_PATH = input.json SERVICE_HOSTNAME = $( kubectl get inferenceservice $MODEL_NAME -o jsonpath = '{.status.url}' | cut -d \"/\" -f 3 ) hey -z 30s -c 5 -m POST -host ${ SERVICE_HOSTNAME } -D $INPUT_PATH http:// ${ INGRESS_HOST } : ${ INGRESS_PORT } /v1/models/ $MODEL_NAME :predict Expected Output Summary: Total: 30 .0152 secs Slowest: 9 .7581 secs Fastest: 0 .0142 secs Average: 0 .0350 secs Requests/sec: 142 .9942 Total data: 948532 bytes Size/request: 221 bytes Response time histogram: 0 .014 [ 1 ] | 0 .989 [ 4286 ] | \u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0 1 .963 [ 0 ] | 2 .937 [ 0 ] | 3 .912 [ 0 ] | 4 .886 [ 0 ] | 5 .861 [ 0 ] | 6 .835 [ 0 ] | 7 .809 [ 0 ] | 8 .784 [ 0 ] | 9 .758 [ 5 ] | Latency distribution: 10 % in 0 .0181 secs 25 % in 0 .0189 secs 50 % in 0 .0198 secs 75 % in 0 .0210 secs 90 % in 0 .0230 secs 95 % in 0 .0276 secs 99 % in 0 .0511 secs Details ( average, fastest, slowest ) : DNS+dialup: 0 .0000 secs, 0 .0142 secs, 9 .7581 secs DNS-lookup: 0 .0000 secs, 0 .0000 secs, 0 .0291 secs req write: 0 .0000 secs, 0 .0000 secs, 0 .0023 secs resp wait: 0 .0348 secs, 0 .0141 secs, 9 .7158 secs resp read: 0 .0001 secs, 0 .0000 secs, 0 .0021 secs Status code distribution: [ 200 ] 4292 responses Autoscaling Customization \u00b6 Autoscaling with ContainerConcurrency \u00b6 ContainerConcurrency determines the number of simultaneous requests that can be processed by each replica of the InferenceService at any given time, it is a hard limit and if the concurrency reaches the hard limit surplus requests will be buffered and must wait until enough capacity is free to execute the requests. New Schema Old Schema apiVersion : \"serving.kserve.io/v1beta1\" kind : \"InferenceService\" metadata : name : \"flowers-sample\" spec : predictor : containerConcurrency : 10 model : modelFormat : name : tensorflow storageUri : \"gs://kfserving-examples/models/tensorflow/flowers\" apiVersion : \"serving.kserve.io/v1beta1\" kind : \"InferenceService\" metadata : name : \"flowers-sample\" spec : predictor : containerConcurrency : 10 tensorflow : storageUri : \"gs://kfserving-examples/models/tensorflow/flowers\" Apply the autoscale-custom.yaml . kubectl kubectl apply -f autoscale-custom.yaml Enable scale down to zero \u00b6 KServe by default sets minReplicas to 1, if you want to enable scaling down to zero especially for use cases like serving on GPUs you can set minReplicas to 0 so that the pods automatically scale down to zero when no traffic is received. New Schema Old Schema apiVersion : \"serving.kserve.io/v1beta1\" kind : \"InferenceService\" metadata : name : \"flowers-sample\" spec : predictor : minReplicas : 0 model : modelFormat : name : tensorflow storageUri : \"gs://kfserving-examples/models/tensorflow/flowers\" apiVersion : \"serving.kserve.io/v1beta1\" kind : \"InferenceService\" metadata : name : \"flowers-sample\" spec : predictor : minReplicas : 0 tensorflow : storageUri : \"gs://kfserving-examples/models/tensorflow/flowers\" Apply the scale-down-to-zero.yaml . kubectl kubectl apply -f scale-down-to-zero.yaml Autoscaling configuration at component level \u00b6 Autoscaling options can also be configured at the component level. This allows more flexibility in terms of the autoscaling configuration. In a typical deployment, transformers may require a different autoscaling configuration than a predictor. This feature allows the user to scale individual components as required. New Schema Old Schema apiVersion : serving.kserve.io/v1beta1 kind : InferenceService metadata : name : torch-transformer spec : predictor : scaleTarget : 2 scaleMetric : concurrency model : modelFormat : name : pytorch storageUri : gs://kfserving-examples/models/torchserve/image_classifier transformer : scaleTarget : 8 scaleMetric : rps containers : - image : kserve/image-transformer:latest name : kserve-container command : - \"python\" - \"-m\" - \"model\" args : - --model_name - mnist apiVersion : serving.kserve.io/v1beta1 kind : InferenceService metadata : name : torch-transformer spec : predictor : scaleTarget : 2 scaleMetric : concurrency pytorch : storageUri : gs://kfserving-examples/models/torchserve/image_classifier transformer : scaleTarget : 8 scaleMetric : rps containers : - image : kserve/image-transformer:latest name : kserve-container command : - \"python\" - \"-m\" - \"model\" args : - --model_name - mnist Apply the autoscale-adv.yaml to create the Autoscale InferenceService. The default for scaleMetric is concurrency and possible values are concurrency , rps , cpu and memory .","title":"Inference Autoscaling"},{"location":"modelserving/autoscaling/autoscaling/#autoscale-inferenceservice-with-inference-workload","text":"","title":"Autoscale InferenceService with inference workload"},{"location":"modelserving/autoscaling/autoscaling/#inferenceservice-with-target-concurrency","text":"","title":"InferenceService with target concurrency"},{"location":"modelserving/autoscaling/autoscaling/#create-inferenceservice","text":"Apply the tensorflow example CR with scaling target set to 1. Annotation autoscaling.knative.dev/target is the soft limit rather than a strictly enforced limit, if there is sudden burst of the requests, this value can be exceeded. The scaleTarget and scaleMetric are introduced in version 0.9 of kserve and should be available in both new and old schema. This is the preferred way of defining autoscaling options. New Schema Old Schema apiVersion : \"serving.kserve.io/v1beta1\" kind : \"InferenceService\" metadata : name : \"flowers-sample\" spec : predictor : scaleTarget : 1 scaleMetric : concurrency model : modelFormat : name : tensorflow storageUri : \"gs://kfserving-examples/models/tensorflow/flowers\" apiVersion : \"serving.kserve.io/v1beta1\" kind : \"InferenceService\" metadata : name : \"flowers-sample\" annotations : autoscaling.knative.dev/target : \"1\" spec : predictor : tensorflow : storageUri : \"gs://kfserving-examples/models/tensorflow/flowers\" Apply the autoscale.yaml to create the Autoscale InferenceService. kubectl kubectl apply -f autoscale.yaml Expected Output $ inferenceservice.serving.kserve.io/flowers-sample created","title":"Create InferenceService"},{"location":"modelserving/autoscaling/autoscaling/#predict-inferenceservice-with-concurrent-requests","text":"The first step is to determine the ingress IP and ports and set INGRESS_HOST and INGRESS_PORT Send traffic in 30 seconds spurts maintaining 5 in-flight requests. MODEL_NAME = flowers-sample INPUT_PATH = input.json SERVICE_HOSTNAME = $( kubectl get inferenceservice $MODEL_NAME -o jsonpath = '{.status.url}' | cut -d \"/\" -f 3 ) hey -z 30s -c 5 -m POST -host ${ SERVICE_HOSTNAME } -D $INPUT_PATH http:// ${ INGRESS_HOST } : ${ INGRESS_PORT } /v1/models/ $MODEL_NAME :predict Expected Output Summary: Total: 30 .0193 secs Slowest: 10 .1458 secs Fastest: 0 .0127 secs Average: 0 .0364 secs Requests/sec: 137 .4449 Total data: 1019122 bytes Size/request: 247 bytes Response time histogram: 0 .013 [ 1 ] | 1 .026 [ 4120 ] | \u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0 2 .039 [ 0 ] | 3 .053 [ 0 ] | 4 .066 [ 0 ] | 5 .079 [ 0 ] | 6 .093 [ 0 ] | 7 .106 [ 0 ] | 8 .119 [ 0 ] | 9 .133 [ 0 ] | 10 .146 [ 5 ] | Latency distribution: 10 % in 0 .0178 secs 25 % in 0 .0188 secs 50 % in 0 .0199 secs 75 % in 0 .0210 secs 90 % in 0 .0231 secs 95 % in 0 .0328 secs 99 % in 0 .1501 secs Details ( average, fastest, slowest ) : DNS+dialup: 0 .0002 secs, 0 .0127 secs, 10 .1458 secs DNS-lookup: 0 .0002 secs, 0 .0000 secs, 0 .1502 secs req write: 0 .0000 secs, 0 .0000 secs, 0 .0020 secs resp wait: 0 .0360 secs, 0 .0125 secs, 9 .9791 secs resp read: 0 .0001 secs, 0 .0000 secs, 0 .0021 secs Status code distribution: [ 200 ] 4126 responses Check the number of running pods now, Kserve uses Knative Serving autoscaler which is based on the average number of in-flight requests per pod(concurrency). As the scaling target is set to 1 and we load the service with 5 concurrent requests, so the autoscaler tries scaling up to 5 pods. Notice that out of all the requests there are 5 requests on the histogram that take around 10s, that's the cold start time cost to initially spawn the pods and download model to be ready to serve. The cold start may take longer(to pull the serving image) if the image is not cached on the node that the pod is scheduled on. $ kubectl get pods NAME READY STATUS RESTARTS AGE flowers-sample-default-7kqt6-deployment-75d577dcdb-sr5wd 3 /3 Running 0 42s flowers-sample-default-7kqt6-deployment-75d577dcdb-swnk5 3 /3 Running 0 62s flowers-sample-default-7kqt6-deployment-75d577dcdb-t2njf 3 /3 Running 0 62s flowers-sample-default-7kqt6-deployment-75d577dcdb-vdlp9 3 /3 Running 0 64s flowers-sample-default-7kqt6-deployment-75d577dcdb-vm58d 3 /3 Running 0 42s","title":"Predict InferenceService with concurrent requests"},{"location":"modelserving/autoscaling/autoscaling/#check-dashboard","text":"View the Knative Serving Scaling dashboards (if configured). kubectl kubectl port-forward --namespace knative-monitoring $( kubectl get pods --namespace knative-monitoring --selector = app = grafana --output = jsonpath = \"{.items..metadata.name}\" ) 3000","title":"Check Dashboard"},{"location":"modelserving/autoscaling/autoscaling/#inferenceservice-with-target-qps","text":"","title":"InferenceService with target QPS"},{"location":"modelserving/autoscaling/autoscaling/#create-the-inferenceservice","text":"Apply the same tensorflow example CR kubectl kubectl apply -f autoscale.yaml Expected Output $ inferenceservice.serving.kserve.io/flowers-sample created","title":"Create the InferenceService"},{"location":"modelserving/autoscaling/autoscaling/#predict-inferenceservice-with-target-qps","text":"The first step is to determine the ingress IP and ports and set INGRESS_HOST and INGRESS_PORT Send 30 seconds of traffic maintaining 50 qps. MODEL_NAME = flowers-sample INPUT_PATH = input.json SERVICE_HOSTNAME = $( kubectl get inferenceservice $MODEL_NAME -o jsonpath = '{.status.url}' | cut -d \"/\" -f 3 ) hey -z 30s -q 50 -m POST -host ${ SERVICE_HOSTNAME } -D $INPUT_PATH http:// ${ INGRESS_HOST } : ${ INGRESS_PORT } /v1/models/ $MODEL_NAME :predict Expected Output Summary: Total: 30 .0264 secs Slowest: 10 .8113 secs Fastest: 0 .0145 secs Average: 0 .0731 secs Requests/sec: 683 .5644 Total data: 5069675 bytes Size/request: 247 bytes Response time histogram: 0 .014 [ 1 ] | 1 .094 [ 20474 ] | \u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0 2 .174 [ 0 ] | 3 .254 [ 0 ] | 4 .333 [ 0 ] | 5 .413 [ 0 ] | 6 .493 [ 0 ] | 7 .572 [ 0 ] | 8 .652 [ 0 ] | 9 .732 [ 0 ] | 10 .811 [ 50 ] | Latency distribution: 10 % in 0 .0284 secs 25 % in 0 .0334 secs 50 % in 0 .0408 secs 75 % in 0 .0527 secs 90 % in 0 .0765 secs 95 % in 0 .0949 secs 99 % in 0 .1334 secs Details ( average, fastest, slowest ) : DNS+dialup: 0 .0001 secs, 0 .0145 secs, 10 .8113 secs DNS-lookup: 0 .0000 secs, 0 .0000 secs, 0 .0196 secs req write: 0 .0000 secs, 0 .0000 secs, 0 .0031 secs resp wait: 0 .0728 secs, 0 .0144 secs, 10 .7688 secs resp read: 0 .0000 secs, 0 .0000 secs, 0 .0031 secs Status code distribution: [ 200 ] 20525 responses Check the number of running pods now, we are loading the service with 50 requests per second, and from the dashboard you can see that it hits the average concurrency 10 and autoscaler tries scaling up to 10 pods.","title":"Predict InferenceService with target QPS"},{"location":"modelserving/autoscaling/autoscaling/#check-dashboard_1","text":"View the Knative Serving Scaling dashboards (if configured). kubectl port-forward --namespace knative-monitoring $( kubectl get pods --namespace knative-monitoring --selector = app = grafana --output = jsonpath = \"{.items..metadata.name}\" ) 3000 Autoscaler calculates average concurrency over 60 second window so it takes a minute to stabilize at the desired concurrency level, however it also calculates the 6 second panic window and will enter into panic mode if that window reaches 2x target concurrency. From the dashboard you can see that it enters panic mode in which autoscaler operates on shorter and more sensitive window. Once the panic conditions are no longer met for 60 seconds, autoscaler will return back to 60 seconds stable window.","title":"Check Dashboard"},{"location":"modelserving/autoscaling/autoscaling/#autoscaling-on-gpu","text":"Autoscaling on GPU is hard with GPU metrics, however thanks to Knative's concurrency based autoscaler scaling on GPU is pretty easy and effective!","title":"Autoscaling on GPU!"},{"location":"modelserving/autoscaling/autoscaling/#create-the-inferenceservice-with-gpu-resource","text":"Apply the tensorflow gpu example CR New Schema Old Schema apiVersion : \"serving.kserve.io/v1beta1\" kind : \"InferenceService\" metadata : name : \"flowers-sample-gpu\" spec : predictor : scaleTarget : 1 scaleMetric : concurrency model : modelFormat : name : tensorflow storageUri : \"gs://kfserving-examples/models/tensorflow/flowers\" runtimeVersion : \"2.6.2-gpu\" resources : limits : nvidia.com/gpu : 1 apiVersion : \"serving.kserve.io/v1beta1\" kind : \"InferenceService\" metadata : name : \"flowers-sample-gpu\" annotations : autoscaling.knative.dev/target : \"1\" spec : predictor : tensorflow : storageUri : \"gs://kfserving-examples/models/tensorflow/flowers\" runtimeVersion : \"2.6.2-gpu\" resources : limits : nvidia.com/gpu : 1 Apply the autoscale-gpu.yaml . kubectl kubectl apply -f autoscale-gpu.yaml","title":"Create the InferenceService with GPU resource"},{"location":"modelserving/autoscaling/autoscaling/#predict-inferenceservice-with-concurrent-requests_1","text":"The first step is to determine the ingress IP and ports and set INGRESS_HOST and INGRESS_PORT Send 30 seconds of traffic maintaining 5 in-flight requests. MODEL_NAME = flowers-sample-gpu INPUT_PATH = input.json SERVICE_HOSTNAME = $( kubectl get inferenceservice $MODEL_NAME -o jsonpath = '{.status.url}' | cut -d \"/\" -f 3 ) hey -z 30s -c 5 -m POST -host ${ SERVICE_HOSTNAME } -D $INPUT_PATH http:// ${ INGRESS_HOST } : ${ INGRESS_PORT } /v1/models/ $MODEL_NAME :predict Expected Output Summary: Total: 30 .0152 secs Slowest: 9 .7581 secs Fastest: 0 .0142 secs Average: 0 .0350 secs Requests/sec: 142 .9942 Total data: 948532 bytes Size/request: 221 bytes Response time histogram: 0 .014 [ 1 ] | 0 .989 [ 4286 ] | \u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0 1 .963 [ 0 ] | 2 .937 [ 0 ] | 3 .912 [ 0 ] | 4 .886 [ 0 ] | 5 .861 [ 0 ] | 6 .835 [ 0 ] | 7 .809 [ 0 ] | 8 .784 [ 0 ] | 9 .758 [ 5 ] | Latency distribution: 10 % in 0 .0181 secs 25 % in 0 .0189 secs 50 % in 0 .0198 secs 75 % in 0 .0210 secs 90 % in 0 .0230 secs 95 % in 0 .0276 secs 99 % in 0 .0511 secs Details ( average, fastest, slowest ) : DNS+dialup: 0 .0000 secs, 0 .0142 secs, 9 .7581 secs DNS-lookup: 0 .0000 secs, 0 .0000 secs, 0 .0291 secs req write: 0 .0000 secs, 0 .0000 secs, 0 .0023 secs resp wait: 0 .0348 secs, 0 .0141 secs, 9 .7158 secs resp read: 0 .0001 secs, 0 .0000 secs, 0 .0021 secs Status code distribution: [ 200 ] 4292 responses","title":"Predict InferenceService with concurrent requests"},{"location":"modelserving/autoscaling/autoscaling/#autoscaling-customization","text":"","title":"Autoscaling Customization"},{"location":"modelserving/autoscaling/autoscaling/#autoscaling-with-containerconcurrency","text":"ContainerConcurrency determines the number of simultaneous requests that can be processed by each replica of the InferenceService at any given time, it is a hard limit and if the concurrency reaches the hard limit surplus requests will be buffered and must wait until enough capacity is free to execute the requests. New Schema Old Schema apiVersion : \"serving.kserve.io/v1beta1\" kind : \"InferenceService\" metadata : name : \"flowers-sample\" spec : predictor : containerConcurrency : 10 model : modelFormat : name : tensorflow storageUri : \"gs://kfserving-examples/models/tensorflow/flowers\" apiVersion : \"serving.kserve.io/v1beta1\" kind : \"InferenceService\" metadata : name : \"flowers-sample\" spec : predictor : containerConcurrency : 10 tensorflow : storageUri : \"gs://kfserving-examples/models/tensorflow/flowers\" Apply the autoscale-custom.yaml . kubectl kubectl apply -f autoscale-custom.yaml","title":"Autoscaling with ContainerConcurrency"},{"location":"modelserving/autoscaling/autoscaling/#enable-scale-down-to-zero","text":"KServe by default sets minReplicas to 1, if you want to enable scaling down to zero especially for use cases like serving on GPUs you can set minReplicas to 0 so that the pods automatically scale down to zero when no traffic is received. New Schema Old Schema apiVersion : \"serving.kserve.io/v1beta1\" kind : \"InferenceService\" metadata : name : \"flowers-sample\" spec : predictor : minReplicas : 0 model : modelFormat : name : tensorflow storageUri : \"gs://kfserving-examples/models/tensorflow/flowers\" apiVersion : \"serving.kserve.io/v1beta1\" kind : \"InferenceService\" metadata : name : \"flowers-sample\" spec : predictor : minReplicas : 0 tensorflow : storageUri : \"gs://kfserving-examples/models/tensorflow/flowers\" Apply the scale-down-to-zero.yaml . kubectl kubectl apply -f scale-down-to-zero.yaml","title":"Enable scale down to zero"},{"location":"modelserving/autoscaling/autoscaling/#autoscaling-configuration-at-component-level","text":"Autoscaling options can also be configured at the component level. This allows more flexibility in terms of the autoscaling configuration. In a typical deployment, transformers may require a different autoscaling configuration than a predictor. This feature allows the user to scale individual components as required. New Schema Old Schema apiVersion : serving.kserve.io/v1beta1 kind : InferenceService metadata : name : torch-transformer spec : predictor : scaleTarget : 2 scaleMetric : concurrency model : modelFormat : name : pytorch storageUri : gs://kfserving-examples/models/torchserve/image_classifier transformer : scaleTarget : 8 scaleMetric : rps containers : - image : kserve/image-transformer:latest name : kserve-container command : - \"python\" - \"-m\" - \"model\" args : - --model_name - mnist apiVersion : serving.kserve.io/v1beta1 kind : InferenceService metadata : name : torch-transformer spec : predictor : scaleTarget : 2 scaleMetric : concurrency pytorch : storageUri : gs://kfserving-examples/models/torchserve/image_classifier transformer : scaleTarget : 8 scaleMetric : rps containers : - image : kserve/image-transformer:latest name : kserve-container command : - \"python\" - \"-m\" - \"model\" args : - --model_name - mnist Apply the autoscale-adv.yaml to create the Autoscale InferenceService. The default for scaleMetric is concurrency and possible values are concurrency , rps , cpu and memory .","title":"Autoscaling configuration at component level"},{"location":"modelserving/batcher/batcher/","text":"Inference Batcher \u00b6 This docs explains on how batch prediction for any ML frameworks (TensorFlow, PyTorch, ...) without decreasing the performance. This batcher is implemented in the KServe model agent sidecar, so the requests first hit the agent sidecar, when a batch prediction is triggered the request is then sent to the model server container for inference. We use webhook to inject the model agent container in the InferenceService pod to do the batching when batcher is enabled. We use go channels to transfer data between http request handler and batcher go routines. Currently we only implemented batching with KServe v1 HTTP protocol, gRPC is not supported yet. When the number of instances (For example, the number of pictures) reaches the maxBatchSize or the latency meets the maxLatency , a batch prediction will be triggered. Example \u00b6 We first create a pytorch predictor with a batcher. The maxLatency is set to a big value (500 milliseconds) to make us be able to observe the batching process. New Schema Old Schema apiVersion : serving.kserve.io/v1beta1 kind : InferenceService metadata : name : \"torchserve\" spec : predictor : minReplicas : 1 timeout : 60 batcher : maxBatchSize : 32 maxLatency : 500 model : modelFormat : name : pytorch storageUri : gs://kfserving-examples/models/torchserve/image_classifier/v1 apiVersion : serving.kserve.io/v1beta1 kind : InferenceService metadata : name : \"torchserve\" spec : predictor : minReplicas : 1 timeout : 60 batcher : maxBatchSize : 32 maxLatency : 500 pytorch : storageUri : gs://kfserving-examples/models/torchserve/image_classifier/v1 maxBatchSize : the max batch size for triggering a prediction. maxLatency : the max latency for triggering a prediction (In milliseconds). timeout : timeout of calling predictor service (In seconds). All of the bellowing fields have default values in the code. You can config them or not as you wish. maxBatchSize : 32. maxLatency : 500. timeout : 60. kubectl kubectl create -f pytorch-batcher.yaml We can now send requests to the pytorch model using hey. The first step is to determine the ingress IP and ports and set INGRESS_HOST and INGRESS_PORT MODEL_NAME = mnist INPUT_PATH = @./input.json SERVICE_HOSTNAME = $( kubectl get inferenceservice torchserve -o jsonpath = '{.status.url}' | cut -d \"/\" -f 3 ) hey -z 10s -c 5 -m POST -host \" ${ SERVICE_HOSTNAME } \" -H \"Content-Type: application/json\" -D ./input.json \"http:// ${ INGRESS_HOST } : ${ INGRESS_PORT } /v1/models/ $MODEL_NAME :predict\" The request will go to the model agent container first, the batcher in sidecar container batches the requests and send the inference request to the predictor container. Note If the interval of sending the two requests is less than maxLatency , the returned batchId will be the same. Expected Output Summary: Total: 10 .5361 secs Slowest: 0 .5759 secs Fastest: 0 .4983 secs Average: 0 .5265 secs Requests/sec: 9 .4912 Total data: 24100 bytes Size/request: 241 bytes Response time histogram: 0 .498 [ 1 ] | \u25a0 0 .506 [ 0 ] | 0 .514 [ 44 ] | \u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0 0 .522 [ 21 ] | \u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0 0 .529 [ 4 ] | \u25a0\u25a0\u25a0\u25a0 0 .537 [ 5 ] | \u25a0\u25a0\u25a0\u25a0\u25a0 0 .545 [ 4 ] | \u25a0\u25a0\u25a0\u25a0 0 .553 [ 0 ] | 0 .560 [ 7 ] | \u25a0\u25a0\u25a0\u25a0\u25a0\u25a0 0 .568 [ 4 ] | \u25a0\u25a0\u25a0\u25a0 0 .576 [ 10 ] | \u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0 Latency distribution: 10 % in 0 .5100 secs 25 % in 0 .5118 secs 50 % in 0 .5149 secs 75 % in 0 .5406 secs 90 % in 0 .5706 secs 95 % in 0 .5733 secs 99 % in 0 .5759 secs Details ( average, fastest, slowest ) : DNS+dialup: 0 .0004 secs, 0 .4983 secs, 0 .5759 secs DNS-lookup: 0 .0001 secs, 0 .0000 secs, 0 .0015 secs req write: 0 .0002 secs, 0 .0000 secs, 0 .0076 secs resp wait: 0 .5257 secs, 0 .4981 secs, 0 .5749 secs resp read: 0 .0001 secs, 0 .0000 secs, 0 .0009 secs Status code distribution: [ 200 ] 100 responses","title":"Inference Batcher"},{"location":"modelserving/batcher/batcher/#inference-batcher","text":"This docs explains on how batch prediction for any ML frameworks (TensorFlow, PyTorch, ...) without decreasing the performance. This batcher is implemented in the KServe model agent sidecar, so the requests first hit the agent sidecar, when a batch prediction is triggered the request is then sent to the model server container for inference. We use webhook to inject the model agent container in the InferenceService pod to do the batching when batcher is enabled. We use go channels to transfer data between http request handler and batcher go routines. Currently we only implemented batching with KServe v1 HTTP protocol, gRPC is not supported yet. When the number of instances (For example, the number of pictures) reaches the maxBatchSize or the latency meets the maxLatency , a batch prediction will be triggered.","title":"Inference Batcher"},{"location":"modelserving/batcher/batcher/#example","text":"We first create a pytorch predictor with a batcher. The maxLatency is set to a big value (500 milliseconds) to make us be able to observe the batching process. New Schema Old Schema apiVersion : serving.kserve.io/v1beta1 kind : InferenceService metadata : name : \"torchserve\" spec : predictor : minReplicas : 1 timeout : 60 batcher : maxBatchSize : 32 maxLatency : 500 model : modelFormat : name : pytorch storageUri : gs://kfserving-examples/models/torchserve/image_classifier/v1 apiVersion : serving.kserve.io/v1beta1 kind : InferenceService metadata : name : \"torchserve\" spec : predictor : minReplicas : 1 timeout : 60 batcher : maxBatchSize : 32 maxLatency : 500 pytorch : storageUri : gs://kfserving-examples/models/torchserve/image_classifier/v1 maxBatchSize : the max batch size for triggering a prediction. maxLatency : the max latency for triggering a prediction (In milliseconds). timeout : timeout of calling predictor service (In seconds). All of the bellowing fields have default values in the code. You can config them or not as you wish. maxBatchSize : 32. maxLatency : 500. timeout : 60. kubectl kubectl create -f pytorch-batcher.yaml We can now send requests to the pytorch model using hey. The first step is to determine the ingress IP and ports and set INGRESS_HOST and INGRESS_PORT MODEL_NAME = mnist INPUT_PATH = @./input.json SERVICE_HOSTNAME = $( kubectl get inferenceservice torchserve -o jsonpath = '{.status.url}' | cut -d \"/\" -f 3 ) hey -z 10s -c 5 -m POST -host \" ${ SERVICE_HOSTNAME } \" -H \"Content-Type: application/json\" -D ./input.json \"http:// ${ INGRESS_HOST } : ${ INGRESS_PORT } /v1/models/ $MODEL_NAME :predict\" The request will go to the model agent container first, the batcher in sidecar container batches the requests and send the inference request to the predictor container. Note If the interval of sending the two requests is less than maxLatency , the returned batchId will be the same. Expected Output Summary: Total: 10 .5361 secs Slowest: 0 .5759 secs Fastest: 0 .4983 secs Average: 0 .5265 secs Requests/sec: 9 .4912 Total data: 24100 bytes Size/request: 241 bytes Response time histogram: 0 .498 [ 1 ] | \u25a0 0 .506 [ 0 ] | 0 .514 [ 44 ] | \u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0 0 .522 [ 21 ] | \u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0 0 .529 [ 4 ] | \u25a0\u25a0\u25a0\u25a0 0 .537 [ 5 ] | \u25a0\u25a0\u25a0\u25a0\u25a0 0 .545 [ 4 ] | \u25a0\u25a0\u25a0\u25a0 0 .553 [ 0 ] | 0 .560 [ 7 ] | \u25a0\u25a0\u25a0\u25a0\u25a0\u25a0 0 .568 [ 4 ] | \u25a0\u25a0\u25a0\u25a0 0 .576 [ 10 ] | \u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0 Latency distribution: 10 % in 0 .5100 secs 25 % in 0 .5118 secs 50 % in 0 .5149 secs 75 % in 0 .5406 secs 90 % in 0 .5706 secs 95 % in 0 .5733 secs 99 % in 0 .5759 secs Details ( average, fastest, slowest ) : DNS+dialup: 0 .0004 secs, 0 .4983 secs, 0 .5759 secs DNS-lookup: 0 .0001 secs, 0 .0000 secs, 0 .0015 secs req write: 0 .0002 secs, 0 .0000 secs, 0 .0076 secs resp wait: 0 .5257 secs, 0 .4981 secs, 0 .5749 secs resp read: 0 .0001 secs, 0 .0000 secs, 0 .0009 secs Status code distribution: [ 200 ] 100 responses","title":"Example"},{"location":"modelserving/certificate/kserve/","text":"KServe with Self Signed Certificate Model Registry \u00b6 If you are using a model registry with a self-signed certificate, you must either skip ssl verify or apply the appropriate CA bundle to the storage-initializer to create a connection with the registry. This document explains three methods that can be used in KServe, described below: Configure CA bundle for storage-initializer Global configuration Namespace scope configuration(Using storage-config Secret) json annotation Skip SSL Verification (NOTE) This is only available for RawDeployment and ServerlessDeployment . For modelmesh, you should add ca bundle content into certificate parameter in storage-config Configure CA bundle for storage-initializer \u00b6 Global Configuration \u00b6 KServe use inferenceservice-config ConfigMap for default configuration. If you want to add cabundle cert for every inference service, you can set caBundleConfigMapName in the ConfigMap. Before updating the ConfigMap, you have to create a ConfigMap for CA bundle certificate in the namespace that KServe controller is running and the data key in the ConfigMap must be cabundle.crt . Create CA ConfigMap with the CA bundle cert kubectl create configmap cabundle --from-file=/path/to/cabundle.crt kubectl get configmap cabundle -o yaml apiVersion: v1 data: cabundle.crt: XXXXX kind: ConfigMap metadata: name: cabundle namespace: kserve Update inferenceservice-config ConfigMap storageInitializer: |- { ... \"caBundleConfigMapName\": \"cabundle\", ... } Afeter you update this configuration, please restart KServe controller pod to pick up the change. When you create a inference service, then the ca bundle will be copied to your user namespace and it will be attached to the storage-initializer container. Using storage-config Secret \u00b6 If you want to apply the cabundle only to a specific inferenceservice, you can use a specific annotation or variable( cabundle_configmap ) on the storage-config Secret used by the inferenceservice. In this case, you have to create the cabundle ConfigMap in the user namespace before you create the inferenceservice. Create a ConfigMap with the cabundle cert kubectl create configmap local-cabundle --from-file=/path/to/cabundle.crt kubectl get configmap cabundle -o yaml apiVersion: v1 data: cabundle.crt: XXXXX kind: ConfigMap metadata: name: local-cabundle namespace: kserve-demo Add an annotation serving.kserve.io/s3-cabundle-configmap to storage-config Secret apiVersion: v1 data: AWS_ACCESS_KEY_ID: VEhFQUNDRVNTS0VZ AWS_SECRET_ACCESS_KEY: VEhFUEFTU1dPUkQ= kind: Secret metadata: annotations: serving.kserve.io/s3-cabundle-configmap: local-cabundle ... name: storage-config namespace: kserve-demo type: Opaque Or, set a variable cabundle_configmap to storage-config Secret apiVersion: v1 stringData: localMinIO: | { \"type\": \"s3\", .... \"cabundle_configmap\": \"local-cabundle\" } kind: Secret metadata: name: storage-config namespace: kserve-demo type: Opaque Skip SSL Verification \u00b6 For testing purposes or when there is no cabundle, you can easily create an SSL connection by disabling SSL verification. This can also be used by adding an annotation or setting a variable in secret-config Secret. Add an annotation( serving.kserve.io/s3-verifyssl ) to storage-config Secret apiVersion: v1 data: AWS_ACCESS_KEY_ID: VEhFQUNDRVNTS0VZ AWS_SECRET_ACCESS_KEY: VEhFUEFTU1dPUkQ= kind: Secret metadata: annotations: serving.kserve.io/s3-verifyssl: \"0\" # 1 is true, 0 is false ... name: storage-config namespace: kserve-demo type: Opaque Or, set a variable ( verify_ssl ) to storage-config Secret apiVersion: v1 stringData: localMinIO: | { \"type\": \"s3\", ... \"verify_ssl\": \"0\" # 1 is true, 0 is false (You can set True/true/False/false too) } kind: Secret metadata: name: storage-config namespace: kserve-demo type: Opaque Full Demo Scripts","title":"CA Certificate"},{"location":"modelserving/certificate/kserve/#kserve-with-self-signed-certificate-model-registry","text":"If you are using a model registry with a self-signed certificate, you must either skip ssl verify or apply the appropriate CA bundle to the storage-initializer to create a connection with the registry. This document explains three methods that can be used in KServe, described below: Configure CA bundle for storage-initializer Global configuration Namespace scope configuration(Using storage-config Secret) json annotation Skip SSL Verification (NOTE) This is only available for RawDeployment and ServerlessDeployment . For modelmesh, you should add ca bundle content into certificate parameter in storage-config","title":"KServe with Self Signed Certificate Model Registry"},{"location":"modelserving/certificate/kserve/#configure-ca-bundle-for-storage-initializer","text":"","title":"Configure CA bundle for storage-initializer"},{"location":"modelserving/certificate/kserve/#global-configuration","text":"KServe use inferenceservice-config ConfigMap for default configuration. If you want to add cabundle cert for every inference service, you can set caBundleConfigMapName in the ConfigMap. Before updating the ConfigMap, you have to create a ConfigMap for CA bundle certificate in the namespace that KServe controller is running and the data key in the ConfigMap must be cabundle.crt . Create CA ConfigMap with the CA bundle cert kubectl create configmap cabundle --from-file=/path/to/cabundle.crt kubectl get configmap cabundle -o yaml apiVersion: v1 data: cabundle.crt: XXXXX kind: ConfigMap metadata: name: cabundle namespace: kserve Update inferenceservice-config ConfigMap storageInitializer: |- { ... \"caBundleConfigMapName\": \"cabundle\", ... } Afeter you update this configuration, please restart KServe controller pod to pick up the change. When you create a inference service, then the ca bundle will be copied to your user namespace and it will be attached to the storage-initializer container.","title":"Global Configuration"},{"location":"modelserving/certificate/kserve/#using-storage-config-secret","text":"If you want to apply the cabundle only to a specific inferenceservice, you can use a specific annotation or variable( cabundle_configmap ) on the storage-config Secret used by the inferenceservice. In this case, you have to create the cabundle ConfigMap in the user namespace before you create the inferenceservice. Create a ConfigMap with the cabundle cert kubectl create configmap local-cabundle --from-file=/path/to/cabundle.crt kubectl get configmap cabundle -o yaml apiVersion: v1 data: cabundle.crt: XXXXX kind: ConfigMap metadata: name: local-cabundle namespace: kserve-demo Add an annotation serving.kserve.io/s3-cabundle-configmap to storage-config Secret apiVersion: v1 data: AWS_ACCESS_KEY_ID: VEhFQUNDRVNTS0VZ AWS_SECRET_ACCESS_KEY: VEhFUEFTU1dPUkQ= kind: Secret metadata: annotations: serving.kserve.io/s3-cabundle-configmap: local-cabundle ... name: storage-config namespace: kserve-demo type: Opaque Or, set a variable cabundle_configmap to storage-config Secret apiVersion: v1 stringData: localMinIO: | { \"type\": \"s3\", .... \"cabundle_configmap\": \"local-cabundle\" } kind: Secret metadata: name: storage-config namespace: kserve-demo type: Opaque","title":"Using storage-config Secret"},{"location":"modelserving/certificate/kserve/#skip-ssl-verification","text":"For testing purposes or when there is no cabundle, you can easily create an SSL connection by disabling SSL verification. This can also be used by adding an annotation or setting a variable in secret-config Secret. Add an annotation( serving.kserve.io/s3-verifyssl ) to storage-config Secret apiVersion: v1 data: AWS_ACCESS_KEY_ID: VEhFQUNDRVNTS0VZ AWS_SECRET_ACCESS_KEY: VEhFUEFTU1dPUkQ= kind: Secret metadata: annotations: serving.kserve.io/s3-verifyssl: \"0\" # 1 is true, 0 is false ... name: storage-config namespace: kserve-demo type: Opaque Or, set a variable ( verify_ssl ) to storage-config Secret apiVersion: v1 stringData: localMinIO: | { \"type\": \"s3\", ... \"verify_ssl\": \"0\" # 1 is true, 0 is false (You can set True/true/False/false too) } kind: Secret metadata: name: storage-config namespace: kserve-demo type: Opaque Full Demo Scripts","title":"Skip SSL Verification"},{"location":"modelserving/data_plane/data_plane/","text":"Data Plane \u00b6 The InferenceService Data Plane architecture consists of a static graph of components which coordinate requests for a single model. Advanced features such as Ensembling, A/B testing, and Multi-Arm-Bandits should compose InferenceServices together. Introduction \u00b6 KServe's data plane protocol introduces an inference API that is independent of any specific ML/DL framework and model server. This allows for quick iterations and consistency across Inference Services and supports both easy-to-use and high-performance use cases. By implementing this protocol both inference clients and servers will increase their utility and portability by operating seamlessly on platforms that have standardized around this API. Kserve's inference protocol is endorsed by NVIDIA Triton Inference Server, TensorFlow Serving, and TorchServe. Note: Protocol V2 uses /infer instead of :predict Concepts \u00b6 Component : Each endpoint is composed of multiple components: \"predictor\", \"explainer\", and \"transformer\". The only required component is the predictor, which is the core of the system. As KServe evolves, we plan to increase the number of supported components to enable use cases like Outlier Detection. Predictor : The predictor is the workhorse of the InferenceService. It is simply a model and a model server that makes it available at a network endpoint. Explainer : The explainer enables an optional alternate data plane that provides model explanations in addition to predictions. Users may define their own explanation container, which configures with relevant environment variables like prediction endpoint. For common use cases, KServe provides out-of-the-box explainers like Alibi. Transformer : The transformer enables users to define a pre and post processing step before the prediction and explanation workflows. Like the explainer, it is configured with relevant environment variables too. For common use cases, KServe provides out-of-the-box transformers like Feast. Data Plane V1 & V2 \u00b6 KServe supports two versions of its data plane, V1 and V2. V1 protocol offers a standard prediction workflow with HTTP/REST. The second version of the data-plane protocol addresses several issues found with the V1 data-plane protocol, including performance and generality across a large number of model frameworks and servers. Protocol V2 expands the capabilities of V1 by adding gRPC APIs. Main changes \u00b6 V2 does not currently support the explain endpoint V2 added Server Readiness/Liveness/Metadata endpoints V2 endpoint paths contain / instead of : V2 renamed :predict endpoint to /infer V2 allows for model versions in the request path (optional) V1 APIs \u00b6 API Verb Path List Models GET /v1/models Model Ready GET /v1/models/ Predict POST /v1/models/:predict Explain POST /v1/models/:explain V2 APIs \u00b6 API Verb Path Inference POST v2/models/[/versions/]/infer Model Metadata GET v2/models/[/versions/] Server Readiness GET v2/health/ready Server Liveness GET v2/health/live Server Metadata GET v2 Model Readiness GET v2/models/[/versions/ ]/ready ** path contents in [] are optional Please see V1 Protocol and V2 Protocol documentation for more information.","title":"Model Serving Data Plane"},{"location":"modelserving/data_plane/data_plane/#data-plane","text":"The InferenceService Data Plane architecture consists of a static graph of components which coordinate requests for a single model. Advanced features such as Ensembling, A/B testing, and Multi-Arm-Bandits should compose InferenceServices together.","title":"Data Plane"},{"location":"modelserving/data_plane/data_plane/#introduction","text":"KServe's data plane protocol introduces an inference API that is independent of any specific ML/DL framework and model server. This allows for quick iterations and consistency across Inference Services and supports both easy-to-use and high-performance use cases. By implementing this protocol both inference clients and servers will increase their utility and portability by operating seamlessly on platforms that have standardized around this API. Kserve's inference protocol is endorsed by NVIDIA Triton Inference Server, TensorFlow Serving, and TorchServe. Note: Protocol V2 uses /infer instead of :predict","title":"Introduction"},{"location":"modelserving/data_plane/data_plane/#concepts","text":"Component : Each endpoint is composed of multiple components: \"predictor\", \"explainer\", and \"transformer\". The only required component is the predictor, which is the core of the system. As KServe evolves, we plan to increase the number of supported components to enable use cases like Outlier Detection. Predictor : The predictor is the workhorse of the InferenceService. It is simply a model and a model server that makes it available at a network endpoint. Explainer : The explainer enables an optional alternate data plane that provides model explanations in addition to predictions. Users may define their own explanation container, which configures with relevant environment variables like prediction endpoint. For common use cases, KServe provides out-of-the-box explainers like Alibi. Transformer : The transformer enables users to define a pre and post processing step before the prediction and explanation workflows. Like the explainer, it is configured with relevant environment variables too. For common use cases, KServe provides out-of-the-box transformers like Feast.","title":"Concepts"},{"location":"modelserving/data_plane/data_plane/#data-plane-v1-v2","text":"KServe supports two versions of its data plane, V1 and V2. V1 protocol offers a standard prediction workflow with HTTP/REST. The second version of the data-plane protocol addresses several issues found with the V1 data-plane protocol, including performance and generality across a large number of model frameworks and servers. Protocol V2 expands the capabilities of V1 by adding gRPC APIs.","title":"Data Plane V1 & V2"},{"location":"modelserving/data_plane/data_plane/#main-changes","text":"V2 does not currently support the explain endpoint V2 added Server Readiness/Liveness/Metadata endpoints V2 endpoint paths contain / instead of : V2 renamed :predict endpoint to /infer V2 allows for model versions in the request path (optional)","title":"Main changes"},{"location":"modelserving/data_plane/data_plane/#v1-apis","text":"API Verb Path List Models GET /v1/models Model Ready GET /v1/models/ Predict POST /v1/models/:predict Explain POST /v1/models/:explain","title":"V1 APIs"},{"location":"modelserving/data_plane/data_plane/#v2-apis","text":"API Verb Path Inference POST v2/models/[/versions/]/infer Model Metadata GET v2/models/[/versions/] Server Readiness GET v2/health/ready Server Liveness GET v2/health/live Server Metadata GET v2 Model Readiness GET v2/models/[/versions/ ]/ready ** path contents in [] are optional Please see V1 Protocol and V2 Protocol documentation for more information.","title":"V2 APIs"},{"location":"modelserving/data_plane/v1_protocol/","text":"Data Plane (V1) \u00b6 KServe's V1 protocol offers a standardized prediction workflow across all model frameworks. This protocol version is still supported, but it is recommended that users migrate to the V2 protocol for better performance and standardization among serving runtimes. However, if a use case requires a more flexible schema than protocol v2 provides, v1 protocol is still an option. API Verb Path Request Payload Response Payload List Models GET /v1/models {\"models\": []} Model Ready GET /v1/models/ {\"name\": ,\"ready\": $bool} Predict POST /v1/models/:predict {\"instances\": []} ** {\"predictions\": []} Explain POST /v1/models/:explain {\"instances\": []} ** {\"predictions\": [], \"explanations\": []} ** = payload is optional Note: The response payload in V1 protocol is not strictly enforced. A custom server can define and return its own response payload. We encourage using the KServe defined response payload for consistency. API Definitions \u00b6 API Definition Predict The \"predict\" API performs inference on a model. The response is the prediction result. All InferenceServices speak the Tensorflow V1 HTTP API . Explain The \"explain\" API is an optional component that provides model explanations in addition to predictions. The standardized explainer interface is identical to the Tensorflow V1 HTTP API with the addition of an \":explain\" verb. Model Ready The \u201cmodel ready\u201d health API indicates if a specific model is ready for inferencing. If the model(s) is downloaded and ready to serve requests, the model ready endpoint returns the list of accessible (s). List Models The \"models\" API exposes a list of models in the model registry.","title":"V1 Inference Protocol"},{"location":"modelserving/data_plane/v1_protocol/#data-plane-v1","text":"KServe's V1 protocol offers a standardized prediction workflow across all model frameworks. This protocol version is still supported, but it is recommended that users migrate to the V2 protocol for better performance and standardization among serving runtimes. However, if a use case requires a more flexible schema than protocol v2 provides, v1 protocol is still an option. API Verb Path Request Payload Response Payload List Models GET /v1/models {\"models\": []} Model Ready GET /v1/models/ {\"name\": ,\"ready\": $bool} Predict POST /v1/models/:predict {\"instances\": []} ** {\"predictions\": []} Explain POST /v1/models/:explain {\"instances\": []} ** {\"predictions\": [], \"explanations\": []} ** = payload is optional Note: The response payload in V1 protocol is not strictly enforced. A custom server can define and return its own response payload. We encourage using the KServe defined response payload for consistency.","title":"Data Plane (V1)"},{"location":"modelserving/data_plane/v1_protocol/#api-definitions","text":"API Definition Predict The \"predict\" API performs inference on a model. The response is the prediction result. All InferenceServices speak the Tensorflow V1 HTTP API . Explain The \"explain\" API is an optional component that provides model explanations in addition to predictions. The standardized explainer interface is identical to the Tensorflow V1 HTTP API with the addition of an \":explain\" verb. Model Ready The \u201cmodel ready\u201d health API indicates if a specific model is ready for inferencing. If the model(s) is downloaded and ready to serve requests, the model ready endpoint returns the list of accessible (s). List Models The \"models\" API exposes a list of models in the model registry.","title":"API Definitions"},{"location":"modelserving/data_plane/v2_protocol/","text":"Open Inference Protocol (V2 Inference Protocol) \u00b6 For an inference server to be compliant with this protocol the server must implement the health, metadata, and inference V2 APIs . Optional features that are explicitly noted are not required. A compliant inference server may choose to implement the HTTP/REST API and/or the GRPC API . Check the model serving runtime table / the protocolVersion field in the runtime YAML to ensure V2 protocol is supported for model serving runtime that you are using. Note: For all API descriptions on this page, all strings in all contexts are case-sensitive. The V2 protocol supports an extension mechanism as a required part of the API, but this document does not propose any specific extensions. Any specific extensions will be proposed separately. Note on changes between V1 & V2 \u00b6 V2 protocol does not currently support the explain endpoint like V1 protocol does. If this is a feature you wish to have in the V2 protocol, please submit a github issue . HTTP/REST \u00b6 The HTTP/REST API uses JSON because it is widely supported and language independent. In all JSON schemas shown in this document $number, $string, $boolean, $object and $array refer to the fundamental JSON types. #optional indicates an optional JSON field. See also: The HTTP/REST endpoints are defined in rest_predict_v2.yaml API Verb Path Request Payload Response Payload Inference POST v2/models/ [/versions/]/infer $inference_request $inference_response Model Metadata GET v2/models/[/versions/] $metadata_model_response Server Ready GET v2/health/ready $ready_server_response Server Live GET v2/health/live $live_server_response Server Metadata GET v2 $metadata_server_response Model Ready GET v2/models/[/versions/ ]/ready $ready_model_response ** path contents in [] are optional For more information regarding payload contents, see Payload Contents . The versions portion of the Path URLs (in [] ) is shown as optional to allow implementations that don\u2019t support versioning or for cases when the user does not want to specify a specific model version (in which case the server will choose a version based on its own policies). For example, if a model does not implement a version, the Model Metadata request path could look like v2/model/my_model . If the model has been configured to implement a version, the request path could look something like v2/models/my_model/versions/v10 , where the version of the model is v10. API Definitions \u00b6 API Definition Inference The /infer endpoint performs inference on a model. The response is the prediction result. Model Metadata The \"model metadata\" API is a per-model endpoint that returns details about the model passed in the path. Server Ready The \u201cserver ready\u201d health API indicates if all the models are ready for inferencing. The \u201cserver ready\u201d health API can be used directly to implement the Kubernetes readinessProbe Server Live The \u201cserver live\u201d health API indicates if the inference server is able to receive and respond to metadata and inference requests. The \u201cserver live\u201d API can be used directly to implement the Kubernetes livenessProbe. Server Metadata The \"server metadata\" API returns details describing the server. Model Ready The \u201cmodel ready\u201d health API indicates if a specific model is ready for inferencing. The model name and (optionally) version must be available in the URL. Health/Readiness/Liveness Probes \u00b6 The Model Readiness probe the question \"Did the model download and is it able to serve requests?\" and responds with the available model name(s). The Server Readiness/Liveness probes answer the question \"Is my service and its infrastructure running, healthy, and able to receive and process requests?\" To read more about liveness and readiness probe concepts, visit the Configure Liveness, Readiness and Startup Probes Kubernetes documentation. Payload Contents \u00b6 Model Ready \u00b6 The model ready endpoint returns the readiness probe response for the server along with the name of the model. Model Ready Response JSON Object \u00b6 $ready_model_response = { \"name\" : $string, \"ready\": $bool } Server Ready \u00b6 The server ready endpoint returns the readiness probe response for the server. Server Ready Response JSON Object \u00b6 $ready_server_response = { \"live\" : $bool, } Server Live \u00b6 The server live endpoint returns the liveness probe response for the server. Server Live Response JSON Objet \u00b6 $live_server_response = { \"live\" : $bool, } Server Metadata \u00b6 The server metadata endpoint provides information about the server. A server metadata request is made with an HTTP GET to a server metadata endpoint. In the corresponding response the HTTP body contains the Server Metadata Response JSON Object or the Server Metadata Response JSON Error Object . Server Metadata Response JSON Object \u00b6 A successful server metadata request is indicated by a 200 HTTP status code. The server metadata response object, identified as $metadata_server_response , is returned in the HTTP body. $metadata_server_response = { \"name\" : $string, \"version\" : $string, \"extensions\" : [ $string, ... ] } \u201cname\u201d : A descriptive name for the server. \"version\" : The server version. \u201cextensions\u201d : The extensions supported by the server. Currently, no standard extensions are defined. Individual inference servers may define and document their own extensions. Server Metadata Response JSON Error Object \u00b6 A failed server metadata request must be indicated by an HTTP error status (typically 400). The HTTP body must contain the $metadata_server_error_response object. $metadata_server_error_response = { \"error\": $string } \u201cerror\u201d : The descriptive message for the error. The per-model metadata endpoint provides information about a model. A model metadata request is made with an HTTP GET to a model metadata endpoint. In the corresponding response the HTTP body contains the Model Metadata Response JSON Object or the Model Metadata Response JSON Error Object . The model name and (optionally) version must be available in the URL. If a version is not provided the server may choose a version based on its own policies or return an error. Model Metadata \u00b6 The per-model metadata endpoint provides information about a model. A model metadata request is made with an HTTP GET to a model metadata endpoint. In the corresponding response the HTTP body contains the Model Metadata Response JSON Object or the Model Metadata Response JSON Error Object . The model name and (optionally) version must be available in the URL. If a version is not provided the server may choose a version based on its own policies or return an error. Model Metadata Response JSON Object \u00b6 A successful model metadata request is indicated by a 200 HTTP status code. The metadata response object, identified as $metadata_model_response , is returned in the HTTP body for every successful model metadata request. $metadata_model_response = { \"name\" : $string, \"versions\" : [ $string, ... ] #optional, \"platform\" : $string, \"inputs\" : [ $metadata_tensor, ... ], \"outputs\" : [ $metadata_tensor, ... ] } \u201cname\u201d : The name of the model. \"versions\" : The model versions that may be explicitly requested via the appropriate endpoint. Optional for servers that don\u2019t support versions. Optional for models that don\u2019t allow a version to be explicitly requested. \u201cplatform\u201d : The framework/backend for the model. See Platforms . \u201cinputs\u201d : The inputs required by the model. \u201coutputs\u201d : The outputs produced by the model. Each model input and output tensors\u2019 metadata is described with a $metadata_tensor object . $metadata_tensor = { \"name\" : $string, \"datatype\" : $string, \"shape\" : [ $number, ... ] } \u201cname\u201d : The name of the tensor. \"datatype\" : The data-type of the tensor elements as defined in Tensor Data Types . \"shape\" : The shape of the tensor. Variable-size dimensions are specified as -1. Model Metadata Response JSON Error Object \u00b6 A failed model metadata request must be indicated by an HTTP error status (typically 400). The HTTP body must contain the $metadata_model_error_response object. $metadata_model_error_response = { \"error\": $string } \u201cerror\u201d : The descriptive message for the error. Inference \u00b6 An inference request is made with an HTTP POST to an inference endpoint. In the request the HTTP body contains the Inference Request JSON Object . In the corresponding response the HTTP body contains the Inference Response JSON Object or Inference Response JSON Error Object . See Inference Request Examples for some example HTTP/REST requests and responses. Inference Request JSON Object \u00b6 The inference request object, identified as $inference_request , is required in the HTTP body of the POST request. The model name and (optionally) version must be available in the URL. If a version is not provided the server may choose a version based on its own policies or return an error. $inference_request = { \"id\" : $string #optional, \"parameters\" : $parameters #optional, \"inputs\" : [ $request_input, ... ], \"outputs\" : [ $request_output, ... ] #optional } \"id\" : An identifier for this request. Optional, but if specified this identifier must be returned in the response. \"parameters\" : An object containing zero or more parameters for this inference request expressed as key/value pairs. See Parameters for more information. \"inputs\" : The input tensors. Each input is described using the $request_input schema defined in Request Input . \"outputs\" : The output tensors requested for this inference. Each requested output is described using the $request_output schema defined in Request Output . Optional, if not specified all outputs produced by the model will be returned using default $request_output settings. Request Input \u00b6 The $inference_request_input JSON describes an input to the model. If the input is batched, the shape and data must represent the full shape and contents of the entire batch. $inference_request_input = { \"name\" : $string, \"shape\" : [ $number, ... ], \"datatype\" : $string, \"parameters\" : $parameters #optional, \"data\" : $tensor_data } \"name\" : The name of the input tensor. \"shape\" : The shape of the input tensor. Each dimension must be an integer representable as an unsigned 64-bit integer value. \"datatype\" : The data-type of the input tensor elements as defined in Tensor Data Types . \"parameters\" : An object containing zero or more parameters for this input expressed as key/value pairs. See Parameters for more information. \u201cdata\u201d: The contents of the tensor. See Tensor Data for more information. Request Output \u00b6 The $request_output JSON is used to request which output tensors should be returned from the model. $inference_request_output = { \"name\" : $string, \"parameters\" : $parameters #optional, } \"name\" : The name of the output tensor. \"parameters\" : An object containing zero or more parameters for this output expressed as key/value pairs. See Parameters for more information. Inference Response JSON Object \u00b6 A successful inference request is indicated by a 200 HTTP status code. The inference response object, identified as $inference_response , is returned in the HTTP body. $inference_response = { \"model_name\" : $string, \"model_version\" : $string #optional, \"id\" : $string, \"parameters\" : $parameters #optional, \"outputs\" : [ $response_output, ... ] } \"model_name\" : The name of the model used for inference. \"model_version\" : The specific model version used for inference. Inference servers that do not implement versioning should not provide this field in the response. \"id\" : The \"id\" identifier given in the request, if any. \"parameters\" : An object containing zero or more parameters for this response expressed as key/value pairs. See Parameters for more information. \"outputs\" : The output tensors. Each output is described using the $response_output schema defined in Response Output . Response Output \u00b6 The $response_output JSON describes an output from the model. If the output is batched, the shape and data represents the full shape of the entire batch. $response_output = { \"name\" : $string, \"shape\" : [ $number, ... ], \"datatype\" : $string, \"parameters\" : $parameters #optional, \"data\" : $tensor_data } \"name\" : The name of the output tensor. \"shape\" : The shape of the output tensor. Each dimension must be an integer representable as an unsigned 64-bit integer value. \"datatype\" : The data-type of the output tensor elements as defined in Tensor Data Types . \"parameters\" : An object containing zero or more parameters for this input expressed as key/value pairs. See Parameters for more information. \u201cdata\u201d: The contents of the tensor. See Tensor Data for more information. Inference Response JSON Error Object \u00b6 A failed inference request must be indicated by an HTTP error status (typically 400). The HTTP body must contain the $inference_error_response object. $inference_error_response = { \"error\": } \u201cerror\u201d : The descriptive message for the error. Parameters \u00b6 The $parameters JSON describes zero or more \u201cname\u201d/\u201dvalue\u201d pairs, where the \u201cname\u201d is the name of the parameter and the \u201cvalue\u201d is a $string, $number, or $boolean. $parameters = { $parameter, ... } $parameter = $string : $string | $number | $boolean Currently no parameters are defined. As required a future proposal may define one or more standard parameters to allow portable functionality across different inference servers. A server can implement server-specific parameters to provide non-standard capabilities. Tensor Data \u00b6 Tensor data must be presented in row-major order of the tensor elements. Element values must be given in \"linear\" order without any stride or padding between elements. Tensor elements may be presented in their nature multi-dimensional representation, or as a flattened one-dimensional representation. Tensor data given explicitly is provided in a JSON array. Each element of the array may be an integer, floating-point number, string or boolean value. The server can decide to coerce each element to the required type or return an error if an unexpected value is received. Note that fp16 and bf16 are problematic to communicate explicitly since there is not a standard fp16/bf16 representation across backends nor typically the programmatic support to create the fp16/bf16 representation for a JSON number. For example, the 2-dimensional matrix: [ 1 2 4 5 ] Can be represented in its natural format as: \"data\" : [ [ 1, 2 ], [ 4, 5 ] ] Or in a flattened one-dimensional representation: \"data\" : [ 1, 2, 4, 5 ] Tensor Data Types \u00b6 Tensor data types are shown in the following table along with the size of each type, in bytes. Data Type Size (bytes) BOOL 1 UINT8 1 UINT16 2 UINT32 4 UINT64 8 INT8 1 INT16 2 INT32 4 INT64 8 FP16 2 FP32 4 FP64 8 BYTES Variable (max 2 32 ) --- Inference Request Examples \u00b6 The following example shows an inference request to a model with two inputs and one output. The HTTP Content-Length header gives the size of the JSON object. POST /v2/models/mymodel/infer HTTP/1.1 Host: localhost:8000 Content-Type: application/json Content-Length: { \"id\" : \"42\", \"inputs\" : [ { \"name\" : \"input0\", \"shape\" : [ 2, 2 ], \"datatype\" : \"UINT32\", \"data\" : [ 1, 2, 3, 4 ] }, { \"name\" : \"input1\", \"shape\" : [ 3 ], \"datatype\" : \"BOOL\", \"data\" : [ true ] } ], \"outputs\" : [ { \"name\" : \"output0\" } ] } For the above request the inference server must return the \u201coutput0\u201d output tensor. Assuming the model returns a [ 3, 2 ] tensor of data type FP32 the following response would be returned. HTTP/1.1 200 OK Content-Type: application/json Content-Length: { \"id\" : \"42\" \"outputs\" : [ { \"name\" : \"output0\", \"shape\" : [ 3, 2 ], \"datatype\" : \"FP32\", \"data\" : [ 1.0, 1.1, 2.0, 2.1, 3.0, 3.1 ] } ] } gRPC \u00b6 The GRPC API closely follows the concepts defined in the HTTP/REST API. A compliant server must implement the health, metadata, and inference APIs described in this section. API rpc Endpoint Request Message Response Message Inference ModelInfer ModelInferRequest ModelInferResponse Model Ready ModelReady [ModelReadyRequest] ModelReadyResponse Model Metadata ModelMetadata ModelMetadataRequest ModelMetadataResponse Server Ready ServerReady ServerReadyRequest ServerReadyResponse Server Live ServerLive ServerLiveRequest ServerLiveResponse For more detailed information on each endpoint and its contents, see API Definitions and Message Contents . See also: The gRPC endpoints, request/response messages and contents are defined in grpc_predict_v2.proto API Definitions \u00b6 The GRPC definition of the service is: // // Inference Server GRPC endpoints. // service GRPCInferenceService { // Check liveness of the inference server. rpc ServerLive(ServerLiveRequest) returns (ServerLiveResponse) {} // Check readiness of the inference server. rpc ServerReady(ServerReadyRequest) returns (ServerReadyResponse) {} // Check readiness of a model in the inference server. rpc ModelReady(ModelReadyRequest) returns (ModelReadyResponse) {} // Get server metadata. rpc ServerMetadata(ServerMetadataRequest) returns (ServerMetadataResponse) {} // Get model metadata. rpc ModelMetadata(ModelMetadataRequest) returns (ModelMetadataResponse) {} // Perform inference using a specific model. rpc ModelInfer(ModelInferRequest) returns (ModelInferResponse) {} } Message Contents \u00b6 Health \u00b6 A health request is made using the ServerLive, ServerReady, or ModelReady endpoint. For each of these endpoints errors are indicated by the google.rpc.Status returned for the request. The OK code indicates success and other codes indicate failure. Server Live \u00b6 The ServerLive API indicates if the inference server is able to receive and respond to metadata and inference requests. The request and response messages for ServerLive are: message ServerLiveRequest {} message ServerLiveResponse { // True if the inference server is live, false if not live. bool live = 1; } Server Ready \u00b6 The ServerReady API indicates if the server is ready for inferencing. The request and response messages for ServerReady are: message ServerReadyRequest {} message ServerReadyResponse { // True if the inference server is ready, false if not ready. bool ready = 1; } Model Ready \u00b6 The ModelReady API indicates if a specific model is ready for inferencing. The request and response messages for ModelReady are: message ModelReadyRequest { // The name of the model to check for readiness. string name = 1; // The version of the model to check for readiness. If not given the // server will choose a version based on the model and internal policy. string version = 2; } message ModelReadyResponse { // True if the model is ready, false if not ready. bool ready = 1; } Metadata \u00b6 Server Metadata \u00b6 The ServerMetadata API provides information about the server. Errors are indicated by the google.rpc.Status returned for the request. The OK code indicates success and other codes indicate failure. The request and response messages for ServerMetadata are: message ServerMetadataRequest {} message ServerMetadataResponse { // The server name. string name = 1; // The server version. string version = 2; // The extensions supported by the server. repeated string extensions = 3; } Model Metadata \u00b6 The per-model metadata API provides information about a model. Errors are indicated by the google.rpc.Status returned for the request. The OK code indicates success and other codes indicate failure. The request and response messages for ModelMetadata are: message ModelMetadataRequest { // The name of the model. string name = 1; // The version of the model to check for readiness. If not given the // server will choose a version based on the model and internal policy. string version = 2; } message ModelMetadataResponse { // Metadata for a tensor. message TensorMetadata { // The tensor name. string name = 1; // The tensor data type. string datatype = 2; // The tensor shape. A variable-size dimension is represented // by a -1 value. repeated int64 shape = 3; } // The model name. string name = 1; // The versions of the model available on the server. repeated string versions = 2; // The model's platform. See Platforms. string platform = 3; // The model's inputs. repeated TensorMetadata inputs = 4; // The model's outputs. repeated TensorMetadata outputs = 5; } Platforms \u00b6 A platform is a string indicating a DL/ML framework or backend. Platform is returned as part of the response to a Model Metadata request but is information only. The proposed inference APIs are generic relative to the DL/ML framework used by a model and so a client does not need to know the platform of a given model to use the API. Platform names use the format \u201c _ \u201d. The following platform names are allowed: tensorrt_plan : A TensorRT model encoded as a serialized engine or \u201cplan\u201d. tensorflow_graphdef : A TensorFlow model encoded as a GraphDef. tensorflow_savedmodel : A TensorFlow model encoded as a SavedModel. onnx_onnxv1 : A ONNX model encoded for ONNX Runtime. pytorch_torchscript : A PyTorch model encoded as TorchScript. mxnet_mxnet: An MXNet model caffe2_netdef : A Caffe2 model encoded as a NetDef. Inference \u00b6 The ModelInfer API performs inference using the specified model. Errors are indicated by the google.rpc.Status returned for the request. The OK code indicates success and other codes indicate failure. The request and response messages for ModelInfer are: message ModelInferRequest { // An input tensor for an inference request. message InferInputTensor { // The tensor name. string name = 1; // The tensor data type. string datatype = 2; // The tensor shape. repeated int64 shape = 3; // Optional inference input tensor parameters. map parameters = 4; // The tensor contents using a data-type format. This field must // not be specified if \"raw\" tensor contents are being used for // the inference request. InferTensorContents contents = 5; } // An output tensor requested for an inference request. message InferRequestedOutputTensor { // The tensor name. string name = 1; // Optional requested output tensor parameters. map parameters = 2; } // The name of the model to use for inferencing. string model_name = 1; // The version of the model to use for inference. If not given the // server will choose a version based on the model and internal policy. string model_version = 2; // Optional identifier for the request. If specified will be // returned in the response. string id = 3; // Optional inference parameters. map parameters = 4; // The input tensors for the inference. repeated InferInputTensor inputs = 5; // The requested output tensors for the inference. Optional, if not // specified all outputs produced by the model will be returned. repeated InferRequestedOutputTensor outputs = 6; // The data contained in an input tensor can be represented in \"raw\" // bytes form or in the repeated type that matches the tensor's data // type. To use the raw representation 'raw_input_contents' must be // initialized with data for each tensor in the same order as // 'inputs'. For each tensor, the size of this content must match // what is expected by the tensor's shape and data type. The raw // data must be the flattened, one-dimensional, row-major order of // the tensor elements without any stride or padding between the // elements. Note that the FP16 data type must be represented as raw // content as there is no specific data type for a 16-bit float // type. // // If this field is specified then InferInputTensor::contents must // not be specified for any input tensor. repeated bytes raw_input_contents = 7; } message ModelInferResponse { // An output tensor returned for an inference request. message InferOutputTensor { // The tensor name. string name = 1; // The tensor data type. string datatype = 2; // The tensor shape. repeated int64 shape = 3; // Optional output tensor parameters. map parameters = 4; // The tensor contents using a data-type format. This field must // not be specified if \"raw\" tensor contents are being used for // the inference response. InferTensorContents contents = 5; } // The name of the model used for inference. string model_name = 1; // The version of the model used for inference. string model_version = 2; // The id of the inference request if one was specified. string id = 3; // Optional inference response parameters. map parameters = 4; // The output tensors holding inference results. repeated InferOutputTensor outputs = 5; // The data contained in an output tensor can be represented in // \"raw\" bytes form or in the repeated type that matches the // tensor's data type. To use the raw representation 'raw_output_contents' // must be initialized with data for each tensor in the same order as // 'outputs'. For each tensor, the size of this content must match // what is expected by the tensor's shape and data type. The raw // data must be the flattened, one-dimensional, row-major order of // the tensor elements without any stride or padding between the // elements. Note that the FP16 data type must be represented as raw // content as there is no specific data type for a 16-bit float // type. // // If this field is specified then InferOutputTensor::contents must // not be specified for any output tensor. repeated bytes raw_output_contents = 6; } Parameters \u00b6 The Parameters message describes a \u201cname\u201d/\u201dvalue\u201d pair, where the \u201cname\u201d is the name of the parameter and the \u201cvalue\u201d is a boolean, integer, or string corresponding to the parameter. Currently, no parameters are defined. As required a future proposal may define one or more standard parameters to allow portable functionality across different inference servers. A server can implement server-specific parameters to provide non-standard capabilities. // // An inference parameter value. // message InferParameter { // The parameter value can be a string, an int64, a boolean // or a message specific to a predefined parameter. oneof parameter_choice { // A boolean parameter value. bool bool_param = 1; // An int64 parameter value. int64 int64_param = 2; // A string parameter value. string string_param = 3; } } Tensor Data \u00b6 In all representations tensor data must be flattened to a one-dimensional, row-major order of the tensor elements. Element values must be given in \"linear\" order without any stride or padding between elements. Using a \"raw\" representation of tensors with ModelInferRequest::raw_input_contents and ModelInferResponse::raw_output_contents will typically allow higher performance due to the way protobuf allocation and reuse interacts with GRPC. For example, see https://github.com/grpc/grpc/issues/23231. An alternative to the \"raw\" representation is to use InferTensorContents to represent the tensor data in a format that matches the tensor's data type. // // The data contained in a tensor represented by the repeated type // that matches the tensor's data type. Protobuf oneof is not used // because oneofs cannot contain repeated fields. // message InferTensorContents { // Representation for BOOL data type. The size must match what is // expected by the tensor's shape. The contents must be the flattened, // one-dimensional, row-major order of the tensor elements. repeated bool bool_contents = 1; // Representation for INT8, INT16, and INT32 data types. The size // must match what is expected by the tensor's shape. The contents // must be the flattened, one-dimensional, row-major order of the // tensor elements. repeated int32 int_contents = 2; // Representation for INT64 data types. The size must match what // is expected by the tensor's shape. The contents must be the // flattened, one-dimensional, row-major order of the tensor elements. repeated int64 int64_contents = 3; // Representation for UINT8, UINT16, and UINT32 data types. The size // must match what is expected by the tensor's shape. The contents // must be the flattened, one-dimensional, row-major order of the // tensor elements. repeated uint32 uint_contents = 4; // Representation for UINT64 data types. The size must match what // is expected by the tensor's shape. The contents must be the // flattened, one-dimensional, row-major order of the tensor elements. repeated uint64 uint64_contents = 5; // Representation for FP32 data type. The size must match what is // expected by the tensor's shape. The contents must be the flattened, // one-dimensional, row-major order of the tensor elements. repeated float fp32_contents = 6; // Representation for FP64 data type. The size must match what is // expected by the tensor's shape. The contents must be the flattened, // one-dimensional, row-major order of the tensor elements. repeated double fp64_contents = 7; // Representation for BYTES data type. The size must match what is // expected by the tensor's shape. The contents must be the flattened, // one-dimensional, row-major order of the tensor elements. repeated bytes bytes_contents = 8; } Tensor Data Types \u00b6 Tensor data types are shown in the following table along with the size of each type, in bytes. Data Type Size (bytes) BOOL 1 UINT8 1 UINT16 2 UINT32 4 UINT64 8 INT8 1 INT16 2 INT32 4 INT64 8 FP16 2 FP32 4 FP64 8 BYTES Variable (max 2 32 )","title":"Open Inference Protocol (V2 Inference Protocol)"},{"location":"modelserving/data_plane/v2_protocol/#open-inference-protocol-v2-inference-protocol","text":"For an inference server to be compliant with this protocol the server must implement the health, metadata, and inference V2 APIs . Optional features that are explicitly noted are not required. A compliant inference server may choose to implement the HTTP/REST API and/or the GRPC API . Check the model serving runtime table / the protocolVersion field in the runtime YAML to ensure V2 protocol is supported for model serving runtime that you are using. Note: For all API descriptions on this page, all strings in all contexts are case-sensitive. The V2 protocol supports an extension mechanism as a required part of the API, but this document does not propose any specific extensions. Any specific extensions will be proposed separately.","title":"Open Inference Protocol (V2 Inference Protocol)"},{"location":"modelserving/data_plane/v2_protocol/#note-on-changes-between-v1-v2","text":"V2 protocol does not currently support the explain endpoint like V1 protocol does. If this is a feature you wish to have in the V2 protocol, please submit a github issue .","title":"Note on changes between V1 & V2"},{"location":"modelserving/data_plane/v2_protocol/#httprest","text":"The HTTP/REST API uses JSON because it is widely supported and language independent. In all JSON schemas shown in this document $number, $string, $boolean, $object and $array refer to the fundamental JSON types. #optional indicates an optional JSON field. See also: The HTTP/REST endpoints are defined in rest_predict_v2.yaml API Verb Path Request Payload Response Payload Inference POST v2/models/ [/versions/]/infer $inference_request $inference_response Model Metadata GET v2/models/[/versions/] $metadata_model_response Server Ready GET v2/health/ready $ready_server_response Server Live GET v2/health/live $live_server_response Server Metadata GET v2 $metadata_server_response Model Ready GET v2/models/[/versions/ ]/ready $ready_model_response ** path contents in [] are optional For more information regarding payload contents, see Payload Contents . The versions portion of the Path URLs (in [] ) is shown as optional to allow implementations that don\u2019t support versioning or for cases when the user does not want to specify a specific model version (in which case the server will choose a version based on its own policies). For example, if a model does not implement a version, the Model Metadata request path could look like v2/model/my_model . If the model has been configured to implement a version, the request path could look something like v2/models/my_model/versions/v10 , where the version of the model is v10.","title":"HTTP/REST"},{"location":"modelserving/data_plane/v2_protocol/#api-definitions","text":"API Definition Inference The /infer endpoint performs inference on a model. The response is the prediction result. Model Metadata The \"model metadata\" API is a per-model endpoint that returns details about the model passed in the path. Server Ready The \u201cserver ready\u201d health API indicates if all the models are ready for inferencing. The \u201cserver ready\u201d health API can be used directly to implement the Kubernetes readinessProbe Server Live The \u201cserver live\u201d health API indicates if the inference server is able to receive and respond to metadata and inference requests. The \u201cserver live\u201d API can be used directly to implement the Kubernetes livenessProbe. Server Metadata The \"server metadata\" API returns details describing the server. Model Ready The \u201cmodel ready\u201d health API indicates if a specific model is ready for inferencing. The model name and (optionally) version must be available in the URL.","title":"API Definitions"},{"location":"modelserving/data_plane/v2_protocol/#healthreadinessliveness-probes","text":"The Model Readiness probe the question \"Did the model download and is it able to serve requests?\" and responds with the available model name(s). The Server Readiness/Liveness probes answer the question \"Is my service and its infrastructure running, healthy, and able to receive and process requests?\" To read more about liveness and readiness probe concepts, visit the Configure Liveness, Readiness and Startup Probes Kubernetes documentation.","title":"Health/Readiness/Liveness Probes"},{"location":"modelserving/data_plane/v2_protocol/#payload-contents","text":"","title":"Payload Contents"},{"location":"modelserving/data_plane/v2_protocol/#model-ready","text":"The model ready endpoint returns the readiness probe response for the server along with the name of the model.","title":"Model Ready"},{"location":"modelserving/data_plane/v2_protocol/#model-ready-response-json-object","text":"$ready_model_response = { \"name\" : $string, \"ready\": $bool }","title":"Model Ready Response JSON Object"},{"location":"modelserving/data_plane/v2_protocol/#server-ready","text":"The server ready endpoint returns the readiness probe response for the server.","title":"Server Ready"},{"location":"modelserving/data_plane/v2_protocol/#server-ready-response-json-object","text":"$ready_server_response = { \"live\" : $bool, }","title":"Server Ready Response JSON Object"},{"location":"modelserving/data_plane/v2_protocol/#server-live","text":"The server live endpoint returns the liveness probe response for the server.","title":"Server Live"},{"location":"modelserving/data_plane/v2_protocol/#server-live-response-json-objet","text":"$live_server_response = { \"live\" : $bool, }","title":"Server Live Response JSON Objet"},{"location":"modelserving/data_plane/v2_protocol/#server-metadata","text":"The server metadata endpoint provides information about the server. A server metadata request is made with an HTTP GET to a server metadata endpoint. In the corresponding response the HTTP body contains the Server Metadata Response JSON Object or the Server Metadata Response JSON Error Object .","title":"Server Metadata"},{"location":"modelserving/data_plane/v2_protocol/#server-metadata-response-json-object","text":"A successful server metadata request is indicated by a 200 HTTP status code. The server metadata response object, identified as $metadata_server_response , is returned in the HTTP body. $metadata_server_response = { \"name\" : $string, \"version\" : $string, \"extensions\" : [ $string, ... ] } \u201cname\u201d : A descriptive name for the server. \"version\" : The server version. \u201cextensions\u201d : The extensions supported by the server. Currently, no standard extensions are defined. Individual inference servers may define and document their own extensions.","title":"Server Metadata Response JSON Object"},{"location":"modelserving/data_plane/v2_protocol/#server-metadata-response-json-error-object","text":"A failed server metadata request must be indicated by an HTTP error status (typically 400). The HTTP body must contain the $metadata_server_error_response object. $metadata_server_error_response = { \"error\": $string } \u201cerror\u201d : The descriptive message for the error. The per-model metadata endpoint provides information about a model. A model metadata request is made with an HTTP GET to a model metadata endpoint. In the corresponding response the HTTP body contains the Model Metadata Response JSON Object or the Model Metadata Response JSON Error Object . The model name and (optionally) version must be available in the URL. If a version is not provided the server may choose a version based on its own policies or return an error.","title":"Server Metadata Response JSON Error Object"},{"location":"modelserving/data_plane/v2_protocol/#model-metadata","text":"The per-model metadata endpoint provides information about a model. A model metadata request is made with an HTTP GET to a model metadata endpoint. In the corresponding response the HTTP body contains the Model Metadata Response JSON Object or the Model Metadata Response JSON Error Object . The model name and (optionally) version must be available in the URL. If a version is not provided the server may choose a version based on its own policies or return an error.","title":"Model Metadata"},{"location":"modelserving/data_plane/v2_protocol/#model-metadata-response-json-object","text":"A successful model metadata request is indicated by a 200 HTTP status code. The metadata response object, identified as $metadata_model_response , is returned in the HTTP body for every successful model metadata request. $metadata_model_response = { \"name\" : $string, \"versions\" : [ $string, ... ] #optional, \"platform\" : $string, \"inputs\" : [ $metadata_tensor, ... ], \"outputs\" : [ $metadata_tensor, ... ] } \u201cname\u201d : The name of the model. \"versions\" : The model versions that may be explicitly requested via the appropriate endpoint. Optional for servers that don\u2019t support versions. Optional for models that don\u2019t allow a version to be explicitly requested. \u201cplatform\u201d : The framework/backend for the model. See Platforms . \u201cinputs\u201d : The inputs required by the model. \u201coutputs\u201d : The outputs produced by the model. Each model input and output tensors\u2019 metadata is described with a $metadata_tensor object . $metadata_tensor = { \"name\" : $string, \"datatype\" : $string, \"shape\" : [ $number, ... ] } \u201cname\u201d : The name of the tensor. \"datatype\" : The data-type of the tensor elements as defined in Tensor Data Types . \"shape\" : The shape of the tensor. Variable-size dimensions are specified as -1.","title":"Model Metadata Response JSON Object"},{"location":"modelserving/data_plane/v2_protocol/#model-metadata-response-json-error-object","text":"A failed model metadata request must be indicated by an HTTP error status (typically 400). The HTTP body must contain the $metadata_model_error_response object. $metadata_model_error_response = { \"error\": $string } \u201cerror\u201d : The descriptive message for the error.","title":"Model Metadata Response JSON Error Object"},{"location":"modelserving/data_plane/v2_protocol/#inference","text":"An inference request is made with an HTTP POST to an inference endpoint. In the request the HTTP body contains the Inference Request JSON Object . In the corresponding response the HTTP body contains the Inference Response JSON Object or Inference Response JSON Error Object . See Inference Request Examples for some example HTTP/REST requests and responses.","title":"Inference"},{"location":"modelserving/data_plane/v2_protocol/#inference-request-json-object","text":"The inference request object, identified as $inference_request , is required in the HTTP body of the POST request. The model name and (optionally) version must be available in the URL. If a version is not provided the server may choose a version based on its own policies or return an error. $inference_request = { \"id\" : $string #optional, \"parameters\" : $parameters #optional, \"inputs\" : [ $request_input, ... ], \"outputs\" : [ $request_output, ... ] #optional } \"id\" : An identifier for this request. Optional, but if specified this identifier must be returned in the response. \"parameters\" : An object containing zero or more parameters for this inference request expressed as key/value pairs. See Parameters for more information. \"inputs\" : The input tensors. Each input is described using the $request_input schema defined in Request Input . \"outputs\" : The output tensors requested for this inference. Each requested output is described using the $request_output schema defined in Request Output . Optional, if not specified all outputs produced by the model will be returned using default $request_output settings.","title":"Inference Request JSON Object"},{"location":"modelserving/data_plane/v2_protocol/#request-input","text":"The $inference_request_input JSON describes an input to the model. If the input is batched, the shape and data must represent the full shape and contents of the entire batch. $inference_request_input = { \"name\" : $string, \"shape\" : [ $number, ... ], \"datatype\" : $string, \"parameters\" : $parameters #optional, \"data\" : $tensor_data } \"name\" : The name of the input tensor. \"shape\" : The shape of the input tensor. Each dimension must be an integer representable as an unsigned 64-bit integer value. \"datatype\" : The data-type of the input tensor elements as defined in Tensor Data Types . \"parameters\" : An object containing zero or more parameters for this input expressed as key/value pairs. See Parameters for more information. \u201cdata\u201d: The contents of the tensor. See Tensor Data for more information.","title":"Request Input"},{"location":"modelserving/data_plane/v2_protocol/#request-output","text":"The $request_output JSON is used to request which output tensors should be returned from the model. $inference_request_output = { \"name\" : $string, \"parameters\" : $parameters #optional, } \"name\" : The name of the output tensor. \"parameters\" : An object containing zero or more parameters for this output expressed as key/value pairs. See Parameters for more information.","title":"Request Output"},{"location":"modelserving/data_plane/v2_protocol/#inference-response-json-object","text":"A successful inference request is indicated by a 200 HTTP status code. The inference response object, identified as $inference_response , is returned in the HTTP body. $inference_response = { \"model_name\" : $string, \"model_version\" : $string #optional, \"id\" : $string, \"parameters\" : $parameters #optional, \"outputs\" : [ $response_output, ... ] } \"model_name\" : The name of the model used for inference. \"model_version\" : The specific model version used for inference. Inference servers that do not implement versioning should not provide this field in the response. \"id\" : The \"id\" identifier given in the request, if any. \"parameters\" : An object containing zero or more parameters for this response expressed as key/value pairs. See Parameters for more information. \"outputs\" : The output tensors. Each output is described using the $response_output schema defined in Response Output .","title":"Inference Response JSON Object"},{"location":"modelserving/data_plane/v2_protocol/#response-output","text":"The $response_output JSON describes an output from the model. If the output is batched, the shape and data represents the full shape of the entire batch. $response_output = { \"name\" : $string, \"shape\" : [ $number, ... ], \"datatype\" : $string, \"parameters\" : $parameters #optional, \"data\" : $tensor_data } \"name\" : The name of the output tensor. \"shape\" : The shape of the output tensor. Each dimension must be an integer representable as an unsigned 64-bit integer value. \"datatype\" : The data-type of the output tensor elements as defined in Tensor Data Types . \"parameters\" : An object containing zero or more parameters for this input expressed as key/value pairs. See Parameters for more information. \u201cdata\u201d: The contents of the tensor. See Tensor Data for more information.","title":"Response Output"},{"location":"modelserving/data_plane/v2_protocol/#inference-response-json-error-object","text":"A failed inference request must be indicated by an HTTP error status (typically 400). The HTTP body must contain the $inference_error_response object. $inference_error_response = { \"error\": } \u201cerror\u201d : The descriptive message for the error.","title":"Inference Response JSON Error Object"},{"location":"modelserving/data_plane/v2_protocol/#parameters","text":"The $parameters JSON describes zero or more \u201cname\u201d/\u201dvalue\u201d pairs, where the \u201cname\u201d is the name of the parameter and the \u201cvalue\u201d is a $string, $number, or $boolean. $parameters = { $parameter, ... } $parameter = $string : $string | $number | $boolean Currently no parameters are defined. As required a future proposal may define one or more standard parameters to allow portable functionality across different inference servers. A server can implement server-specific parameters to provide non-standard capabilities.","title":"Parameters"},{"location":"modelserving/data_plane/v2_protocol/#tensor-data","text":"Tensor data must be presented in row-major order of the tensor elements. Element values must be given in \"linear\" order without any stride or padding between elements. Tensor elements may be presented in their nature multi-dimensional representation, or as a flattened one-dimensional representation. Tensor data given explicitly is provided in a JSON array. Each element of the array may be an integer, floating-point number, string or boolean value. The server can decide to coerce each element to the required type or return an error if an unexpected value is received. Note that fp16 and bf16 are problematic to communicate explicitly since there is not a standard fp16/bf16 representation across backends nor typically the programmatic support to create the fp16/bf16 representation for a JSON number. For example, the 2-dimensional matrix: [ 1 2 4 5 ] Can be represented in its natural format as: \"data\" : [ [ 1, 2 ], [ 4, 5 ] ] Or in a flattened one-dimensional representation: \"data\" : [ 1, 2, 4, 5 ]","title":"Tensor Data"},{"location":"modelserving/data_plane/v2_protocol/#tensor-data-types","text":"Tensor data types are shown in the following table along with the size of each type, in bytes. Data Type Size (bytes) BOOL 1 UINT8 1 UINT16 2 UINT32 4 UINT64 8 INT8 1 INT16 2 INT32 4 INT64 8 FP16 2 FP32 4 FP64 8 BYTES Variable (max 2 32 ) ---","title":"Tensor Data Types"},{"location":"modelserving/data_plane/v2_protocol/#inference-request-examples","text":"The following example shows an inference request to a model with two inputs and one output. The HTTP Content-Length header gives the size of the JSON object. POST /v2/models/mymodel/infer HTTP/1.1 Host: localhost:8000 Content-Type: application/json Content-Length: { \"id\" : \"42\", \"inputs\" : [ { \"name\" : \"input0\", \"shape\" : [ 2, 2 ], \"datatype\" : \"UINT32\", \"data\" : [ 1, 2, 3, 4 ] }, { \"name\" : \"input1\", \"shape\" : [ 3 ], \"datatype\" : \"BOOL\", \"data\" : [ true ] } ], \"outputs\" : [ { \"name\" : \"output0\" } ] } For the above request the inference server must return the \u201coutput0\u201d output tensor. Assuming the model returns a [ 3, 2 ] tensor of data type FP32 the following response would be returned. HTTP/1.1 200 OK Content-Type: application/json Content-Length: { \"id\" : \"42\" \"outputs\" : [ { \"name\" : \"output0\", \"shape\" : [ 3, 2 ], \"datatype\" : \"FP32\", \"data\" : [ 1.0, 1.1, 2.0, 2.1, 3.0, 3.1 ] } ] }","title":"Inference Request Examples"},{"location":"modelserving/data_plane/v2_protocol/#grpc","text":"The GRPC API closely follows the concepts defined in the HTTP/REST API. A compliant server must implement the health, metadata, and inference APIs described in this section. API rpc Endpoint Request Message Response Message Inference ModelInfer ModelInferRequest ModelInferResponse Model Ready ModelReady [ModelReadyRequest] ModelReadyResponse Model Metadata ModelMetadata ModelMetadataRequest ModelMetadataResponse Server Ready ServerReady ServerReadyRequest ServerReadyResponse Server Live ServerLive ServerLiveRequest ServerLiveResponse For more detailed information on each endpoint and its contents, see API Definitions and Message Contents . See also: The gRPC endpoints, request/response messages and contents are defined in grpc_predict_v2.proto","title":"gRPC"},{"location":"modelserving/data_plane/v2_protocol/#api-definitions_1","text":"The GRPC definition of the service is: // // Inference Server GRPC endpoints. // service GRPCInferenceService { // Check liveness of the inference server. rpc ServerLive(ServerLiveRequest) returns (ServerLiveResponse) {} // Check readiness of the inference server. rpc ServerReady(ServerReadyRequest) returns (ServerReadyResponse) {} // Check readiness of a model in the inference server. rpc ModelReady(ModelReadyRequest) returns (ModelReadyResponse) {} // Get server metadata. rpc ServerMetadata(ServerMetadataRequest) returns (ServerMetadataResponse) {} // Get model metadata. rpc ModelMetadata(ModelMetadataRequest) returns (ModelMetadataResponse) {} // Perform inference using a specific model. rpc ModelInfer(ModelInferRequest) returns (ModelInferResponse) {} }","title":"API Definitions"},{"location":"modelserving/data_plane/v2_protocol/#message-contents","text":"","title":"Message Contents"},{"location":"modelserving/data_plane/v2_protocol/#health","text":"A health request is made using the ServerLive, ServerReady, or ModelReady endpoint. For each of these endpoints errors are indicated by the google.rpc.Status returned for the request. The OK code indicates success and other codes indicate failure.","title":"Health"},{"location":"modelserving/data_plane/v2_protocol/#server-live_1","text":"The ServerLive API indicates if the inference server is able to receive and respond to metadata and inference requests. The request and response messages for ServerLive are: message ServerLiveRequest {} message ServerLiveResponse { // True if the inference server is live, false if not live. bool live = 1; }","title":"Server Live"},{"location":"modelserving/data_plane/v2_protocol/#server-ready_1","text":"The ServerReady API indicates if the server is ready for inferencing. The request and response messages for ServerReady are: message ServerReadyRequest {} message ServerReadyResponse { // True if the inference server is ready, false if not ready. bool ready = 1; }","title":"Server Ready"},{"location":"modelserving/data_plane/v2_protocol/#model-ready_1","text":"The ModelReady API indicates if a specific model is ready for inferencing. The request and response messages for ModelReady are: message ModelReadyRequest { // The name of the model to check for readiness. string name = 1; // The version of the model to check for readiness. If not given the // server will choose a version based on the model and internal policy. string version = 2; } message ModelReadyResponse { // True if the model is ready, false if not ready. bool ready = 1; }","title":"Model Ready"},{"location":"modelserving/data_plane/v2_protocol/#metadata","text":"","title":"Metadata"},{"location":"modelserving/data_plane/v2_protocol/#server-metadata_1","text":"The ServerMetadata API provides information about the server. Errors are indicated by the google.rpc.Status returned for the request. The OK code indicates success and other codes indicate failure. The request and response messages for ServerMetadata are: message ServerMetadataRequest {} message ServerMetadataResponse { // The server name. string name = 1; // The server version. string version = 2; // The extensions supported by the server. repeated string extensions = 3; }","title":"Server Metadata"},{"location":"modelserving/data_plane/v2_protocol/#model-metadata_1","text":"The per-model metadata API provides information about a model. Errors are indicated by the google.rpc.Status returned for the request. The OK code indicates success and other codes indicate failure. The request and response messages for ModelMetadata are: message ModelMetadataRequest { // The name of the model. string name = 1; // The version of the model to check for readiness. If not given the // server will choose a version based on the model and internal policy. string version = 2; } message ModelMetadataResponse { // Metadata for a tensor. message TensorMetadata { // The tensor name. string name = 1; // The tensor data type. string datatype = 2; // The tensor shape. A variable-size dimension is represented // by a -1 value. repeated int64 shape = 3; } // The model name. string name = 1; // The versions of the model available on the server. repeated string versions = 2; // The model's platform. See Platforms. string platform = 3; // The model's inputs. repeated TensorMetadata inputs = 4; // The model's outputs. repeated TensorMetadata outputs = 5; }","title":"Model Metadata"},{"location":"modelserving/data_plane/v2_protocol/#platforms","text":"A platform is a string indicating a DL/ML framework or backend. Platform is returned as part of the response to a Model Metadata request but is information only. The proposed inference APIs are generic relative to the DL/ML framework used by a model and so a client does not need to know the platform of a given model to use the API. Platform names use the format \u201c _ \u201d. The following platform names are allowed: tensorrt_plan : A TensorRT model encoded as a serialized engine or \u201cplan\u201d. tensorflow_graphdef : A TensorFlow model encoded as a GraphDef. tensorflow_savedmodel : A TensorFlow model encoded as a SavedModel. onnx_onnxv1 : A ONNX model encoded for ONNX Runtime. pytorch_torchscript : A PyTorch model encoded as TorchScript. mxnet_mxnet: An MXNet model caffe2_netdef : A Caffe2 model encoded as a NetDef.","title":"Platforms"},{"location":"modelserving/data_plane/v2_protocol/#inference_1","text":"The ModelInfer API performs inference using the specified model. Errors are indicated by the google.rpc.Status returned for the request. The OK code indicates success and other codes indicate failure. The request and response messages for ModelInfer are: message ModelInferRequest { // An input tensor for an inference request. message InferInputTensor { // The tensor name. string name = 1; // The tensor data type. string datatype = 2; // The tensor shape. repeated int64 shape = 3; // Optional inference input tensor parameters. map parameters = 4; // The tensor contents using a data-type format. This field must // not be specified if \"raw\" tensor contents are being used for // the inference request. InferTensorContents contents = 5; } // An output tensor requested for an inference request. message InferRequestedOutputTensor { // The tensor name. string name = 1; // Optional requested output tensor parameters. map parameters = 2; } // The name of the model to use for inferencing. string model_name = 1; // The version of the model to use for inference. If not given the // server will choose a version based on the model and internal policy. string model_version = 2; // Optional identifier for the request. If specified will be // returned in the response. string id = 3; // Optional inference parameters. map parameters = 4; // The input tensors for the inference. repeated InferInputTensor inputs = 5; // The requested output tensors for the inference. Optional, if not // specified all outputs produced by the model will be returned. repeated InferRequestedOutputTensor outputs = 6; // The data contained in an input tensor can be represented in \"raw\" // bytes form or in the repeated type that matches the tensor's data // type. To use the raw representation 'raw_input_contents' must be // initialized with data for each tensor in the same order as // 'inputs'. For each tensor, the size of this content must match // what is expected by the tensor's shape and data type. The raw // data must be the flattened, one-dimensional, row-major order of // the tensor elements without any stride or padding between the // elements. Note that the FP16 data type must be represented as raw // content as there is no specific data type for a 16-bit float // type. // // If this field is specified then InferInputTensor::contents must // not be specified for any input tensor. repeated bytes raw_input_contents = 7; } message ModelInferResponse { // An output tensor returned for an inference request. message InferOutputTensor { // The tensor name. string name = 1; // The tensor data type. string datatype = 2; // The tensor shape. repeated int64 shape = 3; // Optional output tensor parameters. map parameters = 4; // The tensor contents using a data-type format. This field must // not be specified if \"raw\" tensor contents are being used for // the inference response. InferTensorContents contents = 5; } // The name of the model used for inference. string model_name = 1; // The version of the model used for inference. string model_version = 2; // The id of the inference request if one was specified. string id = 3; // Optional inference response parameters. map parameters = 4; // The output tensors holding inference results. repeated InferOutputTensor outputs = 5; // The data contained in an output tensor can be represented in // \"raw\" bytes form or in the repeated type that matches the // tensor's data type. To use the raw representation 'raw_output_contents' // must be initialized with data for each tensor in the same order as // 'outputs'. For each tensor, the size of this content must match // what is expected by the tensor's shape and data type. The raw // data must be the flattened, one-dimensional, row-major order of // the tensor elements without any stride or padding between the // elements. Note that the FP16 data type must be represented as raw // content as there is no specific data type for a 16-bit float // type. // // If this field is specified then InferOutputTensor::contents must // not be specified for any output tensor. repeated bytes raw_output_contents = 6; }","title":"Inference"},{"location":"modelserving/data_plane/v2_protocol/#parameters_1","text":"The Parameters message describes a \u201cname\u201d/\u201dvalue\u201d pair, where the \u201cname\u201d is the name of the parameter and the \u201cvalue\u201d is a boolean, integer, or string corresponding to the parameter. Currently, no parameters are defined. As required a future proposal may define one or more standard parameters to allow portable functionality across different inference servers. A server can implement server-specific parameters to provide non-standard capabilities. // // An inference parameter value. // message InferParameter { // The parameter value can be a string, an int64, a boolean // or a message specific to a predefined parameter. oneof parameter_choice { // A boolean parameter value. bool bool_param = 1; // An int64 parameter value. int64 int64_param = 2; // A string parameter value. string string_param = 3; } }","title":"Parameters"},{"location":"modelserving/data_plane/v2_protocol/#tensor-data_1","text":"In all representations tensor data must be flattened to a one-dimensional, row-major order of the tensor elements. Element values must be given in \"linear\" order without any stride or padding between elements. Using a \"raw\" representation of tensors with ModelInferRequest::raw_input_contents and ModelInferResponse::raw_output_contents will typically allow higher performance due to the way protobuf allocation and reuse interacts with GRPC. For example, see https://github.com/grpc/grpc/issues/23231. An alternative to the \"raw\" representation is to use InferTensorContents to represent the tensor data in a format that matches the tensor's data type. // // The data contained in a tensor represented by the repeated type // that matches the tensor's data type. Protobuf oneof is not used // because oneofs cannot contain repeated fields. // message InferTensorContents { // Representation for BOOL data type. The size must match what is // expected by the tensor's shape. The contents must be the flattened, // one-dimensional, row-major order of the tensor elements. repeated bool bool_contents = 1; // Representation for INT8, INT16, and INT32 data types. The size // must match what is expected by the tensor's shape. The contents // must be the flattened, one-dimensional, row-major order of the // tensor elements. repeated int32 int_contents = 2; // Representation for INT64 data types. The size must match what // is expected by the tensor's shape. The contents must be the // flattened, one-dimensional, row-major order of the tensor elements. repeated int64 int64_contents = 3; // Representation for UINT8, UINT16, and UINT32 data types. The size // must match what is expected by the tensor's shape. The contents // must be the flattened, one-dimensional, row-major order of the // tensor elements. repeated uint32 uint_contents = 4; // Representation for UINT64 data types. The size must match what // is expected by the tensor's shape. The contents must be the // flattened, one-dimensional, row-major order of the tensor elements. repeated uint64 uint64_contents = 5; // Representation for FP32 data type. The size must match what is // expected by the tensor's shape. The contents must be the flattened, // one-dimensional, row-major order of the tensor elements. repeated float fp32_contents = 6; // Representation for FP64 data type. The size must match what is // expected by the tensor's shape. The contents must be the flattened, // one-dimensional, row-major order of the tensor elements. repeated double fp64_contents = 7; // Representation for BYTES data type. The size must match what is // expected by the tensor's shape. The contents must be the flattened, // one-dimensional, row-major order of the tensor elements. repeated bytes bytes_contents = 8; }","title":"Tensor Data"},{"location":"modelserving/data_plane/v2_protocol/#tensor-data-types_1","text":"Tensor data types are shown in the following table along with the size of each type, in bytes. Data Type Size (bytes) BOOL 1 UINT8 1 UINT16 2 UINT32 4 UINT64 8 INT8 1 INT16 2 INT32 4 INT64 8 FP16 2 FP32 4 FP64 8 BYTES Variable (max 2 32 )","title":"Tensor Data Types"},{"location":"modelserving/detect/aif/germancredit/","text":"Bias detection on an InferenceService using AIF360 \u00b6 This is an example of how to get bias metrics using AI Fairness 360 (AIF360) on KServe. AI Fairness 360, an LF AI incubation project, is an extensible open source toolkit that can help users examine, report, and mitigate discrimination and bias in machine learning models throughout the AI application lifecycle. We will be using the German Credit dataset maintained by the UC Irvine Machine Learning Repository . The German Credit dataset is a dataset that contains data as to whether or not a creditor gave a loan applicant access to a loan along with data about the applicant. The data includes relevant data on an applicant's credit history, savings, and employment as well as some data on the applicant's demographic such as age, sex, and marital status. Data like credit history, savings, and employment can be used by creditors to accurately predict the probability that an applicant will repay their loans, however, data such as age and sex should not be used to decide whether an applicant should be given a loan. We would like to be able to check if these \"protected classes\" are being used in a model's predictions. In this example we will feed the model some predictions and calculate metrics based off of the predictions the model makes. We will be using KServe payload logging capability collect the metrics. These metrics will give insight as to whether or not the model is biased for or against any protected classes. In this example we will look at the bias our deployed model has on those of age > 25 vs. those of age <= 25 and see if creditors are treating either unfairly. Sample resources for deploying the example can be found here Create the InferenceService \u00b6 Apply the CRD kubectl kubectl apply -f bias.yaml Expected Output $ inferenceservice.serving.kserve.io/german-credit created Deploy the message dumper (sample backend receiver for payload logs) \u00b6 Apply the message-dumper CRD which will collect the logs that are created when running predictions on the inferenceservice. In production setup, instead of message-dumper Kafka can be used to receive payload logs kubectl kubectl apply -f message-dumper.yaml Expected Output service.serving.knative.dev/message-dumper created Run a prediction \u00b6 The first step is to determine the ingress IP and ports and set INGRESS_HOST and INGRESS_PORT MODEL_NAME = german-credit SERVICE_HOSTNAME = $( kubectl get inferenceservice ${ MODEL_NAME } -o jsonpath = '{.status.url}' | cut -d \"/\" -f 3 ) python simulate_predicts.py http:// ${ INGRESS_HOST } : ${ INGRESS_PORT } /v1/models/ $MODEL_NAME :predict ${ SERVICE_HOSTNAME } Process payload logs for metrics calculation \u00b6 Run json_from_logs.py which will craft a payload that AIF can interpret. First, the events logs are taken from the message-dumper and then those logs are parsed to match inputs with outputs. Then the input/outputs pairs are all combined into a list of inputs and a list of outputs for AIF to interpret. A data.json file should have been created in this folder which contains the json payload. python json_from_logs.py Run an explanation \u00b6 Finally, now that we have collected a number of our model's predictions and their corresponding inputs we will send these to the AIF server to calculate the bias metrics. python query_bias.py http:// ${ INGRESS_HOST } : ${ INGRESS_PORT } /v1/models/ $MODEL_NAME :explain ${ SERVICE_HOSTNAME } input.json Interpreting the results \u00b6 Now let's look at one of the metrics. In this example disparate impact represents the ratio between the probability of applicants of the privileged class (age > 25) getting a loan and the probability of applicants of the unprivileged class (age <= 25) getting a loan P(Y=1|D=privileged)/P(Y=1|D=unprivileged) . Since, in the sample output below, the disparate impact is less that 1 then the probability that an applicant whose age is greater than 25 gets a loan is significantly higher than the probability that an applicant whose age is less than or equal to 25 gets a loan. This in and of itself is not proof that the model is biased, but does hint that there may be some bias and a deeper look may be needed. python query_bias.py http:// ${ INGRESS_HOST } : ${ INGRESS_PORT } /v1/models/ $MODEL_NAME :explain ${ SERVICE_HOSTNAME } input.json Expected Output Sending bias query... TIME TAKEN: 0 .21137404441833496 base_rate : 0 .9329608938547486 consistency : [ 0 .982122905027933 ] disparate_impact : 0 .52 num_instances : 179 .0 num_negatives : 12 .0 num_positives : 167 .0 statistical_parity_difference : -0.48 Dataset \u00b6 The dataset used in this example is the German Credit dataset maintained by the UC Irvine Machine Learning Repository . Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.","title":"AIF Bias Detector"},{"location":"modelserving/detect/aif/germancredit/#bias-detection-on-an-inferenceservice-using-aif360","text":"This is an example of how to get bias metrics using AI Fairness 360 (AIF360) on KServe. AI Fairness 360, an LF AI incubation project, is an extensible open source toolkit that can help users examine, report, and mitigate discrimination and bias in machine learning models throughout the AI application lifecycle. We will be using the German Credit dataset maintained by the UC Irvine Machine Learning Repository . The German Credit dataset is a dataset that contains data as to whether or not a creditor gave a loan applicant access to a loan along with data about the applicant. The data includes relevant data on an applicant's credit history, savings, and employment as well as some data on the applicant's demographic such as age, sex, and marital status. Data like credit history, savings, and employment can be used by creditors to accurately predict the probability that an applicant will repay their loans, however, data such as age and sex should not be used to decide whether an applicant should be given a loan. We would like to be able to check if these \"protected classes\" are being used in a model's predictions. In this example we will feed the model some predictions and calculate metrics based off of the predictions the model makes. We will be using KServe payload logging capability collect the metrics. These metrics will give insight as to whether or not the model is biased for or against any protected classes. In this example we will look at the bias our deployed model has on those of age > 25 vs. those of age <= 25 and see if creditors are treating either unfairly. Sample resources for deploying the example can be found here","title":"Bias detection on an InferenceService using AIF360"},{"location":"modelserving/detect/aif/germancredit/#create-the-inferenceservice","text":"Apply the CRD kubectl kubectl apply -f bias.yaml Expected Output $ inferenceservice.serving.kserve.io/german-credit created","title":"Create the InferenceService"},{"location":"modelserving/detect/aif/germancredit/#deploy-the-message-dumper-sample-backend-receiver-for-payload-logs","text":"Apply the message-dumper CRD which will collect the logs that are created when running predictions on the inferenceservice. In production setup, instead of message-dumper Kafka can be used to receive payload logs kubectl kubectl apply -f message-dumper.yaml Expected Output service.serving.knative.dev/message-dumper created","title":"Deploy the message dumper (sample backend receiver for payload logs)"},{"location":"modelserving/detect/aif/germancredit/#run-a-prediction","text":"The first step is to determine the ingress IP and ports and set INGRESS_HOST and INGRESS_PORT MODEL_NAME = german-credit SERVICE_HOSTNAME = $( kubectl get inferenceservice ${ MODEL_NAME } -o jsonpath = '{.status.url}' | cut -d \"/\" -f 3 ) python simulate_predicts.py http:// ${ INGRESS_HOST } : ${ INGRESS_PORT } /v1/models/ $MODEL_NAME :predict ${ SERVICE_HOSTNAME }","title":"Run a prediction"},{"location":"modelserving/detect/aif/germancredit/#process-payload-logs-for-metrics-calculation","text":"Run json_from_logs.py which will craft a payload that AIF can interpret. First, the events logs are taken from the message-dumper and then those logs are parsed to match inputs with outputs. Then the input/outputs pairs are all combined into a list of inputs and a list of outputs for AIF to interpret. A data.json file should have been created in this folder which contains the json payload. python json_from_logs.py","title":"Process payload logs for metrics calculation"},{"location":"modelserving/detect/aif/germancredit/#run-an-explanation","text":"Finally, now that we have collected a number of our model's predictions and their corresponding inputs we will send these to the AIF server to calculate the bias metrics. python query_bias.py http:// ${ INGRESS_HOST } : ${ INGRESS_PORT } /v1/models/ $MODEL_NAME :explain ${ SERVICE_HOSTNAME } input.json","title":"Run an explanation"},{"location":"modelserving/detect/aif/germancredit/#interpreting-the-results","text":"Now let's look at one of the metrics. In this example disparate impact represents the ratio between the probability of applicants of the privileged class (age > 25) getting a loan and the probability of applicants of the unprivileged class (age <= 25) getting a loan P(Y=1|D=privileged)/P(Y=1|D=unprivileged) . Since, in the sample output below, the disparate impact is less that 1 then the probability that an applicant whose age is greater than 25 gets a loan is significantly higher than the probability that an applicant whose age is less than or equal to 25 gets a loan. This in and of itself is not proof that the model is biased, but does hint that there may be some bias and a deeper look may be needed. python query_bias.py http:// ${ INGRESS_HOST } : ${ INGRESS_PORT } /v1/models/ $MODEL_NAME :explain ${ SERVICE_HOSTNAME } input.json Expected Output Sending bias query... TIME TAKEN: 0 .21137404441833496 base_rate : 0 .9329608938547486 consistency : [ 0 .982122905027933 ] disparate_impact : 0 .52 num_instances : 179 .0 num_negatives : 12 .0 num_positives : 167 .0 statistical_parity_difference : -0.48","title":"Interpreting the results"},{"location":"modelserving/detect/aif/germancredit/#dataset","text":"The dataset used in this example is the German Credit dataset maintained by the UC Irvine Machine Learning Repository . Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.","title":"Dataset"},{"location":"modelserving/detect/aif/germancredit/server/","text":"Logistic Regression Model on the German Credit dataset \u00b6 Build a development docker image \u00b6 To build a development image first download these files and move them into the server/ folder - https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data - https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.doc First build your docker image by changing directory to kserve/python and replacing dockeruser with your docker username in the snippet below (running this will take some time). docker build -t dockeruser/aifserver:latest -f aiffairness.Dockerfile . Then push your docker image to your dockerhub repo (this will take some time) docker push dockeruser/aifserver:latest Once your docker image is pushed you can pull the image from dockeruser/aifserver:latest when deploying an inferenceservice by specifying the image in the yaml file.","title":"Logistic Regression Model on the German Credit dataset"},{"location":"modelserving/detect/aif/germancredit/server/#logistic-regression-model-on-the-german-credit-dataset","text":"","title":"Logistic Regression Model on the German Credit dataset"},{"location":"modelserving/detect/aif/germancredit/server/#build-a-development-docker-image","text":"To build a development image first download these files and move them into the server/ folder - https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data - https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.doc First build your docker image by changing directory to kserve/python and replacing dockeruser with your docker username in the snippet below (running this will take some time). docker build -t dockeruser/aifserver:latest -f aiffairness.Dockerfile . Then push your docker image to your dockerhub repo (this will take some time) docker push dockeruser/aifserver:latest Once your docker image is pushed you can pull the image from dockeruser/aifserver:latest when deploying an inferenceservice by specifying the image in the yaml file.","title":"Build a development docker image"},{"location":"modelserving/detect/alibi_detect/alibi_detect/","text":"Deploy InferenceService with Alibi Outlier/Drift Detector \u00b6 In order to trust and reliably act on model predictions, it is crucial to monitor the distribution of the incoming requests via various different type of detectors. KServe integrates Alibi Detect with the following components: Drift detector checks when the distribution of incoming requests is diverging from a reference distribution such as that of the training data. Outlier detector flags single instances which do not follow the training distribution. The architecture used is shown below and links the payload logging available within KServe with asynchronous processing of those payloads in KNative to detect outliers. CIFAR10 Outlier Detector \u00b6 A CIFAR10 Outlier Detector. Run the notebook demo to test. The notebook requires KNative Eventing >= 0.18. CIFAR10 Drift Detector \u00b6 A CIFAR10 Drift Detector. Run the notebook demo to test. The notebook requires KNative Eventing >= 0.18.","title":"Alibi Detector"},{"location":"modelserving/detect/alibi_detect/alibi_detect/#deploy-inferenceservice-with-alibi-outlierdrift-detector","text":"In order to trust and reliably act on model predictions, it is crucial to monitor the distribution of the incoming requests via various different type of detectors. KServe integrates Alibi Detect with the following components: Drift detector checks when the distribution of incoming requests is diverging from a reference distribution such as that of the training data. Outlier detector flags single instances which do not follow the training distribution. The architecture used is shown below and links the payload logging available within KServe with asynchronous processing of those payloads in KNative to detect outliers.","title":"Deploy InferenceService with Alibi Outlier/Drift Detector"},{"location":"modelserving/detect/alibi_detect/alibi_detect/#cifar10-outlier-detector","text":"A CIFAR10 Outlier Detector. Run the notebook demo to test. The notebook requires KNative Eventing >= 0.18.","title":"CIFAR10 Outlier Detector"},{"location":"modelserving/detect/alibi_detect/alibi_detect/#cifar10-drift-detector","text":"A CIFAR10 Drift Detector. Run the notebook demo to test. The notebook requires KNative Eventing >= 0.18.","title":"CIFAR10 Drift Detector"},{"location":"modelserving/detect/art/mnist/","text":"Using ART to get adversarial examples for MNIST classifications \u00b6 This is an example to show how adversarially modified inputs can trick models to predict incorrectly to highlight model vulnerability to adversarial attacks. It is using the Adversarial Robustness Toolbox (ART) on KServe. ART provides tools that enable developers to evaluate, defend, and verify ML models and applications against adversarial threats. Apart from giving capabilities to craft adversarial attacks , it also provides algorithms to defend against them. We will be using the MNIST dataset which is a dataset of handwritten digits and find adversarial examples which can make the model predict a classification incorrectly, thereby showing the vulnerability of the model against adversarial attacks. Sample resources for deploying the example can be found here To deploy the inferenceservice with v1beta1 API kubectl apply -f art.yaml Then find the url kubectl get inferenceservice NAME URL READY DEFAULT TRAFFIC CANARY TRAFFIC AGE artserver http://artserver.somecluster/v1/models/artserver True 100 40m Explanation \u00b6 The first step is to determine the ingress IP and ports and set INGRESS_HOST and INGRESS_PORT MODEL_NAME = artserver SERVICE_HOSTNAME = $( kubectl get inferenceservice ${ MODEL_NAME } -o jsonpath = '{.status.url}' | cut -d \"/\" -f 3 ) python query_explain.py http:// ${ INGRESS_HOST } : ${ INGRESS_PORT } /v1/models/ ${ MODEL_NAME } :explain ${ SERVICE_HOSTNAME } After some time you should see a pop up containing the explanation, similar to the image below. If a pop up does not display and the message \"Unable to find an adversarial example.\" appears then an adversarial example could not be found for the image given in a timely manner. If a pop up does display then the image on the left is the original image and the image on the right is the adversarial example. The labels above both images represent what classification the model made for each individual image. The Square Attack method used in this example creates a random update at each iteration and adds this update to the adversarial input if it makes a misclassification more likely (more specifically, if it improves the objective function). Once enough random updates are added together and the model misclassifies then the resulting adversarial input will be returned and displayed. To try a different MNIST example add an integer to the end of the query between 0-9,999. The integer chosen will be the index of the image to be chosen in the MNIST dataset. Or to try a file with custom data add the file path to the end. Keep in mind that the data format must be {\"instances\": [,