diff --git a/00_notebooks/00_index.ipynb b/00_notebooks/00_index.ipynb index 8dc4ba0d..46fe372a 100644 --- a/00_notebooks/00_index.ipynb +++ b/00_notebooks/00_index.ipynb @@ -91,6 +91,7 @@ "* [*Labelbox* integration](https://github.com/treeverse/lakeFS-samples/blob/main/01_standalone_examples/labelbox-integration/)\n", "* [*Kafka* integration](https://github.com/treeverse/lakeFS-samples/blob/main/01_standalone_examples/kafka/)\n", "* [*Flink* integration](https://github.com/treeverse/lakeFS-samples/blob/main/01_standalone_examples/flink/)\n", + "* [*Red Hat OpenShift AI* integration](https://github.com/treeverse/lakeFS-samples/blob/main/01_standalone_examples/red-hat-openshift-ai/)\n", "* [How to **migrate or clone** a repo](https://github.com/treeverse/lakeFS-samples/blob/main/01_standalone_examples/migrate-or-clone-repo/)" ] }, diff --git a/01_standalone_examples/red-hat-openshift-ai/README.md b/01_standalone_examples/red-hat-openshift-ai/README.md new file mode 100644 index 00000000..7b57b11f --- /dev/null +++ b/01_standalone_examples/red-hat-openshift-ai/README.md @@ -0,0 +1,209 @@ +# Overview + +[lakeFS](https://lakefs.io/) is a data versioning application that brings git-like versioning to object storage. It can interface with many object storage applications on the backend, and provide a S3 API gateway for object storage clients to connect to. In this demo, we'll configure OpenShift AI to connect over S3 interace to lakeFS, which will version the data in a backend [MinIO](https://min.io/docs/minio/kubernetes/openshift/index.html) instance. + +![lakefs](img/lakefsv3.png) + +# lakeFS with OpenShift AI Demo + +The following steps should be followed to perform the [Fraud Detection demo](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_self-managed/2-latest/html/openshift_ai_tutorial_-_fraud_detection_example/index) on OpenShift AI, with lakeFS used for object storage management. + +## Prerequisites + +1. Bring up [OpenShift cluster](https://docs.redhat.com/en/documentation/openshift_container_platform/4.17#Install) +2. Install [OpenShift Service Mesh](https://docs.openshift.com/container-platform/4.16/service_mesh/v2x/installing-ossm.html#ossm-install-ossm-operator_installing-ossm), [OpenShift Serverless](https://docs.openshift.com/serverless/1.34/install/install-serverless-operator.html) and [OpenShift Pipelines](https://docs.openshift.com/pipelines/1.16/install_config/installing-pipelines.html) on the OpenShift cluster +3. Install [OpenShift AI](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_self-managed/2.13/html/installing_and_uninstalling_openshift_ai_self-managed/index) on the OpenShift cluster +4. Install the `oc` OpenShift [CLI client](https://docs.openshift.com/container-platform/4.16/cli_reference/openshift_cli/getting-started-cli.html) on a machine thas access to the cluster + +## Deploy and Configure the Environment +From the client machine, authenticate the `oc` client. + +``` +oc login -u kubeadmin -p +``` + +### Create a `lakefs` project in OpenShift. + +``` +oc new-project lakefs +``` + +### Clone the lakeFS samples repo +Clone the [lakeFS-samples.git](https://github.com/treeverse/lakeFS-samples.git) repository and change into the newly created directory. + +``` +git clone https://github.com/treeverse/lakeFS-samples.git + +cd lakeFS-samples/01_standalone_examples/red-hat-openshift-ai/cluster-configuration +``` + +### Deploy MinIO +Deploy MinIO in the `lakefs` project using the `minio-via-lakefs.yaml` file. + +``` +oc apply -f minio-via-lakefs.yaml +``` +A random MinIO root user and password will be generated, stored in a `secret`, and used to populate MinIO with three storage buckets: +* **my-storage** +* **pipeline-artifacts** +* **quickstart** + + +### Deploy lakeFS +Deploy lakeFS in the **lakefs** project using the `lakefs-minio.yaml` file. This yaml will not only deploy lakefs but also: +* connect it with MinIO buckets created earlier +* create two lakeFS repo: + * **quickstart:** as a sample data repo + * **my-storage** which is connected to backend my-storage s3 bucket created earlier + + + +``` +oc apply -f lakefs-minio.yaml +``` + +### Access lakeFS UI +You can now log into the OpenShift cluster's web console as a regular user (ie. developer). Follow the arrows in the screenshot below to find the lakeFS `route`, which provides external access to the lakeFS administrator. Use the lakeFS route to access the lakeFS UI. + +For this demo, you will use the following credentials to access the lakeFS UI. + +* **Access Key**: something +* **Secret Access Key**: simple + + ![lakefs](img/lakefs-route.png) + +NOTES: +- You can also follow above steps, but click on MinIO in the topology, to find the `route` to access MinIO's console or S3 interface. MinIO access credentials can be found in the `minio-root-user` secret within the OpenShift web console when logged in as an admin user (ie. kubeadmin). + + - Switch to the **Administrator** persona using the drop-down at the top left + - Expand the **Workloads** navigation + - Click on **Secrets** + - Filter for 'minio' name + - Click on the **minio-root-user** secret + - Scroll down and click on **Reveal values** to see the MinIO root user and password + +- If you don't see the visual layout as shown in the screenshot, then click on the icon highlighted below to change the view. + + ![lakefs](img/topology.png) + +### Access OpenShift AI Console +From the OpenShift web console, you can now open the OpenShift AI web console as shown below. + +![lakefs](img/oai-console.png) + +## Fraud Detection Demo + +You may now run through the [Fraud Detection demo](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_self-managed/2-latest/html/openshift_ai_tutorial_-_fraud_detection_example/index) in the new **lakefs** data science project. Refer to following notes for the different sections of this demo: + +2.2. Setting up your data science project: +* Use the `lakefs` data science project for the demo. You do not need to create a new project. + +2.3. Storing data with data connections: +* When going through the demo, follow the steps to manually configure the storage data connections. **Do not** follow steps that use a script to automate the MinIO storage deployment, configuration and data connections. + +2.3.1. Creating data connections to your own S3-compatible object storage: +* When creating "My Storage" data connection, use lakeFS access key ("something"), secret key ("simple"), endpoint ("http://my-lakefs"), region ("us-east-1") and bucket ("my-storage") instead of MinIO access key and endpoint: + + ![My Storage data connection](img/data-connection-my-storage.png) + +* When creating "Pipeline Artifacts" data connection, use MinIO access key, secret key, endpoint (the route to access MinIO's S3 interface), region ("us-east-1") and bucket ("pipeline-artifacts"): + + ![Pipeline Artifacts data connection](img/data-connection-pipeline-artifacts.png) + +3.1. Creating a workbench and selecting a notebook image: +* While creating Workbench add environment variables to access lakeFS: + * LAKECTL_SERVER_ENDPOINT_URL = http://my-lakefs + * LAKEFS_REPO_NAME = my-storage + * LAKEFS_DEFAULT_REGION =us-east-1 + * LAKECTL_CREDENTIALS_ACCESS_KEY_ID = something + * LAKECTL_CREDENTIALS_SECRET_ACCESS_KEY = simple + + ![Workbench lakeFS Environment Variables](img/workbench-lakefs-env-variables.png) + +3.2. Importing the tutorial files into the Jupyter environment: +* After cloning and selecting latest branch for the Fraud Detection tutorial repository (https://github.com/rh-aiservices-bu/fraud-detection.git), double-click the newly-created `fraud-detection` folder in the file browser and click on "Upload Files" icon: + + ![Fraud Detection Tutorial fraud-detection folder](img/fraud-detection-tutorial-image1.png) + +* Select and upload tutorial notebooks changed for the lakeFS tutorial (ending with lakeFS) which are saved in `lakeFS-samples/red-hat-openshift-ai/fraud-detection` folder of `lakeFS-samples` repo (https://github.com/treeverse/lakeFS-samples.git): + + ![Fraud Detection Tutorial upload lakeFS Notebooks](img/fraud-detection-tutorial-image2.png) + +* Double-click the `ray-scripts` subfolder inside `fraud-detection` folder in the file browser and click on "Upload Files" icon: + + ![Fraud Detection Tutorial ray-scripts subfolder](img/fraud-detection-tutorial-image3.png) + +* Select and upload `train_tf_cpu_lakefs.py` changed for the lakeFS tutorial which is saved in `lakeFS-samples/red-hat-openshift-ai/fraud-detection/ray-scripts` folder of `lakeFS-samples` repo: + + ![Fraud Detection Tutorial upload ray script](img/fraud-detection-tutorial-image4.png) + +* After uploading `train_tf_cpu_lakefs.py` file, file browser will show two Python programs: + + ![Fraud Detection Tutorial ray-scripts subfolder after uploading script](img/fraud-detection-tutorial-image5.png) + +* Double-click the `pipeline` subfolder inside `fraud-detection` folder in the file browser and click on "Upload Files" icon: + + ![Fraud Detection Tutorial pipeline subfolder](img/fraud-detection-tutorial-image11.png) + +* Select and upload `7_get_data_train_upload_lakefs.py` and `build_lakefs.sh` changed for the lakeFS tutorial which is saved in `lakeFS-samples/red-hat-openshift-ai/fraud-detection/pipeline` folder of `lakeFS-samples` repo: + + ![Fraud Detection Tutorial upload pipeline](img/fraud-detection-tutorial-image12.png) + +3.4. Training a model: +* In your notebook environment, open the `1_experiment_train_lakefs.ipynb` file instead of `1_experiment_train.ipynb` and follow the instructions directly in the notebook. The instructions guide you through some simple data exploration, experimentation, and model training tasks. + +4.1. Preparing a model for deployment: +* In your notebook environment, open the `2_save_model_lakefs.ipynb` file instead of `2_save_model.ipynb` and follow the instructions directly in the notebook. + +4.2. Deploying a model: +* Use the lakeFS branch name in the path that leads to the version folder that contains your model file: `train01/models/fraud`: + + ![Fraud Detection Tutorial Deploy Model](img/fraud-detection-tutorial-image6.png) + +4.3. Testing the model API: +* In your notebook environment, open the `3_rest_requests_multi_model_lakefs.ipynb` file instead of `3_rest_requests_multi_model.ipynb` and follow the instructions directly in the notebook. +* In your notebook environment, open the `4_grpc_requests_multi_model_lakefs.ipynb` file instead of `4_grpc_requests_multi_model.ipynb` and follow the instructions directly in the notebook. +* In your notebook environment, open the `5_rest_requests_single_model_lakefs.ipynb` file instead of `5_rest_requests_single_model.ipynb` and follow the instructions directly in the notebook. + +5.1. Automating workflows with data science pipelines: +* Instead of creating Red Hat OpenShift AI pipeline from stratch, you can run already created pipeline called `6 Train Save lakefs.pipeline`. In your notebook environment, open `6 Train Save lakefs.pipeline` and click the play button in the toolbar of the pipeline editor to run the pipeline. If you want to create the pipeline from stratch then follow the tutorial instructions but make following changes in section 5.1.5: + +5.1.5. Configure the data connection to the S3 storage bucket: +* Under Kubernetes Secrets, use the secret name for `pipeline-artifacts` data connection for the following environment variables in **both nodes** of the pipeline: + * AWS_ACCESS_KEY_ID + * AWS_SECRET_ACCESS_KEY + * AWS_S3_ENDPOINT + * AWS_DEFAULT_REGION + * AWS_S3_BUCKET + + ![Fraud Detection Tutorial Pipeline Kubernetes Secrets 1](img/fraud-detection-tutorial-image7.png) + + ![Fraud Detection Tutorial Pipeline Kubernetes Secrets 1](img/fraud-detection-tutorial-image8.png) + +* Under Kubernetes Secrets, use the secret name for `my-storage` data connection when adding following lakeFS environment variables in **both nodes** of the pipeline: + * LAKECTL_SERVER_ENDPOINT_URL = AWS_S3_ENDPOINT + * LAKECTL_CREDENTIALS_ACCESS_KEY_ID = AWS_ACCESS_KEY_ID + * LAKECTL_CREDENTIALS_SECRET_ACCESS_KEY = AWS_SECRET_ACCESS_KEY + * LAKEFS_REPO_NAME = AWS_S3_BUCKET + * LAKEFS_DEFAULT_REGION =AWS_DEFAULT_REGION + + ![Fraud Detection Tutorial Pipeline Kubernetes Secrets 1](img/fraud-detection-tutorial-image9.png) + + ![Fraud Detection Tutorial Pipeline Kubernetes Secrets 1](img/fraud-detection-tutorial-image10.png) + +5.2. Running a data science pipeline generated from Python code: +* Use `7_get_data_train_upload_lakefs.yaml` instead of `7_get_data_train_upload.yaml` when importing pipeline in OpenShift AI. + +6.1. Distributing training jobs with Ray: +* In your notebook environment, open the `8_distributed_training_lakefs.ipynb` file instead of `8_distributed_training.ipynb`. Change MinIO Access and Secret keys in the 2nd code cell of the notebook and run the notebook. + + Optionally, if you want to view the Python code for this section, you can find it in the ray-scripts/train_tf_cpu_lakefs.py file. + +See [lakeFS documentation](https://docs.lakefs.io/) and [MinIO documentation for OpenShift](https://min.io/docs/minio/kubernetes/openshift/index.html) for details. + +# File Descriptions + +- [lakefs-local.yaml](./lakefs-local.yaml): Bring up lakeFS using local object storage. This would be useful for a quick demo where MinIO is not included. +- [lakefs-minio.yaml](./lakefs-minio.yaml): Bring up lakeFS configured to use MinIO as backend object storage. This will be used in the lakeFS demo. +- [minio-direct.yaml](./minio-direct.yaml): This file would only be used if lakeFS is not in the picture and OpenShift AI will communicate directly with MinIO. It will bring up MinIO as it is in the default Fraud Detection demo, complete with configuring MinIO storage buckets and the OpenShift AI data connections. It may serve useful in debugging an issue. +- [minio-via-lakefs.yaml](./minio-via-lakefs.yaml): Bring up MinIO for the modified Fraud Detection demo that includes lakeFS, complete with configuring MinIO storage buckets, but do NOT configure the OpenShift AI data connections. This will be used in the lakeFS demo. diff --git a/01_standalone_examples/red-hat-openshift-ai/cluster-configuration/.gitkeep b/01_standalone_examples/red-hat-openshift-ai/cluster-configuration/.gitkeep new file mode 100644 index 00000000..e69de29b diff --git a/01_standalone_examples/red-hat-openshift-ai/cluster-configuration/lakefs-local.yaml b/01_standalone_examples/red-hat-openshift-ai/cluster-configuration/lakefs-local.yaml new file mode 100644 index 00000000..3a2628cc --- /dev/null +++ b/01_standalone_examples/red-hat-openshift-ai/cluster-configuration/lakefs-local.yaml @@ -0,0 +1,173 @@ +--- +kind: ConfigMap +apiVersion: v1 +metadata: + name: my-lakefs + namespace: lakefs + labels: + app.kubernetes.io/managed-by: Helm + annotations: + meta.helm.sh/release-name: my-lakefs + meta.helm.sh/release-namespace: lakefs +data: + config.yaml: | + database: + type: local + blockstore: + type: local +--- +kind: Deployment +apiVersion: apps/v1 +metadata: + annotations: + deployment.kubernetes.io/revision: '2' + meta.helm.sh/release-name: my-lakefs + meta.helm.sh/release-namespace: lakefs + resourceVersion: '102204' + name: my-lakefs + namespace: lakefs + labels: + app: lakefs + app.kubernetes.io/instance: my-lakefs + app.kubernetes.io/managed-by: Helm + app.kubernetes.io/name: lakefs + app.kubernetes.io/version: 1.38.0 + helm.sh/chart: lakefs-1.3.14 +spec: + replicas: 1 + selector: + matchLabels: + app: lakefs + app.kubernetes.io/instance: my-lakefs + app.kubernetes.io/name: lakefs + template: + metadata: + labels: + app: lakefs + app.kubernetes.io/instance: my-lakefs + app.kubernetes.io/name: lakefs + annotations: + checksum/config: 2dde95d5a2b50bddc89371d1692db1005db9407701085531ea77ce14b56c6ec1 + spec: + restartPolicy: Always + serviceAccountName: default + schedulerName: default-scheduler + terminationGracePeriodSeconds: 30 + securityContext: {} + containers: + - resources: {} + readinessProbe: + httpGet: + path: /_health + port: http + scheme: HTTP + timeoutSeconds: 1 + periodSeconds: 10 + successThreshold: 1 + failureThreshold: 3 + terminationMessagePath: /dev/termination-log + name: lakefs + livenessProbe: + httpGet: + path: /_health + port: http + scheme: HTTP + timeoutSeconds: 1 + periodSeconds: 10 + successThreshold: 1 + failureThreshold: 3 + env: + - name: LAKEFS_AUTH_ENCRYPT_SECRET_KEY + value: asdjfhjaskdhuioaweyuiorasdsjbaskcbkj + ports: + - name: http + containerPort: 8000 + protocol: TCP + imagePullPolicy: IfNotPresent + volumeMounts: + - name: config-volume + mountPath: /etc/lakefs + - name: lakefs-volume + mountPath: /lakefs + terminationMessagePolicy: File + image: 'treeverse/lakefs:1.38.0' + args: + - run + - '--config' + - /etc/lakefs/config.yaml + serviceAccount: default + volumes: + - name: config-volume + configMap: + name: my-lakefs + items: + - key: config.yaml + path: config.yaml + defaultMode: 420 + - name: lakefs-volume + emptyDir: + sizeLimit: 100Mi + dnsPolicy: ClusterFirst + strategy: + type: RollingUpdate + rollingUpdate: + maxUnavailable: 25% + maxSurge: 25% + revisionHistoryLimit: 10 + progressDeadlineSeconds: 600 +--- +kind: Service +apiVersion: v1 +metadata: + name: my-lakefs + namespace: lakefs + labels: + app: lakefs + app.kubernetes.io/instance: my-lakefs + app.kubernetes.io/managed-by: Helm + app.kubernetes.io/name: lakefs + app.kubernetes.io/version: 1.38.0 + helm.sh/chart: lakefs-1.3.14 + annotations: + meta.helm.sh/release-name: my-lakefs + meta.helm.sh/release-namespace: lakefs +spec: + ipFamilies: + - IPv4 + ports: + - name: http + protocol: TCP + port: 80 + targetPort: http + internalTrafficPolicy: Cluster + type: ClusterIP + ipFamilyPolicy: SingleStack + sessionAffinity: None + selector: + app: lakefs + app.kubernetes.io/instance: my-lakefs + app.kubernetes.io/name: lakefs +--- +kind: Route +apiVersion: route.openshift.io/v1 +metadata: + name: lakefs-route + namespace: lakefs + labels: + app: lakefs + app.kubernetes.io/instance: my-lakefs + app.kubernetes.io/managed-by: Helm + app.kubernetes.io/name: lakefs + app.kubernetes.io/version: 1.38.0 + helm.sh/chart: lakefs-1.3.14 + annotations: + openshift.io/host.generated: 'true' +spec: + host: lakefs-route-lakefs.apps-crc.testing + to: + kind: Service + name: my-lakefs + weight: 100 + port: + targetPort: http + wildcardPolicy: None \ No newline at end of file diff --git a/01_standalone_examples/red-hat-openshift-ai/cluster-configuration/lakefs-minio.yaml b/01_standalone_examples/red-hat-openshift-ai/cluster-configuration/lakefs-minio.yaml new file mode 100644 index 00000000..9cc04e99 --- /dev/null +++ b/01_standalone_examples/red-hat-openshift-ai/cluster-configuration/lakefs-minio.yaml @@ -0,0 +1,268 @@ +--- +apiVersion: batch/v1 +kind: Job +metadata: + name: create-cfg-map-lakefs +spec: + selector: {} + template: + spec: + containers: + - args: + - -ec + - |- + echo -n 'Waiting for minio root user secret' + while ! oc get secret minio-root-user 2>/dev/null | grep -qF minio-root-user; do + echo -n . + sleep 5 + done; echo + + cat << EOF | oc apply -f- + kind: ConfigMap + apiVersion: v1 + metadata: + name: my-lakefs + labels: + app.kubernetes.io/managed-by: Helm + annotations: + meta.helm.sh/release-name: my-lakefs + meta.helm.sh/release-namespace: lakefs + data: + config.yaml: | + logging: + format: json + level: WARN + output: "-" + database: + type: "local" + auth: + encrypt: + secret_key: "10a718b3f285d89c36e9864494cdd1507f3bc85b342df24736ea81f9a1134bcc" + blockstore: + type: s3 + s3: + force_path_style: true + endpoint: http://minio:9000 + discover_bucket_region: false + credentials: + access_key_id: $(MINIO_ROOT_USER) + secret_access_key: $(MINIO_ROOT_PASSWORD) + installation: + user_name: admin + access_key_id: something + secret_access_key: simple + EOF + env: + - name: MINIO_ROOT_USER + valueFrom: + secretKeyRef: + name: minio-root-user + key: MINIO_ROOT_USER + - name: MINIO_ROOT_PASSWORD + valueFrom: + secretKeyRef: + name: minio-root-user + key: MINIO_ROOT_PASSWORD + command: + - /bin/bash + image: image-registry.openshift-image-registry.svc:5000/openshift/tools:latest + imagePullPolicy: IfNotPresent + name: create-ds-connections + restartPolicy: Never + serviceAccountName: demo-setup +--- +apiVersion: batch/v1 +kind: Job +metadata: + name: create-lakefs-repo +spec: + template: + spec: + containers: + - name: create-repo + image: curlimages/curl:latest + command: + - /bin/sh + - -c + - | + # Wait for lakeFS to be ready + echo "Waiting for lakeFS service to be ready..." + while ! curl -s http://my-lakefs.lakefs.svc.cluster.local:80/_health; do + echo -n "." + sleep 5 + done + echo "lakeFS is ready!" + echo + echo + # Create repository using lakeFS API + echo "Creating quickstart repository..." + curl -u "something:simple" \ + -X POST \ + -H "Content-Type: application/json" \ + -d '{"name": "quickstart", "storage_namespace": "s3://quickstart/", "default_branch": "main", "sample_data": true}' \ + http://my-lakefs.lakefs.svc.cluster.local/api/v1/repositories || true + echo "quickstart repository created!" + echo + echo + echo "Creating my-storage repository..." + curl -u "something:simple" \ + -X POST \ + -H "Content-Type: application/json" \ + -d '{"name": "my-storage", "storage_namespace": "s3://my-storage/", "default_branch": "main"}' \ + http://my-lakefs.lakefs.svc.cluster.local/api/v1/repositories || true + echo "my-storage repository created!" + restartPolicy: Never + serviceAccountName: demo-setup +--- +kind: Deployment +apiVersion: apps/v1 +metadata: + annotations: + deployment.kubernetes.io/revision: '2' + meta.helm.sh/release-name: my-lakefs + meta.helm.sh/release-namespace: lakefs + name: my-lakefs + labels: + app: lakefs + app.kubernetes.io/instance: my-lakefs + app.kubernetes.io/managed-by: Helm + app.kubernetes.io/name: lakefs + app.kubernetes.io/version: 1.38.0 + helm.sh/chart: lakefs-1.3.14 +spec: + replicas: 1 + selector: + matchLabels: + app: lakefs + app.kubernetes.io/instance: my-lakefs + app.kubernetes.io/name: lakefs + template: + metadata: + labels: + app: lakefs + app.kubernetes.io/instance: my-lakefs + app.kubernetes.io/name: lakefs + annotations: + checksum/config: 2dde95d5a2b50bddc89371d1692db1005db9407701085531ea77ce14b56c6ec1 + spec: + restartPolicy: Always + serviceAccountName: default + schedulerName: default-scheduler + terminationGracePeriodSeconds: 30 + securityContext: {} + containers: + - resources: {} + readinessProbe: + httpGet: + path: /_health + port: http + scheme: HTTP + timeoutSeconds: 1 + periodSeconds: 10 + successThreshold: 1 + failureThreshold: 3 + terminationMessagePath: /dev/termination-log + name: lakefs + livenessProbe: + httpGet: + path: /_health + port: http + scheme: HTTP + timeoutSeconds: 1 + periodSeconds: 10 + successThreshold: 1 + failureThreshold: 3 + env: + - name: LAKEFS_AUTH_ENCRYPT_SECRET_KEY + value: asdjfhjaskdhuioaweyuiorasdsjbaskcbkj + ports: + - name: http + containerPort: 8000 + protocol: TCP + imagePullPolicy: IfNotPresent + volumeMounts: + - name: config-volume + mountPath: /etc/lakefs + - name: lakefs-volume + mountPath: /lakefs + terminationMessagePolicy: File + image: 'treeverse/lakefs:1.38.0' + args: + - run + - '--config' + - /etc/lakefs/config.yaml + volumes: + - name: config-volume + configMap: + name: my-lakefs + items: + - key: config.yaml + path: config.yaml + defaultMode: 420 + - name: lakefs-volume + emptyDir: + sizeLimit: 100Mi + dnsPolicy: ClusterFirst + strategy: + type: RollingUpdate + rollingUpdate: + maxUnavailable: 25% + maxSurge: 25% + revisionHistoryLimit: 10 + progressDeadlineSeconds: 600 +--- +kind: Service +apiVersion: v1 +metadata: + name: my-lakefs + labels: + app: lakefs + app.kubernetes.io/instance: my-lakefs + app.kubernetes.io/managed-by: Helm + app.kubernetes.io/name: lakefs + app.kubernetes.io/version: 1.38.0 + helm.sh/chart: lakefs-1.3.14 + annotations: + meta.helm.sh/release-name: my-lakefs + meta.helm.sh/release-namespace: lakefs +spec: + ipFamilies: + - IPv4 + ports: + - name: http + protocol: TCP + port: 80 + targetPort: http + internalTrafficPolicy: Cluster + type: ClusterIP + ipFamilyPolicy: SingleStack + sessionAffinity: None + selector: + app: lakefs + app.kubernetes.io/instance: my-lakefs + app.kubernetes.io/name: lakefs +--- +kind: Route +apiVersion: route.openshift.io/v1 +metadata: + name: lakefs-route + labels: + app: lakefs + app.kubernetes.io/instance: my-lakefs + app.kubernetes.io/managed-by: Helm + app.kubernetes.io/name: lakefs + app.kubernetes.io/version: 1.38.0 + helm.sh/chart: lakefs-1.3.14 + annotations: + openshift.io/host.generated: 'true' +spec: + to: + kind: Service + name: my-lakefs + weight: 100 + port: + targetPort: http + tls: + termination: edge + insecureEdgeTerminationPolicy: Redirect + wildcardPolicy: None \ No newline at end of file diff --git a/01_standalone_examples/red-hat-openshift-ai/cluster-configuration/minio-direct.yaml b/01_standalone_examples/red-hat-openshift-ai/cluster-configuration/minio-direct.yaml new file mode 100644 index 00000000..01010eaf --- /dev/null +++ b/01_standalone_examples/red-hat-openshift-ai/cluster-configuration/minio-direct.yaml @@ -0,0 +1,368 @@ +--- +apiVersion: v1 +kind: Service +metadata: + labels: + app: minio + app.kubernetes.io/component: minio + app.kubernetes.io/instance: minio + app.kubernetes.io/name: minio + app.kubernetes.io/part-of: minio + component: minio + name: minio +spec: + ports: + - name: api + port: 9000 + targetPort: api + - name: console + port: 9090 + targetPort: 9090 + selector: + app: minio + app.kubernetes.io/component: minio + app.kubernetes.io/instance: minio + app.kubernetes.io/name: minio + app.kubernetes.io/part-of: minio + component: minio + sessionAffinity: None + type: ClusterIP +--- +apiVersion: v1 +kind: PersistentVolumeClaim +metadata: + labels: + app: minio + app.kubernetes.io/component: minio + app.kubernetes.io/instance: minio + app.kubernetes.io/name: minio + app.kubernetes.io/part-of: minio + component: minio + name: minio +spec: + accessModes: + - ReadWriteOnce + resources: + requests: + storage: 10Gi +--- +apiVersion: apps/v1 +kind: Deployment +metadata: + labels: + app: minio + app.kubernetes.io/component: minio + app.kubernetes.io/instance: minio + app.kubernetes.io/name: minio + app.kubernetes.io/part-of: minio + component: minio + name: minio +spec: + replicas: 1 + selector: + matchLabels: + app: minio + app.kubernetes.io/component: minio + app.kubernetes.io/instance: minio + app.kubernetes.io/name: minio + app.kubernetes.io/part-of: minio + component: minio + strategy: + type: Recreate + template: + metadata: + labels: + app: minio + app.kubernetes.io/component: minio + app.kubernetes.io/instance: minio + app.kubernetes.io/name: minio + app.kubernetes.io/part-of: minio + component: minio + spec: + containers: + - args: + - minio server /data --console-address :9090 + command: + - /bin/bash + - -c + envFrom: + - secretRef: + name: minio-root-user + image: quay.io/minio/minio:latest + name: minio + ports: + - containerPort: 9000 + name: api + protocol: TCP + - containerPort: 9090 + name: console + protocol: TCP + resources: + limits: + cpu: "2" + memory: 2Gi + requests: + cpu: 200m + memory: 1Gi + volumeMounts: + - mountPath: /data + name: minio + volumes: + - name: minio + persistentVolumeClaim: + claimName: minio + - emptyDir: {} + name: empty +--- +apiVersion: batch/v1 +kind: Job +metadata: + name: create-ds-connections +spec: + selector: {} + template: + spec: + containers: + - args: + - -ec + - |- + echo -n 'Waiting for minio route' + while ! oc get route minio-s3 2>/dev/null | grep -qF minio-s3; do + echo -n . + sleep 5 + done; echo + + echo -n 'Waiting for minio root user secret' + while ! oc get secret minio-root-user 2>/dev/null | grep -qF minio-root-user; do + echo -n . + sleep 5 + done; echo + + MINIO_ROOT_USER=$(oc get secret minio-root-user -o template --template '{{.data.MINIO_ROOT_USER}}') + MINIO_ROOT_PASSWORD=$(oc get secret minio-root-user -o template --template '{{.data.MINIO_ROOT_PASSWORD}}') + MINIO_HOST=https://$(oc get route minio-s3 -o template --template '{{.spec.host}}') + + cat << EOF | oc apply -f- + apiVersion: v1 + kind: Secret + metadata: + annotations: + opendatahub.io/connection-type: s3 + openshift.io/display-name: My Storage + labels: + opendatahub.io/dashboard: "true" + opendatahub.io/managed: "true" + name: aws-connection-my-storage + data: + AWS_ACCESS_KEY_ID: ${MINIO_ROOT_USER} + AWS_SECRET_ACCESS_KEY: ${MINIO_ROOT_PASSWORD} + stringData: + AWS_DEFAULT_REGION: us-east-1 + AWS_S3_BUCKET: my-storage + AWS_S3_ENDPOINT: ${MINIO_HOST} + type: Opaque + EOF + cat << EOF | oc apply -f- + apiVersion: v1 + kind: Secret + metadata: + annotations: + opendatahub.io/connection-type: s3 + openshift.io/display-name: Pipeline Artifacts + labels: + opendatahub.io/dashboard: "true" + opendatahub.io/managed: "true" + name: aws-connection-pipeline-artifacts + data: + AWS_ACCESS_KEY_ID: ${MINIO_ROOT_USER} + AWS_SECRET_ACCESS_KEY: ${MINIO_ROOT_PASSWORD} + stringData: + AWS_DEFAULT_REGION: us-east-1 + AWS_S3_BUCKET: pipeline-artifacts + AWS_S3_ENDPOINT: ${MINIO_HOST} + type: Opaque + EOF + command: + - /bin/bash + image: image-registry.openshift-image-registry.svc:5000/openshift/tools:latest + imagePullPolicy: IfNotPresent + name: create-ds-connections + restartPolicy: Never + serviceAccount: demo-setup + serviceAccountName: demo-setup +--- +apiVersion: batch/v1 +kind: Job +metadata: + labels: + app.kubernetes.io/component: minio + app.kubernetes.io/instance: minio + app.kubernetes.io/name: minio + app.kubernetes.io/part-of: minio + component: minio + name: create-minio-buckets +spec: + selector: {} + template: + metadata: + labels: + app.kubernetes.io/component: minio + app.kubernetes.io/instance: minio + app.kubernetes.io/name: minio + app.kubernetes.io/part-of: minio + component: minio + spec: + containers: + - args: + - -ec + - |- + oc get secret minio-root-user + env | grep MINIO + cat << 'EOF' | python3 + import boto3, os + + s3 = boto3.client("s3", + endpoint_url="http://minio:9000", + aws_access_key_id=os.getenv("MINIO_ROOT_USER"), + aws_secret_access_key=os.getenv("MINIO_ROOT_PASSWORD")) + bucket = 'pipeline-artifacts' + print('creating pipeline-artifacts bucket') + if bucket not in [bu["Name"] for bu in s3.list_buckets()["Buckets"]]: + s3.create_bucket(Bucket=bucket) + bucket = 'my-storage' + print('creating my-storage bucket') + if bucket not in [bu["Name"] for bu in s3.list_buckets()["Buckets"]]: + s3.create_bucket(Bucket=bucket) + EOF + command: + - /bin/bash + envFrom: + - secretRef: + name: minio-root-user + image: image-registry.openshift-image-registry.svc:5000/redhat-ods-applications/s2i-generic-data-science-notebook:2024.1 + imagePullPolicy: IfNotPresent + name: create-buckets + initContainers: + - args: + - -ec + - |- + echo -n 'Waiting for minio root user secret' + while ! oc get secret minio-root-user 2>/dev/null | grep -qF minio-root-user; do + echo -n . + sleep 5 + done; echo + + echo -n 'Waiting for minio deployment' + while ! oc get deployment minio 2>/dev/null | grep -qF minio; do + echo -n . + sleep 5 + done; echo + oc wait --for=condition=available --timeout=60s deployment/minio + sleep 10 + command: + - /bin/bash + image: image-registry.openshift-image-registry.svc:5000/openshift/tools:latest + imagePullPolicy: IfNotPresent + name: wait-for-minio + restartPolicy: Never + serviceAccount: demo-setup + serviceAccountName: demo-setup +--- +apiVersion: batch/v1 +kind: Job +metadata: + labels: + app.kubernetes.io/component: minio + app.kubernetes.io/instance: minio + app.kubernetes.io/name: minio + app.kubernetes.io/part-of: minio + component: minio + name: create-minio-root-user +spec: + backoffLimit: 4 + template: + metadata: + labels: + app.kubernetes.io/component: minio + app.kubernetes.io/instance: minio + app.kubernetes.io/name: minio + app.kubernetes.io/part-of: minio + component: minio + spec: + containers: + - args: + - -ec + - |- + if [ -n "$(oc get secret minio-root-user -oname 2>/dev/null)" ]; then + echo "Secret already exists. Skipping." >&2 + exit 0 + fi + genpass() { + < /dev/urandom tr -dc _A-Z-a-z-0-9 | head -c"${1:-32}" + } + id=$(genpass 16) + secret=$(genpass) + cat << EOF | oc apply -f- + apiVersion: v1 + kind: Secret + metadata: + name: minio-root-user + type: Opaque + stringData: + MINIO_ROOT_USER: ${id} + MINIO_ROOT_PASSWORD: ${secret} + EOF + command: + - /bin/bash + image: image-registry.openshift-image-registry.svc:5000/openshift/tools:latest + imagePullPolicy: IfNotPresent + name: create-minio-root-user + restartPolicy: Never + serviceAccount: demo-setup + serviceAccountName: demo-setup +--- +apiVersion: route.openshift.io/v1 +kind: Route +metadata: + labels: + app: minio + app.kubernetes.io/component: minio + app.kubernetes.io/instance: minio + app.kubernetes.io/name: minio + app.kubernetes.io/part-of: minio + component: minio + name: minio-console +spec: + port: + targetPort: console + tls: + insecureEdgeTerminationPolicy: Redirect + termination: edge + to: + kind: Service + name: minio + weight: 100 + wildcardPolicy: None +--- +apiVersion: route.openshift.io/v1 +kind: Route +metadata: + labels: + app: minio + app.kubernetes.io/component: minio + app.kubernetes.io/instance: minio + app.kubernetes.io/name: minio + app.kubernetes.io/part-of: minio + component: minio + name: minio-s3 +spec: + port: + targetPort: api + tls: + insecureEdgeTerminationPolicy: Redirect + termination: edge + to: + kind: Service + name: minio + weight: 100 + wildcardPolicy: None \ No newline at end of file diff --git a/01_standalone_examples/red-hat-openshift-ai/cluster-configuration/minio-via-lakefs.yaml b/01_standalone_examples/red-hat-openshift-ai/cluster-configuration/minio-via-lakefs.yaml new file mode 100644 index 00000000..e646536e --- /dev/null +++ b/01_standalone_examples/red-hat-openshift-ai/cluster-configuration/minio-via-lakefs.yaml @@ -0,0 +1,310 @@ +--- +apiVersion: v1 +kind: ServiceAccount +metadata: + name: demo-setup +--- +apiVersion: rbac.authorization.k8s.io/v1 +kind: RoleBinding +metadata: + name: demo-setup-edit +roleRef: + apiGroup: rbac.authorization.k8s.io + kind: ClusterRole + name: edit +subjects: +- kind: ServiceAccount + name: demo-setup +--- +apiVersion: v1 +kind: Service +metadata: + labels: + app: minio + app.kubernetes.io/component: minio + app.kubernetes.io/instance: minio + app.kubernetes.io/name: minio + app.kubernetes.io/part-of: minio + component: minio + name: minio +spec: + ports: + - name: api + port: 9000 + targetPort: api + - name: console + port: 9090 + targetPort: 9090 + selector: + app: minio + app.kubernetes.io/component: minio + app.kubernetes.io/instance: minio + app.kubernetes.io/name: minio + app.kubernetes.io/part-of: minio + component: minio + sessionAffinity: None + type: ClusterIP +--- +apiVersion: v1 +kind: PersistentVolumeClaim +metadata: + labels: + app: minio + app.kubernetes.io/component: minio + app.kubernetes.io/instance: minio + app.kubernetes.io/name: minio + app.kubernetes.io/part-of: minio + component: minio + name: minio +spec: + accessModes: + - ReadWriteOnce + resources: + requests: + storage: 10Gi +--- +apiVersion: apps/v1 +kind: Deployment +metadata: + labels: + app: minio + app.kubernetes.io/component: minio + app.kubernetes.io/instance: minio + app.kubernetes.io/name: minio + app.kubernetes.io/part-of: minio + component: minio + name: minio +spec: + replicas: 1 + selector: + matchLabels: + app: minio + app.kubernetes.io/component: minio + app.kubernetes.io/instance: minio + app.kubernetes.io/name: minio + app.kubernetes.io/part-of: minio + component: minio + strategy: + type: Recreate + template: + metadata: + labels: + app: minio + app.kubernetes.io/component: minio + app.kubernetes.io/instance: minio + app.kubernetes.io/name: minio + app.kubernetes.io/part-of: minio + component: minio + spec: + containers: + - args: + - minio server /data --console-address :9090 + command: + - /bin/bash + - -c + envFrom: + - secretRef: + name: minio-root-user + image: quay.io/minio/minio:latest + name: minio + ports: + - containerPort: 9000 + name: api + protocol: TCP + - containerPort: 9090 + name: console + protocol: TCP + resources: + limits: + cpu: "2" + memory: 2Gi + requests: + cpu: 200m + memory: 1Gi + volumeMounts: + - mountPath: /data + name: minio + volumes: + - name: minio + persistentVolumeClaim: + claimName: minio + - emptyDir: {} + name: empty +--- +apiVersion: batch/v1 +kind: Job +metadata: + labels: + app.kubernetes.io/component: minio + app.kubernetes.io/instance: minio + app.kubernetes.io/name: minio + app.kubernetes.io/part-of: minio + component: minio + name: create-minio-buckets +spec: + selector: {} + template: + metadata: + labels: + app.kubernetes.io/component: minio + app.kubernetes.io/instance: minio + app.kubernetes.io/name: minio + app.kubernetes.io/part-of: minio + component: minio + spec: + containers: + - args: + - -ec + - |- + oc get secret minio-root-user + env | grep MINIO + cat << 'EOF' | python3 + import boto3, os + + s3 = boto3.client("s3", + endpoint_url="http://minio:9000", + aws_access_key_id=os.getenv("MINIO_ROOT_USER"), + aws_secret_access_key=os.getenv("MINIO_ROOT_PASSWORD")) + bucket = 'pipeline-artifacts' + print('creating pipeline-artifacts bucket') + if bucket not in [bu["Name"] for bu in s3.list_buckets()["Buckets"]]: + s3.create_bucket(Bucket=bucket) + bucket = 'my-storage' + print('creating my-storage bucket') + if bucket not in [bu["Name"] for bu in s3.list_buckets()["Buckets"]]: + s3.create_bucket(Bucket=bucket) + bucket = 'quickstart' + print('creating quickstart bucket') + if bucket not in [bu["Name"] for bu in s3.list_buckets()["Buckets"]]: + s3.create_bucket(Bucket=bucket) + EOF + command: + - /bin/bash + envFrom: + - secretRef: + name: minio-root-user + image: image-registry.openshift-image-registry.svc:5000/redhat-ods-applications/s2i-generic-data-science-notebook:2024.1 + imagePullPolicy: IfNotPresent + name: create-buckets + initContainers: + - args: + - -ec + - |- + echo -n 'Waiting for minio root user secret' + while ! oc get secret minio-root-user 2>/dev/null | grep -qF minio-root-user; do + echo -n . + sleep 5 + done; echo + + echo -n 'Waiting for minio deployment' + while ! oc get deployment minio 2>/dev/null | grep -qF minio; do + echo -n . + sleep 5 + done; echo + oc wait --for=condition=available --timeout=60s deployment/minio + sleep 10 + command: + - /bin/bash + image: image-registry.openshift-image-registry.svc:5000/openshift/tools:latest + imagePullPolicy: IfNotPresent + name: wait-for-minio + restartPolicy: Never + serviceAccountName: demo-setup +--- +apiVersion: batch/v1 +kind: Job +metadata: + labels: + app.kubernetes.io/component: minio + app.kubernetes.io/instance: minio + app.kubernetes.io/name: minio + app.kubernetes.io/part-of: minio + component: minio + name: create-minio-root-user +spec: + backoffLimit: 4 + template: + metadata: + labels: + app.kubernetes.io/component: minio + app.kubernetes.io/instance: minio + app.kubernetes.io/name: minio + app.kubernetes.io/part-of: minio + component: minio + spec: + containers: + - args: + - -ec + - |- + if [ -n "$(oc get secret minio-root-user -oname 2>/dev/null)" ]; then + echo "Secret already exists. Skipping." >&2 + exit 0 + fi + genpass() { + < /dev/urandom tr -dc _A-Z-a-z-0-9 | head -c"${1:-32}" + } + id=$(genpass 16) + secret=$(genpass) + cat << EOF | oc apply -f- + apiVersion: v1 + kind: Secret + metadata: + name: minio-root-user + type: Opaque + stringData: + MINIO_ROOT_USER: ${id} + MINIO_ROOT_PASSWORD: ${secret} + EOF + command: + - /bin/bash + image: image-registry.openshift-image-registry.svc:5000/openshift/tools:latest + imagePullPolicy: IfNotPresent + name: create-minio-root-user + restartPolicy: Never + serviceAccountName: demo-setup +--- +apiVersion: route.openshift.io/v1 +kind: Route +metadata: + labels: + app: minio + app.kubernetes.io/component: minio + app.kubernetes.io/instance: minio + app.kubernetes.io/name: minio + app.kubernetes.io/part-of: minio + component: minio + name: minio-console +spec: + port: + targetPort: console + tls: + insecureEdgeTerminationPolicy: Redirect + termination: edge + to: + kind: Service + name: minio + weight: 100 + wildcardPolicy: None +--- +apiVersion: route.openshift.io/v1 +kind: Route +metadata: + labels: + app: minio + app.kubernetes.io/component: minio + app.kubernetes.io/instance: minio + app.kubernetes.io/name: minio + app.kubernetes.io/part-of: minio + component: minio + name: minio-s3 +spec: + port: + targetPort: api + tls: + insecureEdgeTerminationPolicy: Redirect + termination: edge + to: + kind: Service + name: minio + weight: 100 + wildcardPolicy: None \ No newline at end of file diff --git a/01_standalone_examples/red-hat-openshift-ai/fraud-detection/.gitkeep b/01_standalone_examples/red-hat-openshift-ai/fraud-detection/.gitkeep new file mode 100644 index 00000000..e69de29b diff --git a/01_standalone_examples/red-hat-openshift-ai/fraud-detection/1_experiment_train_lakefs.ipynb b/01_standalone_examples/red-hat-openshift-ai/fraud-detection/1_experiment_train_lakefs.ipynb new file mode 100644 index 00000000..574a4f60 --- /dev/null +++ b/01_standalone_examples/red-hat-openshift-ai/fraud-detection/1_experiment_train_lakefs.ipynb @@ -0,0 +1,573 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "# Experiment" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "## Install Python dependencies" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "ExecuteTime": { + "end_time": "2024-08-19T15:45:05.830869Z", + "start_time": "2024-08-19T15:45:04.819700Z" + }, + "is_executing": true, + "scrolled": true, + "tags": [] + }, + "outputs": [], + "source": [ + "!pip install onnx onnxruntime tf2onnx lakefs==0.7.1 s3fs==2024.10.0" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Import the dependencies for the model training code:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "ExecuteTime": { + "end_time": "2024-08-19T15:45:08.983925Z", + "start_time": "2024-08-19T15:45:05.835311Z" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "import datetime\n", + "from keras.models import Sequential\n", + "from keras.layers import Dense, Dropout, BatchNormalization, Activation\n", + "from sklearn.model_selection import train_test_split\n", + "from sklearn.preprocessing import StandardScaler\n", + "from sklearn.utils import class_weight\n", + "import tf2onnx\n", + "import onnx\n", + "import pickle\n", + "from pathlib import Path\n", + "import lakefs\n", + "import os\n", + "import s3fs" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The output might show TensorFlow messages, such as a \"Could not find TensorRT\" warning. You can ignore these messages.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Define lakeFS Storage and Repository information" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "lakefs_storage_options={\n", + " \"key\": os.environ.get('LAKECTL_CREDENTIALS_ACCESS_KEY_ID'),\n", + " \"secret\": os.environ.get('LAKECTL_CREDENTIALS_SECRET_ACCESS_KEY'),\n", + " \"client_kwargs\": {\n", + " \"endpoint_url\": os.environ.get('LAKECTL_SERVER_ENDPOINT_URL')\n", + " }\n", + "}\n", + "\n", + "repo_name = os.environ.get('LAKEFS_REPO_NAME')\n", + "mainBranch = \"main\"\n", + "trainingBranch = \"train01\"\n", + "\n", + "repo = lakefs.Repository(repo_name)\n", + "print(repo)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Create Training branch in lakeFS and load the CSV data to the training branch in lakeFS" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "branchTraining = repo.branch(trainingBranch).create(source_reference=mainBranch, exist_ok=True)\n", + "\n", + "obj = branchTraining.object(path='data/train.csv')\n", + "with open('data/train.csv', mode='rb') as reader, obj.writer(mode='wb', metadata={'using': 'python_wrapper', 'source':'Fraud Detection Demo'}) as writer:\n", + " writer.write(reader.read())\n", + "\n", + "obj = branchTraining.object(path='data/validate.csv')\n", + "with open('data/validate.csv', mode='rb') as reader, obj.writer(mode='wb', metadata={'using': 'python_wrapper', 'source':'Fraud Detection Demo'}) as writer:\n", + " writer.write(reader.read())\n", + "\n", + "obj = branchTraining.object(path='data/test.csv')\n", + "with open('data/test.csv', mode='rb') as reader, obj.writer(mode='wb', metadata={'using': 'python_wrapper', 'source':'Fraud Detection Demo'}) as writer:\n", + " writer.write(reader.read())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Load the CSV data from the training branch in lakeFS\n", + "\n", + "The CSV data that you use to train the model contains the following fields:\n", + "\n", + "* **distancefromhome** - The distance from home where the transaction happened.\n", + "* **distancefromlast_transaction** - The distance from the last transaction that happened.\n", + "* **ratiotomedianpurchaseprice** - The ratio of purchased price compared to median purchase price.\n", + "* **repeat_retailer** - If it's from a retailer that already has been purchased from before.\n", + "* **used_chip** - If the credit card chip was used.\n", + "* **usedpinnumber** - If the PIN number was used.\n", + "* **online_order** - If it was an online order.\n", + "* **fraud** - If the transaction is fraudulent." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "ExecuteTime": { + "end_time": "2024-08-19T15:45:09.394745Z", + "start_time": "2024-08-19T15:45:09.051361Z" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "# Set the input (X) and output (Y) data. \n", + "# The only output data is whether it's fraudulent. All other fields are inputs to the model.\n", + "\n", + "feature_indexes = [\n", + " 1, # distance_from_last_transaction\n", + " 2, # ratio_to_median_purchase_price\n", + " 4, # used_chip\n", + " 5, # used_pin_number\n", + " 6, # online_order\n", + "]\n", + "\n", + "label_indexes = [\n", + " 7 # fraud\n", + "]\n", + "\n", + "df = pd.read_csv(f\"s3://{repo_name}/{trainingBranch}/data/train.csv\", storage_options=lakefs_storage_options)\n", + "X_train = df.iloc[:, feature_indexes].values\n", + "y_train = df.iloc[:, label_indexes].values\n", + "\n", + "df = pd.read_csv(f\"s3://{repo_name}/{trainingBranch}/data/validate.csv\", storage_options=lakefs_storage_options)\n", + "X_val = df.iloc[:, feature_indexes].values\n", + "y_val = df.iloc[:, label_indexes].values\n", + "\n", + "df = pd.read_csv(f\"s3://{repo_name}/{trainingBranch}/data/test.csv\", storage_options=lakefs_storage_options)\n", + "X_test = df.iloc[:, feature_indexes].values\n", + "y_test = df.iloc[:, label_indexes].values\n", + "\n", + "\n", + "# Scale the data to remove mean and have unit variance. The data will be between -1 and 1, which makes it a lot easier for the model to learn than random (and potentially large) values.\n", + "# It is important to only fit the scaler to the training data, otherwise you are leaking information about the global distribution of variables (which is influenced by the test set) into the training set.\n", + "\n", + "scaler = StandardScaler()\n", + "\n", + "X_train = scaler.fit_transform(X_train)\n", + "X_val = scaler.transform(X_val)\n", + "X_test = scaler.transform(X_test)\n", + "\n", + "obj = branchTraining.object(path='artifact/test_data.pkl')\n", + "with obj.writer(\"wb\") as handle:\n", + " pickle.dump((X_test, y_test), handle)\n", + "obj = branchTraining.object(path='artifact/scaler.pkl')\n", + "with obj.writer(\"wb\") as handle:\n", + " pickle.dump(scaler, handle)\n", + "\n", + "# Since the dataset is unbalanced (it has many more non-fraud transactions than fraudulent ones), set a class weight to weight the few fraudulent transactions higher than the many non-fraud transactions.\n", + "class_weights = class_weight.compute_class_weight('balanced', classes=np.unique(y_train), y=y_train.ravel())\n", + "class_weights = {i : class_weights[i] for i in range(len(class_weights))}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Build the model\n", + "\n", + "The model is a simple, fully-connected, deep neural network, containing three hidden layers and one output layer." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "ExecuteTime": { + "end_time": "2024-08-19T15:45:09.489856Z", + "start_time": "2024-08-19T15:45:09.419813Z" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "model = Sequential()\n", + "model.add(Dense(32, activation='relu', input_dim=len(feature_indexes)))\n", + "model.add(Dropout(0.2))\n", + "model.add(Dense(32))\n", + "model.add(BatchNormalization())\n", + "model.add(Activation('relu'))\n", + "model.add(Dropout(0.2))\n", + "model.add(Dense(32))\n", + "model.add(BatchNormalization())\n", + "model.add(Activation('relu'))\n", + "model.add(Dropout(0.2))\n", + "model.add(Dense(1, activation='sigmoid'))\n", + "\n", + "model.compile(\n", + " optimizer='adam',\n", + " loss='binary_crossentropy',\n", + " metrics=['accuracy']\n", + ")\n", + "\n", + "model.summary()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Train the model\n", + "\n", + "Training a model is often the most time-consuming part of the machine learning process. Large models can take multiple GPUs for days. Expect the training on CPU for this very simple model to take a minute or more." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "ExecuteTime": { + "end_time": "2024-08-19T15:45:29.664796Z", + "start_time": "2024-08-19T15:45:09.496686Z" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "# Train the model and get performance\n", + "import os\n", + "import time\n", + "\n", + "start = time.time()\n", + "epochs = 2\n", + "history = model.fit(\n", + " X_train,\n", + " y_train,\n", + " epochs=epochs,\n", + " validation_data=(X_val, y_val),\n", + " verbose=True,\n", + " class_weight=class_weights\n", + ")\n", + "end = time.time()\n", + "print(f\"Training of model is complete. Took {end-start} seconds\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Save the model file" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "ExecuteTime": { + "end_time": "2024-08-19T15:45:29.845680Z", + "start_time": "2024-08-19T15:45:29.674230Z" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "import tensorflow as tf\n", + "\n", + "# Normally we use tf2.onnx.convert.from_keras.\n", + "# workaround for tf2onnx bug https://github.com/onnx/tensorflow-onnx/issues/2348\n", + "\n", + "# Wrap the model in a `tf.function`\n", + "@tf.function(input_signature=[tf.TensorSpec([None, X_train.shape[1]], tf.float32, name='dense_input')])\n", + "def model_fn(x):\n", + " return model(x)\n", + "\n", + "# Convert the Keras model to ONNX\n", + "model_proto, _ = tf2onnx.convert.from_function(\n", + " model_fn,\n", + " input_signature=[tf.TensorSpec([None, X_train.shape[1]], tf.float32, name='dense_input')]\n", + ")\n", + "\n", + "# Save the model as ONNX for easy use of ModelMesh\n", + "os.makedirs(\"models/fraud/1\", exist_ok=True)\n", + "onnx.save(model_proto, \"models/fraud/1/model.onnx\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The output might include TensorFlow messages related to GPUs. You can ignore these messages." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Confirm the model file was created successfully\n", + "\n", + "The output should include the model name, size, and date. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "ExecuteTime": { + "end_time": "2024-08-19T15:45:30.012353Z", + "start_time": "2024-08-19T15:45:29.856416Z" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "! ls -alRh ./models/" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Test the model" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "ExecuteTime": { + "end_time": "2024-08-19T15:45:30.047040Z", + "start_time": "2024-08-19T15:45:30.029773Z" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay\n", + "import numpy as np\n", + "import pickle\n", + "import onnxruntime as rt" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Load the test data and scaler from the training branch in lakeFS:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "ExecuteTime": { + "end_time": "2024-08-19T15:45:30.062713Z", + "start_time": "2024-08-19T15:45:30.058023Z" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "obj = branchTraining.object(path='artifact/scaler.pkl')\n", + "with obj.reader('rb') as handle:\n", + " scaler = pickle.load(handle)\n", + "obj = branchTraining.object(path='artifact/test_data.pkl')\n", + "with obj.reader('rb') as handle:\n", + " (X_test, y_test) = pickle.load(handle)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Create an ONNX inference runtime session and predict values for all test inputs:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "ExecuteTime": { + "end_time": "2024-08-19T15:45:30.210272Z", + "start_time": "2024-08-19T15:45:30.073900Z" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "sess = rt.InferenceSession(\"models/fraud/1/model.onnx\", providers=rt.get_available_providers())\n", + "input_name = sess.get_inputs()[0].name\n", + "output_name = sess.get_outputs()[0].name\n", + "y_pred_temp = sess.run([output_name], {input_name: X_test.astype(np.float32)}) \n", + "y_pred_temp = np.asarray(np.squeeze(y_pred_temp[0]))\n", + "threshold = 0.95\n", + "y_pred = np.where(y_pred_temp > threshold, 1, 0)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Show the results:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "ExecuteTime": { + "end_time": "2024-08-19T15:45:30.644142Z", + "start_time": "2024-08-19T15:45:30.221686Z" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "from sklearn.metrics import precision_score, recall_score, confusion_matrix, ConfusionMatrixDisplay\n", + "import numpy as np\n", + "\n", + "y_test_arr = y_test.squeeze()\n", + "correct = np.equal(y_pred, y_test_arr).sum().item()\n", + "acc = (correct / len(y_pred)) * 100\n", + "precision = precision_score(y_test_arr, np.round(y_pred))\n", + "recall = recall_score(y_test_arr, np.round(y_pred))\n", + "\n", + "print(f\"Eval Metrics: \\n Accuracy: {acc:>0.1f}%, \"\n", + " f\"Precision: {precision:.4f}, Recall: {recall:.4f} \\n\")\n", + "\n", + "c_matrix = confusion_matrix(y_test_arr, y_pred)\n", + "ConfusionMatrixDisplay(c_matrix).plot()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Example: Is Sally's transaction likely to be fraudulent?\n", + "\n", + "Here is the order of the fields from Sally's transaction details:\n", + "* distance_from_last_transaction\n", + "* ratio_to_median_price\n", + "* used_chip \n", + "* used_pin_number\n", + "* online_order " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "ExecuteTime": { + "end_time": "2024-08-19T15:45:30.679688Z", + "start_time": "2024-08-19T15:45:30.669086Z" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "sally_transaction_details = [\n", + " [0.3111400080477545,\n", + " 1.9459399775518593,\n", + " 1.0,\n", + " 0.0,\n", + " 0.0]\n", + " ]\n", + "\n", + "prediction = sess.run([output_name], {input_name: scaler.transform(sally_transaction_details).astype(np.float32)})\n", + "\n", + "print(\"Is Sally's transaction predicted to be fraudulent? (true = YES, false = NO) \")\n", + "print(np.squeeze(prediction) > threshold)\n", + "\n", + "print(\"How likely was Sally's transaction to be fraudulent? \")\n", + "print(\"{:.5f}\".format(100 * np.squeeze(prediction)) + \"%\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3.11", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.7" + }, + "vscode": { + "interpreter": { + "hash": "63462a1f26ab486248b2a0fd058a0d9f9a6566a80083a3e1eb8f35617f2381b2" + } + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/01_standalone_examples/red-hat-openshift-ai/fraud-detection/2_save_model_lakefs.ipynb b/01_standalone_examples/red-hat-openshift-ai/fraud-detection/2_save_model_lakefs.ipynb new file mode 100644 index 00000000..66312835 --- /dev/null +++ b/01_standalone_examples/red-hat-openshift-ai/fraud-detection/2_save_model_lakefs.ipynb @@ -0,0 +1,246 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Save the Model\n", + "\n", + "To save this model so that you can use it from various locations, including other notebooks or the model server, upload it to s3-compatible storage." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "## Install the required packages and define a function for the upload" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "!pip install boto3 botocore lakefs==0.7.1" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Define lakeFS Repository" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "import os\n", + "import lakefs\n", + "repo_name = os.environ.get('LAKEFS_REPO_NAME')\n", + "\n", + "mainBranch = \"main\"\n", + "trainingBranch = \"train01\"\n", + "\n", + "repo = lakefs.Repository(repo_name)\n", + "print(repo)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "import boto3\n", + "import botocore\n", + "\n", + "aws_access_key_id = os.environ.get('LAKECTL_CREDENTIALS_ACCESS_KEY_ID')\n", + "aws_secret_access_key = os.environ.get('LAKECTL_CREDENTIALS_SECRET_ACCESS_KEY')\n", + "endpoint_url = os.environ.get('LAKECTL_SERVER_ENDPOINT_URL')\n", + "region_name = os.environ.get('LAKEFS_DEFAULT_REGION')\n", + "bucket_name = os.environ.get('LAKEFS_REPO_NAME')\n", + "\n", + "if not all([aws_access_key_id, aws_secret_access_key, endpoint_url, region_name, bucket_name]):\n", + " raise ValueError(\"One or data connection variables are empty. \"\n", + " \"Please check your data connection to an S3 bucket.\")\n", + "\n", + "session = boto3.session.Session(aws_access_key_id=aws_access_key_id,\n", + " aws_secret_access_key=aws_secret_access_key)\n", + "\n", + "s3_resource = session.resource(\n", + " 's3',\n", + " config=botocore.client.Config(signature_version='s3v4'),\n", + " endpoint_url=endpoint_url,\n", + " region_name=region_name)\n", + "\n", + "bucket = s3_resource.Bucket(bucket_name)\n", + "\n", + "\n", + "def upload_directory_to_s3(local_directory, s3_prefix):\n", + " num_files = 0\n", + " for root, dirs, files in os.walk(local_directory):\n", + " for filename in files:\n", + " file_path = os.path.join(root, filename)\n", + " relative_path = os.path.relpath(file_path, local_directory)\n", + " s3_key = os.path.join(s3_prefix, relative_path)\n", + " print(f\"{file_path} -> {s3_key}\")\n", + " bucket.upload_file(file_path, s3_key)\n", + " num_files += 1\n", + " return num_files\n", + "\n", + "\n", + "def list_objects(prefix):\n", + " filter = bucket.objects.filter(Prefix=prefix)\n", + " for obj in filter.all():\n", + " print(obj.key)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "## Verify the upload\n", + "\n", + "In your S3 bucket, under the `models` upload prefix, run the `list_object` command. As best practice, to avoid mixing up model files, keep only one model and its required files in a given prefix or directory. This practice allows you to download and serve a directory with all the files that a model requires. \n", + "\n", + "If this is the first time running the code, this cell will have no output.\n", + "\n", + "If you've already uploaded your model, you should see this output: `models/fraud/1/model.onnx`\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "list_objects(f\"{trainingBranch}/models\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "## Upload the model to the training branch in lakeFS and check again" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Use the function to upload the `models` folder in a rescursive fashion:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "local_models_directory = \"models\"\n", + "\n", + "if not os.path.isdir(local_models_directory):\n", + " raise ValueError(f\"The directory '{local_models_directory}' does not exist. \"\n", + " \"Did you finish training the model in the previous notebook?\")\n", + "\n", + "num_files = upload_directory_to_s3(\"models\", f\"{trainingBranch}/models\")\n", + "\n", + "if num_files == 0:\n", + " raise ValueError(\"No files uploaded. Did you finish training and \"\n", + " \"saving the model to the \\\"models\\\" directory? \"\n", + " \"Check for \\\"models/fraud/1/model.onnx\\\"\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "To confirm this worked, run the `list_objects` function again:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "list_objects(f\"{trainingBranch}/models\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Commit changes in lakeFS repository" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "branchTraining = repo.branch(trainingBranch)\n", + "ref = branchTraining.commit(message='Uploaded data, artifacts and model')\n", + "print(ref.get_commit())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Next Step\n", + "\n", + "Now that you've saved the model to s3 storage, you can refer to the model by using the same data connection to serve the model as an API.\n" + ] + } + ], + "metadata": { + "celltoolbar": "Raw Cell Format", + "kernelspec": { + "display_name": "Python 3.11", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.7" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/01_standalone_examples/red-hat-openshift-ai/fraud-detection/3_rest_requests_multi_model_lakefs.ipynb b/01_standalone_examples/red-hat-openshift-ai/fraud-detection/3_rest_requests_multi_model_lakefs.ipynb new file mode 100644 index 00000000..d92aacc0 --- /dev/null +++ b/01_standalone_examples/red-hat-openshift-ai/fraud-detection/3_rest_requests_multi_model_lakefs.ipynb @@ -0,0 +1,244 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "f73046ff", + "metadata": {}, + "source": [ + "# REST Inference" + ] + }, + { + "cell_type": "markdown", + "id": "2e7eb303-2ded-41b5-91db-8b8a8860d2ac", + "metadata": {}, + "source": [ + "## Setup\n", + "\n", + "Verify that following variable settings match your deployed model's resource name and rest URL. The following code assumes that the kube service is in the same namespace, but you could refer to it in full with the namespace, for example: `http://modelmesh-serving.project-name.svc.cluster.local:8008/v2/models/fraud/infer`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "27c7c3a9-e5a6-42f9-b9e6-9bce1782c266", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "deployed_model_name = \"fraud\"\n", + "rest_url = \"http://modelmesh-serving.lakefs:8008\"\n", + "infer_url = f\"{rest_url}/v2/models/{deployed_model_name}/infer\"" + ] + }, + { + "cell_type": "markdown", + "id": "c5835f05-7fe2-4d51-bed5-3aa9f2b380fe", + "metadata": {}, + "source": [ + "## Request Function\n", + "\n", + "Build and submit the REST request. \n", + "\n", + "Note: You submit the data in the same format that you used for an ONNX inference." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "67c1d001-ff99-414a-95d4-5729d5849298", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "import requests\n", + "\n", + "\n", + "def rest_request(data):\n", + " json_data = {\n", + " \"inputs\": [\n", + " {\n", + " \"name\": \"dense_input\",\n", + " \"shape\": [1, 5],\n", + " \"datatype\": \"FP32\",\n", + " \"data\": data\n", + " }\n", + " ]\n", + " }\n", + "\n", + " response = requests.post(infer_url, json=json_data)\n", + " response_dict = response.json()\n", + " return response_dict['outputs'][0]['data']" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "021e98ad-867c-4477-a717-72403c2f0610", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# Define lakeFS Repository\n", + "import os\n", + "import lakefs\n", + "\n", + "repo_name = os.environ.get('LAKEFS_REPO_NAME')\n", + "\n", + "mainBranch = \"main\"\n", + "trainingBranch = \"train01\"\n", + "\n", + "repo = lakefs.Repository(repo_name)\n", + "branchTraining = repo.branch(trainingBranch)\n", + "print(repo)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8cdbe0b1", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "#Load the scaler from the training branch in lakeFS\n", + "import pickle\n", + "obj = branchTraining.object(path='artifact/scaler.pkl')\n", + "with obj.reader(\"rb\") as handle:\n", + " scaler = pickle.load(handle)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f0a68b67-b109-4a2f-b097-092f4a4d25ce", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "data = [0.3111400080477545, 1.9459399775518593, 1.0, 0.0, 0.0]\n", + "prediction = rest_request(scaler.transform([data]).tolist()[0])\n", + "prediction" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7e54617f-0c9e-4220-b66c-93885d847050", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "threshhold = 0.95\n", + "\n", + "if (prediction[0] > threshhold):\n", + " print('fraud')\n", + "else:\n", + " print('not fraud')" + ] + }, + { + "cell_type": "markdown", + "id": "5697c2ff", + "metadata": {}, + "source": [ + "## Example 1: user buys a coffee\n", + "\n", + "In this example, the user is buying a coffee. The parameters given to the model are:\n", + "* same location as the last transaction (distance=0)\n", + "* same median price as the last transaction (ratio_to_median=1)\n", + "* using a pin number (pin=1)\n", + "* using the credit card chip (chip=1)\n", + "* not an online transaction (online=0)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0393a5a7", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "data = [0.0, 1.0, 1.0, 1.0, 0.0]\n", + "prediction = rest_request(scaler.transform([data]).tolist()[0])\n", + "threshhold = 0.95\n", + "\n", + "if (prediction[0] > threshhold):\n", + " print('The model predicts that this is fraud')\n", + "else:\n", + " print('The model predicts that this is not fraud')" + ] + }, + { + "cell_type": "markdown", + "id": "e889cdd6", + "metadata": {}, + "source": [ + "## Example 2: fraudulent transaction\n", + "\n", + "In this example, someone stole the user's credit card and is buying something online. The parameters given to the model are:\n", + "* very far away from the last transaction (distance=100)\n", + "* median price similar to the last transaction (ratio_to_median=1.2)\n", + "* not using a pin number (pin=0)\n", + "* not using the credit card chip (chip=0)\n", + "* is an online transaction (online=1)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5deba1d5", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "data = [100, 1.2, 0.0, 0.0, 1.0]\n", + "prediction = rest_request(scaler.transform([data]).tolist()[0])\n", + "threshhold = 0.95\n", + "\n", + "if (prediction[0] > threshhold):\n", + " print('The model predicts that this is fraud')\n", + "else:\n", + " print('The model predicts that this is not fraud')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e1eb1daf-153b-449c-9bce-22516e45119f", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3.11", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.7" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/01_standalone_examples/red-hat-openshift-ai/fraud-detection/4_grpc_requests_multi_model_lakefs.ipynb b/01_standalone_examples/red-hat-openshift-ai/fraud-detection/4_grpc_requests_multi_model_lakefs.ipynb new file mode 100644 index 00000000..cb3987dc --- /dev/null +++ b/01_standalone_examples/red-hat-openshift-ai/fraud-detection/4_grpc_requests_multi_model_lakefs.ipynb @@ -0,0 +1,303 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "f73046ff", + "metadata": {}, + "source": [ + "# GRPC Inference" + ] + }, + { + "cell_type": "markdown", + "id": "443e7e73-24cb-4f03-9491-a6edcc24f0cc", + "metadata": { + "tags": [] + }, + "source": [ + "## Setup\n", + "\n", + "Verify that following variable settings match your deployed model's resource name and grpc URL. The following code assumes that the kube service is in the same namespace, but you could refer to it in full with the namespace, for example: `http://modelmesh-serving.project-name.svc.cluster.local:8008/v2/models/fraud/infer`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "db9df000-a171-4652-8160-272f81e49612", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "!pip install grpcio grpcio-tools" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cdc5ab35-2b2e-480b-b322-cfc6549cbd2f", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "grpc_host = 'modelmesh-serving.lakefs'\n", + "grpc_port = 8033\n", + "model_name = 'fraud'" + ] + }, + { + "cell_type": "markdown", + "id": "4269da9e-5683-4531-9a3f-a1cdad42e3af", + "metadata": {}, + "source": [ + "## Inspect the gRPC Endpoint\n", + "\n", + "Check the gRPC endpoint's model metadata:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "545aa5f4-356f-4e70-b7e6-cd352a68927a", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "import sys\n", + "sys.path.append('./utils')\n", + "\n", + "# grpc_predict_v2_pb2 and grpc_predict_v2_pb2_grpc were created from grpc_predict_v2.proto using protoc\n", + "import grpc\n", + "import utils.grpc_predict_v2_pb2 as grpc_predict_v2_pb2\n", + "import utils.grpc_predict_v2_pb2_grpc as grpc_predict_v2_pb2_grpc\n", + "\n", + "\n", + "channel = grpc.insecure_channel(f\"{grpc_host}:{grpc_port}\")\n", + "stub = grpc_predict_v2_pb2_grpc.GRPCInferenceServiceStub(channel)\n", + "\n", + "request = grpc_predict_v2_pb2.ModelMetadataRequest(name=model_name)\n", + "response = stub.ModelMetadata(request)\n", + "response" + ] + }, + { + "cell_type": "markdown", + "id": "fd5affbf-36c3-4e17-9788-5fc0904de143", + "metadata": {}, + "source": [ + "### Request Function\n", + "\n", + "Build and submit the gRPC request. \n", + "\n", + "Note: You submit the data in the same format that you used for an ONNX inference.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "67c1d001-ff99-414a-95d4-5729d5849298", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "import numpy as np\n", + "\n", + "def grpc_request(data):\n", + " # request content building\n", + " inputs = []\n", + " inputs.append(grpc_predict_v2_pb2.ModelInferRequest().InferInputTensor())\n", + " inputs[0].name = \"dense_input\"\n", + " inputs[0].datatype = \"FP32\"\n", + " inputs[0].shape.extend([1, 5])\n", + " inputs[0].contents.fp32_contents.extend(data)\n", + "\n", + " # request building\n", + " request = grpc_predict_v2_pb2.ModelInferRequest()\n", + " request.model_name = model_name\n", + " request.inputs.extend(inputs)\n", + "\n", + " response = stub.ModelInfer(request)\n", + " result_arr = np.frombuffer(response.raw_output_contents[0], dtype=np.float32)\n", + " return result_arr" + ] + }, + { + "cell_type": "markdown", + "id": "911b1015-28b0-4d60-bc17-7b30326b97bc", + "metadata": {}, + "source": [ + "### Run the Request" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a7573e9a-213c-4f49-8759-8fbff30f3a83", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# Define lakeFS Repository\n", + "import os\n", + "import lakefs\n", + "\n", + "repo_name = os.environ.get('LAKEFS_REPO_NAME')\n", + "\n", + "mainBranch = \"main\"\n", + "trainingBranch = \"train01\"\n", + "\n", + "repo = lakefs.Repository(repo_name)\n", + "branchTraining = repo.branch(trainingBranch)\n", + "print(repo)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4fc549f6", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "#Load the scaler from the training branch in lakeFS\n", + "import pickle\n", + "obj = branchTraining.object(path='artifact/scaler.pkl')\n", + "with obj.reader(\"rb\") as handle:\n", + " scaler = pickle.load(handle)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "12947866-e0f5-4c72-ba9a-04229b1af990", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "data = [0.3111400080477545, 1.9459399775518593, 1.0, 0.0, 0.0]\n", + "prediction = grpc_request(scaler.transform([data]).tolist()[0])\n", + "prediction" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "946f9f1d-b24a-4aa6-b839-f0e8013ef84d", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "threshhold = 0.95\n", + "\n", + "if (prediction[0] > threshhold):\n", + " print('fraud')\n", + "else:\n", + " print('not fraud')" + ] + }, + { + "cell_type": "markdown", + "id": "1d7f6b51", + "metadata": {}, + "source": [ + "## Example 1: user buys a coffee\n", + "\n", + "In this example, the user is buying a coffee. The parameters given to the model are:\n", + "* same location as the last transaction (distance=0)\n", + "* same median price as the last transaction (ratio_to_median=1)\n", + "* using a pin number (pin=1)\n", + "* using the credit card chip (chip=1)\n", + "* not an online transaction (online=0)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f0a68b67-b109-4a2f-b097-092f4a4d25ce", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "data = [0.0, 1.0, 1.0, 1.0, 0.0]\n", + "prediction = grpc_request(scaler.transform([data]).tolist()[0])\n", + "threshhold = 0.95\n", + "\n", + "if (prediction[0] > threshhold):\n", + " print('The model predicts that this is fraud')\n", + "else:\n", + " print('The model predicts that this is not fraud')" + ] + }, + { + "cell_type": "markdown", + "id": "1dd27d88", + "metadata": {}, + "source": [ + "## Example 2: fraudulent transaction\n", + "\n", + "In this example, someone stole the user's credit card and is buying something online. The parameters given to the model are:\n", + "* very far away from the last transaction (distance=100)\n", + "* median price similar to the last transaction (ratio_to_median=1.2)\n", + "* not using a pin number (pin=0)\n", + "* not using the credit card chip (chip=0)\n", + "* is an online transaction (online=1)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7a736a21", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "data = [100, 1.2, 0.0, 0.0, 1.0]\n", + "prediction = grpc_request(scaler.transform([data]).tolist()[0])\n", + "threshhold = 0.95\n", + "\n", + "if (prediction[0] > threshhold):\n", + " print('The model predicts that this is fraud')\n", + "else:\n", + " print('The model predicts that this is not fraud')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4331a17c-5d95-4c4c-a1a5-d26cfa439387", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3.11", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.7" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/01_standalone_examples/red-hat-openshift-ai/fraud-detection/5_rest_requests_single_model_lakefs.ipynb b/01_standalone_examples/red-hat-openshift-ai/fraud-detection/5_rest_requests_single_model_lakefs.ipynb new file mode 100644 index 00000000..c287b352 --- /dev/null +++ b/01_standalone_examples/red-hat-openshift-ai/fraud-detection/5_rest_requests_single_model_lakefs.ipynb @@ -0,0 +1,245 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "id": "55c8afde-9b18-4b6a-9ee5-33924bdb4f16", + "metadata": {}, + "outputs": [], + "source": [ + "# REST Inference" + ] + }, + { + "cell_type": "markdown", + "id": "2c004acc-13cd-4917-8480-592c7c2d623b", + "metadata": {}, + "source": [ + "## Setup\n", + "\n", + "Change the following variable settings to match your deployed model's *Inference endpoint*. for example: \n", + "\n", + "```\n", + "deployed_model_name = \"fraud\"\n", + "infer_endpoint = \"https://fraud-predictor-userx-workshop.apps.clusterx.sandboxx.opentlc.com\"\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3c20c09a-fcef-42ab-b050-f6e76a8e1cdc", + "metadata": {}, + "outputs": [], + "source": [ + "deployed_model_name = \"fraud\"\n", + "infer_endpoint = \"https://fraud.lakefs.svc.cluster.local\"\n", + "infer_url = f\"{infer_endpoint}/v2/models/{deployed_model_name}/infer\"" + ] + }, + { + "cell_type": "markdown", + "id": "d94f9ece-e9cf-44e2-a8a2-73160186aee8", + "metadata": {}, + "source": [ + "## Request Function\n", + "\n", + "Build and submit the REST request. \n", + "\n", + "Note: You submit the data in the same format that you used for an ONNX inference." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "54b9386f-683a-4880-b780-c40bec3ab9f8", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "import requests\n", + "\n", + "\n", + "def rest_request(data):\n", + " json_data = {\n", + " \"inputs\": [\n", + " {\n", + " \"name\": \"dense_input\",\n", + " \"shape\": [1, 5],\n", + " \"datatype\": \"FP32\",\n", + " \"data\": data\n", + " }\n", + " ]\n", + " }\n", + "\n", + " response = requests.post(infer_url, json=json_data, verify=False)\n", + " response_dict = response.json()\n", + " return response_dict['outputs'][0]['data']" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "04074987-296f-482a-b04a-09e748a4ced9", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# Define lakeFS Repository\n", + "import os\n", + "import lakefs\n", + "\n", + "repo_name = os.environ.get('LAKEFS_REPO_NAME')\n", + "\n", + "mainBranch = \"main\"\n", + "trainingBranch = \"train01\"\n", + "\n", + "repo = lakefs.Repository(repo_name)\n", + "branchTraining = repo.branch(trainingBranch)\n", + "print(repo)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5f871f12", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "#Load the scaler from the training branch in lakeFS\n", + "import pickle\n", + "obj = branchTraining.object(path='artifact/scaler.pkl')\n", + "with obj.reader(\"rb\") as handle:\n", + " scaler = pickle.load(handle)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "45ad16ac-23da-48bd-9796-f8e4cacae981", + "metadata": {}, + "outputs": [], + "source": [ + "data = [0.3111400080477545, 1.9459399775518593, 1.0, 0.0, 0.0]\n", + "prediction = rest_request(scaler.transform([data]).tolist()[0])\n", + "prediction" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1d66e0f7-4d4e-4879-bdf1-36b712432fd9", + "metadata": {}, + "outputs": [], + "source": [ + "threshhold = 0.95\n", + "\n", + "if (prediction[0] > threshhold):\n", + " print('fraud')\n", + "else:\n", + " print('not fraud')" + ] + }, + { + "cell_type": "markdown", + "id": "5f7b17c0", + "metadata": {}, + "source": [ + "## Example 1: user buys a coffee\n", + "\n", + "In this example, the user is buying a coffee. The parameters given to the model are:\n", + "* same location as the last transaction (distance=0)\n", + "* same median price as the last transaction (ratio_to_median=1)\n", + "* using a pin number (pin=1)\n", + "* using the credit card chip (chip=1)\n", + "* not an online transaction (online=0)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f0a68b67-b109-4a2f-b097-092f4a4d25ce", + "metadata": {}, + "outputs": [], + "source": [ + "data = [0.0, 1.0, 1.0, 1.0, 0.0]\n", + "prediction = rest_request(scaler.transform([data]).tolist()[0])\n", + "prediction\n", + "threshhold = 0.95\n", + "\n", + "if (prediction[0] > threshhold):\n", + " print('The model predicts that this is fraud')\n", + "else:\n", + " print('The model predicts that this is not fraud')" + ] + }, + { + "cell_type": "markdown", + "id": "db10b280", + "metadata": {}, + "source": [ + "## Example 2: fraudulent transaction\n", + "\n", + "In this example, someone stole the user's credit card and is buying something online. The parameters given to the model are:\n", + "* very far away from the last transaction (distance=100)\n", + "* median price similar to the last transaction (ratio_to_median=1.2)\n", + "* not using a pin number (pin=0)\n", + "* not using the credit card chip (chip=0)\n", + "* is an online transaction (online=1)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "219b8927", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "data = [100, 1.2, 0.0, 0.0, 1.0]\n", + "prediction = rest_request(scaler.transform([data]).tolist()[0])\n", + "prediction\n", + "threshhold = 0.95\n", + "\n", + "if (prediction[0] > threshhold):\n", + " print('The model predicts that this is fraud')\n", + "else:\n", + " print('The model predicts that this is not fraud')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "446f216b-9c8a-4acd-9a41-1c52a73e6593", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3.11", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.7" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/01_standalone_examples/red-hat-openshift-ai/fraud-detection/6 Train Save lakefs.pipeline b/01_standalone_examples/red-hat-openshift-ai/fraud-detection/6 Train Save lakefs.pipeline new file mode 100644 index 00000000..7c7cb8c7 --- /dev/null +++ b/01_standalone_examples/red-hat-openshift-ai/fraud-detection/6 Train Save lakefs.pipeline @@ -0,0 +1,264 @@ +{ + "doc_type": "pipeline", + "version": "3.0", + "json_schema": "http://api.dataplatform.ibm.com/schemas/common-pipeline/pipeline-flow/pipeline-flow-v3-schema.json", + "id": "elyra-auto-generated-pipeline", + "primary_pipeline": "primary", + "pipelines": [ + { + "id": "primary", + "nodes": [ + { + "id": "17500566-dd3e-463d-901a-da339d2d5459", + "type": "execution_node", + "op": "execute-notebook-node", + "app_data": { + "component_parameters": { + "dependencies": [ + "data/*.csv" + ], + "include_subdirectories": true, + "outputs": [ + "models/fraud/1/model.onnx" + ], + "env_vars": [], + "kubernetes_pod_annotations": [], + "kubernetes_pod_labels": [], + "kubernetes_secrets": [ + { + "env_var": "AWS_ACCESS_KEY_ID", + "name": "pipeline-artifacts", + "key": "AWS_ACCESS_KEY_ID" + }, + { + "env_var": "AWS_SECRET_ACCESS_KEY", + "name": "pipeline-artifacts", + "key": "AWS_SECRET_ACCESS_KEY" + }, + { + "env_var": "AWS_S3_ENDPOINT", + "name": "pipeline-artifacts", + "key": "AWS_S3_ENDPOINT" + }, + { + "env_var": "AWS_DEFAULT_REGION", + "name": "pipeline-artifacts", + "key": "AWS_DEFAULT_REGION" + }, + { + "env_var": "AWS_S3_BUCKET", + "name": "pipeline-artifacts", + "key": "AWS_S3_BUCKET" + }, + { + "env_var": "LAKECTL_SERVER_ENDPOINT_URL", + "name": "my-storage", + "key": "AWS_S3_ENDPOINT" + }, + { + "env_var": "LAKECTL_CREDENTIALS_ACCESS_KEY_ID", + "name": "my-storage", + "key": "AWS_ACCESS_KEY_ID" + }, + { + "env_var": "LAKECTL_CREDENTIALS_SECRET_ACCESS_KEY", + "name": "my-storage", + "key": "AWS_SECRET_ACCESS_KEY" + }, + { + "env_var": "LAKEFS_REPO_NAME", + "name": "my-storage", + "key": "AWS_S3_BUCKET" + }, + { + "env_var": "LAKEFS_DEFAULT_REGION", + "name": "my-storage", + "key": "AWS_DEFAULT_REGION" + } + ], + "kubernetes_shared_mem_size": {}, + "kubernetes_tolerations": [], + "mounted_volumes": [], + "filename": "1_experiment_train_lakefs.ipynb" + }, + "label": "", + "ui_data": { + "label": "1_experiment_train_lakefs.ipynb", + "image": "/notebook/lakefs/fraud-detection/static/elyra/notebook.svg", + "x_pos": 168, + "y_pos": 272, + "description": "Run notebook file" + } + }, + "inputs": [ + { + "id": "inPort", + "app_data": { + "ui_data": { + "cardinality": { + "min": 0, + "max": -1 + }, + "label": "Input Port" + } + } + } + ], + "outputs": [ + { + "id": "outPort", + "app_data": { + "ui_data": { + "cardinality": { + "min": 0, + "max": -1 + }, + "label": "Output Port" + } + } + } + ] + }, + { + "id": "f1aacb23-a028-46ac-8f47-0c262062b78c", + "type": "execution_node", + "op": "execute-notebook-node", + "app_data": { + "component_parameters": { + "dependencies": [], + "include_subdirectories": false, + "outputs": [ + "models/fraud/1/model.onnx" + ], + "env_vars": [], + "kubernetes_pod_annotations": [], + "kubernetes_pod_labels": [], + "kubernetes_secrets": [ + { + "env_var": "AWS_ACCESS_KEY_ID", + "name": "pipeline-artifacts", + "key": "AWS_ACCESS_KEY_ID" + }, + { + "env_var": "AWS_SECRET_ACCESS_KEY", + "name": "pipeline-artifacts", + "key": "AWS_SECRET_ACCESS_KEY" + }, + { + "env_var": "AWS_S3_ENDPOINT", + "name": "pipeline-artifacts", + "key": "AWS_S3_ENDPOINT" + }, + { + "env_var": "AWS_DEFAULT_REGION", + "name": "pipeline-artifacts", + "key": "AWS_DEFAULT_REGION" + }, + { + "env_var": "AWS_S3_BUCKET", + "name": "pipeline-artifacts", + "key": "AWS_S3_BUCKET" + }, + { + "env_var": "LAKECTL_SERVER_ENDPOINT_URL", + "name": "my-storage", + "key": "AWS_S3_ENDPOINT" + }, + { + "env_var": "LAKECTL_CREDENTIALS_ACCESS_KEY_ID", + "name": "my-storage", + "key": "AWS_ACCESS_KEY_ID" + }, + { + "env_var": "LAKECTL_CREDENTIALS_SECRET_ACCESS_KEY", + "name": "my-storage", + "key": "AWS_SECRET_ACCESS_KEY" + }, + { + "env_var": "LAKEFS_REPO_NAME", + "name": "my-storage", + "key": "AWS_S3_BUCKET" + }, + { + "env_var": "LAKEFS_DEFAULT_REGION", + "name": "my-storage", + "key": "AWS_DEFAULT_REGION" + } + ], + "kubernetes_shared_mem_size": {}, + "kubernetes_tolerations": [], + "mounted_volumes": [], + "filename": "2_save_model_lakefs.ipynb" + }, + "label": "", + "ui_data": { + "label": "2_save_model_lakefs.ipynb", + "image": "/notebook/lakefs/fraud-detection/static/elyra/notebook.svg", + "x_pos": 559, + "y_pos": 269, + "description": "Run notebook file" + } + }, + "inputs": [ + { + "id": "inPort", + "app_data": { + "ui_data": { + "cardinality": { + "min": 0, + "max": -1 + }, + "label": "Input Port" + } + }, + "links": [ + { + "id": "4da62827-7de2-473c-9712-33df151b7e5c", + "node_id_ref": "17500566-dd3e-463d-901a-da339d2d5459", + "port_id_ref": "outPort" + } + ] + } + ], + "outputs": [ + { + "id": "outPort", + "app_data": { + "ui_data": { + "cardinality": { + "min": 0, + "max": -1 + }, + "label": "Output Port" + } + } + } + ] + } + ], + "app_data": { + "ui_data": { + "comments": [] + }, + "version": 8, + "runtime_type": "KUBEFLOW_PIPELINES", + "properties": { + "name": "6 Train Save lakefs", + "runtime": "Data Science Pipelines", + "pipeline_defaults": { + "kubernetes_pod_annotations": [], + "kubernetes_pod_labels": [], + "mounted_volumes": [], + "kubernetes_tolerations": [], + "kubernetes_shared_mem_size": {}, + "env_vars": [], + "kubernetes_secrets": [], + "runtime_image": "quay.io/modh/runtime-images@sha256:1186ac6c9026d1091f707fe8cedfcc1ea12d1ec46edd9e8d56bb4b12ba048630" + } + } + }, + "runtime_ref": "" + } + ], + "schemas": [] +} \ No newline at end of file diff --git a/01_standalone_examples/red-hat-openshift-ai/fraud-detection/7_get_data_train_upload_lakefs.yaml b/01_standalone_examples/red-hat-openshift-ai/fraud-detection/7_get_data_train_upload_lakefs.yaml new file mode 100644 index 00000000..c35b7e88 --- /dev/null +++ b/01_standalone_examples/red-hat-openshift-ai/fraud-detection/7_get_data_train_upload_lakefs.yaml @@ -0,0 +1,304 @@ +# PIPELINE DEFINITION +# Name: 7-get-data-train-upload-lakefs +components: + comp-get-data: + executorLabel: exec-get-data + outputDefinitions: + artifacts: + train_data_output_path: + artifactType: + schemaTitle: system.Artifact + schemaVersion: 0.0.1 + validate_data_output_path: + artifactType: + schemaTitle: system.Artifact + schemaVersion: 0.0.1 + comp-train-model: + executorLabel: exec-train-model + inputDefinitions: + artifacts: + train_data_input_path: + artifactType: + schemaTitle: system.Artifact + schemaVersion: 0.0.1 + validate_data_input_path: + artifactType: + schemaTitle: system.Artifact + schemaVersion: 0.0.1 + outputDefinitions: + artifacts: + model_output_path: + artifactType: + schemaTitle: system.Artifact + schemaVersion: 0.0.1 + comp-upload-model: + executorLabel: exec-upload-model + inputDefinitions: + artifacts: + input_model_path: + artifactType: + schemaTitle: system.Artifact + schemaVersion: 0.0.1 +deploymentSpec: + executors: + exec-get-data: + container: + args: + - --executor_input + - '{{$}}' + - --function_to_execute + - get_data + command: + - sh + - -c + - "\nif ! [ -x \"$(command -v pip)\" ]; then\n python3 -m ensurepip ||\ + \ python3 -m ensurepip --user || apt-get install python3-pip\nfi\n\nPIP_DISABLE_PIP_VERSION_CHECK=1\ + \ python3 -m pip install --quiet --no-warn-script-location 'kfp==2.5.0'\ + \ '--no-deps' 'typing-extensions>=3.7.4,<5; python_version<\"3.9\"' && \"\ + $0\" \"$@\"\n" + - sh + - -ec + - 'program_path=$(mktemp -d) + + + printf "%s" "$0" > "$program_path/ephemeral_component.py" + + _KFP_RUNTIME=true python3 -m kfp.dsl.executor_main --component_module_path "$program_path/ephemeral_component.py" "$@" + + ' + - "\nimport kfp\nfrom kfp import dsl\nfrom kfp.dsl import *\nfrom typing import\ + \ *\n\ndef get_data(train_data_output_path: OutputPath(), validate_data_output_path:\ + \ OutputPath()):\n import urllib.request\n print(\"starting download...\"\ + )\n print(\"downloading training data\")\n url = \"https://raw.githubusercontent.com/cfchase/fraud-detection/main/data/train.csv\"\ + \n urllib.request.urlretrieve(url, train_data_output_path)\n print(\"\ + train data downloaded\")\n print(\"downloading validation data\")\n \ + \ url = \"https://raw.githubusercontent.com/cfchase/fraud-detection/main/data/validate.csv\"\ + \n urllib.request.urlretrieve(url, validate_data_output_path)\n print(\"\ + validation data downloaded\")\n\n" + image: quay.io/modh/runtime-images:runtime-cuda-tensorflow-ubi9-python-3.9-2024a-20240523 + exec-train-model: + container: + args: + - --executor_input + - '{{$}}' + - --function_to_execute + - train_model + command: + - sh + - -c + - "\nif ! [ -x \"$(command -v pip)\" ]; then\n python3 -m ensurepip ||\ + \ python3 -m ensurepip --user || apt-get install python3-pip\nfi\n\nPIP_DISABLE_PIP_VERSION_CHECK=1\ + \ python3 -m pip install --quiet --no-warn-script-location 'kfp==2.5.0'\ + \ '--no-deps' 'typing-extensions>=3.7.4,<5; python_version<\"3.9\"' &&\ + \ python3 -m pip install --quiet --no-warn-script-location 'onnx' 'onnxruntime'\ + \ 'tf2onnx' 'lakefs==0.7.1' 's3fs==2024.10.0' && \"$0\" \"$@\"\n" + - sh + - -ec + - 'program_path=$(mktemp -d) + + + printf "%s" "$0" > "$program_path/ephemeral_component.py" + + _KFP_RUNTIME=true python3 -m kfp.dsl.executor_main --component_module_path "$program_path/ephemeral_component.py" "$@" + + ' + - "\nimport kfp\nfrom kfp import dsl\nfrom kfp.dsl import *\nfrom typing import\ + \ *\n\ndef train_model(train_data_input_path: InputPath(), validate_data_input_path:\ + \ InputPath(), model_output_path: OutputPath()):\n import numpy as np\n\ + \ import pandas as pd\n from keras.models import Sequential\n from\ + \ keras.layers import Dense, Dropout, BatchNormalization, Activation\n \ + \ from sklearn.model_selection import train_test_split\n from sklearn.preprocessing\ + \ import StandardScaler\n from sklearn.utils import class_weight\n \ + \ import tf2onnx\n import onnx\n import pickle\n from pathlib\ + \ import Path\n import lakefs\n import os\n import s3fs\n\n \ + \ # Define lakeFS Storage and Repository information\n lakefs_storage_options={\n\ + \ \"key\": os.environ.get('LAKECTL_CREDENTIALS_ACCESS_KEY_ID'),\n\ + \ \"secret\": os.environ.get('LAKECTL_CREDENTIALS_SECRET_ACCESS_KEY'),\n\ + \ \"client_kwargs\": {\n \"endpoint_url\": os.environ.get('LAKECTL_SERVER_ENDPOINT_URL')\n\ + \ }\n }\n\n repo_name = os.environ.get('LAKEFS_REPO_NAME')\n\ + \ mainBranch = \"main\"\n trainingBranch = \"train01\"\n\n repo\ + \ = lakefs.Repository(repo_name)\n print(repo)\n\n # Create Training\ + \ branch in lakeFS and load the CSV data to the training branch in lakeFS\n\ + \ branchTraining = repo.branch(trainingBranch).create(source_reference=mainBranch,\ + \ exist_ok=True)\n\n obj = branchTraining.object(path='data/train.csv')\n\ + \ with open(train_data_input_path, mode='rb') as reader, obj.writer(mode='wb',\ + \ metadata={'using': 'python_wrapper', 'source':'Fraud Detection Demo'})\ + \ as writer:\n writer.write(reader.read())\n\n obj = branchTraining.object(path='data/validate.csv')\n\ + \ with open(validate_data_input_path, mode='rb') as reader, obj.writer(mode='wb',\ + \ metadata={'using': 'python_wrapper', 'source':'Fraud Detection Demo'})\ + \ as writer:\n writer.write(reader.read())\n\n # Load the CSV\ + \ data which we will use to train the model.\n # It contains the following\ + \ fields:\n # distancefromhome - The distance from home where the transaction\ + \ happened.\n # distancefromlast_transaction - The distance from last\ + \ transaction happened.\n # ratiotomedianpurchaseprice - Ratio of purchased\ + \ price compared to median purchase price.\n # repeat_retailer - If\ + \ it's from a retailer that already has been purchased from before.\n \ + \ # used_chip - If the (credit card) chip was used.\n # usedpinnumber\ + \ - If the PIN number was used.\n # online_order - If it was an online\ + \ order.\n # fraud - If the transaction is fraudulent.\n\n\n feature_indexes\ + \ = [\n 1, # distance_from_last_transaction\n 2, # ratio_to_median_purchase_price\n\ + \ 4, # used_chip\n 5, # used_pin_number\n 6, # online_order\n\ + \ ]\n\n label_indexes = [\n 7 # fraud\n ]\n\n X_train\ + \ = pd.read_csv(f\"s3://{repo_name}/{trainingBranch}/data/train.csv\", storage_options=lakefs_storage_options)\n\ + \ y_train = X_train.iloc[:, label_indexes]\n X_train = X_train.iloc[:,\ + \ feature_indexes]\n\n X_val = pd.read_csv(f\"s3://{repo_name}/{trainingBranch}/data/validate.csv\"\ + , storage_options=lakefs_storage_options)\n y_val = X_val.iloc[:, label_indexes]\n\ + \ X_val = X_val.iloc[:, feature_indexes]\n\n # Scale the data to remove\ + \ mean and have unit variance. The data will be between -1 and 1, which\ + \ makes it a lot easier for the model to learn than random (and potentially\ + \ large) values.\n # It is important to only fit the scaler to the training\ + \ data, otherwise you are leaking information about the global distribution\ + \ of variables (which is influenced by the test set) into the training set.\n\ + \n scaler = StandardScaler()\n\n X_train = scaler.fit_transform(X_train.values)\n\ + \n obj = branchTraining.object(path='artifact/scaler.pkl')\n with\ + \ obj.writer(\"wb\") as handle:\n pickle.dump(scaler, handle)\n\n\ + \ # Since the dataset is unbalanced (it has many more non-fraud transactions\ + \ than fraudulent ones), set a class weight to weight the few fraudulent\ + \ transactions higher than the many non-fraud transactions.\n class_weights\ + \ = class_weight.compute_class_weight('balanced', classes=np.unique(y_train),\ + \ y=y_train.values.ravel())\n class_weights = {i: class_weights[i] for\ + \ i in range(len(class_weights))}\n\n # Build the model, the model we\ + \ build here is a simple fully connected deep neural network, containing\ + \ 3 hidden layers and one output layer.\n\n model = Sequential()\n \ + \ model.add(Dense(32, activation='relu', input_dim=len(feature_indexes)))\n\ + \ model.add(Dropout(0.2))\n model.add(Dense(32))\n model.add(BatchNormalization())\n\ + \ model.add(Activation('relu'))\n model.add(Dropout(0.2))\n model.add(Dense(32))\n\ + \ model.add(BatchNormalization())\n model.add(Activation('relu'))\n\ + \ model.add(Dropout(0.2))\n model.add(Dense(1, activation='sigmoid'))\n\ + \ model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])\n\ + \ model.summary()\n\n # Train the model and get performance\n\n \ + \ epochs = 2\n history = model.fit(X_train, y_train, epochs=epochs,\n\ + \ validation_data=(scaler.transform(X_val.values),\ + \ y_val),\n verbose=True, class_weight=class_weights)\n\ + \n # Save the model as ONNX for easy use of ModelMesh\n model_proto,\ + \ _ = tf2onnx.convert.from_keras(model)\n print(model_output_path)\n\ + \ onnx.save(model_proto, model_output_path)\n\n" + image: quay.io/modh/runtime-images:runtime-cuda-tensorflow-ubi9-python-3.9-2024a-20240523 + exec-upload-model: + container: + args: + - --executor_input + - '{{$}}' + - --function_to_execute + - upload_model + command: + - sh + - -c + - "\nif ! [ -x \"$(command -v pip)\" ]; then\n python3 -m ensurepip ||\ + \ python3 -m ensurepip --user || apt-get install python3-pip\nfi\n\nPIP_DISABLE_PIP_VERSION_CHECK=1\ + \ python3 -m pip install --quiet --no-warn-script-location 'kfp==2.5.0'\ + \ '--no-deps' 'typing-extensions>=3.7.4,<5; python_version<\"3.9\"' &&\ + \ python3 -m pip install --quiet --no-warn-script-location 'boto3' 'botocore'\ + \ 'lakefs==0.7.1' && \"$0\" \"$@\"\n" + - sh + - -ec + - 'program_path=$(mktemp -d) + + + printf "%s" "$0" > "$program_path/ephemeral_component.py" + + _KFP_RUNTIME=true python3 -m kfp.dsl.executor_main --component_module_path "$program_path/ephemeral_component.py" "$@" + + ' + - "\nimport kfp\nfrom kfp import dsl\nfrom kfp.dsl import *\nfrom typing import\ + \ *\n\ndef upload_model(input_model_path: InputPath()):\n import os\n\ + \ import boto3\n import botocore\n\n # Define lakeFS Repository\n\ + \ import lakefs\n repo_name = os.environ.get('LAKEFS_REPO_NAME')\n\ + \n mainBranch = \"main\"\n trainingBranch = \"train01\"\n\n repo\ + \ = lakefs.Repository(repo_name)\n print(repo)\n\n aws_access_key_id\ + \ = os.environ.get('LAKECTL_CREDENTIALS_ACCESS_KEY_ID')\n aws_secret_access_key\ + \ = os.environ.get('LAKECTL_CREDENTIALS_SECRET_ACCESS_KEY')\n endpoint_url\ + \ = os.environ.get('LAKECTL_SERVER_ENDPOINT_URL')\n region_name = os.environ.get('LAKEFS_DEFAULT_REGION')\n\ + \ bucket_name = os.environ.get('LAKEFS_REPO_NAME')\n\n s3_key = os.environ.get(\"\ + S3_KEY\")\n\n session = boto3.session.Session(aws_access_key_id=aws_access_key_id,\n\ + \ aws_secret_access_key=aws_secret_access_key)\n\ + \n s3_resource = session.resource(\n 's3',\n config=botocore.client.Config(signature_version='s3v4'),\n\ + \ endpoint_url=endpoint_url,\n region_name=region_name)\n\n\ + \ bucket = s3_resource.Bucket(bucket_name)\n\n print(f\"Uploading\ + \ {trainingBranch}/{s3_key}\")\n bucket.upload_file(input_model_path,\ + \ f\"{trainingBranch}/{s3_key}\")\n\n" + env: + - name: S3_KEY + value: models/fraud/1/model.onnx + image: quay.io/modh/runtime-images:runtime-cuda-tensorflow-ubi9-python-3.9-2024a-20240523 +pipelineInfo: + name: 7-get-data-train-upload-lakefs +root: + dag: + tasks: + get-data: + cachingOptions: + enableCache: true + componentRef: + name: comp-get-data + taskInfo: + name: get-data + train-model: + cachingOptions: + enableCache: true + componentRef: + name: comp-train-model + dependentTasks: + - get-data + inputs: + artifacts: + train_data_input_path: + taskOutputArtifact: + outputArtifactKey: train_data_output_path + producerTask: get-data + validate_data_input_path: + taskOutputArtifact: + outputArtifactKey: validate_data_output_path + producerTask: get-data + taskInfo: + name: train-model + upload-model: + cachingOptions: + enableCache: true + componentRef: + name: comp-upload-model + dependentTasks: + - train-model + inputs: + artifacts: + input_model_path: + taskOutputArtifact: + outputArtifactKey: model_output_path + producerTask: train-model + taskInfo: + name: upload-model +schemaVersion: 2.1.0 +sdkVersion: kfp-2.5.0 +--- +platforms: + kubernetes: + deploymentSpec: + executors: + exec-train-model: + secretAsEnv: + - keyToEnv: + - envVar: LAKECTL_CREDENTIALS_ACCESS_KEY_ID + secretKey: AWS_ACCESS_KEY_ID + - envVar: LAKECTL_CREDENTIALS_SECRET_ACCESS_KEY + secretKey: AWS_SECRET_ACCESS_KEY + - envVar: LAKEFS_DEFAULT_REGION + secretKey: AWS_DEFAULT_REGION + - envVar: LAKEFS_REPO_NAME + secretKey: AWS_S3_BUCKET + - envVar: LAKECTL_SERVER_ENDPOINT_URL + secretKey: AWS_S3_ENDPOINT + secretName: my-storage + exec-upload-model: + secretAsEnv: + - keyToEnv: + - envVar: LAKECTL_CREDENTIALS_ACCESS_KEY_ID + secretKey: AWS_ACCESS_KEY_ID + - envVar: LAKECTL_CREDENTIALS_SECRET_ACCESS_KEY + secretKey: AWS_SECRET_ACCESS_KEY + - envVar: LAKEFS_DEFAULT_REGION + secretKey: AWS_DEFAULT_REGION + - envVar: LAKEFS_REPO_NAME + secretKey: AWS_S3_BUCKET + - envVar: LAKECTL_SERVER_ENDPOINT_URL + secretKey: AWS_S3_ENDPOINT + secretName: my-storage diff --git a/01_standalone_examples/red-hat-openshift-ai/fraud-detection/8_distributed_training_lakefs.ipynb b/01_standalone_examples/red-hat-openshift-ai/fraud-detection/8_distributed_training_lakefs.ipynb new file mode 100644 index 00000000..bb3c43eb --- /dev/null +++ b/01_standalone_examples/red-hat-openshift-ai/fraud-detection/8_distributed_training_lakefs.ipynb @@ -0,0 +1,562 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Training the Fraud Detection model with Ray by using Codeflare" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The example fraud detection model is very small and quickly trained. However, for many large models, training requires multiple GPUs and often multiple machines. In this notebook, you learn how to train a model by using Ray on OpenShift AI to scale out the model training. You use the Codeflare SDK to create the cluster and submit the job. You can find detailed documentation for the SDK [here](https://project-codeflare.github.io/codeflare-sdk/detailed-documentation/).\n", + "\n", + "For this procedure, you need to use codeflare-sdk 0.19.1 (or later). Begin by installing the SDK if it's not already installed or up to date:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "!pip install --upgrade codeflare-sdk==0.19.1 lakefs==0.7.1" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Define lakeFS Repository and create Training branch in lakeFS\n", + "### Change MinIO Access and Secret keys" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "import os\n", + "import lakefs\n", + "\n", + "repo_name = os.environ.get('LAKEFS_REPO_NAME')\n", + "\n", + "mainBranch = \"main\"\n", + "trainingBranch = \"train01\"\n", + "\n", + "os.environ[\"PIPELINE_ARTIFACTS_ENDPOINT_URL\"] = \"http://minio:9000\"\n", + "os.environ[\"PIPELINE_ARTIFACTS_ACCESS_KEY_ID\"] = \"MinIO Access Key\"\n", + "os.environ[\"PIPELINE_ARTIFACTS_SECRET_ACCESS_KEY\"] = \"MinIO Secret Key\"\n", + "os.environ[\"PIPELINE_ARTIFACTS_S3_BUCKET\"] = \"pipeline-artifacts\"\n", + "\n", + "repo = lakefs.Repository(repo_name)\n", + "branchMain = repo.branch(mainBranch)\n", + "print(repo)\n", + "\n", + "branchTraining = repo.branch(trainingBranch).create(source_reference=mainBranch, exist_ok=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Preparing the data\n", + "\n", + "Normally, the training data for your model would be available in a shared location. For this example, the data is local. You must upload it to your object storage so that you can see how data loading from a shared data source works. After you upload the data, you can work with it by using Ray Data so that it is properly shared across the worker machines." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "import sys\n", + "sys.path.append('./utils')\n", + "\n", + "import utils.s3\n", + "\n", + "utils.s3.upload_directory_to_s3(\"data\", f\"{trainingBranch}/data\")\n", + "print(\"---\")\n", + "utils.s3.list_objects(f\"{trainingBranch}/data\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Authenticate to the cluster by using the OpenShift console login\n", + "\n", + "You must create the Kubernetes objects for Ray Clusters using the Codeflare SDK. In order to do so, you need access permission for your own namespace. The easiest way to set up access is by using the OpenShift CLI `oc` client. \n", + "\n", + "From the OpenShift web console, you can generate an `oc login` command that includes your token and server information. You can use the command to log in to the OpenShift CLI. \n", + "\n", + "1. To generate the command, select **Copy login command** from the username drop-down menu at the top right of the web console.\n", + "\n", + "
\n", + " \"copy\n", + "
\n", + "\n", + "2. Click **Display token**.\n", + "\n", + "3. Below **Log in with this token**, take note of the parameters for token and server.\n", + " For example:\n", + " ```\n", + " oc login --token=sha256~LongString --server=https://api.your-cluster.domain.com:6443\n", + " ``` \n", + " - token: `sha256~LongString`\n", + " - server: `https://api.your-cluster.domain.com:6443`\n", + " \n", + "4. In the following code cell, in the TokenAuthentication object, replace the token and server values with the values that you noted in Step 3.\n", + " For example:\n", + " ```\n", + " auth = TokenAuthentication(\n", + " token = \"sha256~LongString\",\n", + " server = \"https://api.your-cluster.domain.com:6443\",\n", + " skip_tls=False\n", + " )\n", + " auth.login()\n", + " ```\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "from codeflare_sdk import TokenAuthentication\n", + "# Create authentication object for user permissions\n", + "# IF unused, SDK will automatically check for default kubeconfig, then in-cluster config\n", + "# KubeConfigFileAuthentication can also be used to specify kubeconfig path manually\n", + "auth = TokenAuthentication(\n", + " token = \"sha256~XXXX\",\n", + " server = \"https://XXXX\",\n", + " skip_tls=False\n", + ")\n", + "auth.login()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "## Create a Ray cluster" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Configure a Ray cluster\n", + "\n", + "CodeFlare allows you to specify parameters, such as number of workers, image, and kueue local queue name. A full list of parameters is available [here](https://project-codeflare.github.io/codeflare-sdk/detailed-documentation/cluster/config.html)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "from codeflare_sdk import Cluster, ClusterConfiguration\n", + "\n", + "cluster = Cluster(ClusterConfiguration(\n", + " name=\"raycluster-cpu\",\n", + " head_extended_resource_requests={'nvidia.com/gpu': 0},\n", + " worker_extended_resource_requests={'nvidia.com/gpu': 0},\n", + " num_workers=2,\n", + " worker_cpu_requests=1,\n", + " worker_cpu_limits=4,\n", + " worker_memory_requests=2,\n", + " worker_memory_limits=4,\n", + " image=\"quay.io/modh/ray:2.35.0-py39-cu121\"\n", + "))\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Start the cluster\n", + "\n", + "If you have a running cluster that you want to connect to, skip to the next cell.\n", + "\n", + "To start a cluster, run the following cell to create the necessary Kubernetes objects to run the Ray cluster. This step might take a few minutes to complete." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "cluster.up()\n", + "cluster.wait_ready()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Connect to a running cluster\n", + "\n", + "If you've already created a cluster, but you've restarted the Python kernel, closed the notebook, or are working in a different notebook, and you want to connect to the existing cluster, uncomment the code in the following cell and then run it." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# from codeflare_sdk import get_cluster\n", + "# name=\"raycluster-cpu\"\n", + "# namespace=\"lakefs\"\n", + "# cluster = get_cluster(name, namespace=namespace)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can view information about the cluster, including a link to the Ray dashboard. In the Ray dashboard, you can inspect the running jobs and logs, and see the resources being used.\n", + "
\n", + " \"codeflare\n", + "
\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "cluster.details()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The link to the Ray dashboard is available in the cluster details provided as a result of running the previous cell. It should look something like this:\n", + "\n", + "
\n", + " \"ray\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Ray job submission" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Initialize the Job Submission Client\n", + "\n", + "If you want to submit jobs, connect to the running Ray cluster by initializing the job client that has the proper authentication and connection information.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "client = cluster.job_client" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "After you connect to the Ray cluster, you can query the cluster to determine whether there are any existing jobs:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "client.list_jobs()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "### Create a runtime environment\n", + "\n", + "Now you can configure the [runtime environment](https://docs.ray.io/en/latest/ray-core/handling-dependencies.html#runtime-environments) for the job. This step includes specifying the working directory, files to exclude, dependencies, and environment variables.\n", + "\n", + "```python\n", + "runtime_env={\n", + " \"working_dir\": \"./\", # relative path to files uploaded to the job\n", + " \"excludes\": [\"local_data/\"], # directories and files to exclude from being uploaded to the job\n", + " \"pip\": [\"boto3\", \"botocore\"], # can also be a string path to a requirements.txt file\n", + " \"env_vars\": {\n", + " \"MY_ENV_VAR\": \"MY_ENV_VAR_VALUE\",\n", + " \"MY_ENV_VAR_2\": os.environ.get(\"MY_ENV_VAR_2\"),\n", + " },\n", + "}\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "import os\n", + "\n", + "# script = \"test_data_loader.py\"\n", + "script = \"train_tf_cpu_lakefs.py\"\n", + "runtime_env = {\n", + " \"working_dir\": \"./ray-scripts\",\n", + " \"excludes\": [],\n", + " \"pip\": \"./ray-scripts/requirements.txt\",\n", + " \"env_vars\": {\n", + " \"AWS_ACCESS_KEY_ID\": os.environ.get(\"AWS_ACCESS_KEY_ID\"),\n", + " \"AWS_SECRET_ACCESS_KEY\": os.environ.get(\"AWS_SECRET_ACCESS_KEY\"),\n", + " \"AWS_S3_ENDPOINT\": os.environ.get(\"AWS_S3_ENDPOINT\"),\n", + " \"AWS_DEFAULT_REGION\": os.environ.get(\"AWS_DEFAULT_REGION\"),\n", + " \"AWS_S3_BUCKET\": os.environ.get(\"AWS_S3_BUCKET\"),\n", + " \"PIPELINE_ARTIFACTS_ENDPOINT_URL\": os.environ.get(\"PIPELINE_ARTIFACTS_ENDPOINT_URL\"),\n", + " \"PIPELINE_ARTIFACTS_ACCESS_KEY_ID\": os.environ.get(\"PIPELINE_ARTIFACTS_ACCESS_KEY_ID\"),\n", + " \"PIPELINE_ARTIFACTS_SECRET_ACCESS_KEY\": os.environ.get(\"PIPELINE_ARTIFACTS_SECRET_ACCESS_KEY\"),\n", + " \"PIPELINE_ARTIFACTS_S3_BUCKET\": os.environ.get(\"PIPELINE_ARTIFACTS_S3_BUCKET\"),\n", + " \"NUM_WORKERS\": \"1\",\n", + " \"TRAIN_DATA\": f\"{trainingBranch}/data/train.csv\",\n", + " \"VALIDATE_DATA\": f\"{trainingBranch}/data/validate.csv\",\n", + " \"MODEL_OUTPUT_PREFIX\": f\"{trainingBranch}/models/fraud/1/\",\n", + " },\n", + "}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "### Submit the configured job\n", + "\n", + "Now you can submit the job to the cluster. This step creates the necessary Kubernetes objects to run the job. The job runs the script with the specified runtime environment. The script for this example is located in [ray-scripts/train_tf_cpu.py](./ray-scripts/train_tf_cpu.py). The script follows the code fairly closely to the official [Ray TensorFlow example](https://docs.ray.io/en/latest/train/distributed-tensorflow-keras.html). This example uses TensorFlow, note that the [Ray site](https://docs.ray.io/en/latest/train/train.html) provides examples for PyTorch and other frameworks." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "submission_id = client.submit_job(\n", + " entrypoint=f\"python {script}\",\n", + " runtime_env=runtime_env,\n", + ")\n", + "\n", + "print(submission_id)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "### Query important job information" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# Get the job's status\n", + "print(client.get_job_status(submission_id), \"\\n\")\n", + "\n", + "# Get job related info\n", + "print(client.get_job_info(submission_id), \"\\n\")\n", + "\n", + "# Get the job's logs\n", + "print(client.get_job_logs(submission_id))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can also tail the job logs to watch the progress of the job." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# Iterate through the logs of a job \n", + "async for lines in client.tail_job_logs(submission_id):\n", + " print(lines, end=\"\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### List jobs" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "client.list_jobs()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "### Stop jobs\n", + "\n", + "If you want to stop a job, call `stop_job` and specify the submission ID. In the following cell, the command lists all the jobs and stops them." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "for job_details in client.list_jobs():\n", + " print(f\"deleting {job_details.submission_id}\")\n", + " client.stop_job(job_details.submission_id)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "### Delete jobs\n", + "\n", + "You can also delete the jobs." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "for job_details in client.list_jobs():\n", + " print(f\"deleting {job_details.submission_id}\")\n", + " client.delete_job(job_details.submission_id)\n", + "\n", + "client.list_jobs()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Delete the cluster\n", + "\n", + "After you complete training, you can delete the cluster. When you delete the cluster, you remove the Kubernetes objects and free up resources." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "cluster.down()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3.11", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.7" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/01_standalone_examples/red-hat-openshift-ai/fraud-detection/pipeline/7_get_data_train_upload_lakefs.py b/01_standalone_examples/red-hat-openshift-ai/fraud-detection/pipeline/7_get_data_train_upload_lakefs.py new file mode 100644 index 00000000..4d33ee65 --- /dev/null +++ b/01_standalone_examples/red-hat-openshift-ai/fraud-detection/pipeline/7_get_data_train_upload_lakefs.py @@ -0,0 +1,231 @@ +import os + +from kfp import compiler +from kfp import dsl +from kfp.dsl import InputPath, OutputPath + +from kfp import kubernetes + + +@dsl.component(base_image="quay.io/modh/runtime-images:runtime-cuda-tensorflow-ubi9-python-3.9-2024a-20240523") +def get_data(train_data_output_path: OutputPath(), validate_data_output_path: OutputPath()): + import urllib.request + print("starting download...") + print("downloading training data") + url = "https://raw.githubusercontent.com/cfchase/fraud-detection/main/data/train.csv" + urllib.request.urlretrieve(url, train_data_output_path) + print("train data downloaded") + print("downloading validation data") + url = "https://raw.githubusercontent.com/cfchase/fraud-detection/main/data/validate.csv" + urllib.request.urlretrieve(url, validate_data_output_path) + print("validation data downloaded") + + +@dsl.component( + base_image="quay.io/modh/runtime-images:runtime-cuda-tensorflow-ubi9-python-3.9-2024a-20240523", + packages_to_install=["onnx", "onnxruntime", "tf2onnx", "lakefs==0.7.1", "s3fs==2024.10.0"], +) +def train_model(train_data_input_path: InputPath(), validate_data_input_path: InputPath(), model_output_path: OutputPath()): + import numpy as np + import pandas as pd + from keras.models import Sequential + from keras.layers import Dense, Dropout, BatchNormalization, Activation + from sklearn.model_selection import train_test_split + from sklearn.preprocessing import StandardScaler + from sklearn.utils import class_weight + import tf2onnx + import onnx + import pickle + from pathlib import Path + import lakefs + import os + import s3fs + + # Define lakeFS Storage and Repository information + lakefs_storage_options={ + "key": os.environ.get('LAKECTL_CREDENTIALS_ACCESS_KEY_ID'), + "secret": os.environ.get('LAKECTL_CREDENTIALS_SECRET_ACCESS_KEY'), + "client_kwargs": { + "endpoint_url": os.environ.get('LAKECTL_SERVER_ENDPOINT_URL') + } + } + + repo_name = os.environ.get('LAKEFS_REPO_NAME') + mainBranch = "main" + trainingBranch = "train01" + + repo = lakefs.Repository(repo_name) + print(repo) + + # Create Training branch in lakeFS and load the CSV data to the training branch in lakeFS + branchTraining = repo.branch(trainingBranch).create(source_reference=mainBranch, exist_ok=True) + + obj = branchTraining.object(path='data/train.csv') + with open(train_data_input_path, mode='rb') as reader, obj.writer(mode='wb', metadata={'using': 'python_wrapper', 'source':'Fraud Detection Demo'}) as writer: + writer.write(reader.read()) + + obj = branchTraining.object(path='data/validate.csv') + with open(validate_data_input_path, mode='rb') as reader, obj.writer(mode='wb', metadata={'using': 'python_wrapper', 'source':'Fraud Detection Demo'}) as writer: + writer.write(reader.read()) + + # Load the CSV data which we will use to train the model. + # It contains the following fields: + # distancefromhome - The distance from home where the transaction happened. + # distancefromlast_transaction - The distance from last transaction happened. + # ratiotomedianpurchaseprice - Ratio of purchased price compared to median purchase price. + # repeat_retailer - If it's from a retailer that already has been purchased from before. + # used_chip - If the (credit card) chip was used. + # usedpinnumber - If the PIN number was used. + # online_order - If it was an online order. + # fraud - If the transaction is fraudulent. + + + feature_indexes = [ + 1, # distance_from_last_transaction + 2, # ratio_to_median_purchase_price + 4, # used_chip + 5, # used_pin_number + 6, # online_order + ] + + label_indexes = [ + 7 # fraud + ] + + X_train = pd.read_csv(f"s3://{repo_name}/{trainingBranch}/data/train.csv", storage_options=lakefs_storage_options) + y_train = X_train.iloc[:, label_indexes] + X_train = X_train.iloc[:, feature_indexes] + + X_val = pd.read_csv(f"s3://{repo_name}/{trainingBranch}/data/validate.csv", storage_options=lakefs_storage_options) + y_val = X_val.iloc[:, label_indexes] + X_val = X_val.iloc[:, feature_indexes] + + # Scale the data to remove mean and have unit variance. The data will be between -1 and 1, which makes it a lot easier for the model to learn than random (and potentially large) values. + # It is important to only fit the scaler to the training data, otherwise you are leaking information about the global distribution of variables (which is influenced by the test set) into the training set. + + scaler = StandardScaler() + + X_train = scaler.fit_transform(X_train.values) + + obj = branchTraining.object(path='artifact/scaler.pkl') + with obj.writer("wb") as handle: + pickle.dump(scaler, handle) + + # Since the dataset is unbalanced (it has many more non-fraud transactions than fraudulent ones), set a class weight to weight the few fraudulent transactions higher than the many non-fraud transactions. + class_weights = class_weight.compute_class_weight('balanced', classes=np.unique(y_train), y=y_train.values.ravel()) + class_weights = {i: class_weights[i] for i in range(len(class_weights))} + + # Build the model, the model we build here is a simple fully connected deep neural network, containing 3 hidden layers and one output layer. + + model = Sequential() + model.add(Dense(32, activation='relu', input_dim=len(feature_indexes))) + model.add(Dropout(0.2)) + model.add(Dense(32)) + model.add(BatchNormalization()) + model.add(Activation('relu')) + model.add(Dropout(0.2)) + model.add(Dense(32)) + model.add(BatchNormalization()) + model.add(Activation('relu')) + model.add(Dropout(0.2)) + model.add(Dense(1, activation='sigmoid')) + model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy']) + model.summary() + + # Train the model and get performance + + epochs = 2 + history = model.fit(X_train, y_train, epochs=epochs, + validation_data=(scaler.transform(X_val.values), y_val), + verbose=True, class_weight=class_weights) + + # Save the model as ONNX for easy use of ModelMesh + model_proto, _ = tf2onnx.convert.from_keras(model) + print(model_output_path) + onnx.save(model_proto, model_output_path) + + +@dsl.component( + base_image="quay.io/modh/runtime-images:runtime-cuda-tensorflow-ubi9-python-3.9-2024a-20240523", + packages_to_install=["boto3", "botocore", "lakefs==0.7.1"] +) +def upload_model(input_model_path: InputPath()): + import os + import boto3 + import botocore + + # Define lakeFS Repository + import lakefs + repo_name = os.environ.get('LAKEFS_REPO_NAME') + + mainBranch = "main" + trainingBranch = "train01" + + repo = lakefs.Repository(repo_name) + print(repo) + + aws_access_key_id = os.environ.get('LAKECTL_CREDENTIALS_ACCESS_KEY_ID') + aws_secret_access_key = os.environ.get('LAKECTL_CREDENTIALS_SECRET_ACCESS_KEY') + endpoint_url = os.environ.get('LAKECTL_SERVER_ENDPOINT_URL') + region_name = os.environ.get('LAKEFS_DEFAULT_REGION') + bucket_name = os.environ.get('LAKEFS_REPO_NAME') + + s3_key = os.environ.get("S3_KEY") + + session = boto3.session.Session(aws_access_key_id=aws_access_key_id, + aws_secret_access_key=aws_secret_access_key) + + s3_resource = session.resource( + 's3', + config=botocore.client.Config(signature_version='s3v4'), + endpoint_url=endpoint_url, + region_name=region_name) + + bucket = s3_resource.Bucket(bucket_name) + + print(f"Uploading {trainingBranch}/{s3_key}") + bucket.upload_file(input_model_path, f"{trainingBranch}/{s3_key}") + + +@dsl.pipeline(name=os.path.basename(__file__).replace('.py', '')) +def pipeline(): + get_data_task = get_data() + train_data_csv_file = get_data_task.outputs["train_data_output_path"] + validate_data_csv_file = get_data_task.outputs["validate_data_output_path"] + + train_model_task = train_model(train_data_input_path=train_data_csv_file, + validate_data_input_path=validate_data_csv_file) + onnx_file = train_model_task.outputs["model_output_path"] + + upload_model_task = upload_model(input_model_path=onnx_file) + + upload_model_task.set_env_variable(name="S3_KEY", value="models/fraud/1/model.onnx") + + kubernetes.use_secret_as_env( + task=train_model_task, + secret_name='my-storage', + secret_key_to_env={ + 'AWS_ACCESS_KEY_ID': 'LAKECTL_CREDENTIALS_ACCESS_KEY_ID', + 'AWS_SECRET_ACCESS_KEY': 'LAKECTL_CREDENTIALS_SECRET_ACCESS_KEY', + 'AWS_DEFAULT_REGION': 'LAKEFS_DEFAULT_REGION', + 'AWS_S3_BUCKET': 'LAKEFS_REPO_NAME', + 'AWS_S3_ENDPOINT': 'LAKECTL_SERVER_ENDPOINT_URL', + }) + + kubernetes.use_secret_as_env( + task=upload_model_task, + secret_name='my-storage', + secret_key_to_env={ + 'AWS_ACCESS_KEY_ID': 'LAKECTL_CREDENTIALS_ACCESS_KEY_ID', + 'AWS_SECRET_ACCESS_KEY': 'LAKECTL_CREDENTIALS_SECRET_ACCESS_KEY', + 'AWS_DEFAULT_REGION': 'LAKEFS_DEFAULT_REGION', + 'AWS_S3_BUCKET': 'LAKEFS_REPO_NAME', + 'AWS_S3_ENDPOINT': 'LAKECTL_SERVER_ENDPOINT_URL', + }) + + +if __name__ == '__main__': + compiler.Compiler().compile( + pipeline_func=pipeline, + package_path=__file__.replace('.py', '.yaml') + ) diff --git a/01_standalone_examples/red-hat-openshift-ai/fraud-detection/pipeline/build_lakefs.sh b/01_standalone_examples/red-hat-openshift-ai/fraud-detection/pipeline/build_lakefs.sh new file mode 100644 index 00000000..794bd74a --- /dev/null +++ b/01_standalone_examples/red-hat-openshift-ai/fraud-detection/pipeline/build_lakefs.sh @@ -0,0 +1,4 @@ +#!/bin/bash + +pip install kfp kfp-kubernetes +python 7_get_data_train_upload_lakefs.py diff --git a/01_standalone_examples/red-hat-openshift-ai/fraud-detection/ray-scripts/train_tf_cpu_lakefs.py b/01_standalone_examples/red-hat-openshift-ai/fraud-detection/ray-scripts/train_tf_cpu_lakefs.py new file mode 100644 index 00000000..327e2011 --- /dev/null +++ b/01_standalone_examples/red-hat-openshift-ai/fraud-detection/ray-scripts/train_tf_cpu_lakefs.py @@ -0,0 +1,258 @@ +import os +import pickle +import boto3 +import botocore + +import pyarrow +import pyarrow.fs +import pyarrow.csv + +import sklearn +import numpy as np + +import tensorflow as tf +import onnx +import tf2onnx +from keras.models import Sequential +from keras.layers import Dense, Dropout, BatchNormalization, Activation + +import ray +from ray import train +from ray.train import RunConfig, ScalingConfig +from ray.train.tensorflow import TensorflowTrainer +from ray.train.tensorflow.keras import ReportCheckpointCallback +from ray.data.preprocessors import Concatenator, StandardScaler + +use_gpu = os.environ.get("USE_GPU", "False").lower() == "true" +num_workers = int(os.environ.get("NUM_WORKERS", "1")) +num_epochs = int(os.environ.get("NUM_EPOCHS", "2")) +batch_size = int(os.environ.get("BATCH_SIZE", "64")) +learning_rate = 1e-3 +output_column_name = "features" + +feature_columns = [ + "distance_from_last_transaction", + "ratio_to_median_purchase_price", + "used_chip", + "used_pin_number", + "online_order", +] + +label_columns = [ + "fraud", +] + +aws_access_key_id = os.environ.get("AWS_ACCESS_KEY_ID") +aws_secret_access_key = os.environ.get("AWS_SECRET_ACCESS_KEY") +endpoint_url = os.environ.get("AWS_S3_ENDPOINT") +region_name = os.environ.get("AWS_DEFAULT_REGION") +bucket_name = os.environ.get("AWS_S3_BUCKET") + +pipeline_artifacts_access_key_id = os.environ.get("PIPELINE_ARTIFACTS_ACCESS_KEY_ID") +pipeline_artifacts_secret_access_key = os.environ.get("PIPELINE_ARTIFACTS_SECRET_ACCESS_KEY") +pipeline_artifacts_endpoint_url = os.environ.get("PIPELINE_ARTIFACTS_ENDPOINT_URL") +pipeline_artifacts_bucket_name = os.environ.get("PIPELINE_ARTIFACTS_S3_BUCKET") + +trainingBranch = "train01" +train_data = os.environ.get("TRAIN_DATA", f"{trainingBranch}/data/train.csv") + +keras_model_filename = "model.keras" +model_output_prefix = os.environ.get("MODEL_OUTPUT", f"{trainingBranch}/models/fraud/1/") +model_output_filename = os.environ.get("MODEL_OUTPUT_FILENAME", "model.onnx") +scaler_output = model_output_prefix + "scaler.pkl" +model_output = model_output_prefix + model_output_filename + + +def get_pyarrow_fs(): + return pyarrow.fs.S3FileSystem( + access_key=aws_access_key_id, + secret_key=aws_secret_access_key, + region=region_name, + endpoint_override=endpoint_url) + +def get_pipeline_artifacts_fs(): + return pyarrow.fs.S3FileSystem( + access_key=pipeline_artifacts_access_key_id, + secret_key=pipeline_artifacts_secret_access_key, + region=region_name, + endpoint_override=pipeline_artifacts_endpoint_url) + +def get_s3_resource(): + session = boto3.session.Session( + aws_access_key_id=aws_access_key_id, + aws_secret_access_key=aws_secret_access_key) + + s3_resource = session.resource( + 's3', + config=botocore.client.Config(signature_version='s3v4'), + endpoint_url=endpoint_url, + region_name=region_name) + + return s3_resource + +def get_pipeline_artifacts_s3_resource(): + session = boto3.session.Session( + aws_access_key_id=pipeline_artifacts_access_key_id, + aws_secret_access_key=pipeline_artifacts_secret_access_key) + + s3_resource = session.resource( + 's3', + config=botocore.client.Config(signature_version='s3v4'), + endpoint_url=pipeline_artifacts_endpoint_url, + region_name=region_name) + + return s3_resource + +def get_class_weights(pyarrow_fs): + with pyarrow_fs.open_input_file(f"{bucket_name}/{train_data}") as file: + training_table = pyarrow.csv.read_csv(file) + + y_train = training_table.to_pandas() + y_train = y_train.loc[:, label_columns] + # Since the dataset is unbalanced (it has many more non-fraud transactions than fraudulent ones), set a class weight to weight the few fraudulent transactions higher than the many non-fraud transactions. + class_weights = sklearn.utils.class_weight.compute_class_weight( + 'balanced', + classes=np.unique(y_train), + y=y_train.values.ravel()) + class_weights = {i : class_weights[i] for i in range(len(class_weights))} + + return class_weights + + +def build_model() -> tf.keras.Model: + model = Sequential() + model.add(Dense(32, activation='relu', input_dim=len(feature_columns))) + model.add(Dropout(0.2)) + model.add(Dense(32)) + model.add(BatchNormalization()) + model.add(Activation('relu')) + model.add(Dropout(0.2)) + model.add(Dense(32)) + model.add(BatchNormalization()) + model.add(Activation('relu')) + model.add(Dropout(0.2)) + model.add(Dense(1, activation='sigmoid')) + return model + + +def train_func(config: dict): + batch_size = config.get("batch_size", 64) + epochs = config.get("epochs", 3) + cw = config.get("class_weight", 3) + + strategy = tf.distribute.MultiWorkerMirroredStrategy() + with strategy.scope(): + multi_worker_model = build_model() + multi_worker_model.compile( + optimizer="adam", + loss="binary_crossentropy", + metrics=["accuracy"], + ) + + dataset = train.get_dataset_shard("train") + results = [] + + for epoch in range(epochs): + print(f"Epoch: {epoch}") + tf_dataset = dataset.to_tf( + feature_columns=output_column_name, + label_columns=label_columns[0], + batch_size=batch_size + ) + history = multi_worker_model.fit( + tf_dataset, + class_weight=cw, + callbacks=[ReportCheckpointCallback()] + ) + results.append(history.history) + + return results + + +def create_sklearn_standard_scaler(scaler): + sk_scaler = sklearn.preprocessing.StandardScaler() + mean = [] + std = [] + + for column in feature_columns: + mean.append(scaler.stats_[f"mean({column})"]) + std.append(scaler.stats_[f"std({column})"]) + + sk_scaler.mean_ = np.array(mean) + sk_scaler.scale_ = np.array(std) + sk_scaler.var_ = sk_scaler.scale_ ** 2 + + return sk_scaler + + +def save_scalar(scaler): + s3_resource = get_s3_resource() + bucket = s3_resource.Bucket(bucket_name) + sklearn_scaler = create_sklearn_standard_scaler(scaler) + + sk_scaler_filename = "/tmp/scaler.pkl" + with open(sk_scaler_filename, "wb") as f: + pickle.dump(sklearn_scaler, f) + + print(f"Uploading scaler from {sk_scaler_filename} to {scaler_output}") + bucket.upload_file(sk_scaler_filename, scaler_output) + + +def save_onnx_model(checkpoint_path): + s3_resource = get_s3_resource() + bucket = s3_resource.Bucket(bucket_name) + + pipeline_artifacts_s3_resource = get_pipeline_artifacts_s3_resource() + pipeline_artifacts_bucket = pipeline_artifacts_s3_resource.Bucket(pipeline_artifacts_bucket_name) + + cp_s3_key = checkpoint_path.removeprefix(f"{pipeline_artifacts_bucket_name}/") + "/" + keras_model_filename + keras_model_local = f"/tmp/{keras_model_filename}" + + print(f"Downloading model state_dict from {cp_s3_key} to {keras_model_local}") + pipeline_artifacts_bucket.download_file(cp_s3_key, keras_model_local) + keras_model = tf.keras.models.load_model(keras_model_local) + onnx_model_local = f"/tmp/model.onnx" + onnx_model, _ = tf2onnx.convert.from_keras(keras_model) + onnx.save(onnx_model, onnx_model_local) + + print(f"Uploading model from {onnx_model_local} to {model_output}") + bucket.upload_file(onnx_model_local, model_output) + + +pyarrow_fs = get_pyarrow_fs() +pipeline_artifacts_fs = get_pipeline_artifacts_fs() +class_weights = get_class_weights(pyarrow_fs) + +config = {"lr": learning_rate, "batch_size": batch_size, "epochs": num_epochs, "class_weight":class_weights} + +train_dataset = ray.data.read_csv( + filesystem=pyarrow_fs, + paths=f"s3://{bucket_name}/{train_data}") +scaler = StandardScaler(columns=feature_columns) +concatenator = Concatenator(include=feature_columns, output_column_name=output_column_name) +train_dataset = scaler.fit_transform(train_dataset) +train_dataset = concatenator.fit_transform(train_dataset) + +print(scaler.stats_) + +scaling_config = ScalingConfig(num_workers=num_workers, use_gpu=use_gpu) + +trainer = TensorflowTrainer( + train_loop_per_worker=train_func, + train_loop_config=config, + run_config=RunConfig( + storage_filesystem=pipeline_artifacts_fs, + storage_path=f"{pipeline_artifacts_bucket_name}/ray/", + name="fraud-training", + ), + scaling_config=scaling_config, + datasets={"train": train_dataset}, + metadata={"preprocessor_pkl": scaler.serialize()}, +) +result = trainer.fit() +metadata = result.checkpoint.get_metadata() +print(metadata) +print(StandardScaler.deserialize(metadata["preprocessor_pkl"])) + +save_scalar(scaler) +save_onnx_model(result.checkpoint.path) diff --git a/01_standalone_examples/red-hat-openshift-ai/img/.gitkeep b/01_standalone_examples/red-hat-openshift-ai/img/.gitkeep new file mode 100644 index 00000000..e69de29b diff --git a/01_standalone_examples/red-hat-openshift-ai/img/data-connection-my-storage.png b/01_standalone_examples/red-hat-openshift-ai/img/data-connection-my-storage.png new file mode 100644 index 00000000..06e92092 Binary files /dev/null and b/01_standalone_examples/red-hat-openshift-ai/img/data-connection-my-storage.png differ diff --git a/01_standalone_examples/red-hat-openshift-ai/img/data-connection-pipeline-artifacts-delete.png b/01_standalone_examples/red-hat-openshift-ai/img/data-connection-pipeline-artifacts-delete.png new file mode 100644 index 00000000..02a7685c Binary files /dev/null and b/01_standalone_examples/red-hat-openshift-ai/img/data-connection-pipeline-artifacts-delete.png differ diff --git a/01_standalone_examples/red-hat-openshift-ai/img/data-connection-pipeline-artifacts.png b/01_standalone_examples/red-hat-openshift-ai/img/data-connection-pipeline-artifacts.png new file mode 100644 index 00000000..909278ee Binary files /dev/null and b/01_standalone_examples/red-hat-openshift-ai/img/data-connection-pipeline-artifacts.png differ diff --git a/01_standalone_examples/red-hat-openshift-ai/img/data.png b/01_standalone_examples/red-hat-openshift-ai/img/data.png new file mode 100644 index 00000000..1993be0c Binary files /dev/null and b/01_standalone_examples/red-hat-openshift-ai/img/data.png differ diff --git a/01_standalone_examples/red-hat-openshift-ai/img/fraud-detection-tutorial-image1.png b/01_standalone_examples/red-hat-openshift-ai/img/fraud-detection-tutorial-image1.png new file mode 100644 index 00000000..25930d66 Binary files /dev/null and b/01_standalone_examples/red-hat-openshift-ai/img/fraud-detection-tutorial-image1.png differ diff --git a/01_standalone_examples/red-hat-openshift-ai/img/fraud-detection-tutorial-image10.png b/01_standalone_examples/red-hat-openshift-ai/img/fraud-detection-tutorial-image10.png new file mode 100644 index 00000000..4da8fbb0 Binary files /dev/null and b/01_standalone_examples/red-hat-openshift-ai/img/fraud-detection-tutorial-image10.png differ diff --git a/01_standalone_examples/red-hat-openshift-ai/img/fraud-detection-tutorial-image11.png b/01_standalone_examples/red-hat-openshift-ai/img/fraud-detection-tutorial-image11.png new file mode 100644 index 00000000..e57d2297 Binary files /dev/null and b/01_standalone_examples/red-hat-openshift-ai/img/fraud-detection-tutorial-image11.png differ diff --git a/01_standalone_examples/red-hat-openshift-ai/img/fraud-detection-tutorial-image12.png b/01_standalone_examples/red-hat-openshift-ai/img/fraud-detection-tutorial-image12.png new file mode 100644 index 00000000..bb59e0c1 Binary files /dev/null and b/01_standalone_examples/red-hat-openshift-ai/img/fraud-detection-tutorial-image12.png differ diff --git a/01_standalone_examples/red-hat-openshift-ai/img/fraud-detection-tutorial-image2.png b/01_standalone_examples/red-hat-openshift-ai/img/fraud-detection-tutorial-image2.png new file mode 100644 index 00000000..c858dde0 Binary files /dev/null and b/01_standalone_examples/red-hat-openshift-ai/img/fraud-detection-tutorial-image2.png differ diff --git a/01_standalone_examples/red-hat-openshift-ai/img/fraud-detection-tutorial-image3.png b/01_standalone_examples/red-hat-openshift-ai/img/fraud-detection-tutorial-image3.png new file mode 100644 index 00000000..b416e61c Binary files /dev/null and b/01_standalone_examples/red-hat-openshift-ai/img/fraud-detection-tutorial-image3.png differ diff --git a/01_standalone_examples/red-hat-openshift-ai/img/fraud-detection-tutorial-image4.png b/01_standalone_examples/red-hat-openshift-ai/img/fraud-detection-tutorial-image4.png new file mode 100644 index 00000000..b021345d Binary files /dev/null and b/01_standalone_examples/red-hat-openshift-ai/img/fraud-detection-tutorial-image4.png differ diff --git a/01_standalone_examples/red-hat-openshift-ai/img/fraud-detection-tutorial-image5.png b/01_standalone_examples/red-hat-openshift-ai/img/fraud-detection-tutorial-image5.png new file mode 100644 index 00000000..eb06900e Binary files /dev/null and b/01_standalone_examples/red-hat-openshift-ai/img/fraud-detection-tutorial-image5.png differ diff --git a/01_standalone_examples/red-hat-openshift-ai/img/fraud-detection-tutorial-image6.png b/01_standalone_examples/red-hat-openshift-ai/img/fraud-detection-tutorial-image6.png new file mode 100644 index 00000000..a1cd5580 Binary files /dev/null and b/01_standalone_examples/red-hat-openshift-ai/img/fraud-detection-tutorial-image6.png differ diff --git a/01_standalone_examples/red-hat-openshift-ai/img/fraud-detection-tutorial-image7.png b/01_standalone_examples/red-hat-openshift-ai/img/fraud-detection-tutorial-image7.png new file mode 100644 index 00000000..4d307cc3 Binary files /dev/null and b/01_standalone_examples/red-hat-openshift-ai/img/fraud-detection-tutorial-image7.png differ diff --git a/01_standalone_examples/red-hat-openshift-ai/img/fraud-detection-tutorial-image8.png b/01_standalone_examples/red-hat-openshift-ai/img/fraud-detection-tutorial-image8.png new file mode 100644 index 00000000..f653be6e Binary files /dev/null and b/01_standalone_examples/red-hat-openshift-ai/img/fraud-detection-tutorial-image8.png differ diff --git a/01_standalone_examples/red-hat-openshift-ai/img/fraud-detection-tutorial-image9.png b/01_standalone_examples/red-hat-openshift-ai/img/fraud-detection-tutorial-image9.png new file mode 100644 index 00000000..3a9b642b Binary files /dev/null and b/01_standalone_examples/red-hat-openshift-ai/img/fraud-detection-tutorial-image9.png differ diff --git a/01_standalone_examples/red-hat-openshift-ai/img/lakefs-route.png b/01_standalone_examples/red-hat-openshift-ai/img/lakefs-route.png new file mode 100644 index 00000000..09e249d3 Binary files /dev/null and b/01_standalone_examples/red-hat-openshift-ai/img/lakefs-route.png differ diff --git a/01_standalone_examples/red-hat-openshift-ai/img/lakefsv3.png b/01_standalone_examples/red-hat-openshift-ai/img/lakefsv3.png new file mode 100644 index 00000000..49b7a8fe Binary files /dev/null and b/01_standalone_examples/red-hat-openshift-ai/img/lakefsv3.png differ diff --git a/01_standalone_examples/red-hat-openshift-ai/img/oai-console.png b/01_standalone_examples/red-hat-openshift-ai/img/oai-console.png new file mode 100644 index 00000000..eaef4736 Binary files /dev/null and b/01_standalone_examples/red-hat-openshift-ai/img/oai-console.png differ diff --git a/01_standalone_examples/red-hat-openshift-ai/img/topology.png b/01_standalone_examples/red-hat-openshift-ai/img/topology.png new file mode 100644 index 00000000..0bf29717 Binary files /dev/null and b/01_standalone_examples/red-hat-openshift-ai/img/topology.png differ diff --git a/01_standalone_examples/red-hat-openshift-ai/img/workbench-lakefs-env-variables.png b/01_standalone_examples/red-hat-openshift-ai/img/workbench-lakefs-env-variables.png new file mode 100644 index 00000000..37e6774e Binary files /dev/null and b/01_standalone_examples/red-hat-openshift-ai/img/workbench-lakefs-env-variables.png differ diff --git a/README.md b/README.md index 15dbaf01..95e9f34f 100644 --- a/README.md +++ b/README.md @@ -95,6 +95,7 @@ Under the [standalone_examples](./01_standalone_examples/) folder are a set of e * [Labelbox integration](./01_standalone_examples/labelbox-integration/) * [Kafka integration](./01_standalone_examples/kafka/) * [Flink integration](./01_standalone_examples/flink/) +* [Red Hat OpenShift AI integration](./01_standalone_examples/red-hat-openshift-ai/) * [How to migrate or clone a repo](./01_standalone_examples/migrate-or-clone-repo/) * [Running lakeFS with PostgreSQL as K/V store](./01_standalone_examples/docker-compose-with-postgres/)