Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Use Amazon S3 as Stable Diffusion Model Storage #616

Closed
wants to merge 59 commits into from
Closed
Show file tree
Hide file tree
Changes from 53 commits
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
8228f0f
feat: run GPU node with BR and EBS snapshot with container image cache
lindarr915 Aug 1, 2024
efaabb9
refactor: remove kubectl_manifest of karpenter custom resources
lindarr915 Aug 1, 2024
a3dcd32
feat: locust file fo load testing
lindarr915 Aug 1, 2024
67e29bb
feat: End-to-end deployment of Bottlerocket nodes with container imag…
lindarr915 Aug 1, 2024
08bef10
Merge branch 'awslabs:main' into bottlerocket-cache-container-image
lindarr915 Aug 2, 2024
1ec10ce
feat: bump ray version to 2.24
lindarr915 Aug 13, 2024
521553f
feat: add s3 mountpoint csi driver
lindarr915 Aug 13, 2024
ecc8ba6
feat: add loading SD models from s3 bucket
lindarr915 Aug 13, 2024
e75f6bf
feat: download stable diffusion from huggingface and upload to s3
lindarr915 Aug 13, 2024
d43d962
feat: update the preloaded images to bottlerocket node ami and snapshot
lindarr915 Aug 13, 2024
af40b3c
fix: add pycache path in .gitignore
lindarr915 Aug 13, 2024
42d9c04
Merge branch 'bottlerocket-cache-container-image' of https://github.c…
lindarr915 Aug 13, 2024
a3b4004
fix: fix docs path
lindarr915 Aug 19, 2024
e2ccf55
feat: add notebook of model downloading
lindarr915 Aug 19, 2024
2fa6bc9
fix: add additional IAM policy statements for karpenter to launch fro…
lindarr915 Aug 19, 2024
0c5cab0
feat: add s3 as model cache stroage
lindarr915 Aug 19, 2024
eda4c68
feat: model download script from hugging face
lindarr915 Aug 19, 2024
f458e90
Merge branch 'main' into bottlerocket-cache-container-image
lindarr915 Aug 19, 2024
252fee8
chore: bump ray version to 2.33
lindarr915 Aug 19, 2024
7100836
fix: update required ver of data_addons module to 1.33
lindarr915 Aug 20, 2024
097f858
feat: add s3 bucket for model storage
lindarr915 Aug 20, 2024
f55ce33
feat: using self-managed s3 csi driver to tolerate all taints
lindarr915 Aug 20, 2024
b9b03b8
fix: spilt notebook code blocks
lindarr915 Aug 20, 2024
4c8c5ee
Merge branch 'awslabs:main' into cache-model-from-s3
lindarr915 Aug 20, 2024
e30c0a2
feat: add a variable to make creating s3 bucket optional
lindarr915 Aug 20, 2024
50ecb5d
chore: cleanup comments
lindarr915 Aug 20, 2024
4549a71
fix: Dockerfile base image
lindarr915 Aug 20, 2024
1bfb117
Merge branch 'cache-model-from-s3' of https://github.com/lindarr915/d…
lindarr915 Aug 20, 2024
342fea9
docs: add cache model to s3 draft
lindarr915 Aug 20, 2024
dac490a
Use EKS addon for S3 Mountpoint CSI
lindarr915 Sep 4, 2024
8876cc3
Merge S3 Bucket from s3.tf to addon.tf
lindarr915 Sep 4, 2024
c6574bb
Clear output of download_models.ipynb
lindarr915 Sep 4, 2024
7f3cdbf
Use data-on-eks ECR public repo
lindarr915 Sep 4, 2024
361d077
Add placeholder for S3 Bucket Name
lindarr915 Sep 4, 2024
2f3a131
Update docs to download SD models and upload to S3
lindarr915 Sep 4, 2024
f705fc7
Merge PVC and RayService into a single YAML
lindarr915 Sep 4, 2024
6e45d62
Merge PVC and RayService into a single YAML
lindarr915 Sep 4, 2024
c70394f
Revert "Merge branch 'cache-model-from-s3' of https://github.com/lind…
lindarr915 Sep 6, 2024
770d69a
Remove Comments
lindarr915 Sep 6, 2024
12a1f83
neat: remove used code
lindarr915 Sep 6, 2024
eba5e58
fix docs
lindarr915 Sep 6, 2024
15437fa
Merge branch 'main' into cache-model-from-s3
lindarr915 Sep 6, 2024
a0352f9
revert nim diagram
lindarr915 Sep 6, 2024
922bfe5
fix: remove duplicated lines
lindarr915 Sep 6, 2024
213d3dc
fix: remove whitespaces
lindarr915 Sep 6, 2024
2ae7fb5
fix: Change imagePullPolicy to Always
lindarr915 Sep 6, 2024
c307539
Create a Job to save models to Amazon S3
lindarr915 Sep 20, 2024
9de005d
feat: Add job to download and save models to Amazon S3
lindarr915 Sep 20, 2024
4488eca
feat: support save ML models to S3
lindarr915 Sep 20, 2024
06c678d
feat: add S3 Gateway VPCe
lindarr915 Sep 20, 2024
f19986d
chore: improve logging
lindarr915 Sep 20, 2024
22f4d3d
fix: remove whitespaces
lindarr915 Sep 20, 2024
81c7075
fix: resolve conflict
lindarr915 Sep 20, 2024
b29458c
docs: add comments for volume mounts
lindarr915 Sep 25, 2024
2c1e053
docs: Update page title
lindarr915 Nov 11, 2024
3b6d683
fix: remove unnecessary code
lindarr915 Nov 11, 2024
63a9a34
feat: bump diffuser and transformers versions in Dockerfile
lindarr915 Nov 11, 2024
11c87e3
fix: remove pip section
lindarr915 Nov 11, 2024
ba6c45c
fix: add namespace for downloader job
lindarr915 Nov 12, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -60,5 +60,6 @@ site
# node modules
node_modules
gen-ai/inference/stable-diffusion-rayserve-gpu/locust/__pycache__/*
gen-ai/inference/stable-diffusion-rayserve-gpu/stable-diffusion-2/*
website/package-lock.json
website/package.json
37 changes: 37 additions & 0 deletions ai-ml/jark-stack/terraform/addons.tf
Original file line number Diff line number Diff line change
Expand Up @@ -82,8 +82,18 @@ module "eks_blueprints_addons" {
vpc-cni = {
preserve = true
}

aws-mountpoint-s3-csi-driver = {
service_account_role_arn = module.s3_csi_driver_irsa.iam_role_arn
configuration_values = <<-EOF
node:
tolerateAllTaints: true
EOF
}
}



#---------------------------------------
# AWS Load Balancer Controller Add-on
#---------------------------------------
Expand Down Expand Up @@ -354,6 +364,28 @@ module "data_addons" {
]
}

#---------------------------------------------------------------
# IRSA for Mountpoint for Amazon S3 CSI Driver
#---------------------------------------------------------------
module "s3_csi_driver_irsa" {
source = "terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts-eks"
version = "~> 5.34"
role_name_prefix = format("%s-%s-", local.name, "s3-csi-driver")
role_policy_arns = {
# WARNING: Demo purpose only. Bring your own IAM policy with least privileges
s3_csi_driver = "arn:aws:iam::aws:policy/AmazonS3FullAccess"
}
oidc_providers = {
main = {
provider_arn = module.eks.oidc_provider_arn
namespace_service_accounts = ["kube-system:s3-csi-driver-sa"]
}
}
tags = local.tags
}




#---------------------------------------------------------------
# Additional Resources
Expand Down Expand Up @@ -399,3 +431,8 @@ data "aws_iam_policy_document" "karpenter_controller_policy" {
sid = "KarpenterControllerAdditionalPolicy"
}
}

resource "aws_s3_bucket" "model_storage" {
lindarr915 marked this conversation as resolved.
Show resolved Hide resolved
count = var.create_s3_bucket ? 1 : 0
bucket_prefix = "model-storage-"
}
8 changes: 8 additions & 0 deletions ai-ml/jark-stack/terraform/outputs.tf
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,11 @@ output "configure_kubectl" {
description = "Configure kubectl: make sure you're logged in with the correct AWS profile and run the following command to update your kubeconfig"
value = "aws eks --region ${var.region} update-kubeconfig --name ${var.name}"
}
output "grafana_secret_name" {
description = "The name of the secret containing the Grafana admin password."
value = aws_secretsmanager_secret.grafana.name
}
output "model_s3_bucket" {
description = "The S3 bucket name for storing ML models"
value = length(aws_s3_bucket.model_storage) > 0 ? aws_s3_bucket.model_storage[0].bucket : ""
}
5 changes: 5 additions & 0 deletions ai-ml/jark-stack/terraform/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -56,5 +56,10 @@ variable "bottlerocket_data_disk_snpashot_id" {
description = "Bottlerocket Data Disk Snapshot ID"
type = string
default = ""
}

variable "create_s3_bucket" {
description = "Create S3 Bucket for Model Storage"
default = false
type = bool
}
27 changes: 27 additions & 0 deletions ai-ml/jark-stack/terraform/vpc.tf
Original file line number Diff line number Diff line change
Expand Up @@ -51,3 +51,30 @@ module "vpc" {

tags = local.tags
}


module "vpc_endpoints" {
source = "terraform-aws-modules/vpc/aws//modules/vpc-endpoints"
version = "~> 5.0"

# create = var.enable_vpc_endpoints
lindarr915 marked this conversation as resolved.
Show resolved Hide resolved

create = true

vpc_id = module.vpc.vpc_id
security_group_ids = [module.vpc_endpoints_sg.security_group_id]

endpoints = merge({
s3 = {
service = "s3"
service_type = "Gateway"
route_table_ids = module.vpc.private_route_table_ids
tags = {
Name = "${local.name}-s3"
}
}
}
)

tags = local.tags
}
9 changes: 5 additions & 4 deletions gen-ai/inference/stable-diffusion-rayserve-gpu/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# https://hub.docker.com/layers/rayproject/ray-ml/2.10.0-py310-gpu/images/sha256-4181ed53b0b25a758b155312ca6ab29a65cb78cd57296d42cfbe4806a2b77df4?context=explore
# docker buildx build --platform=linux/amd64 -t ray2.10.0-py310-gpu-stablediffusion:v1.0 -f Dockerfile .
# docker buildx build --platform=linux/amd64 -t ray2.24.0-py310-gpu-stablediffusion:v1.0 -f Dockerfile .

# Use Ray base image
FROM rayproject/ray-ml:2.10.0-py310-gpu
FROM docker.io/rayproject/ray:2.33.0-py311-gpu

# Maintainer label
LABEL maintainer="DoEKS"
Expand All @@ -14,13 +14,14 @@ ENV DEBIAN_FRONTEND=non-interactive
USER $USER

# Install Ray Serve and other Python packages with specific versions
RUN pip install --no-cache-dir requests torch "diffusers==0.12.1" "transformers==4.25.1"
RUN pip install --no-cache-dir requests torch "diffusers==0.29.2" "transformers==4.42.4" "accelerate==0.30.1"
lindarr915 marked this conversation as resolved.
Show resolved Hide resolved

# Set a working directory
WORKDIR /serve_app

# Copy your Ray Serve script into the container
COPY ray_serve_stablediffusion.py /serve_app/ray_serve_stablediffusion.py
# The serving script can moved to the ConfigMap, or copied in the container image
# COPY ray_serve_sd.py /serve_app/ray_serve_sd.py

# Set the PYTHONPATH environment variable
ENV PYTHONPATH=/serve_app:$PYTHONPATH
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
apiVersion: v1
kind: Namespace
metadata:
name: stablediffusion
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: s3-pv-model-storage
spec:
capacity:
storage: 1200Gi # ignored, required
accessModes:
- ReadWriteMany # supported options: ReadWriteMany / ReadOnlyMany
mountOptions:
- allow-overwrite
- allow-delete
- region us-west-2
- prefix stable-diffusion-2/
csi:
driver: s3.csi.aws.com # required
volumeHandle: s3-csi-driver-volume
volumeAttributes:
bucketName: <YOUR_BUCKET_NAME> # Replace bucketName with S3 bucket name created from `terraform output`
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: s3-model-storage-claim
namespace: stablediffusion
spec:
accessModes:
- ReadWriteMany # supported options: ReadWriteMany / ReadOnlyMany
storageClassName: "" # required for static provisioning
resources:
requests:
storage: 1200Gi # ignored, required
volumeName: s3-pv-model-storage
---
apiVersion: v1
kind: ConfigMap
metadata:
name: shell-script-configmap
data:
download_models.py: |
from diffusers import StableDiffusionPipeline
import torch

# Set the model you want to download
model_name = "stabilityai/stable-diffusion-2"
model_directory = "/serve_app/temp-stable-diffusion-2"

# Load the model
pipe = StableDiffusionPipeline.from_pretrained(model_name, torch_dtype=torch.float16, cache_dir=model_directory)

# Save the model to the local directory
pipe.save_pretrained(model_directory)

print(f"Model saved to {model_directory}")

import shutil, logging

# Source directory (the one you want to copy)
src_dir = model_directory

# Destination directory (where you want to copy to)
dst_dir = "/serve_app/stable-diffusion-2/"

shutil.rmtree(f"{model_directory}/.locks")
shutil.rmtree(f"{model_directory}/models--stabilityai--stable-diffusion-2")

# Copy the directory recursively
try:
shutil.copytree(src_dir, dst_dir, dirs_exist_ok=True)
logging.info("Directory copied successfully")
except shutil.Error as e:
logging.error(f"Error copying directory: {e}")
except OSError as e:
logging.error(f"OS error occurred: {e}")
except Exception as e:
logging.error(f"Unexpected error occurred: {e}")

import logging

notice_message = """
NOTICE: shutil.copytree() may generate errors when used with Amazon S3.
This is because it calls copystat(src, dst) at:
/home/ray/anaconda3/lib/python3.11/shutil.py, line 527, in _copytree
Amazon S3 does not support modifying metadata, which causes these errors.
"""

logging.warning(notice_message.strip())
---
apiVersion: batch/v1
kind: Job
metadata:
name: shell-script-job
spec:
template:
spec:
containers:
- name: shell-script-container
image: public.ecr.aws/data-on-eks/ray-serve-gpu-stablediffusion:2.33.0-py311-gpu
command: ["python"]
args: ["/scripts/download_models.py"]
volumeMounts:
- name: script-volume
mountPath: /scripts
- mountPath: /serve_app/stable-diffusion-2
name: cache-dir
volumes:
- name: cache-dir
persistentVolumeClaim:
claimName: s3-model-storage-claim
- name: script-volume
configMap:
name: shell-script-configmap
restartPolicy: Never
backoffLimit: 4
Loading
Loading