Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Use Amazon S3 as Stable Diffusion Model Storage #616

Open
wants to merge 59 commits into
base: main
Choose a base branch
from

Conversation

lindarr915
Copy link
Contributor

@lindarr915 lindarr915 commented Aug 19, 2024

What does this PR do?

Implement 448

🛑 Please open an issue first to discuss any significant work and flesh out details/direction - we would hate for your time to be wasted.
Consult the CONTRIBUTING guide for submitting pull-requests.

Motivation

More

  • Yes, I have tested the PR using my local account setup (Provide any test evidence report under Additional Notes)
  • Mandatory for new blueprints. Yes, I have added a example to support my blueprint PR
  • Mandatory for new blueprints. Yes, I have updated the website/docs or website/blog section for this feature
  • Yes, I ran pre-commit run -a with this PR. Link for installing pre-commit locally

For Moderators

  • E2E Test successfully complete before merge?

Additional Notes

lindarr915 and others added 29 commits August 1, 2024 11:23
@ratnopamc ratnopamc requested a review from askulkarni2 August 20, 2024 22:49
@lindarr915 lindarr915 marked this pull request as ready for review September 6, 2024 03:30
Copy link
Collaborator

@askulkarni2 askulkarni2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lindarr915 thanks so much for the PR! Please address the comments. Also make sure you run pre-commit run -a and commit the changes made.

image: public.ecr.aws/data-on-eks/ray2.11.0-py310-gpu-stablediffusion:latest
imagePullPolicy: IfNotPresent # Ensure the image is always pulled when updated
image: public.ecr.aws/data-on-eks/ray-serve-gpu-stablediffusion:2.33.0-py311-gpu
imagePullPolicy: Always # Ensure the image is always pulled when updated
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why Always?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the image public.ecr.aws/data-on-eks/ray-serve-gpu-stablediffusion:2.33.0-py311-gpu is not up-to-date, then kubelet will pull the image from ECR. Otherwise, it will use the cached image.

https://kubernetes.io/docs/concepts/containers/images/#image-pull-policy

Always
every time the kubelet launches a container, the kubelet queries the container image registry to resolve the name to an image digest. If the kubelet has a container image with that exact digest cached locally, the kubelet uses its cached image; otherwise, the kubelet pulls the image with the resolved digest, and uses that image to launch the container.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thus, if I update the image with the same tag, I can make sure it will use the latest version. If the SHA256 is the same, then kubelet will use the cached image on disk

ai-ml/jark-stack/terraform/addons.tf Show resolved Hide resolved

## End-to-end Example

An end-to-end deployment example can be found in [Stable Diffusion on GPU](../gen-ai/inference/GPUs/stablediffusion-gpus).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong architecture diagram.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the page there is no diagram. Would you let me know which architecture diagram you are referring to?

- Added a new job to download and save models to Amazon S3.
- Created a YAML file for the job configuration.
- Defined a persistent volume and persistent volume claim for model storage.
- Created a config map for the shell script used in the job.
- Configured the job container with the necessary image and command.
- Mounted the script and model storage directories in the container.
@lindarr915
Copy link
Contributor Author

@askulkarni2 Please review the new commits for the PR.

Copy link
Contributor

This PR has been automatically marked as stale because it has been open 30 days
with no activity. Remove stale label or comment or this PR will be closed in 10 days

@github-actions github-actions bot added the stale label Nov 11, 2024
@lindarr915
Copy link
Contributor Author

I am still working on this PR.

@github-actions github-actions bot removed the stale label Nov 12, 2024
fix: ray service yaml filename
Copy link
Contributor

This PR has been automatically marked as stale because it has been open 30 days
with no activity. Remove stale label or comment or this PR will be closed in 10 days

@github-actions github-actions bot added the stale label Dec 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants