Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change Release Base Images to #1607

Closed
wants to merge 4 commits into from
Closed

Change Release Base Images to #1607

wants to merge 4 commits into from

Conversation

KuuCi
Copy link
Contributor

@KuuCi KuuCi commented Oct 22, 2024

This is to speed up release process by making us not need to reinstall all of transformer_engine. Docker caching doesn't work here because of

ADD https://raw.githubusercontent.com/mosaicml/llm-foundry/$BRANCH_NAME/setup.py setup.py
, where if the branch name is different than main (and it will be since we expect it to be whatever branch is associated with the release), cache will be invaliadted.

Testing:
image

@KuuCi KuuCi requested a review from a team as a code owner October 22, 2024 20:23
@@ -93,7 +99,7 @@ jobs:
${{ env.AWS_DOCKER_TAG }}
${{ env.AWS_LATEST_TAG }}
build-args: |
BASE_IMAGE=mosaicml/pytorch:2.4.0_cu124-python3.11-ubuntu20.04-aws
BASE_IMAGE=mosaicml/llm-foundry:2.4.0_cu124-latest
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is base image the same image ur building...?

Copy link
Contributor Author

@KuuCi KuuCi Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're different images, the images built here are
mosaicml/llm-foundry:release_${TAG_NAME}" and mosaicml/llm-foundry:release_latest"
to keep up with foundry releases

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait yeah, how are we building off the foundry images? that should be the result of this right

Copy link
Contributor Author

@KuuCi KuuCi Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohh. Huh, I guess the behavior is slightly weird.

  1. https://github.com/mosaicml/llm-foundry/blob/main/.github/workflows/docker.yaml is what builds mosaicml/llm-foundry:2.4.0_cu124-latest (which is largely cached) -- not necessarily the same pr (could be the previous pr)
  2. https://github.com/mosaicml/llm-foundry/blob/main/.github/workflows/release.yaml builds from that image. We don't cache here because it's almost guarenteed cache miss when we clone from our specific branch + pip install, but that doesn't matter bc the dependencies are mostly installed from the mosaicml/llm-foundry:2.4.0_cu124-latest image, and any that aren't install will be in dockerfile so we get net speed up
  3. We keep the foundry repo

but like isn't that fine?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But currently the release workflow builds on the pytorch image, not the foundry image...i guess the foundry image does exist though...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think its worth optimizing release workflow to add this complication. This feels a bit weird to me to build on top of the foundry build image...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ agree

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can have some unintended complications -- for example when doing foundry release, suppose I create the release branch but don't publish the release and kick off the release workflow yet. Then, in between, someone else merges some breaking PR. Then I launch this new release workflow which builds on the cached latest foundry image that contains the bad commit....right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, sounds fair, will close pr

@@ -93,7 +99,7 @@ jobs:
${{ env.AWS_DOCKER_TAG }}
${{ env.AWS_LATEST_TAG }}
build-args: |
BASE_IMAGE=mosaicml/pytorch:2.4.0_cu124-python3.11-ubuntu20.04-aws
BASE_IMAGE=mosaicml/llm-foundry:2.4.0_cu124-latest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait yeah, how are we building off the foundry images? that should be the result of this right

.github/workflows/release.yaml Show resolved Hide resolved
@KuuCi KuuCi requested a review from snarayan21 October 22, 2024 20:51
@KuuCi KuuCi closed this Oct 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants