-
Notifications
You must be signed in to change notification settings - Fork 533
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change Release Base Images to #1607
Conversation
@@ -93,7 +99,7 @@ jobs: | |||
${{ env.AWS_DOCKER_TAG }} | |||
${{ env.AWS_LATEST_TAG }} | |||
build-args: | | |||
BASE_IMAGE=mosaicml/pytorch:2.4.0_cu124-python3.11-ubuntu20.04-aws | |||
BASE_IMAGE=mosaicml/llm-foundry:2.4.0_cu124-latest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is base image the same image ur building...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They're different images, the images built here are
mosaicml/llm-foundry:release_${TAG_NAME}"
and mosaicml/llm-foundry:release_latest"
to keep up with foundry releases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait yeah, how are we building off the foundry images? that should be the result of this right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ohh. Huh, I guess the behavior is slightly weird.
- https://github.com/mosaicml/llm-foundry/blob/main/.github/workflows/docker.yaml is what builds
mosaicml/llm-foundry:2.4.0_cu124-latest
(which is largely cached) -- not necessarily the same pr (could be the previous pr) - https://github.com/mosaicml/llm-foundry/blob/main/.github/workflows/release.yaml builds from that image. We don't cache here because it's almost guarenteed cache miss when we clone from our specific branch +
pip install
, but that doesn't matter bc the dependencies are mostly installed from themosaicml/llm-foundry:2.4.0_cu124-latest
image, and any that aren't install will be in dockerfile so we get net speed up - We keep the foundry repo
but like isn't that fine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But currently the release workflow builds on the pytorch image, not the foundry image...i guess the foundry image does exist though...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think its worth optimizing release workflow to add this complication. This feels a bit weird to me to build on top of the foundry build image...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^ agree
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can have some unintended complications -- for example when doing foundry release, suppose I create the release branch but don't publish the release and kick off the release workflow yet. Then, in between, someone else merges some breaking PR. Then I launch this new release workflow which builds on the cached latest foundry image that contains the bad commit....right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, sounds fair, will close pr
@@ -93,7 +99,7 @@ jobs: | |||
${{ env.AWS_DOCKER_TAG }} | |||
${{ env.AWS_LATEST_TAG }} | |||
build-args: | | |||
BASE_IMAGE=mosaicml/pytorch:2.4.0_cu124-python3.11-ubuntu20.04-aws | |||
BASE_IMAGE=mosaicml/llm-foundry:2.4.0_cu124-latest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait yeah, how are we building off the foundry images? that should be the result of this right
This is to speed up release process by making us not need to reinstall all of transformer_engine. Docker caching doesn't work here because of
llm-foundry/Dockerfile
Line 15 in 8e78eb5
Testing: