feat(llm-finetuner): Migrate LLM finetuner image from kubernetes-cloud
#21
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
LLM Finetuner Container
This re-homes the container for
coreweave/kubernetes-cloud
's LLM finetuner by copying over itsDockerfile
and compiler wrapper as they appeared in commit 6c10019 under the directoryfinetuner-workflow/finetuner
in that repository, with some updates for cross-repository downloading added to the build.Neat Things About the Build
The
coreweave/kubernetes-cloud
repository is absolutely massive. At the time of writing,git clone https://github.com/coreweave/kubernetes-cloud
thwacks you with a 607 MiB download, primarily comprising nearly 400 MiB of image files under/docs
and an almost 200 MiB.git
directory.This is a bit over-the-top to download just a handful of files, so this container's build is configured to do sparse checkouts that reduce the download size 1000x, to a bit under 600 KiB, which is further reduced to just a few dozen kilobytes by deleting the
.git
directory at the end of the download step.It's a nice improvement that could be integrated into this repository's
sd-finetuner
container build as well, which currently leaves that full 600+ MiB repository in its final image.Weird Things About the Build
Building from a Branch
Branch names can but probably should not be used as commit identifiers for these builds, because Docker may cache the download by the branch's name, which isn't good if the branch has received updates and is expected to be re-downloaded in an updated state. The hash of the latest commit should be used instead.
Coupling
There is currently no default commit defined for the build, and accordingly, no rule to automatically rebuild the image on updates pushed here. The list of files copied during the build process is very specific and doesn't adapt very well between versions of the source.
This could be alleviated a bit by copying over the entire
finetuner-workflow/finetuner
directory into the final image, but I still see this potentially becoming very annoying to manage between many possible concurrent branches inkubernetes-cloud
that could each require distinct build instructions over here, and tracking down corresponding historical changes across two the repositories seems painful.To make that better, we could work on making the build instructions very generic, like including a version-controlled
install.sh
(or something) over inkubernetes-cloud
and running most of the work in there. Alternatively, the LLM finetuner could have its own repository with this container published in it.Alternatively, this entire Dockerfile could be left in
kubernetes-cloud
, versioned with the rest of the source, and we could dynamically download it and build it here inml-containers
from any given commit entirely through a workflow, without any corresponding directory here (or maybe one with only aREADME
). This would cut down on the headache of managing the source in multiple disconnected places while still keeping the container in the centralml-containers
repository.I'd welcome some thoughts on this point.