Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove TE from dockerfile and instead add as optional dependency #1605

Merged
merged 5 commits into from
Oct 21, 2024

Conversation

snarayan21
Copy link
Contributor

@snarayan21 snarayan21 commented Oct 21, 2024

See title

@dakinggg in terms of not breaking workflows, should we also add this to the gpu dep group?

Installed correctly in docker image:

mcli interactive --max-duration 23 --cluster r15z4p1 --image mosaicml/ci-staging:2.4.0_cu124-563786d --gpus 8
✔  Run interactive-J7lSGd has started. Preparing your interactive session...
root@6d9a4d64-787b-465b-8e2c-bb27a14b8622-0:/# python
Python 3.11.9 (main, Apr  6 2024, 17:59:24) [GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import transformer_engine.pytorch as te
>>> x = te.Linear(128,128)
<annoying Flash attn warning>
>>> x
Linear()
>>> exit()

@snarayan21 snarayan21 requested review from a team as code owners October 21, 2024 15:48
Copy link
Contributor

@b-chu b-chu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@KuuCi
Copy link
Contributor

KuuCi commented Oct 21, 2024

Might be careful about this. Changing how the Dockerfile in Foundry is built will almost definitely result in cache misses for how images are built. You can see here in the Docker job is taking 21 minutes instead of the standard 41 seconds.

Edit: might have been wrong about reason why docker job took 21 minutes, could be because of this pr -- lgtm pending that TE version is correct

Copy link
Collaborator

@dakinggg dakinggg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure that the version now pinned is the same/works equivalently to the commit we had pinned?

@KuuCi
Copy link
Contributor

KuuCi commented Oct 21, 2024

Are we sure that the version now pinned is the same/works equivalently to the commit we had pinned?

:p also how did image only take 21 minutes to install, in DLE TE takes ~25 min to install alone

@snarayan21
Copy link
Contributor Author

@dakinggg it should work the same, but is an upgrade from what we had pinned before. Do you want me to run the fp8 regression tests off of the image here?

@KuuCi yeah i'd expect this to result in cache misses because I'm taking out the TE install from the docker image and it's instead getting installed as a foundry dependency. I think with the pypi release they're prebuilding some parts of the package which makes it faster, but there's also some additional things they do to get around build isolation & supporting multiple frameworks

@dakinggg
Copy link
Collaborator

@snarayan21 yes please

@snarayan21
Copy link
Contributor Author

@dakinggg all 3 regression tests are deterministically the same with new TE (loss points are overlapping in these graphs below) --

Screenshot 2024-10-21 at 3 35 44 PM Screenshot 2024-10-21 at 3 36 52 PM Screenshot 2024-10-21 at 3 37 08 PM

@snarayan21 snarayan21 requested a review from dakinggg October 21, 2024 19:40
@snarayan21 snarayan21 merged commit 6448e4e into main Oct 21, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants