Remove TE from dockerfile and instead add as optional dependency #1605

snarayan21 · 2024-10-21T15:48:35Z

See title

@dakinggg in terms of not breaking workflows, should we also add this to the gpu dep group?

Installed correctly in docker image:

mcli interactive --max-duration 23 --cluster r15z4p1 --image mosaicml/ci-staging:2.4.0_cu124-563786d --gpus 8
✔  Run interactive-J7lSGd has started. Preparing your interactive session...
root@6d9a4d64-787b-465b-8e2c-bb27a14b8622-0:/# python
Python 3.11.9 (main, Apr  6 2024, 17:59:24) [GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import transformer_engine.pytorch as te
>>> x = te.Linear(128,128)
<annoying Flash attn warning>
>>> x
Linear()
>>> exit()

merging

merging.

ayo

b-chu

Thanks!

KuuCi · 2024-10-21T17:41:23Z

Might be careful about this. Changing how the Dockerfile in Foundry is built will almost definitely result in cache misses for how images are built. You can see here in the Docker job is taking 21 minutes instead of the standard 41 seconds.

Edit: might have been wrong about reason why docker job took 21 minutes, could be because of this pr -- lgtm pending that TE version is correct

dakinggg

Are we sure that the version now pinned is the same/works equivalently to the commit we had pinned?

KuuCi · 2024-10-21T17:48:50Z

Are we sure that the version now pinned is the same/works equivalently to the commit we had pinned?

:p also how did image only take 21 minutes to install, in DLE TE takes ~25 min to install alone

snarayan21 · 2024-10-21T18:48:32Z

@dakinggg it should work the same, but is an upgrade from what we had pinned before. Do you want me to run the fp8 regression tests off of the image here?

@KuuCi yeah i'd expect this to result in cache misses because I'm taking out the TE install from the docker image and it's instead getting installed as a foundry dependency. I think with the pypi release they're prebuilding some parts of the package which makes it faster, but there's also some additional things they do to get around build isolation & supporting multiple frameworks

dakinggg · 2024-10-21T18:51:13Z

@snarayan21 yes please

snarayan21 · 2024-10-21T19:40:19Z

@dakinggg all 3 regression tests are deterministically the same with new TE (loss points are overlapping in these graphs below) --

snarayan21 added 5 commits September 21, 2024 21:22

yo

68f91dd

Merge branch 'main' of https://github.com/mosaicml/llm-foundry

36cc16a

merging

Merge branch 'main' of https://github.com/mosaicml/llm-foundry

3c55e39

merging.

Merge branch 'main' of https://github.com/mosaicml/llm-foundry

70b46f6

ayo

yo

b287c19

snarayan21 requested review from a team as code owners October 21, 2024 15:48

snarayan21 requested review from irenedea, dakinggg, j316chuck and b-chu October 21, 2024 16:10

b-chu approved these changes Oct 21, 2024

View reviewed changes

dakinggg reviewed Oct 21, 2024

View reviewed changes

KuuCi mentioned this pull request Oct 21, 2024

Add cache-from in foundry release #1603

Closed

snarayan21 requested a review from dakinggg October 21, 2024 19:40

dakinggg approved these changes Oct 21, 2024

View reviewed changes

snarayan21 merged commit 6448e4e into main Oct 21, 2024
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove TE from dockerfile and instead add as optional dependency #1605

Remove TE from dockerfile and instead add as optional dependency #1605

snarayan21 commented Oct 21, 2024 •

edited

Loading

b-chu left a comment

KuuCi commented Oct 21, 2024 •

edited

Loading

dakinggg left a comment

KuuCi commented Oct 21, 2024

snarayan21 commented Oct 21, 2024

dakinggg commented Oct 21, 2024

snarayan21 commented Oct 21, 2024

Remove TE from dockerfile and instead add as optional dependency #1605

Remove TE from dockerfile and instead add as optional dependency #1605

Conversation

snarayan21 commented Oct 21, 2024 • edited Loading

b-chu left a comment

Choose a reason for hiding this comment

KuuCi commented Oct 21, 2024 • edited Loading

dakinggg left a comment

Choose a reason for hiding this comment

KuuCi commented Oct 21, 2024

snarayan21 commented Oct 21, 2024

dakinggg commented Oct 21, 2024

snarayan21 commented Oct 21, 2024

snarayan21 commented Oct 21, 2024 •

edited

Loading

KuuCi commented Oct 21, 2024 •

edited

Loading