-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge Amd64 and Arm64 pax dockerfile #291
base: main
Are you sure you want to change the base?
Conversation
install-flax.sh --defer | ||
install-te.sh --defer | ||
|
||
if [[ -f /opt/requirements-defer.txt ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole thing about requirements-defer.txt
isn't carried over?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing the --defer
mechanism runs the risk of nightlies breaking due to sequential invocations of pip install
installing breaking versions of transitive libraries previous invocations required. Until we have a safer way to install these deps, I'd vote for keeping it for now.
FWIW, you can run pip check
on your built image to see if anything is in conflict. Returning "No broken requirements found" doesn't tell us that it's okay to not install everything together, but just that nothing is broken at the moment. Oddly, my experience has been pip doesn't just tell you that what you're about to install will break transitive dependencies 😞
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It isn't working on ARM.
I don't have time to debug this now. I would prefer to have both container the same now and bring it back later.
I'll try again quickly just in case is now works. But I can't debug it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note, this defer things break the Dockerfile spirit and made developing/fixing the dockerfile more complex and much slower.
Any idea how to not loose this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm getting this error on ARM:
#20 717.0 File "/usr/local/lib/python3.10/dist-packages/pip/_vendor/resolvelib/resolvers.py", line 457, in resolve
#20 717.0 raise ResolutionTooDeep(max_rounds)
#20 717.0 pip._vendor.resolvelib.resolvers.ResolutionTooDeep: 200000
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re:
#20 717.0 File "/usr/local/lib/python3.10/dist-packages/pip/_vendor/resolvelib/resolvers.py", line 457, in resolve
#20 717.0 raise ResolutionTooDeep(max_rounds)
#20 717.0 pip._vendor.resolvelib.resolvers.ResolutionTooDeep: 200000
I saw this earlier with installing t5x + TE, but it seemed to have resolved itself yesterday.
Actually if the resolver is having trouble finding a working version (for me it was seqio), there's likely a conflict that you'll introduce by installing sequentially. If it works for you, we can rel-note; but just FYI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any idea how to investigate this?
If the CI pass, I'm ok with this. The current Arm version doesn't defer too.
I'm not sure what to write as a relnote.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it works for you, we can rel-note
I was thinking if you run pip check
and saw something like "protobuf is in conflict", that could be the relnote unless the package in conflict is inconsequential
Any idea how to investigate this?
You could try installing 2 of the packages with --defer, then do the third one normally, then run pip check
after that to see what pip says is in conflict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I have:
2.4.2 has requirement typing-extensions>=4.6.1, but you have typing-extensions 4.5.0.
pydantic-core 2.10.1 has requirement typing-extensions!=4.7.0,>=4.6.0, but you have typing-extensions 4.5.0.
TE ask for it, but doesn't specify a version. Otherwise we don't see it under /opt.
I think we can ignore it and not relnote it. Do you agree with that?
If so, any other change needed before we can merge this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds related.... #117
Have you run a test with TE? When this happened last, the test failed only under the TE code paths, and not the vanilla. There's actually another issue here is that the pre-submit only tests upstream pax tests and not the ones with TE. @ashors1 has a PR to add TE tests which sound like they'd help validate this dep conflict: #292, perhaps you can cherry pick it in a ephemeral branch to test this.
Another option is to add a constraint file pip install -c constraints.txt -r requirements-defer.txt
and add a pydantic-core version that is known to work. I imagine you'd have to downgrade to find it. I looked at the version in the t5x images and I found 2.0.3, so maybe try populating a constraints.txt
file with
pydantic==2.0.3
and see if that builds? with all three deps deferred?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that we will need a major refactor to adapt the pax dockerfile for pip-compile.
@terrykong shall we merge this PR or wait for you and @nouiz to figure out the deferred install issue? |
I think it's fine to not fix the defer, but @nouiz , can you run the rosetta-pax image from this change thru https://github.com/NVIDIA/JAX-Toolbox/actions/workflows/nightly-rosetta-pax-build.yaml ? I'm worried about this breaking TE @ashors1 Just added some TE tests (in #292 ), which can make sure this won't break functionality (This isn't part of the pre-submit CI yet, so it needs to be dispatched manually thru the web UI) |
How to run it manually? I won't be able to do that today. |
336776e
to
3e7d56e
Compare
…he dockerfile vs install script.
Co-authored-by: Mehdi Amini <[email protected]>
3e7d56e
to
a0917d1
Compare
I rebased and tryed to add the recent change. |
Testing build for 2.14: https://github.com/NVIDIA/JAX-Toolbox/actions/runs/6511142264 |
It looks like the build failed. |
@nouiz Shall we revisit this work on top of the new Dockerfiles that use pip-compile? Also, did any of the dependencies (lingvo, tf-text, etc.) now have usable public pip wheels for arm64? |
This limit the difference.
The file install-lingvo.sh doesn't work on x86.
There is an issue with auditwheel. So I don't use it on x86.