Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge Amd64 and Arm64 pax dockerfile #291

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open

Conversation

nouiz
Copy link
Collaborator

@nouiz nouiz commented Oct 9, 2023

This limit the difference.
The file install-lingvo.sh doesn't work on x86.
There is an issue with auditwheel. So I don't use it on x86.

install-flax.sh --defer
install-te.sh --defer

if [[ -f /opt/requirements-defer.txt ]]; then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole thing about requirements-defer.txt isn't carried over?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the --defer mechanism runs the risk of nightlies breaking due to sequential invocations of pip install installing breaking versions of transitive libraries previous invocations required. Until we have a safer way to install these deps, I'd vote for keeping it for now.

FWIW, you can run pip check on your built image to see if anything is in conflict. Returning "No broken requirements found" doesn't tell us that it's okay to not install everything together, but just that nothing is broken at the moment. Oddly, my experience has been pip doesn't just tell you that what you're about to install will break transitive dependencies 😞

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't working on ARM.
I don't have time to debug this now. I would prefer to have both container the same now and bring it back later.
I'll try again quickly just in case is now works. But I can't debug it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, this defer things break the Dockerfile spirit and made developing/fixing the dockerfile more complex and much slower.
Any idea how to not loose this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm getting this error on ARM:

#20 717.0   File "/usr/local/lib/python3.10/dist-packages/pip/_vendor/resolvelib/resolvers.py", line 457, in resolve
#20 717.0     raise ResolutionTooDeep(max_rounds)
#20 717.0 pip._vendor.resolvelib.resolvers.ResolutionTooDeep: 200000

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re:

#20 717.0   File "/usr/local/lib/python3.10/dist-packages/pip/_vendor/resolvelib/resolvers.py", line 457, in resolve
#20 717.0     raise ResolutionTooDeep(max_rounds)
#20 717.0 pip._vendor.resolvelib.resolvers.ResolutionTooDeep: 200000

I saw this earlier with installing t5x + TE, but it seemed to have resolved itself yesterday.

Actually if the resolver is having trouble finding a working version (for me it was seqio), there's likely a conflict that you'll introduce by installing sequentially. If it works for you, we can rel-note; but just FYI

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any idea how to investigate this?
If the CI pass, I'm ok with this. The current Arm version doesn't defer too.
I'm not sure what to write as a relnote.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it works for you, we can rel-note

I was thinking if you run pip check and saw something like "protobuf is in conflict", that could be the relnote unless the package in conflict is inconsequential

Any idea how to investigate this?

You could try installing 2 of the packages with --defer, then do the third one normally, then run pip check after that to see what pip says is in conflict

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I have:

2.4.2 has requirement typing-extensions>=4.6.1, but you have typing-extensions 4.5.0.
pydantic-core 2.10.1 has requirement typing-extensions!=4.7.0,>=4.6.0, but you have typing-extensions 4.5.0.

TE ask for it, but doesn't specify a version. Otherwise we don't see it under /opt.
I think we can ignore it and not relnote it. Do you agree with that?

If so, any other change needed before we can merge this PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds related.... #117

Have you run a test with TE? When this happened last, the test failed only under the TE code paths, and not the vanilla. There's actually another issue here is that the pre-submit only tests upstream pax tests and not the ones with TE. @ashors1 has a PR to add TE tests which sound like they'd help validate this dep conflict: #292, perhaps you can cherry pick it in a ephemeral branch to test this.

Another option is to add a constraint file pip install -c constraints.txt -r requirements-defer.txt and add a pydantic-core version that is known to work. I imagine you'd have to downgrade to find it. I looked at the version in the t5x images and I found 2.0.3, so maybe try populating a constraints.txt file with

pydantic==2.0.3

and see if that builds? with all three deps deferred?

yhtang
yhtang previously approved these changes Oct 11, 2023
Copy link
Collaborator

@yhtang yhtang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that we will need a major refactor to adapt the pax dockerfile for pip-compile.

@yhtang
Copy link
Collaborator

yhtang commented Oct 11, 2023

@terrykong shall we merge this PR or wait for you and @nouiz to figure out the deferred install issue?

@terrykong
Copy link
Contributor

terrykong commented Oct 11, 2023

I think it's fine to not fix the defer, but @nouiz , can you run the rosetta-pax image from this change thru https://github.com/NVIDIA/JAX-Toolbox/actions/workflows/nightly-rosetta-pax-build.yaml ? I'm worried about this breaking TE

@ashors1 Just added some TE tests (in #292 ), which can make sure this won't break functionality (This isn't part of the pre-submit CI yet, so it needs to be dispatched manually thru the web UI)

@nouiz
Copy link
Collaborator Author

nouiz commented Oct 11, 2023

How to run it manually? I won't be able to do that today.

@nouiz nouiz force-pushed the amd64_arm64_common_pax branch from 3e7d56e to a0917d1 Compare October 12, 2023 18:55
@nouiz
Copy link
Collaborator Author

nouiz commented Oct 12, 2023

I rebased and tryed to add the recent change.

@terrykong
Copy link
Contributor

@yhtang
Copy link
Collaborator

yhtang commented Oct 19, 2023

It looks like the build failed.

@yhtang
Copy link
Collaborator

yhtang commented Dec 4, 2023

@nouiz Shall we revisit this work on top of the new Dockerfiles that use pip-compile? Also, did any of the dependencies (lingvo, tf-text, etc.) now have usable public pip wheels for arm64?

@yhtang yhtang added the stale label Jan 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants