Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pants test timeout includes venv creation overhead causing timeouts #19681

Open
JoostvDoorn opened this issue Aug 28, 2023 · 28 comments
Open

Pants test timeout includes venv creation overhead causing timeouts #19681

JoostvDoorn opened this issue Aug 28, 2023 · 28 comments
Labels
backend: Python Python backend-related issues bug

Comments

@JoostvDoorn
Copy link

JoostvDoorn commented Aug 28, 2023

Describe the bug
When a python_tests has heavy dependencies (e.g. pytorch) a test could timeout on a "cold" execution of the test, as during first execution of the test some additional overhead actions are performed to create the venv. This can be significant, we observe a 'test time' >30s for a test that takes 3s with a warm cache.

One solution would be to exclude this additional venv creation time for the test timeout, and only measure the actual execution time of the python process. This would make the timeout argument for python_tests more useful when large complex dependencies are used.

Not sure if there is a performance component that can be solved, potentially that should then be an issue in the pex repository.

Pants version
2.16.0 and 2.17.0rc5

OS
MacOS + Linux

Additional info
See gist for example:
https://gist.github.com/JoostvDoorn/9c0f63ed5198544a36b477502eeac4fb

To test:

rm -rf ~/.cache/pants/named_caches/
pants test :: --keep-sandboxes=on_failure --test-force
✓ //test_file.py:root succeeded in 23.09s.

Second execution using pants test :: --keep-sandboxes=on_failure --test-force is significantly faster:

✓ //test_file.py:root succeeded in 3.12s.

The bug here is that the test execution timing should reflect the actual time it takes to execute the test, and not the creation of the virtual environment.

For the gist example the expectation is that would always work regardless of the state of the cache:

python_tests(
    name="root",
    timeout=15,
)
@benjyw
Copy link
Contributor

benjyw commented Aug 28, 2023

The timeout (as set globally or on individual python_tests targets) applies only to the pytest process itself, not to any other processes (see here).

So is this extra overhead happening within that process?

@benjyw
Copy link
Contributor

benjyw commented Aug 28, 2023

Hmm, but I can reproduce what you're talking about, with your gist.

@benjyw
Copy link
Contributor

benjyw commented Aug 28, 2023

OK, so this is the cost of seeding the venv from the pex, and does indeed happen in the pytest process. So there is no easy way to avoid this, short of pre-seeding the venv in a separate process.

@benjyw
Copy link
Contributor

benjyw commented Aug 29, 2023

I think the problem is wider than just timeouts though. If each test constructs its own bespoke venv, but they all include pytorch, each test will incur this penalty on first run. And in CI first run is the only run.

One solution, when there are many tests, is to run tests in batches: https://www.pantsbuild.org/docs/python-test-goal#batching-and-parallelism

This doesn't help the single test cold start time, but it amortizes that cold start over many tests.

@JoostvDoorn
Copy link
Author

I think the problem is wider than just timeouts though. If each test constructs its own bespoke venv, but they all include pytorch, each test will incur this penalty on first run. And in CI first run is the only run.

You're completely right, from a user side we can do some steps to alleviate this. We often apply the batch_compatibility_tag as well to reduce the overhead for many of our tests, and it might be a good idea for us to adjust this usage to better address this concern for pytorch (and others). In addition we can adjust minimum timeouts to a relatively high number, and we could rely on https://pypi.org/project/pytest-timeout/ as an alternative way to control test timeouts.

OK, so this is the cost of seeding the venv from the pex, and does indeed happen in the pytest process. So there is no easy way to avoid this, short of pre-seeding the venv in a separate proces

Maybe @jsirois has some idea? I wonder if the venv construction for large library dependencies can be somehow optimized.

@jsirois
Copy link
Contributor

jsirois commented Aug 29, 2023

Have you tried setting this to True?: https://www.pantsbuild.org/docs/reference-pex#venv_use_symlinks

@JoostvDoorn
Copy link
Author

JoostvDoorn commented Aug 29, 2023

This doesn't seem to make a difference in this case.

> rm -rf /Users/joostvandoorn/.cache/pants/named_caches/
> pants test :: --keep-sandboxes=on_failure --test-force --pex-venv-use-symlinks
11:56:42.56 [INFO] Completed: Run Pytest - //test_file.py:root - succeeded.

✓ //test_file.py:root succeeded in 23.41s.

@jsirois
Copy link
Contributor

jsirois commented Aug 29, 2023

Ok, makes sense from a clean Pex cache. I'm not sure there is any way to speed up venv creation from a cold start save for populating the venv site-packages in parallel. That step is currently serial. I can give that a quick try in the am and report timing differences for your requirement set case in the gist.

@benjyw
Copy link
Contributor

benjyw commented Aug 29, 2023

PS to get the gist to work, I had to add

[python]
interpreter_constraints = ["==3.10.*"]

216.x still has broad default ICs, which in my case captured a 3.7 interpreter, which is incompatible with the fsspec>=2023.6.0 requirement.

@jsirois
Copy link
Contributor

jsirois commented Aug 29, 2023

Ok, thanks. I'm going to cut out Pants in my testing, but the IC will still be relevant.

@benjyw
Copy link
Contributor

benjyw commented Aug 29, 2023

--pex-venv-use-symlinks

Not that this helps in CI, but you should see a difference if your nuking command is rm -rf ~/.cache/pants/named_caches/pex_root/venvs/, because then the symlink farm can still be efficiently be recreated, as the symlink targets in ~/.cache/pants/named_caches/pex_root/installed_wheels/ still exist.

So you get a sense of what's going on.

@jsirois
Copy link
Contributor

jsirois commented Aug 29, 2023

I mean, stopping real fast here, just looking at the parallel downloads from the lock file:

$ pex3 lock create -r https://gist.githubusercontent.com/JoostvDoorn/9c0f63ed5198544a36b477502eeac4fb/raw/f4aec15dd827c6480790099526861b4dc4d6b1ee/requirements.txt --style universal --pip-version latest --resolver-version pip-2020-resolver --target-system linux --target-system mac --interpreter-constraint "==3.10.*" --indent 2 -o lock.json
$ rm -rf ~/.pex && time pex -v --exe exe.py --lock lock.json --python python3.10 --layout packed -o example.pex
...
pex: Downloading FileArtifact(url='https://files.pythonhosted.org/packages/a8/cc/c36f3439f5d47c3b13833ce6687b43a040cc7638c502ac46b41e2d4f3d6f/scipy-1.11.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl', fingerprint=Fingerprint(algorithm='sha256', hash='d690e1ca993c8f7ede6d22e5637541217fc6a4d3f78b3672a6fe454dbb7eb9a7'), verified=False, filename='scipy-1.11.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl'): 9482.3ms
pex: Downloading FileArtifact(url='https://files.pythonhosted.org/packages/8c/4d/17e07377c9c3d1a0c4eb3fde1c7c16b5a0ce6133ddbabc08ceef6b7f2645/torch-2.0.1-cp310-cp310-manylinux1_x86_64.whl', fingerprint=Fingerprint(algorithm='sha256', hash='8ced00b3ba471856b993822508f77c98f48a458623596a4c43136158781e306a'), verified=False, filename='torch-2.0.1-cp310-cp310-manylinux1_x86_64.whl'): 54308.9ms
...
pex:         Installing 58 distributions: 11351.9ms
Saving PEX file to example.pex

real    2m7.001s
user    2m11.010s
sys     0m12.343s

Now, this is on my home network, but ~54s (1/2 the total time) just to download torch. As a sanity check I find:

$ time curl -sSL https://files.pythonhosted.org/packages/8c/4d/17e07377c9c3d1a0c4eb3fde1c7c16b5a0ce6133ddbabc08ceef6b7f2645/torch-2.0.1-cp310-cp310-manylinux1_x86_64.whl -O

real    0m42.055s
user    0m0.387s
sys     0m0.627s
$ time curl -sSL https://files.pythonhosted.org/packages/8c/4d/17e07377c9c3d1a0c4eb3fde1c7c16b5a0ce6133ddbabc08ceef6b7f2645/torch-2.0.1-cp310-cp310-manylinux1_x86_64.whl -O

real    0m42.124s
user    0m0.485s
sys     0m0.636s

So the contended (parallel) Pip download Pex does is slower than a straight curl of that one wheel, but not by much. Pex absolutely cannot speed that up in any way whatever besides caching, and this use case is ruling out a cache.

With a cache (re-run the last command), and:

$ time pex -v --exe exe.py --lock lock.json --python python3.10 --layout packed -o example.pex
...
pex:         Installing 58 distributions: 1586.7ms
Saving PEX file to example.pex

real    0m2.347s
user    0m1.810s
sys     0m0.350s

... more analysis to come re the remaining ~60s.

@benjyw
Copy link
Contributor

benjyw commented Aug 29, 2023

We can, however, pre-seed the venv, so that at least the test timeout only applies to the pytest run, and not the venv seeding. This would solve the narrow problem the OP started with, and can be done entirely in Pants.

@jsirois
Copy link
Contributor

jsirois commented Aug 29, 2023

Sure - yes. Save 0 time, but save a timeout. I'll continue to dig on the raw performance aspect on the Pex side to see what can be juiced.

@JoostvDoorn
Copy link
Author

JoostvDoorn commented Aug 29, 2023

Now, this is on my home network, but ~54s (1/2 the total time) just to download torch. Pex absolutely cannot speed that up in any way whatever besides caching, and this use case is ruling out a cache.

Fair enough, but we have a lot of code that depend on torch, and we only need to download torch once for our lockfile per CI run whereas multiple different pex files will be generated, and ?unzipped? as part of the their first pex execution. This can be limited with batch_compatibility_tag but still there could be 5-15 batches running which include torch as a dependencies having to perform the fairly similar bootstrapping of the venv.

@jsirois
Copy link
Contributor

jsirois commented Aug 29, 2023

Ok, the lion's share (~90%), of the remaining time is creating the (compressed) --layout packed wheel zips.

  1. Avoid re-packing:
$ rm -rf ~/.pex/{bootstrap_zips,built_wheels,installed_wheels,interpreters,isolated,venvs}
$ du -shc ~/.pex/*
709M    /home/jsirois/.pex/downloads
709M    /home/jsirois/.pex/packed_wheels
743M    /home/jsirois/.pex/pip
2.2G    total
$ time pex -v --exe exe.py --lock lock.json --python python3.10 --layout packed -o example.pex
...
pex:         Installing 58 distributions: 10896.7ms
Saving PEX file to example.pex
Previous binary unexpectedly exists, cleaning: example.pex

real    0m12.915s
user    0m42.100s
sys     0m5.047s
  1. Force re-packing with compression (the default - Pants does this):
$ rm -rf ~/.pex/{bootstrap_zips,built_wheels,installed_wheels,interpreters,isolated,venvs,packed_wheels}
$ du -shc ~/.pex/*
709M    /home/jsirois/.pex/downloads
743M    /home/jsirois/.pex/pip
1.5G    total
$ time pex -v --exe exe.py --lock lock.json --python python3.10 --layout packed -o example.pex
...
pex:         Installing 58 distributions: 10871.3ms
Saving PEX file to example.pex

real    1m8.286s
user    1m35.345s
sys     0m6.424s
  1. Force re-packing without compression:
$ rm -rf ~/.pex/{bootstrap_zips,built_wheels,installed_wheels,interpreters,isolated,venvs,packed_wheels}
$ du -shc ~/.pex/*
709M    /home/jsirois/.pex/downloads
743M    /home/jsirois/.pex/pip
1.5G    total
$ time pex -v --exe exe.py --lock lock.json --python python3.10 --layout packed -o example.pex --no-compress
...
pex:         Installing 58 distributions: 10971.2ms
Saving PEX file to example.pex

real    0m16.502s
user    0m45.362s
sys     0m6.502s

So the lion's share of the overall time is dominated by the long pole of the download of torch in a parallel download of all needed wheels and the zip compression of packed wheel zips. The latter is serial; so that could be parallelized - but stepping back and trying a native zip of just the torch installed wheel shows:

$ time zip -r --quiet torch-packed.zip ~/.pex/installed_wheels/8ced00b3ba471856b993822508f77c98f48a458623596a4c43136158781e306a/torch-2.0.1-
cp310-cp310-manylinux1_x86_64.whl/

real    0m47.460s
user    0m46.773s
sys     0m0.680s

So there is not much to be gained there in this case.

@jsirois
Copy link
Contributor

jsirois commented Aug 29, 2023

@JoostvDoorn I'll let you continue with @benjyw down the path of not speeding things up in reality, but avoiding the test timeout with accounting tricks. To meaningfully speed venv creation up for this example, you definitely need to allow for caching something.

@benjyw
Copy link
Contributor

benjyw commented Aug 29, 2023

Fair enough, but we have a lot of code that depend on torch, and we only need to download torch once for our lockfile per CI run whereas multiple different pex files will be generated, and ?unzipped? as part of the their first pex execution. This can be limited with batch_compatibility_tag but still there could be 5-15 batches running which include torch as a dependencies having to perform the fairly similar bootstrapping of the venv.

You can control the batch size, to trade off concurrency vs this overhead.

@benjyw
Copy link
Contributor

benjyw commented Aug 29, 2023

This is where @jriddy's work on "provided dependencies" might be valuable. If your CI image could provide torch preinstalled in site-packages...

@jsirois
Copy link
Contributor

jsirois commented Aug 29, 2023

I think that work brings us full circle: pex-tool/pex#2097
@jriddy will have to grapple with that to make "provided dependencies" work for the Pants Python stack.

@jsirois
Copy link
Contributor

jsirois commented Aug 29, 2023

I think what this really all says is Pants probably needs to stop playing whac-a-mole and instead have someone fully embrace that they are a torch user and look at the shortcomings wholistically and actually solve them all. From lock issues to speed issues.

@huonw huonw added the backend: Python Python backend-related issues label Aug 29, 2023
@jriddy
Copy link
Contributor

jriddy commented Aug 30, 2023

I think that work brings us full circle: pex-tool/pex#2097

Yeah I'm aware of that one. I want to get to it, just trying to find when I have the time

instead have someone fully embrace that they are a torch user and look at the shortcomings wholistically and actually solve them all

I'm not fully sure what you mean by this, but in my experience the overlap between the actual users of torch and people who have any interest in deterministic builds is pretty small. Often ML teams have a handful of enabler engineers with far less background in ML that are trying to build tooling and pipelines and generally hold the whole dev shop side of things together. So it will take that rare someone that is both a genuine user of Torch and someone who is willing to wade through the nitty-gritty of weird native code + python packaging to understand the problem holistically.

I think "provided dependencies" + environments is probably a better way to handle this longer term, because torch and tensorflow aren't the only things like this. So giving some way to provide something resembling determinism while acquiescing to the reality of how some packages/ecosystems/organizations work seems to be a reasonable way out for these cases that we can never really solve perfectly through the pip ecosystem.

@jsirois
Copy link
Contributor

jsirois commented Aug 30, 2023

I'm not fully sure what you mean by this, but in my experience the overlap between the actual users of torch and people who have any interest in deterministic builds is pretty small. Often ML teams have a handful of enabler engineers with far less background in ML that are trying to build tooling and pipelines and generally hold the whole dev shop side of things together. So it will take that rare someone that is both a genuine user of Torch and someone who is willing to wade through the nitty-gritty of weird native code + python packaging to understand the problem holistically.

I mean just that - Pants has historically been composed of committers that basically have no clue about the things they're integrating. For example, I learned Python packaging on the job over the course of ~10 years. I had no clue going in. We have had, FWICT, no Pants committer dig in and learn / live ML to produce a useful experience. I agree this is hard, I'm just pointing out its a huge problem and has been for a very long time. Pants has Pants experts and next to no Pants use case problem domain experts. Its very hard being a middleman tool but not having the middleman expertise but that is the position Pants is in. Much better would be if there were a way to get out of the middle somehow. Then the ecosystem expert could know little about Pants but leverage their ecosystem. Right now its a bit of the worst of both worlds / you have to know both worlds to do good work.

@jriddy
Copy link
Contributor

jriddy commented Aug 30, 2023 via email

@benjyw
Copy link
Contributor

benjyw commented Aug 30, 2023

I mean just that - Pants has historically been composed of committers that basically have no clue about the things they're integrating. For example, I learned Python packaging on the job over the course of ~10 years. I had no clue going in. We have had, FWICT, no Pants committer dig in and learn / live ML to produce a useful experience. I agree this is hard, I'm just pointing out its a huge problem and has been for a very long time. Pants has Pants experts and next to no Pants use case problem domain experts. Its very hard being a middleman tool but not having the middleman expertise but that is the position Pants is in. Much better would be if there were a way to get out of the middle somehow. Then the ecosystem expert could know little about Pants but leverage their ecosystem. Right now its a bit of the worst of both worlds / you have to know both worlds to do good work.

100%. This has been one of the biggest challenges of expanding Pants into use-cases the current maintainers don't understand. To make a useful plugin for X you need to know X and Pants, and that intersection is small to empty for most X.

@benjyw
Copy link
Contributor

benjyw commented Aug 30, 2023

Specifically for the initial problem of the test timeout being consumed by the venv creation overhead: @JoostvDoorn , do you want to take a crack at that?

The idea is: right before running pytest we run a process in the same venv (so same params as the pytest process) that does enough work to create the venv, but doesn't actually run the test.

Actually, now that I think of it, this is similar to #19430. Maybe we wait for that one, then unify into one "force venv to materialize" rule.

@JoostvDoorn
Copy link
Author

Specifically for the initial problem of the test timeout being consumed by the venv creation overhead: @JoostvDoorn , do you want to take a crack at that?

Actually, now that I think of it, this is similar to #19430. Maybe we wait for that one, then unify into one "force venv to materialize" rule.

Seems doable from my side. Will have a look at this sometime after this is merged.

@benjyw
Copy link
Contributor

benjyw commented Oct 2, 2023

OK! So #19430 is now merged, and I think we want to generalize it to a "force a venv to exist" rule, that we don't cache. See the conversation in that PR for more context.

huonw added a commit that referenced this issue Jan 9, 2024
All changes:

- https://github.com/pantsbuild/pex/releases/tag/v2.1.153
- https://github.com/pantsbuild/pex/releases/tag/v2.1.154
- https://github.com/pantsbuild/pex/releases/tag/v2.1.155

Highlights:

- `--no-pre-install-wheels` (and `--max-install-jobs`) that likely helps
with:
  - #15062 
  - (the root cause of) #20227
  - _maybe_ arguably #18293, #18965, #19681 
- improved shebang selection, helping with
#19514, but probably not the
full solution (#19925)
- performance improvements
huonw added a commit that referenced this issue Jan 11, 2024
https://github.com/pantsbuild/pex/releases/tag/v2.1.156

Continuing from #20347, this brings additional performance
optimisations, particularly for large wheels like PyTorch, and so may
help with #18293,
#18965,
#19681
cburroughs added a commit that referenced this issue Jul 23, 2024
Highlighted changelogs:
 * https://github.com/pex-tool/pex/releases/tag/v2.4.0
 * https://github.com/pex-tool/pex/releases/tag/v2.4.1
 * https://github.com/pex-tool/pex/releases/tag/v2.5.0
 * https://github.com/pex-tool/pex/releases/tag/v2.6.0
 * https://github.com/pex-tool/pex/releases/tag/v2.7.0
 * https://github.com/pex-tool/pex/releases/tag/v2.8.0
 * https://github.com/pex-tool/pex/releases/tag/v2.8.1
 * https://github.com/pex-tool/pex/releases/tag/v2.9.0
 * https://github.com/pex-tool/pex/releases/tag/v2.10.0
 * https://github.com/pex-tool/pex/releases/tag/v2.10.1
 * https://github.com/pex-tool/pex/releases/tag/v2.11.0 (!)

```
Lockfile diff: 3rdparty/python/user_reqs.lock [python-default]

==                    Upgraded dependencies                     ==

  certifi                        2024.6.2     -->   2024.7.4
  exceptiongroup                 1.2.1        -->   1.2.2
  pex                            2.3.3        -->   2.11.0
  pydantic                       1.10.16      -->   1.10.17
```

NOTE: This updates the scrupulously backwards compatible Pex, but does
not use any new features yet. The minimum version is unchanged.

Possibly relevant for: #15704 #21103 #20852 #15454 #21100 #11324 #19256
#19552 #19681
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend: Python Python backend-related issues bug
Projects
None yet
Development

No branches or pull requests

5 participants