Pants test timeout includes venv creation overhead causing timeouts #19681

JoostvDoorn · 2023-08-28T13:33:18Z

Describe the bug
When a python_tests has heavy dependencies (e.g. pytorch) a test could timeout on a "cold" execution of the test, as during first execution of the test some additional overhead actions are performed to create the venv. This can be significant, we observe a 'test time' >30s for a test that takes 3s with a warm cache.

One solution would be to exclude this additional venv creation time for the test timeout, and only measure the actual execution time of the python process. This would make the timeout argument for python_tests more useful when large complex dependencies are used.

Not sure if there is a performance component that can be solved, potentially that should then be an issue in the pex repository.

Pants version
2.16.0 and 2.17.0rc5

OS
MacOS + Linux

Additional info
See gist for example:
https://gist.github.com/JoostvDoorn/9c0f63ed5198544a36b477502eeac4fb

To test:

rm -rf ~/.cache/pants/named_caches/
pants test :: --keep-sandboxes=on_failure --test-force

✓ //test_file.py:root succeeded in 23.09s.

Second execution using pants test :: --keep-sandboxes=on_failure --test-force is significantly faster:

✓ //test_file.py:root succeeded in 3.12s.

The bug here is that the test execution timing should reflect the actual time it takes to execute the test, and not the creation of the virtual environment.

For the gist example the expectation is that would always work regardless of the state of the cache:

python_tests(
    name="root",
    timeout=15,
)

The text was updated successfully, but these errors were encountered:

benjyw · 2023-08-28T20:19:00Z

The timeout (as set globally or on individual python_tests targets) applies only to the pytest process itself, not to any other processes (see here).

So is this extra overhead happening within that process?

benjyw · 2023-08-28T22:11:07Z

Hmm, but I can reproduce what you're talking about, with your gist.

benjyw · 2023-08-28T23:02:54Z

OK, so this is the cost of seeding the venv from the pex, and does indeed happen in the pytest process. So there is no easy way to avoid this, short of pre-seeding the venv in a separate process.

benjyw · 2023-08-29T03:41:47Z

I think the problem is wider than just timeouts though. If each test constructs its own bespoke venv, but they all include pytorch, each test will incur this penalty on first run. And in CI first run is the only run.

One solution, when there are many tests, is to run tests in batches: https://www.pantsbuild.org/docs/python-test-goal#batching-and-parallelism

This doesn't help the single test cold start time, but it amortizes that cold start over many tests.

JoostvDoorn · 2023-08-29T07:52:04Z

I think the problem is wider than just timeouts though. If each test constructs its own bespoke venv, but they all include pytorch, each test will incur this penalty on first run. And in CI first run is the only run.

You're completely right, from a user side we can do some steps to alleviate this. We often apply the batch_compatibility_tag as well to reduce the overhead for many of our tests, and it might be a good idea for us to adjust this usage to better address this concern for pytorch (and others). In addition we can adjust minimum timeouts to a relatively high number, and we could rely on https://pypi.org/project/pytest-timeout/ as an alternative way to control test timeouts.

OK, so this is the cost of seeding the venv from the pex, and does indeed happen in the pytest process. So there is no easy way to avoid this, short of pre-seeding the venv in a separate proces

Maybe @jsirois has some idea? I wonder if the venv construction for large library dependencies can be somehow optimized.

jsirois · 2023-08-29T09:29:23Z

Have you tried setting this to True?: https://www.pantsbuild.org/docs/reference-pex#venv_use_symlinks

JoostvDoorn · 2023-08-29T09:57:15Z

This doesn't seem to make a difference in this case.

> rm -rf /Users/joostvandoorn/.cache/pants/named_caches/
> pants test :: --keep-sandboxes=on_failure --test-force --pex-venv-use-symlinks
11:56:42.56 [INFO] Completed: Run Pytest - //test_file.py:root - succeeded.

✓ //test_file.py:root succeeded in 23.41s.

jsirois · 2023-08-29T10:12:10Z

Ok, makes sense from a clean Pex cache. I'm not sure there is any way to speed up venv creation from a cold start save for populating the venv site-packages in parallel. That step is currently serial. I can give that a quick try in the am and report timing differences for your requirement set case in the gist.

benjyw · 2023-08-29T14:33:18Z

PS to get the gist to work, I had to add

[python]
interpreter_constraints = ["==3.10.*"]

216.x still has broad default ICs, which in my case captured a 3.7 interpreter, which is incompatible with the fsspec>=2023.6.0 requirement.

jsirois · 2023-08-29T14:36:14Z

Ok, thanks. I'm going to cut out Pants in my testing, but the IC will still be relevant.

benjyw · 2023-08-29T14:44:39Z

--pex-venv-use-symlinks

Not that this helps in CI, but you should see a difference if your nuking command is rm -rf ~/.cache/pants/named_caches/pex_root/venvs/, because then the symlink farm can still be efficiently be recreated, as the symlink targets in ~/.cache/pants/named_caches/pex_root/installed_wheels/ still exist.

So you get a sense of what's going on.

jsirois · 2023-08-29T15:05:45Z

I mean, stopping real fast here, just looking at the parallel downloads from the lock file:

$ pex3 lock create -r https://gist.githubusercontent.com/JoostvDoorn/9c0f63ed5198544a36b477502eeac4fb/raw/f4aec15dd827c6480790099526861b4dc4d6b1ee/requirements.txt --style universal --pip-version latest --resolver-version pip-2020-resolver --target-system linux --target-system mac --interpreter-constraint "==3.10.*" --indent 2 -o lock.json
$ rm -rf ~/.pex && time pex -v --exe exe.py --lock lock.json --python python3.10 --layout packed -o example.pex
...
pex: Downloading FileArtifact(url='https://files.pythonhosted.org/packages/a8/cc/c36f3439f5d47c3b13833ce6687b43a040cc7638c502ac46b41e2d4f3d6f/scipy-1.11.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl', fingerprint=Fingerprint(algorithm='sha256', hash='d690e1ca993c8f7ede6d22e5637541217fc6a4d3f78b3672a6fe454dbb7eb9a7'), verified=False, filename='scipy-1.11.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl'): 9482.3ms
pex: Downloading FileArtifact(url='https://files.pythonhosted.org/packages/8c/4d/17e07377c9c3d1a0c4eb3fde1c7c16b5a0ce6133ddbabc08ceef6b7f2645/torch-2.0.1-cp310-cp310-manylinux1_x86_64.whl', fingerprint=Fingerprint(algorithm='sha256', hash='8ced00b3ba471856b993822508f77c98f48a458623596a4c43136158781e306a'), verified=False, filename='torch-2.0.1-cp310-cp310-manylinux1_x86_64.whl'): 54308.9ms
...
pex:         Installing 58 distributions: 11351.9ms
Saving PEX file to example.pex

real    2m7.001s
user    2m11.010s
sys     0m12.343s

Now, this is on my home network, but ~54s (1/2 the total time) just to download torch. As a sanity check I find:

$ time curl -sSL https://files.pythonhosted.org/packages/8c/4d/17e07377c9c3d1a0c4eb3fde1c7c16b5a0ce6133ddbabc08ceef6b7f2645/torch-2.0.1-cp310-cp310-manylinux1_x86_64.whl -O

real    0m42.055s
user    0m0.387s
sys     0m0.627s
$ time curl -sSL https://files.pythonhosted.org/packages/8c/4d/17e07377c9c3d1a0c4eb3fde1c7c16b5a0ce6133ddbabc08ceef6b7f2645/torch-2.0.1-cp310-cp310-manylinux1_x86_64.whl -O

real    0m42.124s
user    0m0.485s
sys     0m0.636s

So the contended (parallel) Pip download Pex does is slower than a straight curl of that one wheel, but not by much. Pex absolutely cannot speed that up in any way whatever besides caching, and this use case is ruling out a cache.

With a cache (re-run the last command), and:

$ time pex -v --exe exe.py --lock lock.json --python python3.10 --layout packed -o example.pex
...
pex:         Installing 58 distributions: 1586.7ms
Saving PEX file to example.pex

real    0m2.347s
user    0m1.810s
sys     0m0.350s

... more analysis to come re the remaining ~60s.

benjyw · 2023-08-29T15:08:49Z

We can, however, pre-seed the venv, so that at least the test timeout only applies to the pytest run, and not the venv seeding. This would solve the narrow problem the OP started with, and can be done entirely in Pants.

jsirois · 2023-08-29T15:09:41Z

Sure - yes. Save 0 time, but save a timeout. I'll continue to dig on the raw performance aspect on the Pex side to see what can be juiced.

JoostvDoorn · 2023-08-29T15:31:39Z

Now, this is on my home network, but ~54s (1/2 the total time) just to download torch. Pex absolutely cannot speed that up in any way whatever besides caching, and this use case is ruling out a cache.

Fair enough, but we have a lot of code that depend on torch, and we only need to download torch once for our lockfile per CI run whereas multiple different pex files will be generated, and ?unzipped? as part of the their first pex execution. This can be limited with batch_compatibility_tag but still there could be 5-15 batches running which include torch as a dependencies having to perform the fairly similar bootstrapping of the venv.

jsirois · 2023-08-29T15:37:43Z

Ok, the lion's share (~90%), of the remaining time is creating the (compressed) --layout packed wheel zips.

Avoid re-packing:

$ rm -rf ~/.pex/{bootstrap_zips,built_wheels,installed_wheels,interpreters,isolated,venvs}
$ du -shc ~/.pex/*
709M    /home/jsirois/.pex/downloads
709M    /home/jsirois/.pex/packed_wheels
743M    /home/jsirois/.pex/pip
2.2G    total
$ time pex -v --exe exe.py --lock lock.json --python python3.10 --layout packed -o example.pex
...
pex:         Installing 58 distributions: 10896.7ms
Saving PEX file to example.pex
Previous binary unexpectedly exists, cleaning: example.pex

real    0m12.915s
user    0m42.100s
sys     0m5.047s

Force re-packing with compression (the default - Pants does this):

$ rm -rf ~/.pex/{bootstrap_zips,built_wheels,installed_wheels,interpreters,isolated,venvs,packed_wheels}
$ du -shc ~/.pex/*
709M    /home/jsirois/.pex/downloads
743M    /home/jsirois/.pex/pip
1.5G    total
$ time pex -v --exe exe.py --lock lock.json --python python3.10 --layout packed -o example.pex
...
pex:         Installing 58 distributions: 10871.3ms
Saving PEX file to example.pex

real    1m8.286s
user    1m35.345s
sys     0m6.424s

Force re-packing without compression:

$ rm -rf ~/.pex/{bootstrap_zips,built_wheels,installed_wheels,interpreters,isolated,venvs,packed_wheels}
$ du -shc ~/.pex/*
709M    /home/jsirois/.pex/downloads
743M    /home/jsirois/.pex/pip
1.5G    total
$ time pex -v --exe exe.py --lock lock.json --python python3.10 --layout packed -o example.pex --no-compress
...
pex:         Installing 58 distributions: 10971.2ms
Saving PEX file to example.pex

real    0m16.502s
user    0m45.362s
sys     0m6.502s

So the lion's share of the overall time is dominated by the long pole of the download of torch in a parallel download of all needed wheels and the zip compression of packed wheel zips. The latter is serial; so that could be parallelized - but stepping back and trying a native zip of just the torch installed wheel shows:

$ time zip -r --quiet torch-packed.zip ~/.pex/installed_wheels/8ced00b3ba471856b993822508f77c98f48a458623596a4c43136158781e306a/torch-2.0.1-
cp310-cp310-manylinux1_x86_64.whl/

real    0m47.460s
user    0m46.773s
sys     0m0.680s

So there is not much to be gained there in this case.

jsirois · 2023-08-29T15:39:52Z

@JoostvDoorn I'll let you continue with @benjyw down the path of not speeding things up in reality, but avoiding the test timeout with accounting tricks. To meaningfully speed venv creation up for this example, you definitely need to allow for caching something.

benjyw · 2023-08-29T16:20:47Z

Fair enough, but we have a lot of code that depend on torch, and we only need to download torch once for our lockfile per CI run whereas multiple different pex files will be generated, and ?unzipped? as part of the their first pex execution. This can be limited with batch_compatibility_tag but still there could be 5-15 batches running which include torch as a dependencies having to perform the fairly similar bootstrapping of the venv.

You can control the batch size, to trade off concurrency vs this overhead.

benjyw · 2023-08-29T16:26:36Z

This is where @jriddy's work on "provided dependencies" might be valuable. If your CI image could provide torch preinstalled in site-packages...

jsirois · 2023-08-29T16:45:18Z

I think that work brings us full circle: pex-tool/pex#2097
@jriddy will have to grapple with that to make "provided dependencies" work for the Pants Python stack.

jsirois · 2023-08-29T16:47:56Z

I think what this really all says is Pants probably needs to stop playing whac-a-mole and instead have someone fully embrace that they are a torch user and look at the shortcomings wholistically and actually solve them all. From lock issues to speed issues.

jriddy · 2023-08-30T00:24:42Z

I think that work brings us full circle: pex-tool/pex#2097

Yeah I'm aware of that one. I want to get to it, just trying to find when I have the time

instead have someone fully embrace that they are a torch user and look at the shortcomings wholistically and actually solve them all

I'm not fully sure what you mean by this, but in my experience the overlap between the actual users of torch and people who have any interest in deterministic builds is pretty small. Often ML teams have a handful of enabler engineers with far less background in ML that are trying to build tooling and pipelines and generally hold the whole dev shop side of things together. So it will take that rare someone that is both a genuine user of Torch and someone who is willing to wade through the nitty-gritty of weird native code + python packaging to understand the problem holistically.

I think "provided dependencies" + environments is probably a better way to handle this longer term, because torch and tensorflow aren't the only things like this. So giving some way to provide something resembling determinism while acquiescing to the reality of how some packages/ecosystems/organizations work seems to be a reasonable way out for these cases that we can never really solve perfectly through the pip ecosystem.

jsirois · 2023-08-30T00:36:20Z

I'm not fully sure what you mean by this, but in my experience the overlap between the actual users of torch and people who have any interest in deterministic builds is pretty small. Often ML teams have a handful of enabler engineers with far less background in ML that are trying to build tooling and pipelines and generally hold the whole dev shop side of things together. So it will take that rare someone that is both a genuine user of Torch and someone who is willing to wade through the nitty-gritty of weird native code + python packaging to understand the problem holistically.

I mean just that - Pants has historically been composed of committers that basically have no clue about the things they're integrating. For example, I learned Python packaging on the job over the course of ~10 years. I had no clue going in. We have had, FWICT, no Pants committer dig in and learn / live ML to produce a useful experience. I agree this is hard, I'm just pointing out its a huge problem and has been for a very long time. Pants has Pants experts and next to no Pants use case problem domain experts. Its very hard being a middleman tool but not having the middleman expertise but that is the position Pants is in. Much better would be if there were a way to get out of the middle somehow. Then the ecosystem expert could know little about Pants but leverage their ecosystem. Right now its a bit of the worst of both worlds / you have to know both worlds to do good work.

jriddy · 2023-08-30T00:40:37Z

Okay that makes sense. I actually have been there glue guy on an ML team and I’m currently dealing with a RPM shop where we can’t use Pex conventionally . So I think I do have a bit of domain expertise here. I just struggle to carve out enough time to actually make these contributions. 😅

…

On Tue, Aug 29, 2023 at 20:36 John Sirois ***@***.***> wrote: I'm not fully sure what you mean by this, but in my experience the overlap between the actual users of torch and people who have any interest in deterministic builds is pretty small. Often ML teams have a handful of enabler engineers with far less background in ML that are trying to build tooling and pipelines and generally hold the whole dev shop side of things together. So it will take that rare someone that is both a genuine user of Torch and someone who is willing to wade through the nitty-gritty of weird native code + python packaging to understand the problem holistically. I mean just that - Pants has historically been composed of committers that basically have no clue about the things they're integrating. For example, I learned Python packaging on the job over the course of ~10 years. We have had, FWICT, no Pants committer dig in and learn / live ML to produce a useful experience. I agree this is hard, I'm just pointing out its a huge problem and has been for a very long time. Pants has Pants experts and next to no Pants use case problem domain experts. — Reply to this email directly, view it on GitHub <#19681 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAH5UH4PVDRK2RRJDX5B7TDXX2DI7ANCNFSM6AAAAAA4BN3WGA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

benjyw · 2023-08-30T14:45:41Z

I mean just that - Pants has historically been composed of committers that basically have no clue about the things they're integrating. For example, I learned Python packaging on the job over the course of ~10 years. I had no clue going in. We have had, FWICT, no Pants committer dig in and learn / live ML to produce a useful experience. I agree this is hard, I'm just pointing out its a huge problem and has been for a very long time. Pants has Pants experts and next to no Pants use case problem domain experts. Its very hard being a middleman tool but not having the middleman expertise but that is the position Pants is in. Much better would be if there were a way to get out of the middle somehow. Then the ecosystem expert could know little about Pants but leverage their ecosystem. Right now its a bit of the worst of both worlds / you have to know both worlds to do good work.

100%. This has been one of the biggest challenges of expanding Pants into use-cases the current maintainers don't understand. To make a useful plugin for X you need to know X and Pants, and that intersection is small to empty for most X.

benjyw · 2023-08-30T14:50:44Z

Specifically for the initial problem of the test timeout being consumed by the venv creation overhead: @JoostvDoorn , do you want to take a crack at that?

The idea is: right before running pytest we run a process in the same venv (so same params as the pytest process) that does enough work to create the venv, but doesn't actually run the test.

Actually, now that I think of it, this is similar to #19430. Maybe we wait for that one, then unify into one "force venv to materialize" rule.

JoostvDoorn · 2023-09-21T07:59:03Z

Specifically for the initial problem of the test timeout being consumed by the venv creation overhead: @JoostvDoorn , do you want to take a crack at that?

Actually, now that I think of it, this is similar to #19430. Maybe we wait for that one, then unify into one "force venv to materialize" rule.

Seems doable from my side. Will have a look at this sometime after this is merged.

benjyw · 2023-10-02T04:54:18Z

OK! So #19430 is now merged, and I think we want to generalize it to a "force a venv to exist" rule, that we don't cache. See the conversation in that PR for more context.

All changes: - https://github.com/pantsbuild/pex/releases/tag/v2.1.153 - https://github.com/pantsbuild/pex/releases/tag/v2.1.154 - https://github.com/pantsbuild/pex/releases/tag/v2.1.155 Highlights: - `--no-pre-install-wheels` (and `--max-install-jobs`) that likely helps with: - #15062 - (the root cause of) #20227 - _maybe_ arguably #18293, #18965, #19681 - improved shebang selection, helping with #19514, but probably not the full solution (#19925) - performance improvements

https://github.com/pantsbuild/pex/releases/tag/v2.1.156 Continuing from #20347, this brings additional performance optimisations, particularly for large wheels like PyTorch, and so may help with #18293, #18965, #19681

Highlighted changelogs: * https://github.com/pex-tool/pex/releases/tag/v2.4.0 * https://github.com/pex-tool/pex/releases/tag/v2.4.1 * https://github.com/pex-tool/pex/releases/tag/v2.5.0 * https://github.com/pex-tool/pex/releases/tag/v2.6.0 * https://github.com/pex-tool/pex/releases/tag/v2.7.0 * https://github.com/pex-tool/pex/releases/tag/v2.8.0 * https://github.com/pex-tool/pex/releases/tag/v2.8.1 * https://github.com/pex-tool/pex/releases/tag/v2.9.0 * https://github.com/pex-tool/pex/releases/tag/v2.10.0 * https://github.com/pex-tool/pex/releases/tag/v2.10.1 * https://github.com/pex-tool/pex/releases/tag/v2.11.0 (!) ``` Lockfile diff: 3rdparty/python/user_reqs.lock [python-default] == Upgraded dependencies == certifi 2024.6.2 --> 2024.7.4 exceptiongroup 1.2.1 --> 1.2.2 pex 2.3.3 --> 2.11.0 pydantic 1.10.16 --> 1.10.17 ``` NOTE: This updates the scrupulously backwards compatible Pex, but does not use any new features yet. The minimum version is unchanged. Possibly relevant for: #15704 #21103 #20852 #15454 #21100 #11324 #19256 #19552 #19681

JoostvDoorn added the bug label Aug 28, 2023

huonw added the backend: Python Python backend-related issues label Aug 29, 2023

huonw mentioned this issue Dec 30, 2023

Update to Pex 2.1.155 #20347

Merged

huonw mentioned this issue Jan 11, 2024

Upgrade to PEX 2.1.156 #20391

Merged

cburroughs mentioned this issue Jul 18, 2024

update Pex to 2.11.0 #21182

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pants test timeout includes venv creation overhead causing timeouts #19681

Pants test timeout includes venv creation overhead causing timeouts #19681

JoostvDoorn commented Aug 28, 2023 •

edited

Loading

benjyw commented Aug 28, 2023

benjyw commented Aug 28, 2023

benjyw commented Aug 28, 2023

benjyw commented Aug 29, 2023

JoostvDoorn commented Aug 29, 2023

jsirois commented Aug 29, 2023

JoostvDoorn commented Aug 29, 2023 •

edited

Loading

jsirois commented Aug 29, 2023

benjyw commented Aug 29, 2023 •

edited

Loading

jsirois commented Aug 29, 2023

benjyw commented Aug 29, 2023 •

edited

Loading

jsirois commented Aug 29, 2023 •

edited

Loading

benjyw commented Aug 29, 2023

jsirois commented Aug 29, 2023

JoostvDoorn commented Aug 29, 2023 •

edited

Loading

jsirois commented Aug 29, 2023

jsirois commented Aug 29, 2023

benjyw commented Aug 29, 2023

benjyw commented Aug 29, 2023

jsirois commented Aug 29, 2023 •

edited

Loading

jsirois commented Aug 29, 2023

jriddy commented Aug 30, 2023 •

edited

Loading

jsirois commented Aug 30, 2023 •

edited

Loading

jriddy commented Aug 30, 2023 via email

benjyw commented Aug 30, 2023

benjyw commented Aug 30, 2023

JoostvDoorn commented Sep 21, 2023

benjyw commented Oct 2, 2023

Pants test timeout includes venv creation overhead causing timeouts #19681

Pants test timeout includes venv creation overhead causing timeouts #19681

Comments

JoostvDoorn commented Aug 28, 2023 • edited Loading

benjyw commented Aug 28, 2023

benjyw commented Aug 28, 2023

benjyw commented Aug 28, 2023

benjyw commented Aug 29, 2023

JoostvDoorn commented Aug 29, 2023

jsirois commented Aug 29, 2023

JoostvDoorn commented Aug 29, 2023 • edited Loading

jsirois commented Aug 29, 2023

benjyw commented Aug 29, 2023 • edited Loading

jsirois commented Aug 29, 2023

benjyw commented Aug 29, 2023 • edited Loading

jsirois commented Aug 29, 2023 • edited Loading

benjyw commented Aug 29, 2023

jsirois commented Aug 29, 2023

JoostvDoorn commented Aug 29, 2023 • edited Loading

jsirois commented Aug 29, 2023

jsirois commented Aug 29, 2023

benjyw commented Aug 29, 2023

benjyw commented Aug 29, 2023

jsirois commented Aug 29, 2023 • edited Loading

jsirois commented Aug 29, 2023

jriddy commented Aug 30, 2023 • edited Loading

jsirois commented Aug 30, 2023 • edited Loading

jriddy commented Aug 30, 2023 via email

benjyw commented Aug 30, 2023

benjyw commented Aug 30, 2023

JoostvDoorn commented Sep 21, 2023

benjyw commented Oct 2, 2023

JoostvDoorn commented Aug 28, 2023 •

edited

Loading

JoostvDoorn commented Aug 29, 2023 •

edited

Loading

benjyw commented Aug 29, 2023 •

edited

Loading

benjyw commented Aug 29, 2023 •

edited

Loading

jsirois commented Aug 29, 2023 •

edited

Loading

JoostvDoorn commented Aug 29, 2023 •

edited

Loading

jsirois commented Aug 29, 2023 •

edited

Loading

jriddy commented Aug 30, 2023 •

edited

Loading

jsirois commented Aug 30, 2023 •

edited

Loading