Optimise Scenario #1434

PietroPasotti · 2024-10-11T12:26:44Z

The unit testing suite for traefik is probably the largest scenario test battery around.
Today I ran it and I realized it took over a minute and a half to complete, so I decided to profile it.

This is the result:

to produce the profile, run with this tox env:

[testenv:profile-scenario-tests]
description = Profile scenario unit test battery
deps =
    pytest
    pytest-profiling
    ops[testing]>=2.17
    -r{toxinidir}/requirements.txt
commands =
    pytest -v --tb native {[vars]tst_path}/scenario --profile --profile-svg --log-cli-level=INFO -s {posargs}

There are some obvious candidates for optimization:

using an in-memory mock for the sqlite db instead of using the real thing could shave off a good chunk of time spent in pointless IO
A ridiculous amount of time is spent in State.__new__. Can we do something about that?
how come mocking juju-log takes so long?
A single list comprehension in state.py:143 takes 2 seconds of our time: can we lazify some of the code perhaps?

profiling scenario's own test suite yields a very similar picture:

The text was updated successfully, but these errors were encountered:

tonyandrewmeyer · 2024-10-13T20:11:35Z

using an in-memory mock for the sqlite db instead of using the real thing could shave off a good chunk of time spent in pointless IO

This indeed seems like an obvious win. Presumably we can just pass in :memory: instead of the unit state filename - trivial to do, and we're not testing sqlite so no downsides.

A ridiculous amount of time is spent in State.__new__. Can we do something about that?

A single list comprehension in state.py:143 takes 2 seconds of our time: can we lazify some of the code perhaps?

Both of these are the positional/keyword argument processing. I wonder if we can find a tidy way to have newer Python use the built-in support and only fall back to this on old Python.

how come mocking juju-log takes so long?

I wonder if this is also the same issue: lots of log lines and each one involves creating a JujuLogLine object.

benhoyt · 2024-10-13T20:56:27Z

Presumably we can just pass in :memory: instead of the unit state filename - trivial to do, and we're not testing sqlite so no downsides.

Yeah, this is probably the single biggest win. I made this change locally when running tox -e scenario on the traefik-k8s-operator tests. My times are different, but I get about 35s without the change, and 20s with the change, so a nice improvement! FWIW, it looks like Harness already uses :memory:.

Happy for us to spend a few hours of rainy-day time looking at the other bottlenecks too. (Let's just be careful not to get too carried away on the long tail...)

james-garner-canonical · 2024-10-30T09:42:29Z

Alex Batisse mentioned that Scenario tests take much longer and the culprit is YAML?
Maybe they just need the compiled extension (cyaml?)

Batalex · 2024-10-31T15:48:53Z

Alex Batisse mentioned that Scenario tests take much longer and the culprit is YAML? Maybe they just need the compiled extension (cyaml?)

I checked, and I do have libyaml bundled with/used by pyyaml. When I profiled my tests, I noticed that the tests were spending a long time serializing yaml in scenario/runtime:Runtime._virtual_charm_root.

Mind you, this is but a quick and dirty experiment, but here is what I tried:

I added a new attribute in Context with the serialized meta/config/action charm spec, passed all the way down to the runtime.py method mentioned above where I removed the yaml.safe_dump calls to use the new attribute
Since I use a fixture to set up the Context in my tests, I widened the fixture scope from "function" to "module"

This yielded a ~15% speed increase for my test file

tonyandrewmeyer · 2024-11-11T20:51:06Z

Some initial times (best of 3), using traefik-k8s (branch scenario-7-migraine 😂) -e scenario:

105.373 base (operator and operator/testing main@HEAD)
181.183 :memory: unit state location (I must be doing something wrong here, but I'm not sure what)
103.840 avoid YAML dump of metadata by linking files
103.196 avoid YAML dump of metadata by linking files and cache spec objects to avoid loading YAML
72.423 _max_posargs(n) is object (as a proxy for somehow simplifying or avoiding all that code with recent Python)

tonyandrewmeyer · 2024-11-11T21:15:28Z

I checked, and I do have libyaml bundled with/used by pyyaml. When I profiled my tests, I noticed that the tests were spending a long time serializing yaml in scenario/runtime:Runtime._virtual_charm_root.

@Batalex what repo (and branch, if appropriate) are you testing against? I'll use it as a second benchmark so that we're checking against multiple types of tests.

Batalex · 2024-11-12T08:30:20Z

@Batalex what repo (and branch, if appropriate) are you testing against? I'll use it as a second benchmark so that we're checking against multiple types of tests.

I think I was using the latest release (7.0.5) because I directly edited the files in the virtual env between two sessions during the sprint. I don't have this venv around any more, but the project itself has ops-scenario 7.0.5 in its dependencies lock file.

tonyandrewmeyer · 2024-11-12T21:14:36Z

I think I was using the latest release (7.0.5)

Hey @Batalex, sorry for being unclear. I meant what test suite are you running (one of the data platform repos I assume), so I can run against that one too with changes to make sure that they do improve things.

Batalex · 2024-11-13T08:28:04Z

Ah, sorry, I was quite slow-witted yesterday. I was testing on https://github.com/canonical/kafka-k8s-operator (branch feat/dpe-5591-status, which should be merged into main by today). IIRC, I was specifically running tox -e unit -- -s tests/unit/test_charm.py

…#1494) This PR changes Scenario to always use `:memory:` as the location for the unit state database, to speed up running tests. In most cases, the charm root location is a `TemporaryDirectory`, so the unit state database only exists during the event emission anyway. The database could be accessed if a specific root directory was provided (or inside the context manager context) but the correct way to interact with the database is through the State API (deferred events, saved state objects) so we don't want to leave that open for users. Timing (best of 3): |Suite|main|branch| |-|-|-| |operator unit tests (no concurrency)|4m11.923s|3m41.354s| |traefik scenario tests|3m49.479s|1m56.795s| |kafka-k8s-operator scenario tests|0m13.463s|0m13.236s| Note that this [is already the case for Harness](https://github.com/canonical/operator/blob/0ad731a6a9118dfac240e84681f5c41810c4f3f2/ops/_private/harness.py#L309). Refs #1434

…ine (#1495) The Scenario code that limits the number of positional arguments (needed in Python 3.8 since the dataclasses there do not have `KW_ONLY`) is quite expensive. For the case of `JujuLogLine`, it's not actually providing any value, because both of the arguments can be positional. `JujuLogLine` objects are created frequently (one per logging call), so the cost of creation matters a lot for the test performance. This PR changes `JujuLogLine` to be a regular frozen dataclass. There are other performance improvements possible in the max-positional-args code (most significant of which would be only using it for Python 3.8 and 3.9) but this one-line change provides most of the benefits, so seems worth doing before any more significant work. Timing (best of 3): |Suite|main|branch| |-|-|-| |operator unit tests (no concurrency)|4m11.923s|3m20.175s| |traefik scenario tests|3m49.479s|2m18.024s| |kafka-k8s-operator scenario tests|0m13.463s|0m9.973s| Refs #1434

The Scenario code that limits the number of positional arguments (needed in Python 3.8 since the dataclasses there do not have `KW_ONLY`) is quite expensive. We have a method of providing this functionality in 3.8, but it is unfortunately expensive, and therefore a regression from older versions of Scenario. I considered 4 options here: * Manually write out the dozen or so `__init__` signatures (as we had at one point in the [original PR](canonical/ops-scenario#137)). I'd rather not for the same reasons we made the decision at that time. * Figure out a way to have a syntax that allows using regular `dataclasses.KW_ONLY` with modern Python and only use our custom code on older Python. I spent some time looking into this, and couldn't figure out a way to do it where the code still seemed readable and maintainable enough. * Cache the most expensive work (what's in this PR). * Don't bother doing anything now, and eagerly wait for the ability to drop 3.8 support. A roughly 5% performance regression felt worth fixing as long as the change is fairly straightforward (although this change only gets about half of that back). The most expensive parts of the `__new__` method are the ones that work with `inspect`: `inspect.signature` in particular, but also getting the default value of the parameters. If we assume that no-one is run-time altering the signature of the class (I believe no-one should be) then the values of these never actually change, but we are currently calculating them every time an instance of the class is created. This PR changes that to cache those three values the first time they are needed. There's one additional performance tweak in the branch that doesn't make a significant difference, but is trivial to do: when checking if the YAML files exist, skip the filesystem `exists()` call if we just created an empty temporary directory a few lines above, since we know that it will never exist. A drive-by: I happened to notice while working on this branch Harness referring to `options.yaml`, which does not exist (any more?) as far as I know, so a docs tweak to address that. Timing (best of 3): |Suite|main|branch| |-|-|-| |operator unit tests (no concurrency)|165.25s|161.89s| |traefik scenario tests|45.49|44.30s| |kafka-k8s-operator scenario tests|4.48s|4.38s| Refs #1434

tonyandrewmeyer · 2024-12-17T23:52:38Z

Some improvements have been made - I believe these should address the regressions since Scenario 6.x and hopefully a bit more. More can be done, of course, but I think this is sufficient for now. #1498 could be moved up in priority if there are some test suites that are long (at least 30s) where that's the issue. I would definitely like to hear from anyone who has either performance regressions from Scenario 6.x that aren't fixed now, or whose conversion from Harness to Scenario results in slower tests.

kafka-k8s-operator (-e unit -- -s tests/unit/test_charm.py): 4.44s down from 7.47s (1.7x)
internal Scenario unit tests: 5.00s down from 26.32s (5.3x)
traefik-k8s-operator (branch scenario-7-migraine, -e scenario): 39.29s down from 98.52s (2.5x) (with main using Scenario 6, I get 110.40s)

PietroPasotti changed the title ~~optimize scenario tests~~ optimize scenario Oct 11, 2024

tonyandrewmeyer changed the title ~~optimize scenario~~ Optimise Scenario Oct 13, 2024

benhoyt added the testing Related to ops.testing label Oct 16, 2024

tonyandrewmeyer self-assigned this Nov 10, 2024

tonyandrewmeyer added the 25.04 label Nov 11, 2024

This was referenced Dec 9, 2024

feat: use :memory: as the unit state storage location for ops.testing #1494

Merged

refactor: don't use the max-positional-args parent class for JujuLogLine #1495

Merged

This was referenced Dec 12, 2024

Reduce the amount of YAML load/dump calls present in Scenario tests #1498

Open

refactor: cache signature structure in ops.testing state classes #1499

Merged

tonyandrewmeyer mentioned this issue Dec 17, 2024

test: add a small set of ops.testing benchmark tests #1504

Open

tonyandrewmeyer closed this as completed Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimise Scenario #1434

Optimise Scenario #1434

PietroPasotti commented Oct 11, 2024 •

edited

Loading

tonyandrewmeyer commented Oct 13, 2024

benhoyt commented Oct 13, 2024

james-garner-canonical commented Oct 30, 2024

Batalex commented Oct 31, 2024 •

edited

Loading

tonyandrewmeyer commented Nov 11, 2024

tonyandrewmeyer commented Nov 11, 2024

Batalex commented Nov 12, 2024

tonyandrewmeyer commented Nov 12, 2024

Batalex commented Nov 13, 2024 •

edited by tonyandrewmeyer

Loading

tonyandrewmeyer commented Dec 17, 2024

Optimise Scenario #1434

Optimise Scenario #1434

Comments

PietroPasotti commented Oct 11, 2024 • edited Loading

tonyandrewmeyer commented Oct 13, 2024

benhoyt commented Oct 13, 2024

james-garner-canonical commented Oct 30, 2024

Batalex commented Oct 31, 2024 • edited Loading

tonyandrewmeyer commented Nov 11, 2024

tonyandrewmeyer commented Nov 11, 2024

Batalex commented Nov 12, 2024

tonyandrewmeyer commented Nov 12, 2024

Batalex commented Nov 13, 2024 • edited by tonyandrewmeyer Loading

tonyandrewmeyer commented Dec 17, 2024

PietroPasotti commented Oct 11, 2024 •

edited

Loading

Batalex commented Oct 31, 2024 •

edited

Loading

Batalex commented Nov 13, 2024 •

edited by tonyandrewmeyer

Loading