Measuring things #127
Replies: 2 comments
-
opcode time analysis |
Beta Was this translation helpful? Give feedback.
-
A few other things which might be of interest to collect.
Another item in this direction which might be worth investigating is the idea of for a given operation what is "Mach 1" — as in the fastest you can go without breaking any rules. Hotspots are not necessarily a problem if they correspond to code which is well optimised and already getting close to ideal performance. A good example here is dictionaries and strings. Several years back I remember benchmarking On the interpreter side a lot is known about how fast one can make an interpreter go short of JIT'ing. Darek Mihocka has some excellent articles on the subject (mostly within the context of emulating CPU architectures, but in many regards this is more challenging than Python bytecode since CPU instructions typically do far less real work than Python opcodes). A fun test here is to just get the interpreter to rattle through a few billion NOP's and count how many CPU cycles it burns. Combined with how many opcodes are needed to execute a 'real' benchmark this can give a weak upper bound on performance. If this upper bound is not sufficient then it indicates that one may want to give the interpreter a prod. For opcodes some care is needed when interpreting results to due operator overloading and arbitrary precision integers. Spending a lot of time adding numbers is a problem; unless you're in a benchmark which does arbitrary precision calculations where such a result is expected. Same goes for subscript operations. |
Beta Was this translation helpful? Give feedback.
-
This is not really a new issue, more of a rant. Everybody loves to work on a cool new feature. But too often we don't know how to design something because we don't have enough data. Examples: Where does startup time go? How much overhead does bytecode instruction dispatch cost? Which are the hottest C functions in our code base? What are the most common opcode pairs?
The result is that often we design purely based on intuition, or based on old data or hearsay (e.g. a table of opcode pair frequencies published by Instagram five years ago), or based on some proxy (e.g. static opcode frequency instead of dynamic opcode frequency).
When we do collect data we may use hackish tooling (maybe a few lines of sed/grep/etc. pipelines that is not written down except in our shell history file) that cannot be reproduced by others or collected systematically over time.
So I think we should formulate our needs for data and then design and build some tooling to collect that data. A (relatively) good example is speed.python.org and PyPerformance -- this solves two data needs, comparing benchmark performance over time and across Python versions, with the ability in the UI to drill down on individual benchmarks or alternative configurations.
We have a few other tools (mostly related to counting opcodes in various ways) collected in https://github.com/faster-cpython/tools/, but we also have huge blind gaps (we don't know anything about the timing of individual opcodes or the distribution of types), and we have no infrastructure for collecting and publishing various types of profiling data in a repeatable way (a la speed.python.org). For example, we have a flamegraph of where startup time goes (on Linux), but it would be nice to be able to produce a similar graph now that we have implemented freezing and deep-freezing by pressing a button.
Beta Was this translation helpful? Give feedback.
All reactions