Trace-based optimizer #375

markshannon · 2022-05-10T18:27:34Z

markshannon
May 10, 2022
Collaborator

The execution of bytecodes is the most important part of a fast VM, and the most complex (unless we want a super-fancy GC).

Ultimately we will need a JIT compiler, but it is clear from the relatively poor performance of Jython and IronPython that simply compiling to machine code is not the magic pixie dust that many seem to think.

Since not doing something at all is always preferable to doing it faster, we want to optimize by removing overheads, including type-checks, boxing and unboxing, reference counting operations, object allocation, before translating to machine code.

Region selection

There are two broad categories of regions in the literature and in industry:

Traces
Functions

Although, there are variants of both that make the division less clear cut. E.g. trace trees can end up looking similar to methods with lazy compilation of cold branches.

Compiling whole functions is conceptually simpler; the unit of compilation is more obvious and the boundaries clearer.
Traces are more flexible and naturally handle hot paths regardless of their shape.

In general, larger regions provide more optimization potential.
Trace-based approaches form larger regions by stitching together traces, or even re-optimizing several traces at once as a larger region (see HHVM).
Function based approaches use inlining to form larger regions (e.g. the JVM).

A traced-based approach seems best for a number of reasons:

Calls in Python are complex, so we want to optimize across calls. Traces allow that naturally, whereas inlining can be complicated.
Optimizing small regions reduces jitter and latency. Traces tend to be small, and can be limited in size with no additional complexity. Functions can be arbitrarily large and optimizing only parts of a function requires much of the machinery of a trace-based optimizer in addition to the function-based optimizer.
Optimizing traces is simpler, as all instructions dominate the latter ones, and there are no merge points.
For dynamic languages, luajit shows probably the most impressive speedup, luajit is trace-based. lua is mostly faster than Python so the speedup of luajit is not because the baseline is poor.
My PhD thesis compares an optimizing tracing interpreter (HotPy) with a method-at-a-time JIT compiler (unladen swallow) that did less specialization; the interpreter wins easily.

Trace selection

There are two main approaches to choosing traces:

Long traces: Traces are recorded, to cover a whole loop, or until some limit is reached; used by PyPy, luajit and tracemonkey.
Short traces: A short trace, usually covering a single basic block, is projected using the information available at the start of the trace; used in HHVM or YJIT.

Long traces have the advantage that larger regions enable better optimization, and may include a loop which allows loop optimizations.
Short traces always complete, and do not need an extra tracing interpreter.

We are greedy and want the advantages of both!
We will project traces, like HHVM or YJIT, and use the statistical type information gathered by the adaptive interpreter, to project past some branch points. Once our confidence that execution will stay on trace drops to some level (~40%?) or we hit a back-edge we will stop the trace.
This should give us all the advantages of short traces, with some of the advantages of long traces.

Trace optimization

There are three key optimization phases needed to execute traces as fast as possible:

Specialization
Redundancy elimination
Compilation to machine code

The best data we have for the relative and combined effectiveness of these approaches is from my PhD thesis.

The VM (HotPy) was quite different from CPython, and there were no optimization other than trace-based ones, so the numbers should not be considered to be indicative of speedups we might achieve.
However, they do show how the optimizations interact.

Each number shows the relative speedup from adding one or more phases, relative to the base interpreter.

	T	TS	TD	TSD	TC	TSC	TDC	TSDC
Short Benchmarks	1.10	1.91	0.96	2.11	0.76	1.46	0.75	1.83
Medium Benchmarks	1.09	1.96	0.95	2.19	1.05	2.51	1.04	3.78
Long Benchmarks	1.16	2.14	1.01	2.41	1.24	3.42	1.25	5.83

Key:

T Tracing
S Specialization (including redundant type-check elimination)
D Object elimination and other redundancy elimination
C Compilation to machine code

A few things to note:

Without specialization, the other phases have little value.
For short running benchmarks the compiler can slow things down a bit.
Combined, all four optimization resulted in a large speedup.

kmod · 2022-05-10T19:23:46Z

kmod
May 10, 2022

Method-vs-tracing aside (though for the record I am pessimistic about tracing jits for Python) --

You might already be planning this, but it'd be really cool if this was developed as a consumer of a JIT/execution API to help drive the design of the API

0 replies

markshannon · 2022-05-11T09:42:57Z

markshannon
May 11, 2022
Collaborator Author

I am pessimistic about tracing jits for Python

Why are you pessimistic? What data or experience do you have? Please share.

it'd be really cool if this was developed as a consumer of a JIT/execution API

It will use an API, but I suspect that the resulting API won't be much use to Cinder or Pyston.

0 replies

kmod · 2022-05-11T18:15:29Z

kmod
May 11, 2022

All I have are some unsubstantiated hunches, but I'm happy to share them -- I haven't worked on a tracing jit so my knowledge is limited.

The first thing is my understanding that the JavaScript community invested a large amount of effort into tracing jits, only to give them up for method jits; I also believe that tracing jits are worse for Python than for JS (which I'll describe later).

My main reason for pessimism is this train of thought (again, not fully substantiated):

PyPy has poor performance (>25% slower than cpython) on macrobenchmarks: this one was the only one that I got PyPy working on and shows this behavior
I suspect this performance is due to their tracing architecture
The reasons that make me suspect their performance is bad seem like they would affect the proposal here

The theories I have as to their performance are as follows (again, these are just speculation since I don't know much about the internal details of PyPy):

First a non-reason: people often say the issue is C extensions, but in the benchmark I listed above the C extension usage is purposefully minimal and looking at perf record I don't see any c-extension-related functions showing up
Tracing architectures have a higher base overhead than method-at-a-time architectures since they have more sophisticated logic around entering, leaving, and managing traces
- a resulting suspicion is that tracing architectures are probably inferior if they are unable to spend most of their time on traces
I suspect that the number of performance-affecting traces grows super-linearly in the size/complexity of the program. With a fixed proportion of trace memory overhead, this means that a decreasing fraction of code will be able to run on traces
Combining the above two thoughts, it would make sense if tracing architectures perform worse on larger programs
While I understand that tracing can handle non-loop regions, I suspect the benefits of loop optimizations are not realized because much application code is not loopy. This I actually have a small amount of data for: I did an experiment on the Dropbox codebase circa 2017, and I saw that during execution you on average have to walk up ~10 python frames to find a frame that is currently inside a loop.
- In general, I believe that Python code is less loopy than JavaScript code because Python programmers are accustomed to offloading loop constructs to native code. So for example doing loop-unboxing optimizations and such end up not being helpful if such numeric code is already written for numpy.
The web frameworks I've looked at use enough dynamic features that I suspect it is very hard to actually apply the kinds of optimizations that one hopes for from a tracing architecture.

I think pyperformance's focus on tight numeric loops is quite misleading as to what most Python users want. I don't doubt that a tracing architecture could do extremely well on pyperformance, but I disagree that that's what should be optimized for.

Finally, we have multiple examples of method jits for Python that (again, unsubstantiated) I believe already eliminate a majority of the overhead that can be avoided with these sorts of optimizations. So if the performance is similar (such as by being limited by the same performance ceiling), I believe that method-at-a-time techniques have the benefits of being simpler, which means easier to optimize and quicker to develop.

So these are the reasons why when we the Pyston team were faced with this decision we went with a method jit, and why I'm still happy with that decision. I'd love to hear though from people who have actually worked on tracing architectures, particularly PyPy.

That said, I am very much in support of more experimentation with different techniques, even ones that I am personally pessimistic about. I am a bit uneasy, however, about the idea of committing CPython to a particular architecture before it's clear which one is better, and I personally don't think that that question will be settled via discussion and instead will require actual implementations that can be compared. (This is where my question about using-an-API came from.)

0 replies

cfbolz · 2022-05-13T10:06:53Z

cfbolz
May 13, 2022

I think what @markshannon is proposing is a very interesting direction, looking forward to results! I don't think we can easily know in advance how well this approach will work, but HotPy certainly is an existence proof that it can.

PyPy is a bit of a different case, due to its meta-tracing approach, which has a bunch of advantages as well as disadvantages. I don't think I can usefully discuss these in a nuanced way right this minute. However, I would like to say something about this part:

I suspect this [poor] performance is due to their tracing architecture

PyPy differs from CPython, Pyston and Cinder in many many different ways: different implementation language, different garbage collection approach, different interpreter, and yes, a (meta-)tracing JIT. I don't think it's obvious which one of these differences is ultimately the reason for its poorer performance on some benchmarks. Probably all of them together?

Re TraceMonkey: Python's situation is different from Javascript's for several reasons:

JS has many unique jitting challenges, one of them being the extreme importance of starting really quickly. I mean, some of the VMs are now on their what, fourth execution tier? One of the things that made TraceMonkey's life tough was its need to start generating code really quickly, after very low loop iterations (I don't remember the exact number, but I'm fairly sure single-digit), which leads to somewhat weird pressures on the system. I'm not aware of any definitive postmortem of TraceMonkey unfortunately, though. There are some tidbits up and down-thread in this discussion: https://twitter.com/cdleary/status/1398352432342605829
JS also doesn't have a large ecosystem of C extensions. This gives JS VMs massive freedom to rewrite all the internal representations of their objects whenever they need to. I am not sure that's an argument pro or contra tracing, but it certainly is a constraint that Python has and they don't.
Another difference is funding: none of us is Google or Apple. The amount of engineering that went into all the existing method-based JS JITs is very far from anything that anybody currently seems to be investing into optimizing any Python implementation.

Anyway, I am happy to discuss things in more detail if there is interest in anything specific, let me know.

0 replies

markshannon · 2022-05-13T15:39:02Z

markshannon
May 13, 2022
Collaborator Author

One way to test the effectiveness of the PyPy JIT is to compare it to PyPy with just the interpreter and compare that to the baseline (CPython).
E.g. suppose that PyPy (with JIT) takes 1.2 times as long as CPython, but PyPy (no JIT) takes 1.7 times as long, that gives some bounds on the speedup provided by the PyPy JIT and at least a suggestion of the performance an equivalent optimizer for CPython might have.

In the example above, if the speed of the jitted code is X (relative to CPython), then the fraction of time spent in the jitted code is f such that
f/X + (1-f)*1.7 == 1.2 from that we can make estimates of the speed of CPython with such a JIT.

This is all somewhat approximate, but it seems better than just arm waving about loops, Javascript and C extensions.

2 replies

cfbolz May 13, 2022

I think that would be extremely approximate: if pypy-jit/pypy-interp is a big number it could just mean that our interpreter is slow.

rlamy May 14, 2022

Well, we can account for that if we know how much time pypy spends in jitted code - call that T_J. Modifying @markshannon 's model, if we call f the fraction of the code that is jitted, then the total PyPy run-time is T == T_J + (1 - f)* T_0, where T_0 is the --jit off timing. If we could somehow graft the PyPy JIT onto CPython, that would have the same T_J and f but CPython's T_0. Of course, that leaves out some important details, such as warm-up and pure-C code (which presumably runs at the same speed on both interpreters)...

I tried to get some numbers, but the split between client and server processes makes this benchmark rather inconvenient to work with, and 1800 iterations is barely enough for PyPy to warm up.

markshannon · 2022-05-16T14:35:51Z

markshannon
May 16, 2022
Collaborator Author

I think it would help clarify things if I split the approaches to region selction into three categories, rather than just the two:

Method at a time; JVM, Pyston, Cinder, V8
Recorded (long) traces; PyPy, luajit, tracemonkey
Projected (short) traces; YJIT (basic block versioning), HHVM (traclets).

The two forms of traces are very similar in terms of optimization, but quite different in how they determine a region.

Recorded traces follow all branches. If those branches are highly biased, then they produce high quality regions, but if the branches are unbiased it can result optimizing a lot of cold code.
Projected tracers do not follow any branches. This results in small traces, resulting in extra work to move from one trace to another, and reduces the effectiveness of optimization

0 replies

FreddieWitherden · 2022-05-20T12:42:32Z

FreddieWitherden
May 20, 2022

In terms of the most effective JIT approach I think one thing to keep in mind is the variable of change. Python programs today are very different from Python programs of 10 years ago with Python following a somewhat similar trajectory to JS in many regards.

Consider a Flask web application using SQLAlchemy. Here, it is not uncommon for there to be 100 packages in requirements.txt and tens of megabytes of bytecode. Moreover, in these types of applications there are virtually no hot spots: everything is lukewarm. There are no simple loops, just mountains of method calls from quadruply-dectorated methods. These applications are what Londoner's would referrer to as fatbergs. For sure, it is not as extreme as the JS community (where 1000+ dependencies is commonplace), but it is very different from what was commonplace a decade ago, and likely quite different to what we see with languages such as Lua; where to the best of my knowledge people are not routinely rolling up with 20 megabytes of code.

Indeed, to a first order approximation these are probably the only benchmarks which should be considered simply because they are so hard to get a win on, and represent a worse case scenario for a JIT. We know JIT's can get a win on simpler more loopy problems — PyPy's speed centre demonstrates this nicely — but the fact we are all here, and that PyPy has not seen widespread adoption provides a strong indication that these types of benchmarks are not representative. (I believe one of the above links from @kmod shows PyPy underperforming CPython for a Flask application.)

In terms of what this means for tracing vs method at a time, my conclusion is that for non-loopy applications with huge footprints method at a time is probably the best means of getting a win.

PS It may also be good to include Psyco in the discussion, which to this day is probably the most successful Python JIT project — and even did so as a module. Indeed, I know several teams which stuck with 32-bit Python builds for many years past their due date on account of the fact that Psyco was 32-bit only and was delivering a real boost performance (albeit at the cost of massive memory usage). Although this was before the Cambrian explosion of PyPI.

0 replies

markshannon · 2022-05-20T14:56:33Z

markshannon
May 20, 2022
Collaborator Author

@FreddieWitherden

I think we are all aware that some programs are large, and that some programs have a flatter execution profile than others.
Pysco is well known, as well.

You characterize region selection as "tracing vs method at a time". Did you read the previous post?

Merely stating that "method at a time is probably the best means of getting a win" is not helpful.
Please explain why you think method-at-a-time is better using some sort of data, not anecdotes.

The things we are want to maximize are:

Hot code optimized
Specialization of that code
The size of regions

The things we want to minimize are:

Cold code optimized
Replication of optimized regions

There are clear tensions here: specialization can lead to replication; larger regions make both replication and optimization of cold code more likely.

0 replies

rlamy · 2022-05-20T16:33:01Z

rlamy
May 20, 2022

Before we start drawing far-reaching conclusions from a single benchmark, I should point out that pypy does eventually reach a performance level close to Pyston. However, it warms up very slowly (~15k iterations).

0 replies

kmod · 2022-05-20T19:44:19Z

kmod
May 20, 2022

@markshannon can I ask what you're hoping to see from this thread? I don't think it's possible for anyone to produce hard enough data to prove that you shouldn't use a tracing approach; without having an implementation to measure I think the discussion is necessarily conjectural.

Personally I think the burden of justification is on tracing, since it is the less-common and more-difficult approach, and in the only direct matchup I know of between tracing and method-at-a-time, method-at-a-time won convincingly.

0 replies

markshannon · 2022-05-23T11:27:52Z

markshannon
May 23, 2022
Collaborator Author

@markshannon can I ask what you're hoping to see from this thread?

Data. Experiential reports are also useful. Anything that might be useful in making an informed engineering decision on the best way to select regions for our second tier optimizer (the first tier being PEP 659) would also be helpful.

Personally I think the burden of justification is on tracing

There is no burden on anyone, this isn't a legal case.
This isn't a "tracing vs method-at-time debate" competition. There are many ways to select regions, not just two.

PyPy is slower than Pyston in some cases, and faster in others. A lot faster in some cases.
Saying that one is "better" than the other isn't useful.
It is the cases where they perform poorly that are the most interesting, as we want consistent, rather than large, speedups for this tier.

You've been working on Pyston for 8 years. You must have some relevant data.
How does Pyston decide what to compile? When to inline? How to handle cold paths? What data did you use to decide on these things?

0 replies

kmod · 2022-05-23T19:02:32Z

kmod
May 23, 2022

I apologize if I made it seem like I'm trying to say Pyston is better -- I do believe that method-at-a-time techniques are better, and that's why Pyston uses them, but I'm not trying to claim that Pyston has proven this thesis or that it's "better" than PyPy. (I am unsure what you are referring to when you quote that word.)

I provided the benchmark because it is a tangible piece of data, as is being asked for. I'm not claiming that this one data point is conclusive, and I continue to concede that tracing architectures, and projects that use them, have other advantages. I am unsure what else I can provide, since this is a hard piece of data that compares projects that use the considered architectures, that is empirically investigatable, and has a good chance of being attributable to the topic being discussed. I also provided a conjecture about what's going on (that PyPy spends less time running optimized code than Pyston does on this particular benchmark, and in a way attributable to its architecture) which should be empirically verifiable without too much work.

I do think it would be helpful if you said which views of yours are open for reconsideration and what it would take to convince you of something, since the people you are calling "not helpful" are investing a decent amount of time into things that you apparently do not find compelling.

0 replies

markshannon · 2022-05-30T10:00:03Z

markshannon
May 30, 2022
Collaborator Author

I apologize if I made it seem like I'm trying to say Pyston is better

No need to apologize. Pyston is your baby, of course you think it is better 🙂

0 replies

markshannon · 2022-05-30T10:20:54Z

markshannon
May 30, 2022
Collaborator Author

@rlamy's comment that PyPy reaches the performance of Pyston eventually (emphasis mine) suggest that PyPy is weak for that one benchmark because its interpreter and compiler are slow, even if the optimized code it produces is at least as fast as Pyston's.

So, what I'm beginning to think is this:

PyPy's interpreter is slower that Pyston/CPython3.8 (about half the speed?)
PyPy's compiler is a lot slower than Pyston's
The code generated by PyPy's JIT is faster than that produced by Pyston's (a lot faster in some cases, a little faster in others).

Which suggests to me that the region of optimization doesn't actually matter that much.

0 replies

markshannon · 2022-05-30T12:00:41Z

markshannon
May 30, 2022
Collaborator Author

I do think it would be helpful if you said which views of yours are open for reconsideration

All of them. But I need solid data.

That one benchmark is tangible data, thanks.
And it tells me that Pyston is faster than PyPy for one benchmark, but there are many benchmarks on which PyPy is faster than Pyston.
What should I conclude from that?

0 replies

kmod · 2022-05-31T18:33:10Z

kmod
May 31, 2022

I believe that this question can be made quantitative by assembling a benchmark suite that you think models the code you want to optimize. I think I've made my opinion pretty clear at this point that I think it should be macrobenchmarks and not the synthetic benchmarks in the pyperformance suite, but then again our visibility into real Python usage is one of Pyston's strategic advantages and we don't mind maintaining that for longer.

1 reply

ericsnowcurrently Jun 1, 2022
Maintainer

I think it should be macrobenchmarks and not the synthetic benchmarks in the pyperformance suite

We definitely agree on that. (Micro benchmarks have their place too though, e.g. for diagnostic purposes.)

FWIW, I started a wiki page about the important information on benchmarking (https://github.com/faster-cpython/ideas/wiki/All-About-Python-Benchmarking), with dedicated pages to enumerate Python workloads and features, as well as the corresponding benchmarks. My hope is that we can use this as a tool to improve our coverage of workloads (via macro benchmarks), where we currently have a significant gap.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trace-based optimizer #375

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 16 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Trace-based optimizer #375

markshannon May 10, 2022 Collaborator

Region selection

Trace selection

Trace optimization

Replies: 16 comments · 3 replies

markshannon May 11, 2022 Collaborator Author

markshannon May 13, 2022 Collaborator Author

markshannon May 16, 2022 Collaborator Author

markshannon May 20, 2022 Collaborator Author

markshannon May 23, 2022 Collaborator Author

markshannon May 30, 2022 Collaborator Author

markshannon May 30, 2022 Collaborator Author

markshannon May 30, 2022 Collaborator Author

ericsnowcurrently Jun 1, 2022 Maintainer

markshannon
May 10, 2022
Collaborator

Replies: 16 comments 3 replies

markshannon
May 11, 2022
Collaborator Author

markshannon
May 13, 2022
Collaborator Author

markshannon
May 16, 2022
Collaborator Author

markshannon
May 20, 2022
Collaborator Author

markshannon
May 23, 2022
Collaborator Author

markshannon
May 30, 2022
Collaborator Author

markshannon
May 30, 2022
Collaborator Author

markshannon
May 30, 2022
Collaborator Author

ericsnowcurrently Jun 1, 2022
Maintainer