Trace-based optimizer #375
Replies: 16 comments 3 replies
-
Method-vs-tracing aside (though for the record I am pessimistic about tracing jits for Python) -- You might already be planning this, but it'd be really cool if this was developed as a consumer of a JIT/execution API to help drive the design of the API |
Beta Was this translation helpful? Give feedback.
-
Why are you pessimistic? What data or experience do you have? Please share.
It will use an API, but I suspect that the resulting API won't be much use to Cinder or Pyston. |
Beta Was this translation helpful? Give feedback.
-
All I have are some unsubstantiated hunches, but I'm happy to share them -- I haven't worked on a tracing jit so my knowledge is limited. The first thing is my understanding that the JavaScript community invested a large amount of effort into tracing jits, only to give them up for method jits; I also believe that tracing jits are worse for Python than for JS (which I'll describe later). My main reason for pessimism is this train of thought (again, not fully substantiated):
The theories I have as to their performance are as follows (again, these are just speculation since I don't know much about the internal details of PyPy):
I think pyperformance's focus on tight numeric loops is quite misleading as to what most Python users want. I don't doubt that a tracing architecture could do extremely well on pyperformance, but I disagree that that's what should be optimized for. Finally, we have multiple examples of method jits for Python that (again, unsubstantiated) I believe already eliminate a majority of the overhead that can be avoided with these sorts of optimizations. So if the performance is similar (such as by being limited by the same performance ceiling), I believe that method-at-a-time techniques have the benefits of being simpler, which means easier to optimize and quicker to develop. So these are the reasons why when we the Pyston team were faced with this decision we went with a method jit, and why I'm still happy with that decision. I'd love to hear though from people who have actually worked on tracing architectures, particularly PyPy. That said, I am very much in support of more experimentation with different techniques, even ones that I am personally pessimistic about. I am a bit uneasy, however, about the idea of committing CPython to a particular architecture before it's clear which one is better, and I personally don't think that that question will be settled via discussion and instead will require actual implementations that can be compared. (This is where my question about using-an-API came from.) |
Beta Was this translation helpful? Give feedback.
-
I think what @markshannon is proposing is a very interesting direction, looking forward to results! I don't think we can easily know in advance how well this approach will work, but HotPy certainly is an existence proof that it can. PyPy is a bit of a different case, due to its meta-tracing approach, which has a bunch of advantages as well as disadvantages. I don't think I can usefully discuss these in a nuanced way right this minute. However, I would like to say something about this part:
PyPy differs from CPython, Pyston and Cinder in many many different ways: different implementation language, different garbage collection approach, different interpreter, and yes, a (meta-)tracing JIT. I don't think it's obvious which one of these differences is ultimately the reason for its poorer performance on some benchmarks. Probably all of them together? Re TraceMonkey: Python's situation is different from Javascript's for several reasons:
Anyway, I am happy to discuss things in more detail if there is interest in anything specific, let me know. |
Beta Was this translation helpful? Give feedback.
-
One way to test the effectiveness of the PyPy JIT is to compare it to PyPy with just the interpreter and compare that to the baseline (CPython). In the example above, if the speed of the jitted code is X (relative to CPython), then the fraction of time spent in the jitted code is This is all somewhat approximate, but it seems better than just arm waving about loops, Javascript and C extensions. |
Beta Was this translation helpful? Give feedback.
-
I think it would help clarify things if I split the approaches to region selction into three categories, rather than just the two:
The two forms of traces are very similar in terms of optimization, but quite different in how they determine a region.
|
Beta Was this translation helpful? Give feedback.
-
In terms of the most effective JIT approach I think one thing to keep in mind is the variable of change. Python programs today are very different from Python programs of 10 years ago with Python following a somewhat similar trajectory to JS in many regards. Consider a Flask web application using SQLAlchemy. Here, it is not uncommon for there to be 100 packages in requirements.txt and tens of megabytes of bytecode. Moreover, in these types of applications there are virtually no hot spots: everything is lukewarm. There are no simple loops, just mountains of method calls from quadruply-dectorated methods. These applications are what Londoner's would referrer to as fatbergs. For sure, it is not as extreme as the JS community (where 1000+ dependencies is commonplace), but it is very different from what was commonplace a decade ago, and likely quite different to what we see with languages such as Lua; where to the best of my knowledge people are not routinely rolling up with 20 megabytes of code. Indeed, to a first order approximation these are probably the only benchmarks which should be considered simply because they are so hard to get a win on, and represent a worse case scenario for a JIT. We know JIT's can get a win on simpler more loopy problems — PyPy's speed centre demonstrates this nicely — but the fact we are all here, and that PyPy has not seen widespread adoption provides a strong indication that these types of benchmarks are not representative. (I believe one of the above links from @kmod shows PyPy underperforming CPython for a Flask application.) In terms of what this means for tracing vs method at a time, my conclusion is that for non-loopy applications with huge footprints method at a time is probably the best means of getting a win. PS It may also be good to include Psyco in the discussion, which to this day is probably the most successful Python JIT project — and even did so as a module. Indeed, I know several teams which stuck with 32-bit Python builds for many years past their due date on account of the fact that Psyco was 32-bit only and was delivering a real boost performance (albeit at the cost of massive memory usage). Although this was before the Cambrian explosion of PyPI. |
Beta Was this translation helpful? Give feedback.
-
I think we are all aware that some programs are large, and that some programs have a flatter execution profile than others. You characterize region selection as "tracing vs method at a time". Did you read the previous post? Merely stating that "method at a time is probably the best means of getting a win" is not helpful. The things we are want to maximize are:
The things we want to minimize are:
There are clear tensions here: specialization can lead to replication; larger regions make both replication and optimization of cold code more likely. |
Beta Was this translation helpful? Give feedback.
-
Before we start drawing far-reaching conclusions from a single benchmark, I should point out that pypy does eventually reach a performance level close to Pyston. However, it warms up very slowly (~15k iterations). |
Beta Was this translation helpful? Give feedback.
-
@markshannon can I ask what you're hoping to see from this thread? I don't think it's possible for anyone to produce hard enough data to prove that you shouldn't use a tracing approach; without having an implementation to measure I think the discussion is necessarily conjectural. Personally I think the burden of justification is on tracing, since it is the less-common and more-difficult approach, and in the only direct matchup I know of between tracing and method-at-a-time, method-at-a-time won convincingly. |
Beta Was this translation helpful? Give feedback.
-
Data. Experiential reports are also useful. Anything that might be useful in making an informed engineering decision on the best way to select regions for our second tier optimizer (the first tier being PEP 659) would also be helpful.
There is no burden on anyone, this isn't a legal case. PyPy is slower than Pyston in some cases, and faster in others. A lot faster in some cases. You've been working on Pyston for 8 years. You must have some relevant data. |
Beta Was this translation helpful? Give feedback.
-
I apologize if I made it seem like I'm trying to say Pyston is better -- I do believe that method-at-a-time techniques are better, and that's why Pyston uses them, but I'm not trying to claim that Pyston has proven this thesis or that it's "better" than PyPy. (I am unsure what you are referring to when you quote that word.) I provided the benchmark because it is a tangible piece of data, as is being asked for. I'm not claiming that this one data point is conclusive, and I continue to concede that tracing architectures, and projects that use them, have other advantages. I am unsure what else I can provide, since this is a hard piece of data that compares projects that use the considered architectures, that is empirically investigatable, and has a good chance of being attributable to the topic being discussed. I also provided a conjecture about what's going on (that PyPy spends less time running optimized code than Pyston does on this particular benchmark, and in a way attributable to its architecture) which should be empirically verifiable without too much work. I do think it would be helpful if you said which views of yours are open for reconsideration and what it would take to convince you of something, since the people you are calling "not helpful" are investing a decent amount of time into things that you apparently do not find compelling. |
Beta Was this translation helpful? Give feedback.
-
No need to apologize. Pyston is your baby, of course you think it is better 🙂 |
Beta Was this translation helpful? Give feedback.
-
@rlamy's comment that PyPy reaches the performance of Pyston eventually (emphasis mine) suggest that PyPy is weak for that one benchmark because its interpreter and compiler are slow, even if the optimized code it produces is at least as fast as Pyston's. So, what I'm beginning to think is this:
Which suggests to me that the region of optimization doesn't actually matter that much. |
Beta Was this translation helpful? Give feedback.
-
All of them. But I need solid data. That one benchmark is tangible data, thanks. |
Beta Was this translation helpful? Give feedback.
-
I believe that this question can be made quantitative by assembling a benchmark suite that you think models the code you want to optimize. I think I've made my opinion pretty clear at this point that I think it should be macrobenchmarks and not the synthetic benchmarks in the pyperformance suite, but then again our visibility into real Python usage is one of Pyston's strategic advantages and we don't mind maintaining that for longer. |
Beta Was this translation helpful? Give feedback.
-
The execution of bytecodes is the most important part of a fast VM, and the most complex (unless we want a super-fancy GC).
Ultimately we will need a JIT compiler, but it is clear from the relatively poor performance of Jython and IronPython that simply compiling to machine code is not the magic pixie dust that many seem to think.
Since not doing something at all is always preferable to doing it faster, we want to optimize by removing overheads, including type-checks, boxing and unboxing, reference counting operations, object allocation, before translating to machine code.
Region selection
There are two broad categories of regions in the literature and in industry:
Although, there are variants of both that make the division less clear cut. E.g. trace trees can end up looking similar to methods with lazy compilation of cold branches.
Compiling whole functions is conceptually simpler; the unit of compilation is more obvious and the boundaries clearer.
Traces are more flexible and naturally handle hot paths regardless of their shape.
In general, larger regions provide more optimization potential.
Trace-based approaches form larger regions by stitching together traces, or even re-optimizing several traces at once as a larger region (see HHVM).
Function based approaches use inlining to form larger regions (e.g. the JVM).
A traced-based approach seems best for a number of reasons:
Trace selection
There are two main approaches to choosing traces:
Long traces have the advantage that larger regions enable better optimization, and may include a loop which allows loop optimizations.
Short traces always complete, and do not need an extra tracing interpreter.
We are greedy and want the advantages of both!
We will project traces, like HHVM or YJIT, and use the statistical type information gathered by the adaptive interpreter, to project past some branch points. Once our confidence that execution will stay on trace drops to some level (~40%?) or we hit a back-edge we will stop the trace.
This should give us all the advantages of short traces, with some of the advantages of long traces.
Trace optimization
There are three key optimization phases needed to execute traces as fast as possible:
The best data we have for the relative and combined effectiveness of these approaches is from my PhD thesis.
The VM (HotPy) was quite different from CPython, and there were no optimization other than trace-based ones, so the numbers should not be considered to be indicative of speedups we might achieve.
However, they do show how the optimizations interact.
Each number shows the relative speedup from adding one or more phases, relative to the base interpreter.
Key:
A few things to note:
Beta Was this translation helpful? Give feedback.
All reactions