Register-based VM #119
Replies: 9 comments
-
Note that Victor has also worked on such a thing: see https://mail.python.org/archives/list/[email protected]/thread/X72OYMPH2HLTY4SIGVPKSTIRWL2XFY7G/
(It's actually faster!) |
Beta Was this translation helpful? Give feedback.
-
As an indication of how complex this transition would be, both Skip and Victor chose to keep the existing bytecode compiler and tack on a separate translation phase, written in Python! |
Beta Was this translation helpful? Give feedback.
-
CPython is significantly faster than it was when Victor did those experiments. So I wouldn't trust those numbers.
That might just be a comment on how difficult the compiler is to work with (but that's a separate topic). |
Beta Was this translation helpful? Give feedback.
-
I'm 👎 on this. Not that it is a bad idea, just that the effort and disruption outweigh the marginal benefits. The important issue here is not "register" or "stack", but is a wider instruction with explicit operands faster than the narrower instruction with implicit operand(s) that we currently have? Maybe, but is it faster than the judicious use of super-instructions (such as LOAD_FAST_LOAD_FAST) and static specializations Experimental evidence[Virtual Machine Showdown: Stack Versus Registers] (https://www.usenix.org/legacy/events%2Fvee05%2Ffull_papers/p153-yunhe.pdf) shows a 47% reduction in VM instructions executed. However, for Lua, the change from stack machine in Lua 4 to register machine in Lua 5 showed a 35% reduction. Since Python is even more dynamic and higher level than Lua, I would expect the reduction to be less than that, maybe in the 25-30% range. In both cases considerable effort was put into bytecode optimization for the register machine, but not the stack machine. Another thing to consider is that branch predictors have improved a lot since 2005, so per-instruction overhead is reduced. Comparison with super-instructionsSuper-instructions are much simpler to implement. Does it help or hinder the important optimizations?An infinite stack is a lot easier to reason about and optimize than a fixed-size array. Consequently a "register" machine may ultimately slow things down by making specialization and compilation harder. |
Beta Was this translation helpful? Give feedback.
-
Agreed. One idea I had was to just go systematically over all the opcodes that currently have no arguments and where appropriate add a variant (or variants, like ADD_INT, ADD_CONST, ADD_FAST) that takes an extra operand. This is nearly free (the code can use a goto) and saves instruction count (the oparg field is otherwise unused) and decoding time. We can either analyze output from runs with dxpairs or analysis of code objects to determine which opcode pairs are common. |
Beta Was this translation helpful? Give feedback.
-
A downside of adding ADD_CONST, ADD_FAST, etc. is that if we want to specialize BINARY_ADD we need to duplicate those specializations. |
Beta Was this translation helpful? Give feedback.
-
Stack caching is an alternative that results in similar benefits and introduces much less code churn. |
Beta Was this translation helpful? Give feedback.
-
I'm pretty curious about this. I think it would mean making (say) 3 copies of the whole dispatch loop, somewhat like below. It would severely hurt code size/locality, but I think it would also significantly decrease the number of memory reads/writes into the stack on most instructions that get executed. I don't think the bytecode compiler would have to change at all. PyObject *first_register, *second_register; // Sometimes use these instead of touching the stack
dispatch_zero: // zero of the registers in use
switch (opcode) {
...
case TARGET0(LOAD_FAST): {
first_register = GETLOCAL(oparg);
if (first_register == NULL) {
...
}
Py_INCREF(first_register);
DISPATCH_ONE();
}
case TARGET0(STORE_FAST): {
PyObject *value = BASIC_POP();
SETLOCAL(oparg, value);
DISPATCH_ZERO();
}
case TARGET0(POP_TOP): {
PyObject *value = BASIC_POP();
Py_DECREF(value);
DISPATCH_ZERO();
}
case TARGET0(BINARY_SUBTRACT): {
PyObject *right = stack_pointer[-1];
PyObject *left = stack_pointer[-2];
stack_pointer -= 2;
first_register = PyNumber_Subtract(left, right);
Py_DECREF(right);
Py_DECREF(left);
if (first_register == NULL)
goto error0;
DISPATCH_ONE();
}
...
}
dispatch_one: // one of the registers in use
switch (opcode) {
...
case TARGET1(LOAD_FAST): {
second_register = GETLOCAL(oparg);
if (second_register == NULL) {
...
}
Py_INCREF(second_register);
DISPATCH_TWO();
}
case TARGET1(STORE_FAST): {
SETLOCAL(oparg, first_register);
DISPATCH_ZERO();
}
case TARGET1(POP_TOP): {
Py_DECREF(first_register);
DISPATCH_ZERO();
}
case TARGET1(BINARY_SUBTRACT): {
PyObject *right = first_register;
PyObject *left = stack_pointer[-1];
stack_pointer -= 1;
first_register = PyNumber_Subtract(left, right);
Py_DECREF(right);
Py_DECREF(left);
if (first_register == NULL)
goto error1;
DISPATCH_ONE();
}
...
}
dispatch_two: // two of the registers in use
switch (opcode) {
...
case TARGET2(LOAD_FAST): {
BASIC_PUSH(first_register);
first_register = second_register;
second_register = GETLOCAL(oparg);
if (second_register == NULL) {
...
}
Py_INCREF(second_register);
DISPATCH_TWO();
}
case TARGET2(STORE_FAST): {
SETLOCAL(oparg, second_register);
DISPATCH_ONE();
}
case TARGET2(POP_TOP): {
Py_DECREF(second_register);
DISPATCH_ONE();
}
case TARGET2(BINARY_SUBTRACT): {
PyObject *left = first_register;
first_register = PyNumber_Subtract(left, second_register);
Py_DECREF(second_register);
Py_DECREF(left);
if (first_register == NULL)
goto error2;
DISPATCH_ONE();
}
...
} |
Beta Was this translation helpful? Give feedback.
-
I wonder if, assuming Sam Gross' nogil work is eventually integrated, we'll end up with a (mostly) register-based VM anyways. |
Beta Was this translation helpful? Give feedback.
-
Skip Montanaro has been keeping this idea alive. See his latest review on python-dev.
Beta Was this translation helpful? Give feedback.
All reactions