Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hi sendVTable bytecode :) #11

Draft
wants to merge 11 commits into
base: pharo-12
Choose a base branch
from

Conversation

PalumboN
Copy link
Collaborator

@PalumboN PalumboN commented Sep 10, 2024

Goal

We want to replace the megamorphic calls with v-table sends to see if it is faster.

We can identify the megamorphic selectors and compile all message sends of those selectors (or maybe just the methods with megamorphic calls?) to v-table sends.
Thus, all the classes in the system must have a v-table filled with a pre-fixed method base on selectors (for example, all the classes will have the method size as first entry...).

Then, instead of filling a PIC and finish in a megamorphic call, the method can found via the v-table (it is 2 LOADs more than a PIC, one for the class and other for the v-table).

So, the plan is:

  1. Profile the megamorphic calls and make a list of common selectors.
  2. Add a v-table to all classes and fill all of them with the methods corresponding to the list of common selectors.
  3. Add a new bytecode to perform the v-table call
  4. Recompile all methods (or specific ones) to use the new bytecode for the messages send in the list of common selectors.
  5. Run benchmarks 🤓

What we have

Then we can define a method using the <opalBytecodeMethod> option:

vTableSizeOf: col
    <opalBytecodeMethod>

       ^ IRBuilder new
            numArgs: 1;
            addTemps: #(col);
            pushTemp: #col;
            sendVTable: 0 args: 0; 
            returnTop;
            ir
  • We run some benchmarks (using the selector size) in Interpreter-mode:
"Fill all v-tables"
Smalltalk allClassesDo: [ :class | |vTable|
	vTable := Array new: 10.	
	vTable at: 1 put: (class lookupSelector: #size).
	class vTable: vTable ].

"Recompile all methods sending size"
#size senders do: #recompile.

"Micro benchmark"
obj := { 1 . 3 . 5 } .
[Collection vTableSizeOf: obj] bench. 
"INT: 101,106,127 iterations in 5 seconds 1 millisecond. 20217181.964 per second"
[Collection polySizeOf: obj] bench.    
"INT: 186,123,581 iterations in 5 seconds 1 millisecond. 37217272.745 per second"

We learnt that the global cache is faster than using v-table (less accesses to the memory).

  • Then we implemented a version of the bytecode for the JIT compiler:
    StackToRegisterMappingCogit >> genSendVTableMessage [
    | methodIndex argumentCount jumpInterpret mergeJump |
    methodIndex := (byte1 >> 3) + (extA << 5).
    extA := 0.
    argumentCount := (byte1 bitAnd: 7) + (extB << 3).
    extB := 0.
    numExtB := 0.
    "Allocate registers"
    self marshallSendArguments: argumentCount.
    "Fetch the method from the VTable"
    objectRepresentation genGetClassObjectOf: ReceiverResultReg into: ClassReg scratchReg: TempReg instRegIsReceiver: true.
    self MoveMw: (VTableIndex + 1) * objectMemory wordSize r: ClassReg R: TempReg.
    self MoveMw: (methodIndex + 1) * objectMemory wordSize r: TempReg R: TempReg.
    "If the method is compiled, we jump to it, if not we jump to the interpret."
    objectRepresentation genLoadSlot: HeaderIndex sourceReg: TempReg destReg: ClassReg.
    jumpInterpret := objectRepresentation genJumpImmediate: ClassReg.
    "Jump to the method's unchecked entry-point."
    self AddCq: cmNoCheckEntryOffset R: ClassReg.
    self CallR: ClassReg. "Receiver and Args travel in registers"
    mergeJump := self Jump: 0.
    "Call the trampoline to continue execution in the interpreter"
    jumpInterpret jmpTarget: self Label.
    self PushR: ReceiverResultReg.
    self CallRT: vTableSend. "Receiver and Args travel in the stack"
    mergeJump jmpTarget: self Label.
    self annotateBytecode: self Label.
    self voidReceiverOptStatus.
    self ssPushRegister: ReceiverResultReg.
    ^0
    ]

We run the same micro-benchmark than before, but with more types (I lost the code).
We learnt that, by speed the vTable is equivalent to a polymorphic call:
mono > poly = vTable > mega

Next steps

We want to replace almost all megamorphic calls in the image with vTable sends.

For that we need:

  • Identify all megamorphic calls -> selectors and sites
  • Fill the vTables based on those selectors
  • Recompile the methods with mega-calls to use the vTable sends for those cases

Most of the work needed is in pharo-project/pharo#17088.
I leave here a possible util workspace:

selectors := OpalCompiler megamorphicCalls 
	collect: [ :tuple | tuple key third ] 
	thenSelect: [ :sel | sel numArgs = 0 ].
	
(selectors asSet asArray copyWithoutAll: {#size . #shallowCopy . #basicNew})
do: [:sel | sel implementors do: [ :method | self assert: ((method isPrimitive not) or: [ method isQuick ]) ] ] .

OpalCompiler fillAllVTables.

CompilationContext default optionOptimiseVTable.

methods := OpalCompiler megamorphicCalls collect: [ :tuple | tuple value ].
methods asSet asArray do: #recompile.

Then, we should have a stable image with less megamorphic calls than the normal one.

And we can measure them in Benchy 📈

@PalumboN PalumboN marked this pull request as draft September 10, 2024 15:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant