Hi sendVTable bytecode :) #11

PalumboN · 2024-09-10T15:10:35Z

Goal

We want to replace the megamorphic calls with v-table sends to see if it is faster.

We can identify the megamorphic selectors and compile all message sends of those selectors (or maybe just the methods with megamorphic calls?) to v-table sends.
Thus, all the classes in the system must have a v-table filled with a pre-fixed method base on selectors (for example, all the classes will have the method size as first entry...).

Then, instead of filling a PIC and finish in a megamorphic call, the method can found via the v-table (it is 2 LOADs more than a PIC, one for the class and other for the v-table).

So, the plan is:

Profile the megamorphic calls and make a list of common selectors.
Add a v-table to all classes and fill all of them with the methods corresponding to the list of common selectors.
Add a new bytecode to perform the v-table call
Recompile all methods (or specific ones) to use the new bytecode for the messages send in the list of common selectors.
Run benchmarks 🤓

What we have

We a v-table to all classes in the image: [Experiment] Add vTable to ClassDescription pharo-project/pharo#17088
We added a new bytecode:

pharo-vm/smalltalksrc/VMMaker/StackInterpreter.class.st

Line 743 in d409e4d

(246 sendVTableMessage)

It is implemented in the Interpreter:

pharo-vm/smalltalksrc/VMMaker/StackInterpreter.class.st

Lines 13632 to 13653 in d409e4d

    
           StackInterpreter >> sendVTableMessage [ 
        
           	| byte methodIndex rcvr classTag class vTable | 
        
           	byte := self fetchByte. 
        
           	methodIndex := (byte >> 3) + (extA << 5). 
        
           	extA := 0. 
        
           	argumentCount := (byte bitAnd: 7) + (extB << 3). 
        
           	extB := 0. 
        
           	numExtB := 0. 
        
           	rcvr := self stackValue: argumentCount. 
        
           	classTag := objectMemory fetchClassTagOf: rcvr. 
        
           	class := objectMemory classForClassTag: classTag. 
        
           	vTable := objectMemory fetchPointer: VTableIndex ofObject: class. 
        
           	newMethod := objectMemory fetchPointer: methodIndex ofObject: vTable. 
        
           	primitiveFunctionPointer := 0. 
        
           	self executeNewMethod: false. 
        
           	self fetchNextBytecode 
        
           ]

We extended Opal to compile methods using the new bytecode: pharo-project/pharo@281566a

Then we can define a method using the <opalBytecodeMethod> option:

vTableSizeOf: col
    <opalBytecodeMethod>

       ^ IRBuilder new
            numArgs: 1;
            addTemps: #(col);
            pushTemp: #col;
            sendVTable: 0 args: 0; 
            returnTop;
            ir

We run some benchmarks (using the selector size) in Interpreter-mode:

"Fill all v-tables"
Smalltalk allClassesDo: [ :class | |vTable|
	vTable := Array new: 10.	
	vTable at: 1 put: (class lookupSelector: #size).
	class vTable: vTable ].

"Recompile all methods sending size"
#size senders do: #recompile.

"Micro benchmark"
obj := { 1 . 3 . 5 } .
[Collection vTableSizeOf: obj] bench. 
"INT: 101,106,127 iterations in 5 seconds 1 millisecond. 20217181.964 per second"
[Collection polySizeOf: obj] bench.    
"INT: 186,123,581 iterations in 5 seconds 1 millisecond. 37217272.745 per second"

We learnt that the global cache is faster than using v-table (less accesses to the memory).

Then we implemented a version of the bytecode for the JIT compiler:

pharo-vm/smalltalksrc/VMMaker/StackToRegisterMappingCogit.class.st

Lines 2506 to 2543 in 58fda6f

    
           StackToRegisterMappingCogit >> genSendVTableMessage [ 
        
           	| methodIndex argumentCount jumpInterpret mergeJump | 
        
           	methodIndex := (byte1 >> 3) + (extA << 5). 
        
           	extA := 0. 
        
           	argumentCount := (byte1 bitAnd: 7) + (extB << 3). 
        
           	extB := 0. 
        
           	numExtB := 0. 
        
           	"Allocate registers" 
        
           	self marshallSendArguments: argumentCount. 
        
           	"Fetch the method from the VTable"	 
        
           	objectRepresentation  genGetClassObjectOf: ReceiverResultReg into: ClassReg scratchReg: TempReg instRegIsReceiver: true. 
        
           	self MoveMw: (VTableIndex + 1) * objectMemory wordSize r: ClassReg R: TempReg. 
        
           	self MoveMw: (methodIndex + 1) * objectMemory wordSize r: TempReg R: TempReg. 
        
           	"If the method is compiled, we jump to it, if not we jump to the interpret." 
        
           	objectRepresentation genLoadSlot: HeaderIndex sourceReg: TempReg destReg: ClassReg. 
        
           	jumpInterpret := objectRepresentation genJumpImmediate: ClassReg. 
        
           	"Jump to the method's unchecked entry-point." 
        
           	self AddCq: cmNoCheckEntryOffset R: ClassReg. 
        
           	self CallR: ClassReg. "Receiver and Args travel in registers" 
        
           	mergeJump := self Jump: 0. 
        
           	"Call the trampoline to continue execution in the interpreter" 
        
           	jumpInterpret jmpTarget: self Label. 
        
           	self PushR: ReceiverResultReg. 
        
           	self CallRT: vTableSend. "Receiver and Args travel in the stack" 
        
           	mergeJump jmpTarget: self Label. 
        
           	self annotateBytecode: self Label. 
        
           	self voidReceiverOptStatus. 
        
           	self ssPushRegister: ReceiverResultReg. 
        
           	^0 
        
           ]

We run the same micro-benchmark than before, but with more types (I lost the code).
We learnt that, by speed the vTable is equivalent to a polymorphic call:
mono > poly = vTable > mega

Next steps

We want to replace almost all megamorphic calls in the image with vTable sends.

For that we need:

Identify all megamorphic calls -> selectors and sites
Fill the vTables based on those selectors
Recompile the methods with mega-calls to use the vTable sends for those cases

Most of the work needed is in pharo-project/pharo#17088.
I leave here a possible util workspace:

selectors := OpalCompiler megamorphicCalls 
	collect: [ :tuple | tuple key third ] 
	thenSelect: [ :sel | sel numArgs = 0 ].
	
(selectors asSet asArray copyWithoutAll: {#size . #shallowCopy . #basicNew})
do: [:sel | sel implementors do: [ :method | self assert: ((method isPrimitive not) or: [ method isQuick ]) ] ] .

OpalCompiler fillAllVTables.

CompilationContext default optionOptimiseVTable.

methods := OpalCompiler megamorphicCalls collect: [ :tuple | tuple value ].
methods asSet asArray do: #recompile.

Then, we should have a stable image with less megamorphic calls than the normal one.

And we can measure them in Benchy 📈

…o the interpreter

Hi sendVTable bytecode :)

23a7c8a

PalumboN marked this pull request as draft September 10, 2024 15:10

PalumboN added 10 commits September 17, 2024 16:58

Hi genSendVTableMessage

22591b6

Ups, removing unused variable declaration

ee80c40

Fixing the genSendVTableMessage using ssFlushStack

beeeea8

Crazy fixes

d409e4d

Fix CallR and PushRegister

3b9d83a

Fix type declaration in genPushMaybeContextReceiverVariable:

e1b0dac

Recovering code lost because I don't know how to use git

e3d167a

Fix trampoline to runtime

bc79e67

Pushing code to the subclass

3099b3f

Fix genSendVTableMessage: Push receiver to the stack before jumping t…

58fda6f

…o the interpreter

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hi sendVTable bytecode :) #11

Hi sendVTable bytecode :) #11

PalumboN commented Sep 10, 2024 •

edited

Loading

	StackInterpreter >> sendVTableMessage [

	\| byte methodIndex rcvr classTag class vTable \|
	byte := self fetchByte.
	methodIndex := (byte >> 3) + (extA << 5).
	extA := 0.
	argumentCount := (byte bitAnd: 7) + (extB << 3).
	extB := 0.
	numExtB := 0.

	rcvr := self stackValue: argumentCount.
	classTag := objectMemory fetchClassTagOf: rcvr.

	class := objectMemory classForClassTag: classTag.
	vTable := objectMemory fetchPointer: VTableIndex ofObject: class.

	newMethod := objectMemory fetchPointer: methodIndex ofObject: vTable.
	primitiveFunctionPointer := 0.

	self executeNewMethod: false.
	self fetchNextBytecode
	]

	StackToRegisterMappingCogit >> genSendVTableMessage [

	\| methodIndex argumentCount jumpInterpret mergeJump \|
	methodIndex := (byte1 >> 3) + (extA << 5).
	extA := 0.
	argumentCount := (byte1 bitAnd: 7) + (extB << 3).
	extB := 0.
	numExtB := 0.

	"Allocate registers"
	self marshallSendArguments: argumentCount.

	"Fetch the method from the VTable"
	objectRepresentation genGetClassObjectOf: ReceiverResultReg into: ClassReg scratchReg: TempReg instRegIsReceiver: true.
	self MoveMw: (VTableIndex + 1) * objectMemory wordSize r: ClassReg R: TempReg.
	self MoveMw: (methodIndex + 1) * objectMemory wordSize r: TempReg R: TempReg.

	"If the method is compiled, we jump to it, if not we jump to the interpret."
	objectRepresentation genLoadSlot: HeaderIndex sourceReg: TempReg destReg: ClassReg.
	jumpInterpret := objectRepresentation genJumpImmediate: ClassReg.

	"Jump to the method's unchecked entry-point."
	self AddCq: cmNoCheckEntryOffset R: ClassReg.
	self CallR: ClassReg. "Receiver and Args travel in registers"
	mergeJump := self Jump: 0.

	"Call the trampoline to continue execution in the interpreter"
	jumpInterpret jmpTarget: self Label.
	self PushR: ReceiverResultReg.
	self CallRT: vTableSend. "Receiver and Args travel in the stack"

	mergeJump jmpTarget: self Label.
	self annotateBytecode: self Label.

	self voidReceiverOptStatus.
	self ssPushRegister: ReceiverResultReg.
	^0
	]

Hi sendVTable bytecode :) #11

Are you sure you want to change the base?

Hi sendVTable bytecode :) #11

Conversation

PalumboN commented Sep 10, 2024 • edited Loading

Goal

What we have

Next steps

PalumboN commented Sep 10, 2024 •

edited

Loading