Switch to codegen solution? #257

lukehutch · 2020-07-27T07:13:23Z

lukehutch
Jul 27, 2020

JOML has a large number of methods, mostly brought about by the cross-product of many different concerns, e.g. precision (float vs. double, as well as mixed-precision, e.g. Vector3d.add(Vector3f) and vice versa), normalization (Unit methods vs. regular), in-place operations vs. operations with a dest parameter, postfix vs. prefix application of binary operators (mul vs. premul), etc.

A large amount of this is simply boilerplate, and could be automated with some sort of codegen solution. As a benefit, not only would the work needed to maintain JOML go down (once the codegen is done, at least), but error rate would go down (fixing a bug in one place would fix it in all variants), and the API consistency would increase.

I don't know of an off-the-shelf solution for this, so it would probably need to be custom code. Maybe this could be the basis for the next-gen JOML.

httpdigest · 2020-07-27T08:29:57Z

httpdigest
Jul 27, 2020
Maintainer

Code generation is a topic I've of course been aware of (also when looking at LWJGL3) and have been thinking about for the last two years or so and so far I've concluded that (given the amount of work needed to cover all such concerns - with JavaDoc documentation being by far the most complicated to handle) code generation is simply not worth it. So far I had less trouble being filed issues (method overloads missing, or documentation wrong, or method impl wrong in Vector3f but not in Vector3d) and fixing them in a few minutes time than to come up with a good codegen solution.
That is, for the current JOML 1 version.
As you also mentioned, this could and should be a basis for JOML 2, which I think is a very good solution!
Code generation also opens up A LOT of other possibilities, like step some foot into generating code for other languages (like JavaScript, TypeScript, AssemblyScript, WebAssembly, ...). Also, when we look at Valhalla with value types and Panama with native vector types/instructions, a general and abstract code generation solution would be an insanely awesome idea, absolutely.
Anyways, thank you for formulating code generation usage as a GitHub issue for a basis for further discussions/ideas! :)

0 replies

lukehutch · 2020-07-28T03:55:21Z

lukehutch
Jul 28, 2020
Author

I understand codegen is not trivial! However, I don't think the Javadoc generation should be that tricky. For example, changing parameter types from float to double or vice versa would not change the @param lines for those parameters in the Javadoc. Other parameters that are automatically added for some method variants (such as for methods that accept a final dest parameter) can just add the same templated string in each case (e.g. /** @param dest The destination for the operation */), or this could be parameterized with a special syntax that expands the type of dest (e.g. /** @param dest The ${typeof(dest)} to set with the result */).

I agree that codegen would bring up many other possibilities -- including developing a small linear algebra DSL that could be used to turn MATLAB-like expressions into optimal Java code, while creating as few intermediate objects as possible.

0 replies

pollend · 2020-07-28T06:17:06Z

pollend
Jul 28, 2020

I think there is some cool meta-programming you can do with the java annotation processor if you want to optimize stuff at compile time. Dagger and Micronaut are good cases for this and the performance benefits are huge.

Its more work fiddling with an intermediary templating language then just copying a method for two additional cases. Termath uses stringtemplates to generate code for this math library but its kind of a headache to work with. Just wondering what you have in mind? Most of the changes are pretty trivial so it just seems to add more complexity to an already trivial process?

https://github.com/MovingBlocks/TeraMath/blob/develop/src/generator/resources/Vector.st

0 replies

lukehutch · 2020-07-28T07:39:58Z

lukehutch
Jul 28, 2020
Author

I had in mind something like those stringtemplates, yes -- although I can see how this could be a pain to deal with. It would be better to represent each data type (VectorN, Quaternion, etc.) as a collection of ASTs of operations for each type of operation that is supported -- and then those ASTs can be instantiated into a CST for a specific data type or numerical precision, and rendered into Java source. You could then concatenate or nest operations by simply joining ASTs together (by substitution) before rendering into a CST -- e.g. q.transformInverse(v) could be obtained by simply substituting the AST for v.inverse(new VectorNP()) into the v parameter of the AST for q.transform(v) for some number of dimensions N and precision P. (The .new(VectorNP()) part could then be optimized away by the code generator to avoid instantiating new objects, by using local variables to represent each dimension of the vector.)

The idea would be to implement the actual operator logic just a single time, and type-ify the logic for each precision (float and double). You would need some special logic for mixed-precision, to say whether the intermediate computations should be promoted to double (e.g. when multiplications are involved), or whether any double parameters should be demoted to float (e.g. when only additions are involved, and the result is intended to be float).

0 replies

httpdigest · 2020-07-28T11:14:37Z

httpdigest
Jul 28, 2020
Maintainer

@lukehutch Exactly! That's what I was thinking of as well. I've also looked into https://github.com/javaparser/javaparser so that the actual logic of an operator is implemented in Java and then semantic AST transformations are done on it to produce the final Java source code. However, things get even more complicated when we think about Panama Vector API. You would then not formulate vector operators on the individual x, y, z fields (like now) but only ever use vector types and flatten the operations for non-Panama-vector-API code generation targets.

0 replies

lukehutch · 2020-07-28T11:41:30Z

lukehutch
Jul 28, 2020
Author

You might be over-complicating things if you want to write the operators in Java, and use Javaparser to produce the AST. I suggest simply building the AST manually, using nested calls to AST node constructors. It will be more awkward to manually construct the AST in this way rather than using Java code to implement the same operators; however, manually constructing the AST will allow you to represent operators in a much more generic (e.g. type-agnostic) way. You would have a ScalarMul AST node, for example, which is connected to two child nodes of any scalar type.

0 replies

httpdigest · 2020-12-06T19:57:44Z

httpdigest
Dec 6, 2020
Maintainer

Hey @lukehutch I've begun working on a codegen solution now - it's going to be epic! :)
Also in order to actually prepare for the Panama Vector API, which is already incubated in JDK 16, to later do pattern matching on the scalar AST operations and generate vector instructions from them. But the whole "do I have explicit AST support for vector ops or do I have some form of 'autovectorization' in order to detect multiple scalar operations and replace them with vector instructions during codegen" is a whole different story.
I really like the idea of building the method bodies via a custom AST API and let the codegen do the cartesian product expansion of all the aspects defined/used in the method, like different vector dimensions and also generating multiple overloads for coalesceable types, like when the "self" type that we want to generate is double, we also generate float and int parameter overloads.
It really wasn't hard to define the AST classes, a simple first codegen visitor and getting to a point where this:

class Vector {
  static void add(Context ctx) {
    Value other = loadparam(selfType().andCoalescables());
    VectorValue result = vectorvalue();
    for (int i = 0; i < selfType().dimensions(); i++)
      result.set(i, binop(ADD, extractelement(thiz(), i), extractelement(other, i)));
    defineResult(result);
  }
}

can generate this:

public Vector3f add(Vector3f other) {
  float rx = this.x + other.x;
  float ry = this.y + other.y;
  float rz = this.z + other.z;
  this.x = rx;
  this.y = ry;
  this.z = rz;
}
public Vector3f add(Vector3f other, Vector3f dest) {
  float rx = this.x + other.x;
  float ry = this.y + other.y;
  float rz = this.z + other.z;
  dest.x = rx;
  dest.y = ry;
  dest.z = rz;
}

I should've done this ages ago. :)

0 replies

pollend · 2020-12-06T22:46:52Z

pollend
Dec 6, 2020

@httpdigest any clue when you would start publishing some of the code? I should be able to help with testing the library and making sure things work correctly.

0 replies

httpdigest · 2020-12-07T15:54:24Z

httpdigest
Dec 7, 2020
Maintainer

@pollend thanks for the offer! Though, before I have an actual usable solution, it's definitely going to take some time. Right now I'm exploring a few options/ways to follow. My first very hacky approach was: define a very simple AST for a "value graph" (much like Graal's nodes API), which will be how I am going to define the methods (templates) of JOML's classes. But then there is the whole problem of how to lower this AST to actual Java code. You really don't want to do StringBuilder.append("Java-code-snippet") or string-based templating. We should use a proper solution with a solid AST for the Java language, which is also what https://github.com/javaparser/javaparser can do. I am also exploring ways to use LLVM or Graal to do the actual register/variable allocation and AST optimization and then maybe generating Java code from that. But that might turn out as not actually feasible/doable, since their ASTs are too low-level (i.e. there is not a local variable assignment node in Graals nodes API).

Basically, the challenges now are:

coming up with a sufficiently abstract AST that allows concinct definition of a JOML's class methods without repeating simple things and allowing to cover many aspects that usually lead to combinatorial explosion
implement "local variable allocation", i.e.: how do we get from a value dependency graph to minimal variable assignments in order to eliminate common subexpressions/subnodes in the graph and avoid aliasing/read-after-write errors (as was typical in JOML's handwritten methods)
how to do "combinatorial multiplication" over various aspects like overloads with multiple parameter types, or even over the binary operator to not having to repeat basic vector operations like add, subtract, multiply, divide
generating actual Java code from some form of Java AST, because at some point after all the above, we want to generate Java code in the end and hopefully without string templating. :)

So, I'm right in the midst of researching/seeing how others (most notably Graal, LLVM and emscripten/relooper) does things - though they are more for lowering code, and not for generating highlevel language code again from an IR. But, I would really as much as possible want to use those mature infrastructures as much as anyhow possible.

0 replies

lukehutch · 2020-12-07T18:39:31Z

lukehutch
Dec 7, 2020
Author

I should've done this ages ago. :)

Yeah man :-D I am quite amazed at how many variants of the same code are maintained by hand. It must have been a ton of work to try to keep the API consistent and in sync with itself.

I was going to start on this myself, and offer it as a prototype to kick off a larger implementation effort, but I have been completely swamped over the last year, and it won't let up next year, I think. However, maybe I can offer some of my ideas here.

Firstly, rather than binop(ADD, extractelement(thiz(), i), extractelement(other, i)), you should probably use the more OO approach of new Add(extractelement(thiz(), i), extractelement(other, i)), where class Add extends BinOp, since this will allow you to encode the different behavior of the different binary operators using methods of each class.

You probably want unary and tertiary operators too, e.g. for negation and if...then...else, respectively. It's worth going through all the main linear algebra methods, and making a list of all the building blocks that would be needed to implement the body of the methods as an AST node (e.g. boolean isApproxZero(double x) { return Math.abs(x) < 1.0e-30; } or similar).

Don't worry about unnecessary assignments -- the compiler should optimize those out. In fact in the code generator I would extract all fields of all passed-in objects to local variables, and do all computation on local variables, e.g. double lx = leftVec.x; -- because while extra local assignments can be optimized away, field references can't be in Java, due to its weak concurrency model (i.e. every time you refer to leftVec.x, it has to look up the x field of the leftVec reference, rather than just using a snapshot cached in a register).

I wouldn't worry about register allocation either -- you can create as many local variables as you want, and either the JIT engine (if using ASM) or the compiler (if compiling to Java source) will take care of register or local variable optimization.

AST nodes should be fully composable, and they should be able to be assembled into either a tree or a DAG (a DAG could be used to avoid re-computing earlier expressions). You could probably also optimize away re-computing of shared expressions by searching for common subgraphs of the AST, and coalescing the AST nodes into a DAG automatically.

One of the biggest opportunities here is to be able to avoid the allocation of intermediate objects, by fully expanding a composed AST -- including function calls into other ASTs -- into flat method code that only uses local variables. You also have opportunities to optimize away JRE library method calls. In some JOML code I wrote recently, something like 60% of the time was taken by Math.abs calls! I rewrote this code to use x < 0 ? -x : x, and the code got a lot faster.

It's worth it to make a list of all the cross-products you want to generate API calls for. For example:

Precision: double, float, and even possibly int.
- It would be really useful to consider mixed-precision variants -- e.g. take the cross product of precision across left operand, right operand, and output type. This is partially supported today in JOML, but there is some missing coverage, which leads to having to write more boilerplate than would otherwise be necessary.
- You should also create high-precision versions of float-precision methods, which use double internally, but supply the result as floats, as opposed to casting all intermediate values to float (when the input and output types are float). This can avoid loss of precision for computations, particularly in cases where multiplication is used, or where floats are added that have very different orders of magnitude, or where norms are being calculated, and could cause an overflow before square root is applied.
Dimension, for vectors and square matrices.
For multi-dimensional objects, whether passed in in a wrapper object (Vector3d etc.) or componentwise (double x, double y, double z). N.B. this applies to both the left and the right operand of binary operators, creating 2x2=4 variants in the cross product.
- You could even supply both column-major and row-major variants for matrix constructors, and methods that take matrices broken out element-wise into separate arguments.
Whether normalized to unit length, for vectors (e.g. unit normals) and quaternions.
- A normalized quaternion is a versor. This could amount to treating quaternions using two variants, similarly to precision: Quaternion and Versor extends Quaternion. Versors would always be automatically normalized after they are created. Similarly there could be Vector and UnitVector extends Vector.
Whether the result is written to this, or stored in a dest final parameter (these two are in the current API), or returned in a new object (this would cut down on some boilerplate, and would require another method to be generated with a ...New suffix or something, or maybe if dest == null, then a new object is created -- although this is at the cost of one conditional branch).
You might want to automatically generate a static version of each method too, since sometimes Ops.op(a, b) is easier to read than a.op(b), since the parenthesization follows mathematical conventions in the former case.
Whether to apply the argument to the left or the right of this (i.e. pre/post multiplication, for matrix-matrix, or for matrix-vector multiplication (with vector transpose)).
Whether or not to take the square root of the result, for methods that return a length (or length squared).
You could even create ...Fast versions of appropriate methods, which call the FastMath math routines rather than the regular math routines.

When you consider all the variants that can be generated using codegen, and the amount of work this will save you in the long run, the codegen approach really makes sense!

There are other things that sort of form part of the cross-product, but probably need a different implementation in each case, so they may need to be written out as separate methods, e.g.:

Whether one or both the arguments are inverted (for matrices and quaternions) or negated (for vectors). I find a ton of the boilerplate I write for JOML is due to missing methods when one arg needs to be inverted (e.g. the quaternion difference code only allows you to invert one of the operands, not the other).

Compilation from AST nodes could be used in several ways, including:

AOT-generation of static libraries from a graph of AST nodes (you're talking about codegen to Java statements)
AOT-compilation of linear algebra scripts in some basic pure functional scripting language (in fact if you build a compiler for a small functional expression language, e.g. using an operator-precedence parser, then you can write all the methods in a generic way, agnostic of the number of dimensions of vectors or the numerical precision, and generate all the static library code from function definitions in the scripting language). This would be a bit like creating a vertex shader.
JIT-compilation of interface-based Java code using ObjectWeb ASM. For this you could build an API just using interfaces, then read the bytecode of methods with a given annotation (e.g. @JOMLCompile) using ASM, and turn the calls to this interface-based API into a tree or DAG of AST nodes, which would then be compiled down into optimized Java classes that are dynamically loaded. This would allow you to write code using Java's API, specifying the exact interface you want to run each of a set of linear algebra methods, and the compiler could create the concrete code that implements the interface calls in those methods, while optimally avoiding creation of intermediate objects, etc. I don't know if this would be worth the effort though if you already have the above shader-like scripting language implemented.
JIT-compilation of a tree of AST nodes into a dynamic method. A program could programmatically build an AST, or provide a script in the scripting language, then this could be compiled down into an implementation callable via InvocationHandler, or, more optimally, MethodHandle.

0 replies

httpdigest · 2020-12-07T18:57:03Z

httpdigest
Dec 7, 2020
Maintainer

JIT-compilation of a tree of AST nodes into a dynamic method.

Jupp, that's what I also briefly had in mind. Using an interface for vector and matrix classes with a generated class that is basically only invokedynamic with a bootstrap method for all methods and calling any particular method will link to a MethodHandle of a generated static method in a dynamic class. And generating JVM bytecode is in most parts easier than generating Java code.
But then I thought: What benefits would this actually have: Probably none. First, you would lose source debugging capabilities. Then, you would have to drag ASM libs as dependency. Whether this would make JOML's library actually larger compared to pre-generating and javac-compiling all methods I don't know yet. But most importantly, there will be a huge hit in initial call performance.

Don't worry about unnecessary assignments -- the compiler should optimize those out. In fact in the code generator I would extract all fields of all passed-in objects to local variables, and do all computation on local variables

I'd agree with you iff there wasn't the bytecode length threshold of a method in HotSpot that keeps methods from being inlined into their callers. There was actually one commit that I spent basically only optimizing the effective bytecode length of all JOML methods. That's actually what https://github.com/JOML-CI/JOML/blob/main/buildhelper/InlineAdvisor.java is for. So, generating optimal (as in fewest bytecode length possible) methods is also a goal. :)

0 replies

lukehutch · 2020-12-07T19:04:57Z

lukehutch
Dec 7, 2020
Author

Well I sort of listed those four implementation methods in order of difficulty. Personally I would start with (1), and just try to create a "better JOML". But I think you'll find out pretty quickly that it's worth it to write a script language to AST converter, i.e. (2), since building an AST by hand using a bunch of nested new Op(..., ...) calls will be a bit of a pain. You can see that here, where in this class I build an AST of meta-grammar nodes for a parser generator:

https://github.com/lukehutch/pikaparser/blob/master/src/main/java/pikaparser/grammar/MetaGrammar.java

(static methods like oneOrMore(...) translate to AST nodes like new OneOrMore(...)) -- compare this programmatic method of constructing an AST to the much simpler text-based grammar description format (analogous to the scripting language I am proposing for JOML):

https://github.com/lukehutch/pikaparser#grammar-description-file-arithmeticgrammar

PS the combinatorial explosion only occurs for compilation methods (1) and (2). Don't worry about combinatorial explosion in the API. It's much better to have all possible functionality available in the API, and be overwhelmed with the number of options that pop up in an IDE, than it is to have to manually write boilerplate when the API is missing something basic and commonly-needed. You can just create some good documentation that shows the high-level API calls without all the combinatorial versions, so that a user can look through the available methods much more quickly than scanning through the long list that pops up in their IDE. And you can carefully consider how to name the IDE methods, so that the user can "drill down" to what they need, from the highest-level to the lowest-level concept that specifies the functionality they're looking for.

0 replies

lukehutch · 2020-12-07T19:14:52Z

lukehutch
Dec 7, 2020
Author

I'd agree with you iff there wasn't the bytecode length threshold of a method in HotSpot that keeps methods from being inlined into their callers.

Honestly I would worry far more about having the code generator inline every nested call within a method's AST, including inlining all the other JOML methods it calls, so that every JOML method is "flat code" that makes no method calls except to JRE libraries, than I would worry about JOML methods getting inlined into their caller. I suspect you will get far more bang for your buck that way. You have had to do a lot of this in your mind already while writing JOML, e.g. you don't call Vector3d.add every time you need to add two vectors within another JOML method, instead you just do the three additions and store results in local variables -- but you should be able to use vector addition in the scripting language I'm proposing, and simply know that the AST compiler will output the smallest code possible.

Also remember that premature optimization is the root of all evil. It's worth getting this working first, and then figuring out where the code generator is generating suboptimal code, and fixing the code generator, so that the fix applies to all methods. Ultimately if you can't fit an optimization into the byte limit, you can't fit it.

But also let's say that your method refers to v.x, v.y, and v.z three times inside your method, and if you extract these values to local variables double vx = v.x etc., then your code is over the hotspot limit. You might find that you actually lose more performance without using local variables, in spite of the fact that the code can't be inlined, because without the individual field values are each being read 3x rather than once.

As always with optimization, it's not worth it to make assumptions about fine-grained optimization until you have profiled the heck out of the code.

0 replies

httpdigest · 2020-12-07T19:18:41Z

httpdigest
Dec 7, 2020
Maintainer

@lukehutch Thanks for your input, I really appreciate it!

0 replies

lukehutch · 2020-12-07T19:38:29Z

lukehutch
Dec 7, 2020
Author

You're welcome. By the way I agree with you that debugability is hugely important. For this reason, I think doing AOT codegen to Java source is inherently going to prove the most valuable.

I took a look to see what sorts of Java code generator libraries are out there, and there are a few, like these:

https://github.com/square/javapoet
http://jenesis4java.sourceforge.net/

...and there are probably some better libraries than this, I didn't look very hard.

But to be honest, the internals of JOML methods are really so very simple that I don't think you should go to the complexity of using a library. You just need to be able to declare methods, assign to variables and fields, evaluate basic arithmetic (including possibly accelerating arithmetic using fma), call other JOML methods (which will be expanded/flattened) and system methods, possibly loop (which can be automatically unrolled), and test conditionals. I think you could create a very reasonable code generator from scratch without too much effort for those operations -- and it will be much easier to weave in your own optimizations (e.g. fma) if you have full vertical integration with your own code generator.

0 replies

lukehutch · 2020-12-07T19:42:51Z

lukehutch
Dec 7, 2020
Author

Another idea: if you write an interpreter for the scripting language (or really for the AST structure), then you could automatically generate testcase code, supplying random values to all method calls, and making sure the compiled code and the interpreter come up with the same results. This would save you a ton of time writing testcases. (Specifically this will test that codegen is working as expected.)

0 replies

httpdigest · 2020-12-08T20:26:27Z

httpdigest
Dec 8, 2020
Maintainer

@lukehutch You're right, it is really no big deal to generate Java source myself without using a library, and many pieces are already falling into place. I've got a concrete syntax tree representation for all Java syntax elements (class, fields, constructors, methods, statements, expressions) that I need and can simply toString() them to generate semi-formatted Java source, which I then simply run through google's java formatter library and get properly formatted Java sources.
The development flow for this has been pretty good and fun so far.
Above the Java-specific CST there is the target-language-agnostic AST representing the value dependency graph. This AST is being built now with an internal Java DSL. I will also look into your proposed scripting language when needing to actually scale the solution to the thousands of JOML methods there are right now.
I've also come to realize that having vectors and vector operations (unary, binary, broadcast, select operations) as first-class citizens is pretty much necessary to also be able to property generate Java statements for the Vector API types/methods.
So, I will formulate JOML's template classes/methods' AST in terms of vector types and operations when it is appropriate.
Also, it's interesting how many different aspects we actually have here (all of which you already laid out in your long post above) and where in the pipeline they need to act on the AST or CST to e.g. transform a value/operation formulated as vector ops into flat scalar operations. I've already have this done in the Java CST transformation.
I'll keep you all updated!

0 replies

lukehutch · 2020-12-08T21:27:14Z

lukehutch
Dec 8, 2020
Author

Exciting stuff. It's a super-cool project, so I'm glad you have caught the vision for it! I wish I had time to contribute to this, because I think it will be a really awesome project to work on, but to be honest it's probably better for one person to get the initial prototype working anyway.

In the scripting language, you probably want to define low-level operators like + using their standard algebraic interpretation over rings/fields (i.e. to apply elementwise to all elements in a matrix, vector, or quaternion), rather than in the language itself (using a for loop), to simplify things. However, you could certainly implement looping in the language, e.g. to loop over all dimensions of the input types, and then add some logic to unroll these loops during codegen. This would allow you to define a function just once, regardless of the dimensionality. You could also provide some implicit iteration types, e.g. sum over all dimensions.

i.e. you have options, with increasing levels of complexity:

(1) Manually write code in the scripting language for each number of dimensions -- but not for each level of precision:

fn dot(Vec2 a, Vec2 b) = {
    return a[0] * b[0] + a[1] * b[1];
}

fn dot(Vec3 a, Vec3 b) = {
    return a[0] * b[0] + a[1] * b[1] + a[2] * b[2];
}

(2) This will produce the same code as the above, but requires loop unrolling, and absorbing identities, in this case the initial "base case for folding" (0 + ...) part of the sum (i.e. this requires a bit of tricky work in the compiler, which is best handled using partial evaluation, as used in GraalVM):

fn dot(Vec a, Vec b) = {
    num result = 0;  // "num" has the precision of intermediate computations
    for (i in dim(a)) {
        result += a[i] * b[i];
    }
}

(3) Implicit looping (probably best of all, from a flexibility point of view, and from the perspective of implementation simplicity compared to explicit looping):

fn dot(Vec a, Vec b) = {
    return sum(for i in dim(a)) { a[i] * b[i] };
}

As I mentioned previously, you should assume for the basic output of all binary operations that the input types have the same precision (here Vec a and Vec b and the return type should all be either double or float) -- but you can generate additional variants for one or two of the three types being float and the other(s) beingdouble. The internal precision for intermediate values should probably be the same as the result precision by default, i.e. if the return type is float, all values inside the function should be assigned to float local variables, and/or cast to floats.

However as I pointed out previously there's a fourth dimension in the cross product, where the internal precision can be forced to double even when the output type is float. So that's not quite 2^4 different contribution to the cross product, given the precision levels of double and float. There are actually 12 different cross product terms rather than 16, since there are 2^3 different cross product terms for the precision levels of the operands and the result type, but then in the four cases where the result type is float, you have an additional factor of 2 precision levels to allow the internal precision level to be either float or double.

0 replies

httpdigest · 2020-12-09T09:31:47Z

httpdigest
Dec 9, 2020
Maintainer

I wish I had time to contribute to this, because I think it will be a really awesome project to work on, but to be honest it's probably better for one person to get the initial prototype working anyway.

Well, I'm certainly glad to at least discuss these things here which definitely helps me staying on track and motivated. So, keep the ideas coming.

Now that I've got some ideas about how to generate the vector classes, I just looked into how to do that for the matrix classes... And it gets really involved there. The part that I don't like is that it's really like the vector types here: you can only formulate the operations on a very high-level view, basically the vector "add" operation now looks like:

@Operator(name = "add")
static Value vadd(@This Value thiz, @Selftyped Value v) {
  return thiz.vadd(v); // <- build a dedicated "this is a vector add instruction" AST value
}

This is kindof okay when generating the vector classes, I guess, since codegen will depend on how vector operations are actually manifested in Java code (flattened to scalar operations or Java Vector API).

But it's quite the same for the matrix classes, because literally every operation (multiplication, componentwise multiplication, componentwise addition, determinant, ...) also depends on how codegen will generate that in the end, with options ranging from "flattening to scalar operations" (just like now) to "using Vector API to optimize those operations" to "generating SSE/AVX native code when hopefully we have a fast FFI solution in Java with Panama". So, I would fear that a matrix multiply template method would simply look like:

@Operator(name = "mul")
static Value mmul(@This Value thiz, @Selftyped Value v) {
  return thiz.mmul(v); // <- build a dedicated "this is a matrix multiplication instruction" AST value
}

in order to keep the semantic information that "this is a matrix multiplication, so lower it to any code you want", instead of actually formulating the template method using scalar or vector operations.

So, in the end there is not much use for template classes/methods, since everything is decided in codegen anyways. The only actual use I see now is for "combining" multiple methods (i.e. dependency graphs) in order to avoid allocating temp objects or reading/writing from/to fields.

EDIT: Well... maybe I've painted a too dark picture regarding templating. We need matrices and multiplication as a first-class citienzen in the graph, however, it starts and ends right there. There are still many many JOML matrix methods that just do more other stuff, like the various Matrix4f#rotateX/Y/Z... methods, which can still be expressed as a 3x3 matrix constructor node/value with some built-in function nodes like sin/cos that create a matrix, followed by a matrix multiplication intrinsic value. Something like:

@Method(name = "rotateX")
public static Value rotateX(@This Value thiz, @Elementtyped Value angle) {
  Value s = builtin(SIN, angle), c = builtin(COS, angle);
  return thiz.mmul(m3x3(one, zero, zero, zero, c, s, zero, s.negate(), c));
}

with Value::mmul() just meaning "matrix multiplication" and just creating a graph node for this very operation and Value::m3x3 constructing a 3x3 matrix with defined values.

The codegen can then take care of lowering that to scalar operations and folding unnecessary multiplications/additions based on whether a particular matrix element is the identity for the respective operation, while other codegen approaches with SSE/AVX or Vector API would just carry out the full multiplication, which will in the end be faster.

0 replies

lukehutch · 2020-12-09T15:25:40Z

lukehutch
Dec 9, 2020
Author

OK, here's a wild idea... I mentioned partial evaluation in a previous comment. If you don't know what this is yet, prepare for your mind to be blown (mine was, when I first learned about it, and still is, every time I think about it).

With partial evaluation, you turn your program into a DAG of operations, then inside the compiler you pre-compute every part of that DAG that depends only upon constant values. This includes partial evaluation of even binary operations, e.g. 0 + x can be partially evaluated to simply x, even if x is a variable and therefore cannot be known a priori.

Now here's the crazy part -- read about Futamura projections here:

https://en.wikipedia.org/wiki/Partial_evaluation

You may have heard that GraalVM employs partial evaluation in a very deep way to optimize code from any one of the many languages that it can compile to JVM bytecodes (or can even lower bytecodes to native code). I cannot believe they got this working so well even for one language, let alone for so many diverse languages, including dynamically-typed languages -- but it works and it's magical. You may be able to tie into the GraalVM APIs to do all the heavy lifting for you. For example, you can probably feed the code in form (2) from my previous comment above into GraalVM, and have it generate the IR for form (1), then turn the IR back into Java source. In other words, partial evaluation should be able to unroll loops and remove identity operations for you, as well as handling function composition and inlining, so almost all the problems you need to solve are already solved if you simply use a robust partial evaluation system. This may amount to only a few API calls to the GraalVM compiler API (I think that's the Truffle module), although I haven't looked at the GraalVM API or source, so I don't know for sure whether you can use it this way, without full compilation.

0 replies

httpdigest · 2020-12-09T15:41:52Z

httpdigest
Dec 9, 2020
Maintainer

Yepp, the biggest problem with this whole codegen stuff is: projecting a solution sufficiently far into the future for what's to come and what to leverage. Like... I am also thinking about outputting TypeScript and even AssemblyScript for JavaScript server runtimes and the web.
Using Graal or LLVM (basically any optimizer/compiler with a nice AST) was somewhat my goal in the beginning.
I've maybe got to look more into that org.graalvm.polyglot package and Truffle.

0 replies

lukehutch · 2020-12-09T15:42:13Z

lukehutch
Dec 9, 2020
Author

As a bonus, you can write the scripting language itself as a Truffle module, so that your simple vector algebra language is a fully-supported GraalVM language, callable from any other GraalVM language! JOML can become a polyglot linear algebra solution.

And since you mention Panama, a huge (huge) advantage of doing this all with GraalVM is that you can generate AOT-compiled native code versions from the same exact source code, with almost zero extra effort. In the long run, that may be a total game-changer.

0 replies

httpdigest · 2020-12-11T09:26:30Z

httpdigest
Dec 11, 2020
Maintainer

Right now I roll my own AST and arithmetic expression optimizer with a bit of integrated symbolic simplifications, like discovering trigonometric identities in the AST, which can happen quite frequently with all those Matrix.rotate() methods using sin/cos when combining multiple methods, and rewriting it to simpler expressions, without actually having to numerically evaluate the expression.
I think I've found an AST scheme that works for scalar, vector and matrix values alike and that also allows easy compositinng of values/operations as AST nodes using an API on the AST node/value classes themselves. It's also possible to "scalarize" vector and matrix nodes/values and apply optimizations on the individual "lanes" of those values, yielding optimized scalar values/nodes.
I wouldn't yet go as far as to invent my own math scripting language to simplify building AST nodes. That can come later as an additional "frontend" to the whole templating story.
Progress is definitely being made.

0 replies

httpdigest · 2020-12-16T15:43:10Z

httpdigest
Dec 16, 2020
Maintainer

I've reached a new milestone. Using this pretty and concise template for Matrix4f.rotateX(angle):

@Method
public static MatrixValue rotateX(@This MatrixValue thiz, @ElementTyped ScalarValue angle) {
    ScalarValue s = new ScalarSinFunctionValue(angle), c = new ScalarCosFunctionValue(angle);
    return thiz.mul(m3x3(one, zero, zero, zero, c, s, zero, s.negate(), c));
}

to generate this Java code:

public Matrix4f rotateX(float angle) {
  float v0 = java.lang.Math.sin(angle);
  float v1 = java.lang.Math.cos(angle);
  float rm10 = Math.fma(this.m20, v0, this.m10 * v1);
  float rm11 = Math.fma(this.m21, v0, this.m11 * v1);
  float rm12 = Math.fma(this.m22, v0, this.m12 * v1);
  float rm13 = Math.fma(this.m23, v0, this.m13 * v1);
  float rm20 = Math.fma(this.m20, v1, this.m10 * -v0);
  float rm21 = Math.fma(this.m21, v1, this.m11 * -v0);
  float rm22 = Math.fma(this.m22, v1, this.m12 * -v0);
  float rm23 = Math.fma(this.m23, v1, this.m13 * -v0);
  this.m10 = rm10;
  this.m11 = rm11;
  this.m12 = rm12;
  this.m13 = rm13;
  this.m20 = rm20;
  this.m21 = rm21;
  this.m22 = rm22;
  this.m23 = rm23;
  return this;
}

There is a lot in play here:

building an AST consisting in this case of a 3x3 matrix immediate constant with defined element values being either scalar constants or the sin()/cos() function node with the angle as argument; and a dimension-agnostic "matrix multiply" AST node.
scalarizing the matrix AST nodes to their individual elements' scalar values (because we build a specialization for a 4x4 this, we also "identity-extend" the a 3x3 matrix immediate to 4x4); this actually performs the full matrix multiplication and builds scalar AST nodes for the scalar operations
eliminating identity operations (like adding zero) and eliminating multiplications with zero
hoisting "costly" common sub-expressions into local variables (this is done with an AST pass that identifies equal sub-expressions and assigns "ref" nodes with a "uses" counter; once the "uses" counter passes 1, the expression is stored in a local variable in the language-specific code generator)
identifying unnecessary writes (here, the first and last column of the matrix multiplication are actually the same values, so are omitted; so basically, we do not compute/write code that writes back to dest/this when it was a read/access from the same this matrix element)
promoting patterns of (a*b +c) into fma(a, b, c)

Now, what makes this scheme so good, is that we can now use actual clean algorithms (albeit currently implemented via an internal DSL / AST API) to formulate algorithms, like Gauss-Jordan for matrix inversion and then shake it down to their optimized scalar operations (or vector operations if we target a backend with vector types/operations). So, the algorithm implemented in each matrix AST node (like a MatrixInverseValue node) to scalarize a value will actually be the algorithm itself, which is then interpreted by walking the AST built by that method and optimizing it.

4 replies

lukehutch Dec 16, 2020
Author

This is pretty much exactly what I was suggesting. Great work, and great progress!!

httpdigest Dec 16, 2020
Maintainer

Yes, it is. :) And probably I'll come to the realization that I also need a scripting language :D

lukehutch Dec 16, 2020
Author

And if you build a custom scripting language, you'll probably come to the realization that you should have just written it in Truffle to start with! :-D

httpdigest Dec 16, 2020
Maintainer

Yeah, I guess sometimes you need to feel the pain (a lot) before making a change in the right direction. Sounds weird coming from someone who's hand-written like fourteenthousand lines of code in Matrix4f.java...

httpdigest · 2020-12-17T16:22:02Z

httpdigest
Dec 17, 2020
Maintainer

Alright, the next milestone will be: How to determine all the possible/valid parameter types (such as precision and dimension) for all the overloads of a method?

Let's say we start with a template for matrix-vector multiplication like the following:

@Method
public static VectorValue mul(@This MatrixValue thiz, @Receiver VectorValue v) {
    return thiz.mul(v);
}

(Receiver tells the codegenerator, which argument will receive the final result in case we don't generate a dest parameter and the return type of the method is not of this type)

So, with the above template, the question is: Given a specific specialization (precision + dimension) for this, what possible specializations can we overload for the v parameter?
We really don't want to over-use annotations to specify the constraints of a parameter specialization when we can infer that from the actual operations used in the returned value AST.

So, what I was going to do is do a recursive AST pass which flows and constrains the possible types of any particular visited operation, yielding (for any formal parameter) on recursive ascend the actual possible specializations for that parameter (given the target language's rules for implicit primitive type coalescing/casting - possibly also with support for explicit narrowing when wanted).

0 replies

lukehutch · 2020-12-17T19:44:01Z

lukehutch
Dec 17, 2020
Author

I think that the general principle should be that all three of thiz and v and the receiver type should have the same precision, since z = a ? y is a ring or field for any binary operator ?. Actually one of the worst ideas in the history of programming language design was automatic type coercion. Automatic widening type conversion was always deemed as safe and useful (which I dispute), but Java even supports implicit narrowing conversion, which is a horrendously bad idea:

int x = 1;
double y = 2.9;
x += y;  // x == 3 after this; no warning

That said, it's also useful in the cross product of all method attributes to support higher and/or lower precision for at least the two operands (e.g. double add(double, double) should also auto-generate double add(double, float)), to avoid boilerplate type conversion code (and temporary intermediate objects, in the case of 2+-dimensional types) when calling operator methods.

The trouble is that Java method overloading provides no way to match the return type of a function, only the parameter types. To support multiple different result types for the same parameter types and method name, you have to add a suffix or something to the method names. So assume you have a method a(x, y), then if you support calling a with types (float, double), and if you allow both narrowing and widening of operand types, there's no way to know if the result type is supposed to be float or double. If you support only widening of parameter types, then the result type is obviously double, the higher of the two operand precisions.

There is probably still a case to be made for precision-narrowing operators though, since one huge goal for this codegen solution should be to avoid unnecessary object allocations like the plague. Having to fit a double-precision result into a float-precision receiver will create an extra intermediate object and require some boilerplate code, if there isn't a method for it. Therefore, I think for every method that has a double receiver/result precision, you should generate a method with the suffix ...AsFloat() that casts everything to float at the very last moment, before putting the values in the receiver. (You need the method name suffix for the reason detailed above.) If narrowing methods always require a method name suffix, then you have to explicitly invoke the narrowing, indicating that you understand you're going to lose precision. And this means that there will be no ambiguity in result type: just always pick the higher of the precisions of the operand types. And then as a convenience, you can also add ...AsDouble() methods to the methods where both parameter precisions are float.

As I mentioned before though, there should also be one extra high-precision version of any method where both operands have float precision, where all intermediate values are computed in double precision. This will need a method name suffix too, because there is no way to determine the desired intermediate precision without the user specifying it.

So from the method add(x, y), I would generate the following:

double add(double, double)
float addAsFloat(double, double) -- results are cast to float when writing to the receiver
double add(float, double) -- the float parameter is cast to double on entry
float addAsFloat(float, double)
double add(double, float)
float addAsFloat(double, float)
float add(float, float)
float addHighPrec(float, float) -- parameters are cast to double on entry, and the result is cast to float when writing to the receiver. Not that useful for add, but for other ops like crossProd, this is important.
double addAsDouble(float, float) -- internal computation is float, but the result is assigned to a double receiver
double addHighPrecAsDouble(float, float) -- parameters are cast to double on entry, and result is written to a double receiver.

I wanted to use ...ToDouble() to the suffix, rather than ...AsDouble(), but addToDouble makes it sound like the second operand has double precision, rather than the receiver.

And by the way you can then figure out which variant to use for codegen when composing AST DAGs -- the intermediate value precision of the outer DAG becomes the parameter precision for the inner DAG.

4 replies

httpdigest Dec 17, 2020
Maintainer

As always, nice input. Thanks!

httpdigest Dec 17, 2020
Maintainer

The scope that this is going to grow into is both fascinating and scaring. :)
Currently, I'm refreshing a bit on type theory and I implemented intersection/union operators on the type representation classes and actually now have a need for the bottom and top type when flowing types through the AST and computing the lower and upper bound for possible argument and return types of AST nodes. I guess it's going to be really awesome when it finally works.

lukehutch Dec 17, 2020
Author

Programming language theory is one of the most mind-twisting yet satisfying areas of CS...

I'm going to make a prediction here... by the time you're done, you will have transcended Java entirely and have created not just a new scripting language but an entire new programming language from scratch :-D

lukehutch Dec 17, 2020
Author

Also I should point out that when making a single architectural decision (in this case, building a generic AST) starts unlocking cascade after cascade of mathematical principles that you never previously thought about as being related, then you know you're onto a good thing!

Spasi · 2020-12-29T21:05:40Z

Spasi
Dec 29, 2020

@httpdigest @lukehutch Hey guys, finally had some time to go through this thread, great stuff!

I would like to point out that JOML would benefit tremendously not just from Panama, but also Valhalla. It would unlock a very different set of API design and implementation choices. We don't know when it's going to be ready (JOML 2.0? 3.0?), but you may want to think how it would affect the solution you're working on right now.

Also, love the work on optimizing sin/cos away (ala finish your derivations, please). I wonder if this could be more actively encouraged at the API level even (e.g. by outright removing rotateX(float) style methods).

0 replies

Switch to codegen solution? #257

lukehutch Jul 27, 2020

Replies: 27 comments · 8 replies

httpdigest Jul 27, 2020 Maintainer

lukehutch Jul 28, 2020 Author

pollend Jul 28, 2020

lukehutch Jul 28, 2020 Author

httpdigest Jul 28, 2020 Maintainer

lukehutch Jul 28, 2020 Author

httpdigest Dec 6, 2020 Maintainer

pollend Dec 6, 2020

httpdigest Dec 7, 2020 Maintainer

lukehutch Dec 7, 2020 Author

httpdigest Dec 7, 2020 Maintainer

lukehutch Dec 7, 2020 Author

lukehutch Dec 7, 2020 Author

httpdigest Dec 7, 2020 Maintainer

lukehutch Dec 7, 2020 Author

lukehutch Dec 7, 2020 Author

httpdigest Dec 8, 2020 Maintainer

lukehutch Dec 8, 2020 Author

httpdigest Dec 9, 2020 Maintainer

lukehutch Dec 9, 2020 Author

httpdigest Dec 9, 2020 Maintainer

lukehutch Dec 9, 2020 Author

httpdigest Dec 11, 2020 Maintainer

httpdigest Dec 16, 2020 Maintainer

lukehutch Dec 16, 2020 Author

httpdigest Dec 16, 2020 Maintainer

lukehutch Dec 16, 2020 Author

httpdigest Dec 16, 2020 Maintainer

httpdigest Dec 17, 2020 Maintainer

lukehutch Dec 17, 2020 Author

httpdigest Dec 17, 2020 Maintainer

httpdigest Dec 17, 2020 Maintainer

lukehutch Dec 17, 2020 Author

lukehutch Dec 17, 2020 Author

Spasi Dec 29, 2020

lukehutch
Jul 27, 2020

Replies: 27 comments 8 replies

httpdigest
Jul 27, 2020
Maintainer

lukehutch
Jul 28, 2020
Author

pollend
Jul 28, 2020

lukehutch
Jul 28, 2020
Author

httpdigest
Jul 28, 2020
Maintainer

lukehutch
Jul 28, 2020
Author

httpdigest
Dec 6, 2020
Maintainer

pollend
Dec 6, 2020

httpdigest
Dec 7, 2020
Maintainer

lukehutch
Dec 7, 2020
Author

httpdigest
Dec 7, 2020
Maintainer

lukehutch
Dec 7, 2020
Author

lukehutch
Dec 7, 2020
Author

httpdigest
Dec 7, 2020
Maintainer

lukehutch
Dec 7, 2020
Author

lukehutch
Dec 7, 2020
Author

httpdigest
Dec 8, 2020
Maintainer

lukehutch
Dec 8, 2020
Author

httpdigest
Dec 9, 2020
Maintainer

lukehutch
Dec 9, 2020
Author

httpdigest
Dec 9, 2020
Maintainer

lukehutch
Dec 9, 2020
Author

httpdigest
Dec 11, 2020
Maintainer

httpdigest
Dec 16, 2020
Maintainer

lukehutch Dec 16, 2020
Author

httpdigest Dec 16, 2020
Maintainer

lukehutch Dec 16, 2020
Author

httpdigest Dec 16, 2020
Maintainer

httpdigest
Dec 17, 2020
Maintainer

lukehutch
Dec 17, 2020
Author

httpdigest Dec 17, 2020
Maintainer

httpdigest Dec 17, 2020
Maintainer

lukehutch Dec 17, 2020
Author

lukehutch Dec 17, 2020
Author

Spasi
Dec 29, 2020