Replies: 27 comments 8 replies
-
Code generation is a topic I've of course been aware of (also when looking at LWJGL3) and have been thinking about for the last two years or so and so far I've concluded that (given the amount of work needed to cover all such concerns - with JavaDoc documentation being by far the most complicated to handle) code generation is simply not worth it. So far I had less trouble being filed issues (method overloads missing, or documentation wrong, or method impl wrong in Vector3f but not in Vector3d) and fixing them in a few minutes time than to come up with a good codegen solution. |
Beta Was this translation helpful? Give feedback.
-
I understand codegen is not trivial! However, I don't think the Javadoc generation should be that tricky. For example, changing parameter types from I agree that codegen would bring up many other possibilities -- including developing a small linear algebra DSL that could be used to turn MATLAB-like expressions into optimal Java code, while creating as few intermediate objects as possible. |
Beta Was this translation helpful? Give feedback.
-
I think there is some cool meta-programming you can do with the java annotation processor if you want to optimize stuff at compile time. Dagger and Micronaut are good cases for this and the performance benefits are huge. Its more work fiddling with an intermediary templating language then just copying a method for two additional cases. Termath uses stringtemplates to generate code for this math library but its kind of a headache to work with. Just wondering what you have in mind? Most of the changes are pretty trivial so it just seems to add more complexity to an already trivial process? https://github.com/MovingBlocks/TeraMath/blob/develop/src/generator/resources/Vector.st |
Beta Was this translation helpful? Give feedback.
-
I had in mind something like those stringtemplates, yes -- although I can see how this could be a pain to deal with. It would be better to represent each data type ( The idea would be to implement the actual operator logic just a single time, and type-ify the logic for each precision ( |
Beta Was this translation helpful? Give feedback.
-
@lukehutch Exactly! That's what I was thinking of as well. I've also looked into https://github.com/javaparser/javaparser so that the actual logic of an operator is implemented in Java and then semantic AST transformations are done on it to produce the final Java source code. However, things get even more complicated when we think about Panama Vector API. You would then not formulate vector operators on the individual x, y, z fields (like now) but only ever use vector types and flatten the operations for non-Panama-vector-API code generation targets. |
Beta Was this translation helpful? Give feedback.
-
You might be over-complicating things if you want to write the operators in Java, and use Javaparser to produce the AST. I suggest simply building the AST manually, using nested calls to AST node constructors. It will be more awkward to manually construct the AST in this way rather than using Java code to implement the same operators; however, manually constructing the AST will allow you to represent operators in a much more generic (e.g. type-agnostic) way. You would have a |
Beta Was this translation helpful? Give feedback.
-
Hey @lukehutch I've begun working on a codegen solution now - it's going to be epic! :) class Vector {
static void add(Context ctx) {
Value other = loadparam(selfType().andCoalescables());
VectorValue result = vectorvalue();
for (int i = 0; i < selfType().dimensions(); i++)
result.set(i, binop(ADD, extractelement(thiz(), i), extractelement(other, i)));
defineResult(result);
}
} can generate this: public Vector3f add(Vector3f other) {
float rx = this.x + other.x;
float ry = this.y + other.y;
float rz = this.z + other.z;
this.x = rx;
this.y = ry;
this.z = rz;
}
public Vector3f add(Vector3f other, Vector3f dest) {
float rx = this.x + other.x;
float ry = this.y + other.y;
float rz = this.z + other.z;
dest.x = rx;
dest.y = ry;
dest.z = rz;
} I should've done this ages ago. :) |
Beta Was this translation helpful? Give feedback.
-
@httpdigest any clue when you would start publishing some of the code? I should be able to help with testing the library and making sure things work correctly. |
Beta Was this translation helpful? Give feedback.
-
@pollend thanks for the offer! Though, before I have an actual usable solution, it's definitely going to take some time. Right now I'm exploring a few options/ways to follow. My first very hacky approach was: define a very simple AST for a "value graph" (much like Graal's nodes API), which will be how I am going to define the methods (templates) of JOML's classes. But then there is the whole problem of how to lower this AST to actual Java code. You really don't want to do StringBuilder.append("Java-code-snippet") or string-based templating. We should use a proper solution with a solid AST for the Java language, which is also what https://github.com/javaparser/javaparser can do. I am also exploring ways to use LLVM or Graal to do the actual register/variable allocation and AST optimization and then maybe generating Java code from that. But that might turn out as not actually feasible/doable, since their ASTs are too low-level (i.e. there is not a local variable assignment node in Graals nodes API). Basically, the challenges now are:
So, I'm right in the midst of researching/seeing how others (most notably Graal, LLVM and emscripten/relooper) does things - though they are more for lowering code, and not for generating highlevel language code again from an IR. But, I would really as much as possible want to use those mature infrastructures as much as anyhow possible. |
Beta Was this translation helpful? Give feedback.
-
Yeah man :-D I am quite amazed at how many variants of the same code are maintained by hand. It must have been a ton of work to try to keep the API consistent and in sync with itself. I was going to start on this myself, and offer it as a prototype to kick off a larger implementation effort, but I have been completely swamped over the last year, and it won't let up next year, I think. However, maybe I can offer some of my ideas here. Firstly, rather than You probably want unary and tertiary operators too, e.g. for negation and Don't worry about unnecessary assignments -- the compiler should optimize those out. In fact in the code generator I would extract all fields of all passed-in objects to local variables, and do all computation on local variables, e.g. I wouldn't worry about register allocation either -- you can create as many local variables as you want, and either the JIT engine (if using ASM) or the compiler (if compiling to Java source) will take care of register or local variable optimization. AST nodes should be fully composable, and they should be able to be assembled into either a tree or a DAG (a DAG could be used to avoid re-computing earlier expressions). You could probably also optimize away re-computing of shared expressions by searching for common subgraphs of the AST, and coalescing the AST nodes into a DAG automatically. One of the biggest opportunities here is to be able to avoid the allocation of intermediate objects, by fully expanding a composed AST -- including function calls into other ASTs -- into flat method code that only uses local variables. You also have opportunities to optimize away JRE library method calls. In some JOML code I wrote recently, something like 60% of the time was taken by It's worth it to make a list of all the cross-products you want to generate API calls for. For example:
When you consider all the variants that can be generated using codegen, and the amount of work this will save you in the long run, the codegen approach really makes sense! There are other things that sort of form part of the cross-product, but probably need a different implementation in each case, so they may need to be written out as separate methods, e.g.:
Compilation from AST nodes could be used in several ways, including:
|
Beta Was this translation helpful? Give feedback.
-
Jupp, that's what I also briefly had in mind. Using an interface for vector and matrix classes with a generated class that is basically only invokedynamic with a bootstrap method for all methods and calling any particular method will link to a MethodHandle of a generated static method in a dynamic class. And generating JVM bytecode is in most parts easier than generating Java code.
I'd agree with you iff there wasn't the bytecode length threshold of a method in HotSpot that keeps methods from being inlined into their callers. There was actually one commit that I spent basically only optimizing the effective bytecode length of all JOML methods. That's actually what https://github.com/JOML-CI/JOML/blob/main/buildhelper/InlineAdvisor.java is for. So, generating optimal (as in fewest bytecode length possible) methods is also a goal. :) |
Beta Was this translation helpful? Give feedback.
-
Well I sort of listed those four implementation methods in order of difficulty. Personally I would start with (1), and just try to create a "better JOML". But I think you'll find out pretty quickly that it's worth it to write a script language to AST converter, i.e. (2), since building an AST by hand using a bunch of nested (static methods like https://github.com/lukehutch/pikaparser#grammar-description-file-arithmeticgrammar PS the combinatorial explosion only occurs for compilation methods (1) and (2). Don't worry about combinatorial explosion in the API. It's much better to have all possible functionality available in the API, and be overwhelmed with the number of options that pop up in an IDE, than it is to have to manually write boilerplate when the API is missing something basic and commonly-needed. You can just create some good documentation that shows the high-level API calls without all the combinatorial versions, so that a user can look through the available methods much more quickly than scanning through the long list that pops up in their IDE. And you can carefully consider how to name the IDE methods, so that the user can "drill down" to what they need, from the highest-level to the lowest-level concept that specifies the functionality they're looking for. |
Beta Was this translation helpful? Give feedback.
-
Honestly I would worry far more about having the code generator inline every nested call within a method's AST, including inlining all the other JOML methods it calls, so that every JOML method is "flat code" that makes no method calls except to JRE libraries, than I would worry about JOML methods getting inlined into their caller. I suspect you will get far more bang for your buck that way. You have had to do a lot of this in your mind already while writing JOML, e.g. you don't call Also remember that premature optimization is the root of all evil. It's worth getting this working first, and then figuring out where the code generator is generating suboptimal code, and fixing the code generator, so that the fix applies to all methods. Ultimately if you can't fit an optimization into the byte limit, you can't fit it. But also let's say that your method refers to As always with optimization, it's not worth it to make assumptions about fine-grained optimization until you have profiled the heck out of the code. |
Beta Was this translation helpful? Give feedback.
-
@lukehutch Thanks for your input, I really appreciate it! |
Beta Was this translation helpful? Give feedback.
-
You're welcome. By the way I agree with you that debugability is hugely important. For this reason, I think doing AOT codegen to Java source is inherently going to prove the most valuable. I took a look to see what sorts of Java code generator libraries are out there, and there are a few, like these: https://github.com/square/javapoet ...and there are probably some better libraries than this, I didn't look very hard. But to be honest, the internals of JOML methods are really so very simple that I don't think you should go to the complexity of using a library. You just need to be able to declare methods, assign to variables and fields, evaluate basic arithmetic (including possibly accelerating arithmetic using |
Beta Was this translation helpful? Give feedback.
-
Another idea: if you write an interpreter for the scripting language (or really for the AST structure), then you could automatically generate testcase code, supplying random values to all method calls, and making sure the compiled code and the interpreter come up with the same results. This would save you a ton of time writing testcases. (Specifically this will test that codegen is working as expected.) |
Beta Was this translation helpful? Give feedback.
-
@lukehutch You're right, it is really no big deal to generate Java source myself without using a library, and many pieces are already falling into place. I've got a concrete syntax tree representation for all Java syntax elements (class, fields, constructors, methods, statements, expressions) that I need and can simply toString() them to generate semi-formatted Java source, which I then simply run through google's java formatter library and get properly formatted Java sources. |
Beta Was this translation helpful? Give feedback.
-
Exciting stuff. It's a super-cool project, so I'm glad you have caught the vision for it! I wish I had time to contribute to this, because I think it will be a really awesome project to work on, but to be honest it's probably better for one person to get the initial prototype working anyway. In the scripting language, you probably want to define low-level operators like i.e. you have options, with increasing levels of complexity: (1) Manually write code in the scripting language for each number of dimensions -- but not for each level of precision:
(2) This will produce the same code as the above, but requires loop unrolling, and absorbing identities, in this case the initial "base case for folding" (
(3) Implicit looping (probably best of all, from a flexibility point of view, and from the perspective of implementation simplicity compared to explicit looping):
As I mentioned previously, you should assume for the basic output of all binary operations that the input types have the same precision (here However as I pointed out previously there's a fourth dimension in the cross product, where the internal precision can be forced to |
Beta Was this translation helpful? Give feedback.
-
Well, I'm certainly glad to at least discuss these things here which definitely helps me staying on track and motivated. So, keep the ideas coming. Now that I've got some ideas about how to generate the vector classes, I just looked into how to do that for the matrix classes... And it gets really involved there. The part that I don't like is that it's really like the vector types here: you can only formulate the operations on a very high-level view, basically the vector "add" operation now looks like: @Operator(name = "add")
static Value vadd(@This Value thiz, @Selftyped Value v) {
return thiz.vadd(v); // <- build a dedicated "this is a vector add instruction" AST value
} This is kindof okay when generating the vector classes, I guess, since codegen will depend on how vector operations are actually manifested in Java code (flattened to scalar operations or Java Vector API). But it's quite the same for the matrix classes, because literally every operation (multiplication, componentwise multiplication, componentwise addition, determinant, ...) also depends on how codegen will generate that in the end, with options ranging from "flattening to scalar operations" (just like now) to "using Vector API to optimize those operations" to "generating SSE/AVX native code when hopefully we have a fast FFI solution in Java with Panama". So, I would fear that a matrix multiply template method would simply look like: @Operator(name = "mul")
static Value mmul(@This Value thiz, @Selftyped Value v) {
return thiz.mmul(v); // <- build a dedicated "this is a matrix multiplication instruction" AST value
} in order to keep the semantic information that "this is a matrix multiplication, so lower it to any code you want", instead of actually formulating the template method using scalar or vector operations. So, in the end there is not much use for template classes/methods, since everything is decided in codegen anyways. The only actual use I see now is for "combining" multiple methods (i.e. dependency graphs) in order to avoid allocating temp objects or reading/writing from/to fields. EDIT: Well... maybe I've painted a too dark picture regarding templating. We need matrices and multiplication as a first-class citienzen in the graph, however, it starts and ends right there. There are still many many JOML matrix methods that just do more other stuff, like the various Matrix4f#rotateX/Y/Z... methods, which can still be expressed as a 3x3 matrix constructor node/value with some built-in function nodes like sin/cos that create a matrix, followed by a matrix multiplication intrinsic value. Something like: @Method(name = "rotateX")
public static Value rotateX(@This Value thiz, @Elementtyped Value angle) {
Value s = builtin(SIN, angle), c = builtin(COS, angle);
return thiz.mmul(m3x3(one, zero, zero, zero, c, s, zero, s.negate(), c));
} with The codegen can then take care of lowering that to scalar operations and folding unnecessary multiplications/additions based on whether a particular matrix element is the identity for the respective operation, while other codegen approaches with SSE/AVX or Vector API would just carry out the full multiplication, which will in the end be faster. |
Beta Was this translation helpful? Give feedback.
-
OK, here's a wild idea... I mentioned partial evaluation in a previous comment. If you don't know what this is yet, prepare for your mind to be blown (mine was, when I first learned about it, and still is, every time I think about it). With partial evaluation, you turn your program into a DAG of operations, then inside the compiler you pre-compute every part of that DAG that depends only upon constant values. This includes partial evaluation of even binary operations, e.g. Now here's the crazy part -- read about Futamura projections here: https://en.wikipedia.org/wiki/Partial_evaluation You may have heard that GraalVM employs partial evaluation in a very deep way to optimize code from any one of the many languages that it can compile to JVM bytecodes (or can even lower bytecodes to native code). I cannot believe they got this working so well even for one language, let alone for so many diverse languages, including dynamically-typed languages -- but it works and it's magical. You may be able to tie into the GraalVM APIs to do all the heavy lifting for you. For example, you can probably feed the code in form (2) from my previous comment above into GraalVM, and have it generate the IR for form (1), then turn the IR back into Java source. In other words, partial evaluation should be able to unroll loops and remove identity operations for you, as well as handling function composition and inlining, so almost all the problems you need to solve are already solved if you simply use a robust partial evaluation system. This may amount to only a few API calls to the GraalVM compiler API (I think that's the Truffle module), although I haven't looked at the GraalVM API or source, so I don't know for sure whether you can use it this way, without full compilation. |
Beta Was this translation helpful? Give feedback.
-
Yepp, the biggest problem with this whole codegen stuff is: projecting a solution sufficiently far into the future for what's to come and what to leverage. Like... I am also thinking about outputting TypeScript and even AssemblyScript for JavaScript server runtimes and the web. |
Beta Was this translation helpful? Give feedback.
-
As a bonus, you can write the scripting language itself as a Truffle module, so that your simple vector algebra language is a fully-supported GraalVM language, callable from any other GraalVM language! JOML can become a polyglot linear algebra solution. And since you mention Panama, a huge (huge) advantage of doing this all with GraalVM is that you can generate AOT-compiled native code versions from the same exact source code, with almost zero extra effort. In the long run, that may be a total game-changer. |
Beta Was this translation helpful? Give feedback.
-
Right now I roll my own AST and arithmetic expression optimizer with a bit of integrated symbolic simplifications, like discovering trigonometric identities in the AST, which can happen quite frequently with all those Matrix.rotate() methods using sin/cos when combining multiple methods, and rewriting it to simpler expressions, without actually having to numerically evaluate the expression. |
Beta Was this translation helpful? Give feedback.
-
I've reached a new milestone. Using this pretty and concise template for @Method
public static MatrixValue rotateX(@This MatrixValue thiz, @ElementTyped ScalarValue angle) {
ScalarValue s = new ScalarSinFunctionValue(angle), c = new ScalarCosFunctionValue(angle);
return thiz.mul(m3x3(one, zero, zero, zero, c, s, zero, s.negate(), c));
} to generate this Java code: public Matrix4f rotateX(float angle) {
float v0 = java.lang.Math.sin(angle);
float v1 = java.lang.Math.cos(angle);
float rm10 = Math.fma(this.m20, v0, this.m10 * v1);
float rm11 = Math.fma(this.m21, v0, this.m11 * v1);
float rm12 = Math.fma(this.m22, v0, this.m12 * v1);
float rm13 = Math.fma(this.m23, v0, this.m13 * v1);
float rm20 = Math.fma(this.m20, v1, this.m10 * -v0);
float rm21 = Math.fma(this.m21, v1, this.m11 * -v0);
float rm22 = Math.fma(this.m22, v1, this.m12 * -v0);
float rm23 = Math.fma(this.m23, v1, this.m13 * -v0);
this.m10 = rm10;
this.m11 = rm11;
this.m12 = rm12;
this.m13 = rm13;
this.m20 = rm20;
this.m21 = rm21;
this.m22 = rm22;
this.m23 = rm23;
return this;
} There is a lot in play here:
Now, what makes this scheme so good, is that we can now use actual clean algorithms (albeit currently implemented via an internal DSL / AST API) to formulate algorithms, like Gauss-Jordan for matrix inversion and then shake it down to their optimized scalar operations (or vector operations if we target a backend with vector types/operations). So, the algorithm implemented in each matrix AST node (like a |
Beta Was this translation helpful? Give feedback.
-
Alright, the next milestone will be: How to determine all the possible/valid parameter types (such as precision and dimension) for all the overloads of a method? Let's say we start with a template for matrix-vector multiplication like the following: @Method
public static VectorValue mul(@This MatrixValue thiz, @Receiver VectorValue v) {
return thiz.mul(v);
} (Receiver tells the codegenerator, which argument will receive the final result in case we don't generate a So, with the above template, the question is: Given a specific specialization (precision + dimension) for So, what I was going to do is do a recursive AST pass which flows and constrains the possible types of any particular visited operation, yielding (for any formal parameter) on recursive ascend the actual possible specializations for that parameter (given the target language's rules for implicit primitive type coalescing/casting - possibly also with support for explicit narrowing when wanted). |
Beta Was this translation helpful? Give feedback.
-
I think that the general principle should be that all three of
That said, it's also useful in the cross product of all method attributes to support higher and/or lower precision for at least the two operands (e.g. The trouble is that Java method overloading provides no way to match the return type of a function, only the parameter types. To support multiple different result types for the same parameter types and method name, you have to add a suffix or something to the method names. So assume you have a method There is probably still a case to be made for precision-narrowing operators though, since one huge goal for this codegen solution should be to avoid unnecessary object allocations like the plague. Having to fit a double-precision result into a float-precision receiver will create an extra intermediate object and require some boilerplate code, if there isn't a method for it. Therefore, I think for every method that has a As I mentioned before though, there should also be one extra high-precision version of any method where both operands have So from the method
I wanted to use And by the way you can then figure out which variant to use for codegen when composing AST DAGs -- the intermediate value precision of the outer DAG becomes the parameter precision for the inner DAG. |
Beta Was this translation helpful? Give feedback.
-
@httpdigest @lukehutch Hey guys, finally had some time to go through this thread, great stuff! I would like to point out that JOML would benefit tremendously not just from Panama, but also Valhalla. It would unlock a very different set of API design and implementation choices. We don't know when it's going to be ready (JOML 2.0? 3.0?), but you may want to think how it would affect the solution you're working on right now. Also, love the work on optimizing sin/cos away (ala finish your derivations, please). I wonder if this could be more actively encouraged at the API level even (e.g. by outright removing |
Beta Was this translation helpful? Give feedback.
-
JOML has a large number of methods, mostly brought about by the cross-product of many different concerns, e.g. precision (
float
vs.double
, as well as mixed-precision, e.g.Vector3d.add(Vector3f)
and vice versa), normalization (Unit
methods vs. regular), in-place operations vs. operations with adest
parameter, postfix vs. prefix application of binary operators (mul
vs.premul
), etc.A large amount of this is simply boilerplate, and could be automated with some sort of codegen solution. As a benefit, not only would the work needed to maintain JOML go down (once the codegen is done, at least), but error rate would go down (fixing a bug in one place would fix it in all variants), and the API consistency would increase.
I don't know of an off-the-shelf solution for this, so it would probably need to be custom code. Maybe this could be the basis for the next-gen JOML.
Beta Was this translation helpful? Give feedback.
All reactions