-
-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More triangles/vertices per meshlet #15023
Conversation
nit: 255v/128t (the code is correct, the PR description is not) |
Thanks for catching that, updated the PR description. |
let world_from_local = affine3_to_square(instance_uniform.world_from_local); | ||
|
||
// Load and project 1 vertex per thread, and then again if there are more than 128 vertices in the meshlet | ||
for (var i = 0u; i <= 128u; i += 128u) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: up to you but it feels like i < 255u
would be more clear; in general, this loop is
for (i = 0; i < max_vertices; i += workgroup_size)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, not quite the way I thought of it. My thought process was "we want to do this twice, with the second time using an offset of 128".
// Simplify the group to ~50% triangle count | ||
// TODO: Simplify using vertex attributes | ||
let mut error = 0.0; | ||
let simplified_group_indices = simplify( | ||
&group_indices, | ||
vertices, | ||
group_indices.len() / 2, | ||
target_error, | ||
f32::MAX, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After this change, is there a scenario where both of the following happen:
- simplification fails to reach target triangle count,
- there are still leftover vertices that are not shared with other meshlets?
For Nanite simplification, all vertices mentioned in point 2 could probably be removed. While this might introduce artefacts, the error metric should show more detailed meshlets in such cases.
I suspect that to produce an optimal DAG ("Batched Multi Triangulation" section 3.2) you should always remove as many triangles as possible to reach 256 triangles (2 meshlets). The only constraint is Nanite's immutable shared vertices between meshlets. Any problems caused by simplification are shoved into the error metric.
See #14998 (comment) for context.
I've noticed You have a following line:
Is the 65% a heuristic? You could also try 75%. If you had 4 meshlets, reduction by 25% ends up with 3 meshlets. In the next iteration METIS might "redistribute" these meshlets. In my experience this is a mixed bag. This change should reduce number of roots at a cost of less uniform meshlet size and a bigger overall DAG. It's also interesting to see what happens if you remove this code path (or just comment out |
// Simplify the group to ~50% triangle count | ||
// TODO: Simplify using vertex attributes | ||
let mut error = 0.0; | ||
let simplified_group_indices = simplify( | ||
&group_indices, | ||
vertices, | ||
group_indices.len() / 2, | ||
target_error, | ||
f32::MAX, | ||
SimplifyOptions::LockBorder | SimplifyOptions::Sparse | SimplifyOptions::ErrorAbsolute, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Context: #14998 (comment) , section "What is a border?".
I think that meshoptimizer's definition of border vertex can be relaxed for Nanite. It includes vertices that are not shared with other meshlets (see image in the post lined above). You could provide locked vertices (const unsigned char* vertex_lock
) to e.g. meshopt_simplifyWithAttributes(). This might lead to visual artefacts, but wouldn't that be an issue with an error metric instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what open edge means in this context (google doesn't show anything). But check the comment I left in that thread - I'm not sure that we're on the same page about which vertices need locking.
I ran the mesh converter on the "Huge Icelandic Lava Cliff" mesh from quixel megascans, with various percentages. Here's the data: https://pastebin.com/raw/pvQPfbZG. It's structured as key:value pairs where the key is a triangle count [0, 128], and the value is the number of meshlets in the current LOD level that have that many triangles. I'm not sure what you would say is "best". 50% has less variance, but also less LOD levels compared to higher percentages. I think ideally we want log2(lod_0_meshlet_count) levels, so ~14 total here? So maybe higher percentage is better? Perhaps 99%, or basically as long as we simplified something (maybe one meshlet's worth?). |
Btw, here's the git patch (on top of the previous commit) that I use for testing https://pastebin.com/raw/mFKrCtYr. Run it with |
I think I want to ship these improvements for now, and I'll continue to experiment on my own, and open another PR if I end up finding further improvements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy with the level of review from the other reviewers, and there's nothing concerning in here.
### Builder changes - Increased meshlet max vertices/triangles from 64v/64t to 255v/128t (meshoptimizer won't allow 256v sadly). This gives us a much greater percentage of meshlets with max triangle count (128). Still not perfect, we still end up with some tiny <=10 triangle meshlets that never really get simplified, but it's progress. - Removed the error target limit. Now we allow meshoptimizer to simplify as much as possible. No reason to cap this out, as the cluster culling code will choose a good LOD level anyways. Again leads to higher quality LOD trees. - After some discussion and consulting the Nanite slides again, changed meshlet group error from _adding_ the max child's error to the group error, to doing `group_error = max(group_error, max_child_error)`. Error is already cumulative between LODs as the edges we're collapsing during simplification get longer each time. - Bumped the 65% simplification threshold to allow up to 95% of the original geometry (e.g. accept simplification as valid even if we only simplified 5% of the triangles). This gives us closer to log2(initial_meshlet_count) LOD levels, and fewer meshlet roots in the DAG. Still more work to be done in the future here. Maybe trying METIS for meshlet building instead of meshoptimizer. Using ~8 clusters per group instead of ~4 might also make a big difference. The Nanite slides say that they have 8-32 meshlets per group, suggesting some kind of heuristic. Unfortunately meshopt's compute_cluster_bounds won't work with large groups atm (zeux/meshoptimizer#750 (comment)) so hard to test. Based on discussion from bevyengine#14998, zeux/meshoptimizer#750, and discord. ### Runtime changes - cluster:triangle packed IDs are now stored 25:7 instead of 26:6 bits, as max triangles per cluster are now 128 instead of 64 - Hardware raster now spawns 128 * 3 vertices instead of 64 * 3 vertices to account for the new max triangles limit - Hardware raster now outputs NaN triangles (0 / 0) instead of zero-positioned triangles for extra vertex invocations over the cluster triangle count. Shouldn't really be a difference idt, but I did it anyways. - Software raster now does 128 threads per workgroup instead of 64 threads. Each thread now loads, projects, and caches a vertex (vertices 0-127), and then if needed does so again (vertices 128-254). Each thread then rasterizes one of 128 triangles. - Fixed a bug with `needs_dispatch_remap`. I had the condition backwards in my last PR, I probably committed it by accident after testing the non-default code path on my GPU.
Builder changes
group_error = max(group_error, max_child_error)
. Error is already cumulative between LODs as the edges we're collapsing during simplification get longer each time.Still more work to be done in the future here. Maybe trying METIS for meshlet building instead of meshoptimizer.
Using ~8 clusters per group instead of ~4 might also make a big difference. The Nanite slides say that they have 8-32 meshlets per group, suggesting some kind of heuristic. Unfortunately meshopt's compute_cluster_bounds won't work with large groups atm (zeux/meshoptimizer#750 (comment)) so hard to test.
Based on discussion from #14998, zeux/meshoptimizer#750, and discord.
Runtime changes
needs_dispatch_remap
. I had the condition backwards in my last PR, I probably committed it by accident after testing the non-default code path on my GPU.