-
-
Notifications
You must be signed in to change notification settings - Fork 3.7k
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Meshlet simplification might be a bottleneck in bevy's Nanite solution #14660
Comments
Hey, thanks for opening this! I haven't seen your project before - awesome to see more work in this space, especially with so many toggles for trying different things out. This is valuable data to have. I've only tested on the Stanford Bunny so far, I haven't tested anything complex yet. The bunny is an easy simplification case as it has simple materials and pretty uniform geometry. Thanks for the links to more models, I'll have to test my code on them. I'll also definitely be checking out your code, I'm interested in seeing how you did occlusion culling and software raster. Occlusion culling has been bugged in my code (#14062), which I need to fix :P To answer stuff you've brought up and in your project's readme: I plan to stick with hierarchical meshlet LODs. The initial plan at the start of the project was to use discrete LODs broken up into meshlets, but eventually I switched as you really want hierarchical if the goal is "unlimited geometry". If part of a huge mountain is far away, and part is close, you very much want to be able to render those with different level of detail, and be able to stream out the unneeded sections. From what I've seen, meshoptimizer does not actually perform that well, especially since it doesn't account for vertex attributes at the moment. I saw another Nanite-like project that switched from meshoptimizer to a custom implementation and got much better results. Iirc @zeux is also researching this area. In general, I haven't really worried about the offline preprocessing step of the project all that much, just an initial implementation that's "good enough" to test the runtime section, which is my main focus. I'm highly suspect that switching from e.g. 64 -> 128 triangles per meshlet would improve raster performance a lot, but I haven't experimented yet as I already struggle to get >32 triangles/meshlet as-is. Definitely an area that Bevy needs to work on a lot more. As the Nanite slides say, this is the hardest part of Nanite to get right :) In case you haven't see it btw, the current PR for Bevy is software raster #14623 (haven't seen large perf improvements over my HW method, potentially because of low triangle fill rate per meshlet). After that, I have a roadmap up here: #11518, of which a large part is parity with Nanite. |
FWIW this would be more precisely formulated as "Bevy currently uses the version of the algorithm that doesn't take the value of vertex attributes into account". I'm also curious about "much better results", unsure if you have references - while incremental work is planned to improve various aspects of simplification and clustering, I'm not aware of an existing alternative that is obviously superior. There's some caveats around being able to generate new vertices during simplification which is a difficult tradeoff. |
Wrt mecha doll model, I'm not sure what exactly the limiting factor there is - it simplifies down to ~25k triangles in gltfpack (this is not using clustered hierarchical simplification to be clear!). In general it feels like the hierarchical simplification should always, in principle, be able to reach the same number of triangles as the full-mesh simplification - while there's restrictions around meshlet connectivity, and currently clustering definitely can sometimes be suboptimal, you can always increase the number of meshlets you merge per step if you see the simplifier stuck. For full mesh the debug visualization shows a particular region of the model that is all locked from the simplifier perspective; this is probably unwelded vertices for whatever reason which restricts the simplifier's freedom and takes the budget away from the rest of the process. In general simplification needs to be combined with some amount of mesh pre-processing that can get more aggressive with smaller LOD levels, and maybe that's what would be necessary here - don't have time to analyze this model in detail atm. edit: Ah and a couple more regions like these; these are all using faceted shading so normals create attribute discontinuities that restrict simplification. Sufficiently aggressive welding fixes this; I also plan to look into providing some option to do this as a byproduct of simplification. This is a notable area where "custom" simplifiers can get this wrong if they simplify across discontinuities in an uncontrollable fashion, which improves the behavior for models like this but can break other cases where that creates issues with UVs or skinning. |
Response to JMS55's comment:
Oh, no! There are some bugs in my code too! I also don't know WGSL that much. I was only interested in getting something "passable". I don't even use two-pass occlusion culling! Without even the most barebone occlusion culling, the whole app.. struggles.
I had the same approach. My last real commit is named "Holistic system overview" where I checked how the whole project works a whole. Then I saw that the meshlet hierarchy simplification does not work that well.
Maybe my hardware raster path is just crap :P. But you definitely want a scene that plays to software rasterizer strengths. Tons of tiny triangles per pixel (!), tons of overdraw. Have dense mesh up close. Or have a 3k triangle model rendered into a 10x10 px square. The first link in my readme is 640m tris, the second is 1.7b. IIRC the opening of UE5's reveal had 1b. Ofc. in my app I can always cheat and use impostor billboards. But I also have a slider to check how the scene performs both hardware and software rasterized. The change is noticeable.
Well, at least now you have more info about possible roadblocks! I consider this value-added! |
@zeux sorry, I wrote my response in a rush, didn't mean to diss your work on meshoptimizer. I'm aware of the experimental vertex attribute-aware simplifier function, but haven't tested it yet. No hard data about the project that did better than meshoptimizer, but you can find the post/author in the DirectX discord: https://discord.com/channels/590611987420020747/1259493087395184680/1259493252151775233 I am really curious why I'm getting poor meshlet fill-rate (at least on the stanford bunny/dragon) though (independent of simplifcation); I originally decided to go down to 64 from 128 triangles because I could barely get 32. Something to investigate for the future though, I'm still focused on runtime atm and don't have a ton of free time unfortunately. |
Yeah currently I allow iirc 1-20% deformation per meshlet as the LOD level increases, but it's very much just a guess, I haven't spent time tweaking it. Could be we need to be much more aggressive. I've heard from other authors that vertex welding helps a lot for faceted meshes (e.g. the GLTF flight helmet), but I haven't tested that myself. |
Yep for sure, thanks for opening this! My test scene is currently a large grid of bunnies (3092), more details in https://jms55.github.io/posts/2024-06-09-virtual-geometry-bevy-0-14/#frame-breakdown. I definitely need a more realistic scene to test on though, probably using the free quixel megascans. |
Response to zeux's comment:
Unfortunately, it's the meshlet hierarchy that is responsible for the problem. I did 0 investigation on where the actual problem is. I was hoping that someone else had a similar issue. I assume that the proper solution would be to import the meshlets into Blender and see what is actually happening. But since nanite-webgpu was just a throwaway side project for me.. So the issue is that Nanite meshlet hierarchy is a specialised use case. I admit, I don't know much about the simplification algorithms. PS. The model itself is just a random mesh that I've grabbed from Sketchfab. I assume any other high poly models would have the same problem. PS2. If you want, you can close this GitHub issue. Not many actionable things to do right now. (I've realized I should also respond to zeux's comment, so I've removed this postscriptum from my previous response.) |
Yeah I understand that - and if you have a restricted merge (eg only merge up to 4) then it's possible that issues in clusterization are compounded through a long simplification process. This is on my list to investigate wrt meshopt's clustering, I'm just making an observation that it doesn't necessarily need to impact simplification if you remove restrictions on merges. An ideal outer flow of clustering+simplification should be able to have correct reduction down to a minimum number of meshlets regardless of the defects in clustering. For example, if recursive split/merge stops when the cluster can't be simplified substantially, it may be worthwhile to merge on a higher level (aka a larger group) and retry as that reduces the number of locked edges. There's still going to be cases where mesh topology prevents simplification of course - I believe UE's Nanite simplification suffers from similar issues as I don't think they allow completely unrestricted merges - so a completely arbitrary model from Sketchfab may still experience issues easily in collapse based simplification workflows. |
Thanks. Hard to evaluate without visual differences, and this topic is deep and complicated; just noting that attribute simplification is working fairly well now but also some more improvements will be coming later this year to the simplifier in this and other areas. And yeah the "optimizing position placement" is the tradeoff I was referring to: UE5's Nanite does do this, and I have plans to implement this as an option, but crucially it means that every level in the meshlet hierarchy when you simplify ends up generating a lot of new vertices. This, in aggregate, is a significant increase in vertex memory; Nanite solves this with LOD streaming and highly efficient compression which makes the total system work acceptably but is not a guaranteed win overall, as you can rebalance LOD levels in certain cases to accomodate that. There is also a ton of complications wrt attributes with this. Overall I think it will eventually be a useful option to add, and some visual issues in shading are difficult to fix without introducing new vertex data, but I don't think it's the case that every use of simplification would benefit from it. |
@zeux I've spent some more time on this project and come to the conclusion that the simplify-cluster step is a bottleneck, and producing low-quality meshlet DAGs. Besides trying out the new attribute-based simplification and tuning my cluster grouping heuristics for METIS, I'd like to know more about improving meshoptimizer for "optimizing position placement". Would it be ok to open an issue on the meshoptimizer github so we can discuss this further? Maybe you can give me some pointers on how difficult this would be for me to implement, if you're open to adding it to the library. |
Isn't the DAG quality a property of the clusterizer? I don't think attribute simplification will help if you are running into the graphs being suboptimal, and the same goes for position placement. But I'm not sure where the problem lies exactly; I'd recommend creating a discussion thread. |
I've already moved to other project, but I still had some thoughts about this problem (readme updated). I think there might be another approach to simplification. ATM Meshoptimizer allows you to specify approx. how many triangles do you want in the result. Let's introduce a guarantee/constraint: "You will always get the EXACT number of triangles you want". If you have "fully filled" meshlets (128 tris each), merge them, and meshlet split/simplify them into 128 tris, you will get an exact 128 tris ALWAYS. Due to the recursive/transitive process of tree building this guarantees "fully filled" meshlets on all DAG levels. You also get an efficient log2 number of levels. The only problem is to have the original/bottom level meshlets be also "fully filled".
I'm not sure how one would implement this. I'd guess some kind of greedy algorithm (I think also mentioned by Brian Karis, not sure in this context). But inevitably also adding new vertices (I think?). As I've said, I don't know much about simplification algorithms. This analysis matches following postscript from Brian Karis' presentation:
|
Thank you, that's an interesting point. Something like the greedy algorithm from https://jcgt.org/published/0012/02/01/paper-lowres.pdf night perform better if we want to ensure we always get 128 triangles. |
Greedy algorithm is Similarly, because simplification is constrained by the cluster boundary, you have no guarantee of reaching that number. I suspect there's always going to be cases where you simplify down to eg 80 triangles while targeting 64 due to a boundary restriction (that you have to have to avoid cracks) and end up with a group that has to be split into under-filled meshlets; I would be very surprised if Nanite has any guarantees about the cluster fill in the general case. I think their split routine doesn't try to necessarily reach the maximum triangle count either but I'm not 100% sure on this. |
Yes, I agree - locking meshlet borders means it's not possible to have meshlet's tris count always be a multiple of 128. I assume "Simplify-cluster step is a bottleneck, and producing low-quality meshlet DAGs" is a hypothesis? Would be quite hard to measure the impact, esp. without better DAG ("Batched Multi Triangulation" section 3.2) to compare. But e.g. it does not count things like meshlet fill rate or thin diagonal triangles, etc. I took a quick look at Bevy's simplification code and a few things stand out. If you want to experiment, here are a few ideas. What is a border?You give some triangles to meshoptimizer to simplify with A fair simplificationFor a quick check you can try following fair simplification algorithm:
Edge collapsing a random vertex will introduce artefacts. This is something you can improve (as well as using heuristic, handling concave surfaces, etc.). See below for a discussion of other simplification constraints. But might be good enough to at least test your hypothesis - "Simplify-cluster step is a bottleneck, and produces low-quality meshlet DAGs". Can you combine this with meshoptimizer? E.g. optimize with meshoptimizer first, and (if triangle count is still too high) edge collapse at random? The above algorithm might not improve much, only change the definition of the border vertex. However, it's implemented inside your Rust code, which allows for fast iteration of other ideas. TL;DR: Every vertex that is not a border vertex (using shared meshlet vertices definition) can potentially be removed. Error metric DOFError metric is another degree of freedom. In your current implementation it is restricted by meshoptimizer's Basically, you can try dropping all constraints from simplification. Instead, "bake" them into the value used for selecting meshlets at runtime ( Simplify to a single meshletYou might be surprised, but Bevy's implementation already simplifies any mesh into a single meshlet. At least with a little bit of wordplay. You might think the problem is that your DAG has "many roots". If you cannot simplify more, you stop processing the meshlets. You add nothing to
Jokes aside, I had described in my readme how this works in practice. Sometimes, even a mesh with terrible simplification can end up as a single meshlet. At a price of many "useless" intermediate LOD levels. Other stuff
|
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Hi, I've recently finished my implementation of Nanite - WebGPU. It includes meshlet hierarchy, software rasterizer, impostors, etc. It's a webpage and has sliders and checkboxes for every setting imaginable. Then I noticed that the Bevy team also started working on this feature. We use the same algorithm to simplify meshlets (based on meshoptimizer). Unfortunately, it will not be enough. Examples (triangles count summed over the root/coarse level meshlets):
3k triangles for Jinx's most coarse meshlets level is not amazing, but workable. OTOH 170k triangles are unusable for the robot mesh. It's unusable with Nanite-like technology. With enough man-hours, you could fix the algorithm (in my readme.md I've thrown some ideas). But you don't have to.
Iterating on Nanite
The problem with the simplification is that it happens on the 4-merged-meshlets level. Resigning from the meshlet hierarchy allows you to implement a large part of culling solutions first. And software rasterizer is a big win for both dense meshes and objects that are far (but have a bit more triangles than expected). Meshlets could then be based on the old-school discrete LOD-levels. The flow would be similar to the following presentations:
Another plus: without Nanite you don't have to research the error function that selects the correct meshlet level. From my own tests, e.g. normals are a pain.
Ofc. nothing stops you from going all-in on Nanite from the start. However the optimal implementation will require a lot of experiments and heuristics.
Pinging @JMS55 as she is the main architect for this feature from Bevy's side.
Disclaimer
I don't know much about Bevy, nor the current plans for it. I suppose you will want to do your own research. Yet I suppose knowing about the potential problems upfront will be beneficial. I don't assume you will drop your plans for Nanite, but having more data points might affect prioritization.
The text was updated successfully, but these errors were encountered: