-
-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ideas For a High-Performance System Design #29
Comments
Something like this could be awesome and may open up the possibility of first- and even second-order speedup (N10x or N100x) factors using multicore , SIMD, or GPGPU, and maybe even future reentrant server-side execution for multi-player or cloud use. It would be best to do this on the implementation side without changing existing Body, Shape, Fixture, Joint classes so much as to alienate existing Farseer clients. For a maintainable component, Callbacks such as CollidesWith and other are not dependent on the nuances of being called from within a tight loop and that this previous frame->next frame state model sounds like a good approach. The old adage is "Restrain from premature optimizing", but this effort would not be like that as its going for big gains by reducing algorithmic complexity. GPU acceleration using OpenCL.Net or similar as is done by the BulletPhysics engine which has very rigid and incredibly fast piling allowing for impressive building falling simulations. With .net we value rapid prototyping, safe code, and steady APIs, and component software. Open source has the tendency to branch and explode in versions unless it is contained. I think for existential reasons Velcro physics should be positioned away from PhysX, Bullet in 2d constrained mode, Jitter-2d, and other physics engines. BulletPhysics piling performance has been shown to be much higher with GPU acceleration and there is a server component, but the cost of integration of any different physics engine to an existing client is high and the inclusion of 3d increases the system complexity by a whole dimension. From what I have seen, Piling of bullet objects performance in Farseer is not good, issues seen with only 20 items, on a phone, less. Not using CCD and allowing sticking, tunneling to happen can make physics games buggy enough to not be ship-able. As a physics engine client most of the heavy work I think I have to do is trying to work without CCD when possible, to flip the IsBullet. As a heavy api user of Farseer, I would forward that Body and Joint and Fixture classes not be overly changed for this if practical. The state could be copied from the vectors. But since we did not upgrade to the very latest version of Farseer because the changes where not compelling enough vs the difficulty of merging, a different data structure or API would not be a major deterrent. My team decided to put DataContract persistence on the properties of Body that are not derivatives of other state so if body were a struct then that fits with this model. If this is pursued I can produce some links of prior work if that might help. |
Great ideas @ilexp So far I've tried to be really close to Box2D and only provide some much-needed tools and APIs to make it easier for beginners to use a physics engine. For a clean Box2D, there used to be Box2DX. I will be going in another direction with Velcro Physics and your proposal sounds interesting.
In an unrelated project, I've done some research into RyuJIT64 as well as the CLR runtime and optimizations. Let's just say that I'm deeply troubled by my findings, as it seems even trivial optimizations are not performed. One example is |
@damian-666 I also tried to build a multi-threaded version of the engine which failed miserably. The engine restricts the number of points in polygons to 8, which keeps the workload low, so any optimization to the collision resolution code did not benefit in the inner loops. MT to the outer loops were also unsuccessful due read/write constraints. @ilexp's proposal to make a step read-only will help a lot with that though. The CCD has always been very expensive as it substeps inside each step. This is really an algorithm problem rather than an optimization problem. As for API backwards compatibility, I don't think I will have that with Velcro. Farseer Physics Engine is stable and if it does the job, you should stay with it. I might have to move shapes, fixtures etc. out of the API for safety reasons, and there is just no way I can keep it compatible with FPE's API while doing so. |
This proposal with require a redesign of most of the engine. Louis over at his fork has done a great job at simplifying the engine to accommodate refactorings like this, so I will be copying some of his changes first. |
There's the tiered JIT issue in coreclr. To me, it seems like the biggest blocker for more JIT optimizations is the requirement to be fast at compiling vs. fast at executing. A tiered JIT has the potential to work around this in a way. But in any case, that's kind of future / hypothetical stuff. The other part is extending C# to allow writing more efficient code in the first place. Anyway - really glad you think some of the above ideas are worth pursuing. I'm currently a bit short on time, but I'll take an occasional look at this issue (and project) to see how things develop, or whether there's opportunity for me to chime in with an occasional comment. And from my side, as a long-term Farseer user, feel free to break as much as you want with Velcro as long as it's for the better. 👍 |
Seems like if significant redesign is going to be considered, I see that the batches that must be organised are not the same as existing Islands, which were originally designed to be done in parallel, according to its original designer Erin Catto. But that does not give incredible single pile performance like bullet shows . From what I gather the key is to batch contacts that do not share a body as described here on slide 6 There are other tricks to parallelizing in there. That way no sync is needed since the position or velocity state of a body is not being pushed by another contact on a parallel thread, so no race condition or lock is needed. here they discuss the parallelizing issues around box2d / Farseer Islands , some of which may be relevant even in the many-batch-in-one-pile scheme: Looks like the box2d-MT has the box2d islands but its not super-impressive on the tests. Erin chimed in somewhere else on this that the Islands would need to have separate piles for testing. Using AABB in world coordinates from the last frame I am guessing. The reference code in C++ I think is in Bullet by Coumans he has made incredible huge piles of boxes and very stiff joints and contacts for robot and AI researchers who might also be interested in 2d for certain problems. If parallelizing with Parallel.For or something works then maybe GPU could using OpenCL.Net or something could be looked at. For 2d render without 3d shaders loaded, and for 2d physics the most GPUs would have quite alot of spare memory and could allow all the data to go into the GPU memory which is super important for that. If all of the bodies data, the lists of vertices in Fixtures, the AABBs in world ( i guess from the last frame), the tree, and the positions and velocities fit into GPU memory then they can be batched in there. I know this is a tall order but parallel physics is almost standard as even $50 phones have 4 cores now and cheap AMD processors now use 6 or 12. One must be wary of little-gain optimizations if they are not thread-safe or away from the functional programming methodology of passing in the parameters that might not be the same for all callers in a frame, as they may turn up as issues in making it threadsafe. Not to sweat the little stuff if 4, 12, or even 60-factor gains are possible |
Very interesting read in relation to performance: |
its interesting but seems like bounds-checking type optimization is never
going to give the possible performance boosts of 2x, 3x, 8x when
algorithmic complexity is pretty high and there is an opportunity for
parallel or batch processing. given 4 core is on budget phones and xbox
one is 8 core, and now AMD is selling 8 core CPUs for 118$, isnt there any
sort of parallel processing that can be pursued without doing a massive
overhaul, in the CCD or the collision detection and response?
Although we havent had much improvement with Parallel.For in spots here and
there, My team had wild success with running ray casts ( uses broad and
narrow phase) in parallel with farseer , with just one or two fixes in our
branch . With .net it was easy to find the race condition as it was
crashing in a static variable which was a negligible optimization, and just
making that a member viable was all that was needed to make our
pseudocollier's ray casts just about 4 x faster on a quad. I know that
collision and ccd is more complex that just collision detection.
according to Erin Cattos comment in one newsgroup, the existing box2d
islands were originally meant to be run in parallel and i believe iforce2d
gave it a try in c++ where threading conflicts are more mysterious to
diagnose. In my experience a functional-like programming ( passing in
the parameters you need) and elimination of static vars that are both read
and write is usually enough to fix the threading conflicts.
I am not sure that c# is great for xbox one or UWP development at the
moment since there are graphics performance issues, but allot of us,
including unity developers, own allot of c# code and might want to put our
physics games on xbox and not have 4 or even 6 idle cores
…On Thu, Jun 29, 2017 at 3:39 PM, Ian Qvist ***@***.***> wrote:
Very interesting read in relation to performance:
https://blogs.msdn.microsoft.com/dotnet/2017/06/29/performan
ce-improvements-in-ryujit-in-net-core-and-net-framework/
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#29 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AP_LmQAIYnbOkFbD0it9-s2Am79_wJeXks5sJCefgaJpZM4NZqKu>
.
|
this is a big job, and is certainly not a pullrequest. Maybe we can establish a Git Bounty then fork this to try and assemble a pair of engineers who have done this before, to try and do this using a "try an work through the runtime issues" approach A ton of work in the solid library ImageSharp has parrallel.For here and they just run it in parallel then fix the exceptions found. That is how my team did parallel Raycasts in Farseer. C# is great for parallel for as it allows easy "try and see" approaches for batching. box2d islands means disparate piles, and in real game use this is not going to be as useful as the other batch of body states, those" sets of contacts that do not share a body"
This way no two threads are fighting over any body state since no two batches share a body
And / or In parallel with this, try to run CCD in parallel, that is of equal or more importance. Another thing is to research parallel techniques for the iterative constraint solver. Again, this was all achieved in Bullet and with GPU, with two orders magnitude faster piling in the case GPU is used, in the bullet lib. Once CPU parallelisation is done, a GPU method is about memory management and might be explorer with Alea Parallel.For if this is done then a LiquidFun SPH type extension to Velcro and water with boats can be explorer from there. on 6 or 8 cores thats could be interesting and playable water. |
I've had the multi-threading discussion more times than I care to count :) First of all, It is certainly possible to multi-thread the engine in specific scenarios, but years of experiments have taught us that a general solution hurts the common case. The things that have made a difference was cache-coherency and modern CPU features such as prefetch and pipelining. Keep small structures in CPU cache and you are suddenly running 2x speed. If I add just a bit of complexity such as keeping track of locks and have state objects for threads, then performance drops. The most common case for this engine is not large games with incredibly advanced physics - it is small to medium simulations with 10-500 objects, and if I add on multithreading to speed up scenarios where you have 5000 objects, then we slow down the 10-500 objects case by x2. Not a great trade off. In the 5000 objects case, you can certainly find a hotspot like ray-casts and multi-thread it in order to juggle more. We just have to remember that more threads do not equal faster simulation, it equals more simultaneous objects. In my tests, I had a ~20% increase in throughput by multi-threading with 4 cores; a terrible waste of resources gone to locking, state copying etc. However, if you create a world and put that into a thread by itself, you suddenly have a 100% increase in throughput, and that scales linearly! So in larger games that actually need threading, it is much easier to copy state manually between the worlds on the macro level, than trying to squeeze anything out on the micro level. With regards to islands - I did try to multithread (MT) them but keeping track of state across islands again degraded performance for the common case, while increasing performance for large simulations. I ended up with a version with A LOT of compiler conditions in order to get around it, and I was about to make a "MT edition" and a single thread edition, but I figured most people would choose the MT and then disregard the engine due to its bad performance in PoCs. That being said, I'm very interested in what people can come up with. I'd love a multi-threaded version of the engine :) |
Thats an interesting idea running entirely separate world in parallels, it can help for some ideas such as the current concept of "Islands", which are not the same as the batches or islands used in a successful MT piling implementation. Two piles could be solved in parallel by putting the bodies all in separate farseer worlds as you mentioned and running the whole worlds in parallel. But for a single pile of boxes, it would all be in one island and so not taking advantage of the 4, 6 or 8 cores that are now in almost all devices. I wasn't proposing introducing any fancy synchronization, just something like presorting the workloads and batching them all without need for locks or sync inside the main loop(s), as mentioned by the thread originator. I haven't have much success with complex locks either, and I think @ilexp was proposing batching and running it all in parallel without sync except maybe a wait for all batch working threads to join, outside the outer loop(s). From what i gather, the batches needed for MT within a pile or other set of interacting bodies are sets of constraints that do not share a body as described here. https://www.slideshare.net/takahiroharada/solver-34909157 As i wrote earlier, I got a 3.9x faster performance for 4 cores when i simply ran cay casts in parallel over sets of bodies big or small, in a real game scenario, to implement a wind or explosion effect with shadowing. This was just divide bodies in one batch per core, run workers and then wait. ImageSharp uses Parallel.For throughout, I used a simple ThreadPool , run batches, and WaitAll for them to all finish, with zero locks or complex sync. For Raycasts, its an embarrassingly parallel case since the collision detection geometry fixtures are organized in a tree which does not change inside the loop, but I know the contact and constraint solver is another matter. For my simulation, the performance of Farseer is OK for now, except when I run CCD and things are in collision range, or when I use allot of constraint iterations ( setting.VelocityIterations and or positioniterations higher) or a small timestep. But for these cases, the batching before the loop would have to happen also. Better would be some more specific propositions based on the existing Farseer code and internal datatypes , how to sort them by "sets which do not share a body or fixture" and which loops to address specifically. I did look at the Box2d MT discussion, but that is about solving disparate "Island" in parallel but that approach is not how Bullet does begin to successfully stack 100,000 boxes. |
Ray-casts has always been the low-hanging fruit since they are pretty much stateless. I'm worried that if it is multi-threaded, it will result in thread starvation or too much overhead due to crappy schedulers on mobile devices etc. It could definitely be configurable with low overhead, but as you said, the solvers are a completely different matter and much more interesting. |
i looked deeper into the velcro code and have some promising things to try. But if this conjures up old failed or controversial attempts please direct me to those pre-Github branches if they are not lost. I have added info below on the the engine fix we put for parallel ray casts in for farseer client in section 1) , then a partial proposition for implementing parallel collision and constraint solving work in 2) . The parallel raycast is a very clear trouble-free 4x speedup we tested on quad MacOSX and PC that required at least two minor engine changes in the Farseer branch. For the client code that calls the raycast, the outer sync was just break the body list in to N sections, where using processorCount for N makes sense, and use ThreadPool.QueueUserWorkItem, then WaitHandle.WaitAll(waitHandles) after all the raycasts for all batches were done, then do the remainder. We didn't try Parallel.For but that is the new and simple to do it , and the way ImageSharp iterates nearly all of its batches of work. For my ray code the speedup is obvious but for the constraint loops in Velco, benchmarking timers around each Parellel.For loop would be needed. Then the power of managed .net is exploited that any runtime issues are easy to find than in c++. Now that i look at some of the heavy loops, with that i'm an optimist to the absurd, I think we can do speed up Velcro incrementally and practially without any noticable tradeoffs or ifdefs, use of the structs, copying, or even drastic changes to API or objects. To realize the full benefit of multicore , by using Parallel.For on lists of Bodies, we would next need to look at how Bullet sorts contacts and or constraints by -those batches that do not share a body-. Then, all the loops that apply changes to body.AngularVelocityInternal. and body.LinearVelocityInternal, and the positions, will not be having conflicting write to the same body, since -no two batches share a body-. This genius idea is described and illustrated by Takahiro Harada in many of his publications, but he goes much further to the GPU methods. I see that GPU in .net is something much harder to be deferred for future, and would likely need this constraint and contact sorting step done as a prerequisite. No #ifdef would be required for Parallel.For, but the BatchSize could be a runtime parameter for testing, setting to 1 means don't bother with MT loops. This sorting would be done on the bodyLists inside an Island. As for running the current box2d "Island"concept in parallel, I do not recommend to even bother with that, since with the Takahiro-described batching method, each island will be sped up and the cores will all be loaded. looking at Bullet where it might be implemented in the OpenCL version, it is hard to understand the batching code, I found some openCL code appearing to make batches but its really hard to read and undocumented. We could maybe use HashSets and put the bodies in there by keyed by ID, then try to organise the pairs such that pair.BodyA and pair.BodyB are all unique in each batch of pairs. Harada describes an interative method to make the batches (serial is slow, parallel is not straightforward, maybe more details in here: http://dl.acm.org/citation.cfm?id=2077406. There are some loops like SynchroniseFixtures that may not require batching, just a foreach. I'm not proposing GPU rigidbodies as in Bullet3, it is a massive problem to implement and deploy, or stacking of 100,000 boxes as seen in the demos, but just maybe taking the batching technique if it can translate from that baremetal openCL code to .net 1.To make raycast multithreaded there are several changes to the RayCast method implementation, and make it use the parameters input and less of the tree state. A crash was caused byissue with using the stack as a shared member in DynamicTree.RayCast method. rather than do try to pull request and then merge/ commit i am setting you the three important changes, i am using my own branch not on GitHub. but i'd be happy to commit them if someone will piont me to the unittest . The change to _stack was to allocate a new stack as box2d did. Reusing this tiny 256-item stack in the tree really is negligible impact in the context of nearly any usecase. This is an example of a small optimization making a big one crash. Then i have shown a client can call RayCasts in parallel, reusing a big or small dynamictree. then b) RayCastInput subInput;
then c) This might not crash but is faster and causes less GC AABB segmentAABB;
for example here is a heavy , expensive process that writes to body state, velocity and position. //add benchmark code here.. Parallel.Foreach batch... do this instead.. SynchroniseFixtures is all very expensive and trying that in parallel could be worth a try. |
I didn't read the entire conversation in detail since my last comment, but please be aware that replacing a If you're writing a game library, you need to be aware that one of the contexts in which it is used will be editor applications, which often have a special With this in mind and also regarding what @Genbox mentioned, it seems preferable to not do multithreading on a micro-level like individual for loops, but instead (continue to) provide the facilities to allow users of the library to do multithreading on a macro-level, e.g. a defined API that allows to perform certain operations in a thread-safe way. Give users the tools to multi-thread if they need or care to instead. Edit: Another thing I'd like to mention is that from the perspective of me as an engine developer, I would very much prefer to just put the entire physics simulation in a different thread, which I can do already. Multithreading on a micro-level as a library implementation detail of Velcro seems like it would not add much value at the cost of increasing complexity. |
Thanks for raising the possible issue of "This means that Parallel.For and other TPL classes may actually process window messages" . I have an editor and we put the physics outside of the UI thread, not sure if my setup would suffer issues like this . So in my system it is graphics and , UI in one thread, and physics on the other. We used means of trying to load the other cores that are normally there in a quadcore, but sometimes CCD or piles is the bottleneck. Any code proposed could use parallel for and batching only if proposed Settings.MaxBatches was >1.. , or some boolean to prevent it. so clients using UI thread also for physics could just set that to 1 for one thread, or it be the default. ImageSharp is a very solid general-purpose .net imaging engine with many professional level contributors around the clock. They use parallel.For throughout since raster processes are parallelizable , and I never saw mention that clients didn't want it or how to get around it. But to shut it off clients can set paralleloptions.maxdegreeofparallelism to 1 i believe, it will be passed to the .net method parallel.For Here is an example of a parallel.For inside a general purpose engine implementation: https://github.com/SixLabors/ImageSharp/blob/master/src/ImageSharp/Processing/Processors/Convolution/Convolution2DProcessor.cs ImageSharp may well could be used in editors with UI, not sure if the window message issue would come up for that. For my level editor, in my setup, visible objects are drawn while physics updates bodies, then the body positions are copied to the visible objects , then repeat. btw batching as in Haradas paper is really non- trivial, so im not sure if this will happen. The batchin could be the existing "Islands" though, or maybe a parallel batching scheme testing the bodies in new constraints to those in existing batches, all in parallel. Again in MaxBatches was ==1 , then it would not need to incur this overhead |
I used the TumblerTest with 400 bodies to stress the engine and see if it's possible to multithread some methods. I found that |
SolvePositionConstrains() will be harder to optimize with threads, mainly because it removes Contacts from the list. I know that @Genbox plans to replace the List<> with a LinkedList<>, so here's an interesting article about LinkedLists and Multithreading: |
The proposition here in the original thread is to multithread without locks of sync inside the loops, only this approach can provide the performance multipliers per core that allow Bullet3 to stack 100,000 boxes ( but that is with GPU) . |
I prefer lock-free algorithm myself but in this case I don't think it's possible. |
Oh 3 or more times performance for 4 cores is a pretty good result, but when you have 3.95X for four core you know you are not wasting cycles in syncing that could be used for AI or render . However this might be the best that is practical. If you would share a github branch with this work it would be interesting, especially to try on a pile of boxes all marked bullet, with the threading and this type of lock in the Island.SolveTOI. But see, the compelling thing about the "upfront batching" is that it has been implemented that way and tested years ago for bullet3, albeit for GPU, and, there are a few loops that all could reuse the per frame body batches . Yes, the" graph coloring" batching is illustrated in the https://www.slideshare.net/takahiroharada/solver-34909157 with chess pieces , i am still trying to download the paper. As you note it should be , batching itself lends itself to be a parallel process with locks. Here might be the batching code if anyone can understand it, in openCL, its treats the static bodies as readonly and so batches can share those. this is just an idea to explore and might not be as practical on CPU as for GPU. If I google "A parallel constraint solver for a rigid body simulation" Takahiro Harada, I get So ,while bullet3 does stack 100,000 bodies in amazing videos, it is hard-to-deploy- gpu code and just a simpler multi -threaded stack solver of 2d "bulleted" boxes would be useful. Even stacking 50 items with CCD can get bogged... https://www.youtube.com/watch?v=7TkOi-6LCEU <--big stacks I see a couple OpenCL and WebCL implementation of box2d with c++ but for not for .net. maybe the https://github.com/wolfviking0/webcl-box2d is worth a look. Also Box2d MT that does islands in parallel, that might be another topic to explore that might be more practical for 4 , 6, or 8 threads in parallel, rather than for GPU, which wants to do something like 64 or more parallel batches |
Hi all. Hope my dropping in on this conversation is okay. I really appreciate @Genbox 's mention, back on May 14, of my C++ work with Box2D. Incidentally, I'm still working on it but I've shifted my work over to a new repository with a new project name of PlayRho in case anyone is interested in following it. As for speed optimizations, so far the thing I've seen in my experimenting that's most impressed me is what happened when I compared using various different data structures vs. the existing Box2D structure for detecting when a contact already existed in code for finding new contacts. I got inspired to look into the Add Pair Stress Test Testbed demo where related sources in Erin's Box2D code had commented about considering using a hash table (see In regards to multi-threading, I haven't seen any performance improvement with the multi-threading that I've tried so far, at least on a pyramid like stacking test. Given that a pyramid is all one island, I wouldn't expect solving separate islands in separate threads to speed up a test with just one island but it was discouraging that multi-threading the contact updating didn't help. Admittedly, I haven't played around with this as much as I think it needs and realize that there's probably ways I could optimize what I've done. OTOH, I'm worried that cache misses may have more to do with performance losses than what multi-threading can gain and so more of my focus has been on refactoring code towards becoming cache friendlier. In any case, this continues to be a very interesting problem to me. |
Hi @louis-langholtz - thanks for dropping by and writing about your experience. I suspect you are right regarding the cache coherence versus code running time. |
C# is soon going to have this: https://github.com/dotnet/csharplang/blob/master/proposals/csharp-7.2/readonly-ref.md I remember back 10 years ago when they argued that Java/C's const parameters were too much for most developers and it should be something handled by the JIT compiler. Well, they seem to have reversed that decision completely and are now going with const parameters. In any case, I expect the compiler to be able to inline much more aggressively and do a lot more tricks for arithmetic reductions. |
A quick update on SIMD: dotnet/coreclr#13576 |
As far as SIMD hardware acceleration goes, would it be possible to use System.Numerics.Vectors? Already out there since 2015, no need to wait for cutting edge CLR / FX changes to hit mainstream .NET. |
@ilexp sure, that's not really the news though :) I've implemented part of the engine solver using System.Numerics.Vector and ran some tests - the results were not really what I expected. Much of it had to do with boxing, conversions and memory copying. With this API, it does not seem to be an issue. |
Hey there,
I've recently done a bit of work on physics-related tasks and in the process spent some thoughts on how Farseer performance could be improved.
Performance Observations
First, some assumptions on certain performance aspects that hopefully are uncontroversial enough to agree on as a precondition:
List<T>
vs.T[]
when it comes to bulk-processing large amounts of data.Potential Performance Sinks
Next, here is a list of Farseer system design decisions that potentially drag performance below the theoretical maximum:
Design Decision Draft for Maximizing Efficiency
Finally, I'll just throw in some rather radical ideas on how to restructure the overall system to address the above issues:
Parallel.For
and similar.int
-based handles that represent the access index in their world.As a side effect to improved efficiency, less overhead and internal parallel execution, the above changes could also make serialization (see issue #6) a lot easier. It would also make pooling (see issue #5) obsolete.
On the downside, this would require users to adopt a new API and potentially be more careful due to less safety options being around. There will be a bigger need for good documentation so users are primed for certain aspects of the system. This might also generate a large amount of issues that need to be resolved in order to adapt to the new design, but if any of these changes should make it into Velcro, now would likely be a better opportunity to tackle them than later.
I'm totally aware that the above are quite radical suggestions and that I don't have any idea about most of the internal structures and design decisions of Farseer, so please read it as a collection of ideas by an interested observer :)
The text was updated successfully, but these errors were encountered: