-
-
Notifications
You must be signed in to change notification settings - Fork 21.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow frame_queue_size=1 for reduced input lag #94898
base: master
Are you sure you want to change the base?
Allow frame_queue_size=1 for reduced input lag #94898
Conversation
Cc @DarioSamo can you remember why we ensured a minimum queue size of 2 in the project setting? |
This is a nice time to summon https://darksylinc.github.io/vsync_simulator/. Assuming Godot internals aren't assuming the frame queue size is >= 1, which seems to be the case, otherwise this PR would have just failed instead of working, this may make sense. As far as I can tell, if the game is simple enough and/or the hardware powerful enough so CPU and GPU time are short enough, no frame queuing (frame "queue" size being 1) reduces latency a lot. However, if either the CPU or the GPU aren't so lighning fast, framerate drops (as there are no queued frames to "dampen" the fact") without an improvement in latency. My take is that we may actually want to unlock this possibility to users, but maybe warn them somehow. We could warn when that's set to |
It feels like allowing users to potentially shoot themselves in the foot to me to allow knowledge to spread about what is generally a bad practice. While it may look like it runs well for them, they can't guarantee that's the case for end users. I can easily see this leading to someone making some tutorial and going like "Why is this not the default? It is clearly better!" without warning people about the costs or realizing they're shipping something that won't work as well on weaker machines. I'll say you're probably looking at the wrong place to reduce input latency depending on the platform. If we're talking about Windows, you'll immediately get a much bigger benefit out of switching to D3D12 thanks to DXGI than doing something like this. Compatibility renderer isn't inherently lower latency because of Forward+ adding a frame queue, it's because Vulkan Surfaces are just generally awful in Windows and have inherent input latency you can't get rid off. I feel you should only unlock this setting if you're clearly aware of what you're doing and I'm not sure even just putting a few warnings would be enough to detract people from making a potential mistake.
They indeed don't fail anymore as I've rewritten it so it works. There's cases where RenderingDevice uses a frame queue of 1, like when it's created to render lightmaps. The minimum project setting was established mostly with concern for the reasons outlined above. |
If you ask me, it looks like an option that advanced players may want to tweak, not something to be predetermined by the game developer. |
I think that sounds reasonable as a setting for power users in a shipped game, and in fact, some games do expose this kind of behavior as a toggle for people with strong CPUs. The only problem is it requires a game restart at the moment, although there's a chance you could make it tweakable at runtime with the right function as long as it flushes all existing work safely. |
I think this is all the more reason why we should allow this to be changed at runtime. Everyone's configuration is gonna be different. Heck, I believe KDE recently introduced triple buffering into their compositor. It automatically turns itself on for older Intel integrated GPUs and improves smoothness significantly.
I am already telling the nVidia drivers to layer OpenGL/Vulkan apps on DXGI. Presentmon reports "Hardware Composed: Independent Flip" for the game's process. Latency is the same in both windowed and fullscreen.
The option is already hidden behind the Advanced toggle in the UI and there's plenty of other options where users could shoot themselves in the foot. I'd at least like the option to drop frame queue size to 1 because I know what I'm making is not a very intensive game (and if framerate does become a concern the player can always disable V-Sync). In my eyes, this hardly feels like a mistake. I've dealt long enough with games in all kinds of engines having issues with input lag. It's one of the reasons I sprung for a 240hz monitor. But not everyone has that kind of money, and there's also plenty of folks using non-gaming laptops, so I'd really like to improve the situation for those on 60hz displays. I think informing the user of the potential downsides of this is a good idea, though. Currently the documentation for the setting doesn't mention anything about the tradeoffs you get by changing that value. I'm unsure if a warning would be appropriate here though - usually that stuff is reserved for incorrect configuration, and for many projects it's reasonable to set it to 1. I've also been thinking of updating the current documentation on jitter, stutter, and input lag to mention frame queue size and swapchain image count. |
da24f54
to
c74095f
Compare
c74095f
to
c85caa7
Compare
I've used the same setting and I've hardly seen better results than when it's actually D3D12. It behaves as more of a compatibility setting to make sure stuff actually works (e.g. screenshots) than actually offering lower latency in my experience. I very much recommend trying D3D12 to draw a better conclusion here than the driver's option. It is immediately apparent just from using the editor the latency is significantly lower.
A warning seems appropriate considering it'll significantly impact performance and users might forget that they modified it in the long run. The more demanding the project is, the more disastrous the performance drop will be. But if you ask me, I still think this is one of those things where the minimum of 2 is reasonable because of the reasons I laid out. I will not go against lowering the minimum if it's what others want, but I think the possibility I stated is very real and will result in extra support and a worse perceived image of the engine's performance on games that ship with that setting modified to 1 at the project level. |
I think this might be worth investigating. It's entirely possible that D3D12 and Vulkan implementations might differ in how they're handled by the driver. From my limited understanding of synchronization, a higher frame queue size means the driver may choose to use the additional time for optimizations but it doesn't have to. Driver settings might also affect this depending on the rendering API. For example, AMD advertises Radeon Anti-Lag as supporting DirectX 9, 11, and 12, but does not seem to mention OpenGL or Vulkan (though it looks like Vulkan support landed very recently in the This reminds me of the whole glFinish situation from many years ago. Some drivers needed it for lower latency, while others had problems with it enabled.
My main concern is that it might cause too much noise if Godot gets linting tools in the future (though I suppose they could be set to permit specific warnings if the user knows what they're doing). |
I concur with Dario that disabling double buffering is a shotgun to the foot. However the solution may be different. Ideally Godot should sleep to maximize latency. The concept is simple:
The concept is very simple, however the reason I haven't attempted to implement it is because the devil is in the details:
When you enable "anti lag" in NV & AMD control panels, this is pretty much what happpens under the hood: the driver sleeps for us and pretends the GPU is taking more time, so that the app starts polling input as late as possible. TheForge integrated Swappy (and there is a PR in the works) that adds such functionality but only for Android, given that Swappy makes use of VK_GOOGLE_display_timing and Android's Choreographer. |
This isn't disabling double buffering.
This sort of technique can very easily introduce jitter/stutter if the framerate fluctuates. It also doesn't address the existing latency caused by the swapchain or lazy CPU/GPU synchronization, which is safer to remove than delaying rendering.
My understanding of "anti lag" is that it changes the driver behavior to send the commands to the GPU as soon as it can. There might be an "Ultra" setting that does more such as sleeping to render later, but those come at the risk of causing stutter. |
No. You could set swapchain_image_count to 10 and yet it will be single buffered. frame_queue_size = 2 means there is one frame the GPU is working on, and another frame the CPU is working on. Aka double buffer. When you set frame_queue_size = 1, only the CPU can access the frame, or only the GPU. |
From what I recall, double-buffering and triple-buffering refer to the number of images in the swapchain. Each one of these images is a buffer. With double-buffering, you have one buffer displayed on-screen while the GPU is updating the other. With triple-buffering, you have three buffers. This setting is controlled by Single-buffering is more along the lines of old DOS programs that would update the screen directly as it was being displayed, meaning you would actually see partial updates as the application is drawing elements such as boxes and text. I am not trying to create single-buffering, just remove unnecessary latency when using double-buffering.
My understanding (correct me if I'm wrong) is that |
This is wrong. darksylinc explained it correctly above. Setting Hopefully this illustration will help you understand what is going on
|
I think we're just referring to different things when we talk about single-buffering vs double-buffering. In this case you're referring to command buffering. Might be worth establishing some common definitions when referring to this stuff, but at this point I think we're just bikeshedding.
I am already getting 240 FPS. I'm not sure what I'd gain by relaxing the GPU command sync. And it's not like I'm making anything real intensive. Just because it might cause problems for other projects, shouldn't mean I should be forbidden from using this in mine. There's a reason we have Project Settings after all. (Also, having the CPU wait for V-Sync may be better for energy efficiency since it can spend more time sleeping) Edit: I measured some times with my project. All times have V-Sync disabled, driver DXGI enabled, driver low-latency disabled, and With Having it at 2 seems to produce a slightly higher and more consistent framerate, but I think the 50% better input lag is worth more than 4% better framerate. The disparity in framerate would probably be larger with a more complex project, though. |
For D3D12 you should use GPUView to see what the GPU is actually doing. Here is a screenshot of a capture I made using a simple project with default settings using the mobile renderer: PresentMon reported 4 frames (66ms) of latency (that may be excluding some of the CPU work). Although I cannot explain what exactly is going on here, it seems that part of the latency is from Godot sending too much work to the GPU, as I believe there is work for two whole frames being blocked in the GPU queue. The flip queue shows two present frames in the queue, which is the expected amount for swapchain_image_count = 3. With frame_queue_size = 1, this is what I got: PresentMon still reported 4 frames (63ms) of latency though I can't say whether I feel any actual difference (would need to film the screen and count the frames). There is clearly less work queued on the GPU now, but something is still being blocked (it might be blocking on the Present call, which is expected when not using a waitable swap chain.) I used this project spinning-cube_multiwindow.zip to test. You can try comparing the latency with #94503 presenting OpenGL on DXGI. (For comparison, the following is a capture with OpenGL on DXGI using a waitable swap chain.) |
As far as I'm aware, PresentMon only cares about how the final image is presented to the screen. I don't think it knows anything about the command buffers used to produce the image. So PresentMon giving a similar result for both settings is somewhat expected. I don't have D3D12 support set up in Godot right now, so I'm looking into alternative tools (I'm also mostly interested in the Vulkan results currently). May see if nVidia FrameView or Renderdoc can get the information I need. |
Fair enough. Single buffering vs double buffering refers to how many "frames in flight" are allowed by the execution model. They correspond to Importantly, there is no command buffering going on. The commands are submitted and executed in an appropriate order. The difference is at what point you synchronize the CPU and the GPU. In other words, the difference is how many frames you buffer. I'm not sure what you mean by command buffering. In Vulkan speak, command buffering is the process of recording commands before submitting them to a command queue.
Modern graphics APIs don't really have a built in concept of buffering frames, its something that exists totally on the application side now. In other words, we are the ones who control when to synchronize and when to buffer up frames. Also, to contrast this with OpenGL. In the OpenGL backend, we don't allow single buffering either. It always uses the equivalent of
This is exactly why darksylinc calls this setting a "shotgun to the foot". The problem we are explicitly working to avoid is a developer with an expensive PC turning the setting on because shrug "it works on my PC", then shipping it to users and not understanding why their game runs horribly on all other devices. This way of thinking only makes sense when the only hardware you ship on is your own hardware.
All of the above discussion assumes that Vsync is enabled. If you disable vsync then there is no point in double buffering as you will get tearing anyway |
You're just gonna get added input latency out of Vulkan on Windows unless you present to DXGI directly (which is something I attempted to hack together but something about the synchronization primitives fails sadly). It is very apparent that the D3D12 driver doesn't suffer from this, and it's something I've experienced in other projects that aren't Godot as well. The driver option doesn't solve it, it just makes it not work as badly as regular Vulkan surfaces do on Windows when it comes to other capabilities. |
Yes, this is what I meant by command buffering. Was looking for a term that describes "CPU communicating with the GPU" but now I realize there is some confusing overlap here.
My understanding is that frame queueing in OpenGL is highly dependent on the underlying driver implementation and when it decides to do CPU/GPU synchronization, not to mention the user's driver settings, which is one of the reasons that APIs like Vulkan exist now. So this comparison doesn't make a lot of sense.
I plan to have others test my game, but it's a 2D game rendering at 640x480 resolution and there's only about a hundred sprites (around 32x32 in size). I don't expect issues, because if players have an OpenGL 3-capable GPU, chances are the system will be fast enough to run the game well. If there's issues I'll provide an easy setting for players to use If extra support is a concern, document the potential issues with this setting and refer to that. Then the onus is on the game developer, not Godot.
Those times were meant to show that the efficiency gains from
I'm not sure why the focus is on DXGI here. I set
Are you certain that this isn't due to driver-level anti-lag that supports D3D12 and not Vulkan? |
To be extremely clear, you are getting reduced latency as a side effect of doing something that can severely impact the playability of your game. The focus is on DXGI because the main source of latency is presenting a Vulkan swapchain which will always be slow on Windows. You can reduce most of the latency just by switching to DX12 or DXGI for swapchain presentation. To put the reactions you have gotten here in perspective. Using single buffering to decrease latency in a game engine is like removing the engine from a car to make it more lightweight. If being lightweight is your only goal, then sure go ahead, maybe you only ever need to drive downhill, so you don't care about having an engine. The thing is, from the perspective of the engine it makes no sense a car without an engine sucks for 99.99% of people. We make software for the 99.99% of people, not for the person who wants an engine-less car. Ultimately, we have expressed that we don't want to give our users a foot gun. You have expressed that you don't care about shooting yourself in the foot and are very happy to do it because of your unique constraints. The ideal solution here is for you to ship your game with a custom build that allows single-buffering. That way we don't have to worry about the support burden and you can happily ship your game with reduced latency. You know best what trade-offs you are willing to make, so you can provide this option for yourself without us having to sacrifice the usability of the engine. |
There are several assumptions being made here that are still not being addressed.
Can you provide some numbers or examples on how this could impact playability any more than the existing tweakables Godot provides in Advanced mode? This is my main point of contention.
With DXGI, it is using the hardware display planes. I can enable DXGI layering for Vulkan in the driver settings and if I disable V-Sync, I see tearing even in windowed mode. I don't think you can get any more direct-to-screen than that. Is there something else I'm missing here? We also don't know exactly why D3D12 has less latency than Vulkan here. It could be due to DXGI implementation. It could be due to driver-level anti-lag. This warrants further investigation.
Presently, developers can do things like set the 3D shadow resolution to 16k. This doesn't work well for 99.99% of users, but Godot still allows it. In fact, shadow resolution above 2k can cause significant performance issues with integrated AMD graphics (the default is 4k). But that's why it's a tweakable. At some point you just need to trust that the game developer knows what they're doing. If they don't, it's the developer's fault. Trying to take responsibility for every bad thing any developer could do is not a healthy relationship to have. Yes, try to prevent developers from making mistakes. But if the warning is there, and it's not default, and the developer chooses to do it anyway, just let them do it.
Because I'm not the only one making games. I'm sure many others want to make fast-paced games where low latency is a priority. If they didn't, they wouldn't be filing bug reports about input lag in the Forward+ renderer. |
The first response to your PR pointed you to a simulator that illustrates the problem and allows you to see the impact of this setting in all scenarios. You can literally see it for yourself. https://darksylinc.github.io/vsync_simulator/. It feels now like I've just been talking past you and none of what I have said has actually stuck. So I will just leave you with that. |
I strongly recommend getting a high speed camera and doing end-to-end measurements if you want to be able to see what really makes a difference, as tools like PresentMon are known to give incorrect latency readouts. You can point a phone at your screen with your mouse keyboard/visible and use its camera's (super) slow motion mode, which is typically 240 FPS or 480 FPS. This gives you 4.2 ms or 2.1 ms precision which should suffice as long as you perform enough samples. https://github.com/GPUOpen-Tools/frame_latency_meter was recently released, but looking at its issues page, there are reports of the latency readouts not being very accurate either. I'd say nothing beats a hardware solution here.
We can expose it as a CLI argument like we did for |
On the jargon aspect, anecdotally, I also have that legacy of the very old times where double-buffering was about what we call the swapchain today, so it's indeed a good thing the concepts are cleared so everyone is talking about the same. On the latency aspect, it's worth noting that the frame timing simulator, which is irrespective of DXGI/Vulkan, shows latency is roughly halved with single-buffering under conditions akin to the kind of project being talked about here. However, being practical, it's already been established that single-buffering may be a good fit for some, so the discussion is now if that should be exposed. Maybe we can find an approach that satisfies everyone. Collecting some of the ideas already stated, would this work?:
|
Sounds good to me 🙂 |
Okay, I tried getting some better traces. With Intel's Graphics Trace Analyzer it shows me where exactly the wait is happening. Honestly I think these traces are pointless anyway, because all the wait looks to be caused by the renderer trying to render too many frames, but also my test scene is too simplistic to really show anything. |
I'm surprised this change is so controversial. Unity supports this since at least 2015: QualitySettings.maxQueuedFrames |
A value of |
Your captures show exactly the problem I pointed out: Godot polls input immediately after VBLANK, starts rendering, then sleeps for 14ms. Ideally it should be doing the opposite: sleep for 14ms, poll input, start rendering. |
There is a big problem with this idea: We do not know how long the next frame will take to render. Let's say that last frame we take 4ms to render. V-blank happens every 16ms. So we wait for 16 - 4 - 1 = 11ms (using 1ms as a cushion). Next we poll input and render, which takes 4ms. We spent 15ms, so we make it in time for V-blank. We saved 11ms in latency. Great! But what if the player pans the camera around and we suddenly have to render more? We continue with the assumption that it will take 4ms to render. We wait 11ms again. We poll input and render, but instead it takes 6ms. Now we've spent 17ms and missed the V-blank window, resulting in stutter. So now we realize that 1ms of cushion isn't enough time in this scenario. Now we introduce a tweakable in the Project Settings to allow changing this value. Suddenly we have to tweak this value per-project, per-machine, in order to find a good value that won't cause stutter while still saving enough input lag to be worth using. There will be no "one size fits all" solution. I can foresee this being a far bigger support burden than just We also need to consider that the sleep function can be problematic, especially on Windows, where we're at the mercy of the current timer resolution. Performing a sleep may take as long as 15.6ms, and in my experience I haven't been able to find a way to get the sleep resolution lower than about 2. This means there is a very real risk of over-sleeping, resulting in stutter (and we aren't even taking pre-emption by the task scheduler into account). There's an article that discusses the issue here, including changes that were made in Windows 10 that affect timer stability and the ability to set it to a lower value. Mitigations for timing attacks in browsers and the OS may also be a problem. I'm not quite sure what the graphics driver is doing, but it seems to do a better job of waiting for V-blank than the Having said all that, the situation is probably much less dire with And regardless of frame queue size, we'd still be moving around the timing of the input reads, which some players might not like. It'd be an issue for fighting games, shooters, and rhythm games (especially rhythm games, where a static value for latency compensation is established during game onboarding). I'm not even sure if the OS or windowing system can guarantee that it will deliver the input events to the window when we expect to read them, so the delayed input read might not even fix latency at all depending on the system. It's an interesting idea, but I feel like there's too much that could go wrong here. Anyway, I decided to do some more research on the topic of swapchains, since Unfortunately, support is not ubiquitous. nVidia and Intel support this presentation mode, but AMD does not. Android seems to have support for it across all drivers, but it's been reported to cause battery drain due to rendering unused frames. In the macOS/iOS world, mailbox apparently doesn't exist at all. Things also get weird on hybrid graphics laptops. I tried this setting on AMD integrated + nVidia discrete. When running on the nVidia graphics, it acted the same as V-Sync off (probably because the AMD chip was driving the display). When running on the AMD graphics, mailbox wasn't used at all and input lag was dreadful. So a different technique is needed to bridge the gap on these systems. It seems that right now there are several ways to attack input lag, but none that are "one size fits all" unless the game is simple enough that framerate drops won't be an issue:
|
A silly thought that might not require specialized hardware: we could make a test program that moves the hardware cursor (since that's on its own hardware plane) while also displaying a circle where we told the cursor to move, and simply grab a fast-enough camera and take a picture.
This could be a useful escape hatch in the event that a developer hardcodes a frame queue size of 1 but players need to set it to 2 and the developer has stopped updating their game. Edit: If we allow games to set the frame queue size at runtime, the escape hatch would need to block this and enforce the value specified on the commandline for this to be effective. |
Give #94960 a try. |
I just made a latency tester app in Godot. It should help us measure input lag in frames instead of having to rely on software tools or "eyeballing it": https://github.com/KeyboardDanni/godot-latency-tester Summary of findings:
To compare with OpenGL, latency is 2 frames with V-Sync enabled on Layered DXGI, and 3 frames with Native Presentation. Which is weird, because Vulkan seems to have the opposite effect. Maybe nVidia made a mistake when implementing Layered DXGI in Vulkan? This is without any patches besides allowing |
Wrote up a new proposal that tries to address some of the concerns raised previously: godotengine/godot-proposals#11200 |
Related: #75830
This PR adjusts the minimum value for
rendering/rendering_device/vsync/frame_queue_size
down to 1. The default value of 2 is not changed.Setting
frame_queue_size=1
improves input lag in Forward+ to be in line with the Compatibility renderer, at the expense of potentially reducing the framerate in complex scenes. While the framerate hit might be an issue for intensive 3D games, it's not a problem for less demanding applications, and considerably improves the experience for 2D precision platformers and fast-paced shooters.I haven't noticed any issues with it on my Windows 11/nVidia system (apart from a bit of stuttering on one run due to a known issue that went away when minimizing the editor).