Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow frame_queue_size=1 for reduced input lag #94898

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

KeyboardDanni
Copy link
Contributor

Related: #75830

This PR adjusts the minimum value for rendering/rendering_device/vsync/frame_queue_size down to 1. The default value of 2 is not changed.

Setting frame_queue_size=1 improves input lag in Forward+ to be in line with the Compatibility renderer, at the expense of potentially reducing the framerate in complex scenes. While the framerate hit might be an issue for intensive 3D games, it's not a problem for less demanding applications, and considerably improves the experience for 2D precision platformers and fast-paced shooters.

I haven't noticed any issues with it on my Windows 11/nVidia system (apart from a bit of stuttering on one run due to a known issue that went away when minimizing the editor).

@KeyboardDanni KeyboardDanni requested review from a team as code owners July 29, 2024 03:40
@clayjohn
Copy link
Member

Cc @DarioSamo can you remember why we ensured a minimum queue size of 2 in the project setting?

@RandomShaper
Copy link
Member

This is a nice time to summon https://darksylinc.github.io/vsync_simulator/.

Assuming Godot internals aren't assuming the frame queue size is >= 1, which seems to be the case, otherwise this PR would have just failed instead of working, this may make sense.

As far as I can tell, if the game is simple enough and/or the hardware powerful enough so CPU and GPU time are short enough, no frame queuing (frame "queue" size being 1) reduces latency a lot. However, if either the CPU or the GPU aren't so lighning fast, framerate drops (as there are no queued frames to "dampen" the fact") without an improvement in latency.

My take is that we may actually want to unlock this possibility to users, but maybe warn them somehow. We could warn when that's set to 1, as well as stating it clearly in the docs (if it's not already). Additionally, it may be interesting to have a runtime component that detects issues in this area.

@DarioSamo
Copy link
Contributor

DarioSamo commented Jul 29, 2024

Cc @DarioSamo can you remember why we ensured a minimum queue size of 2 in the project setting?

It feels like allowing users to potentially shoot themselves in the foot to me to allow knowledge to spread about what is generally a bad practice. While it may look like it runs well for them, they can't guarantee that's the case for end users. I can easily see this leading to someone making some tutorial and going like "Why is this not the default? It is clearly better!" without warning people about the costs or realizing they're shipping something that won't work as well on weaker machines.

I'll say you're probably looking at the wrong place to reduce input latency depending on the platform. If we're talking about Windows, you'll immediately get a much bigger benefit out of switching to D3D12 thanks to DXGI than doing something like this. Compatibility renderer isn't inherently lower latency because of Forward+ adding a frame queue, it's because Vulkan Surfaces are just generally awful in Windows and have inherent input latency you can't get rid off.

I feel you should only unlock this setting if you're clearly aware of what you're doing and I'm not sure even just putting a few warnings would be enough to detract people from making a potential mistake.

Assuming Godot internals aren't assuming the frame queue size is >= 1, which seems to be the case, otherwise this PR would have just failed instead of working, this may make sense.

They indeed don't fail anymore as I've rewritten it so it works. There's cases where RenderingDevice uses a frame queue of 1, like when it's created to render lightmaps. The minimum project setting was established mostly with concern for the reasons outlined above.

@alvinhochun
Copy link
Contributor

If you ask me, it looks like an option that advanced players may want to tweak, not something to be predetermined by the game developer.

@DarioSamo
Copy link
Contributor

DarioSamo commented Jul 29, 2024

If you ask me, it looks like an option that advanced players may want to tweak, not something to be predetermined by the game developer.

I think that sounds reasonable as a setting for power users in a shipped game, and in fact, some games do expose this kind of behavior as a toggle for people with strong CPUs. The only problem is it requires a game restart at the moment, although there's a chance you could make it tweakable at runtime with the right function as long as it flushes all existing work safely.

@KeyboardDanni
Copy link
Contributor Author

Cc @DarioSamo can you remember why we ensured a minimum queue size of 2 in the project setting?

It feels like allowing users to potentially shoot themselves in the foot to me to allow knowledge to spread about what is generally a bad practice. While it may look like it runs well for them, they can't guarantee that's the case for end users. I can easily see this leading to someone making some tutorial and going like "Why is this not the default? It is clearly better!" without warning people about the costs or realizing they're shipping something that won't work as well on weaker machines.

I think this is all the more reason why we should allow this to be changed at runtime. Everyone's configuration is gonna be different. Heck, I believe KDE recently introduced triple buffering into their compositor. It automatically turns itself on for older Intel integrated GPUs and improves smoothness significantly.

I'll say you're probably looking at the wrong place to reduce input latency depending on the platform. If we're talking about Windows, you'll immediately get a much bigger benefit out of switching to D3D12 thanks to DXGI than doing something like this. Compatibility renderer isn't inherently lower latency because of Forward+ adding a frame queue, it's because Vulkan Surfaces are just generally awful in Windows and have inherent input latency you can't get rid off.

I am already telling the nVidia drivers to layer OpenGL/Vulkan apps on DXGI. Presentmon reports "Hardware Composed: Independent Flip" for the game's process. Latency is the same in both windowed and fullscreen.

I feel you should only unlock this setting if you're clearly aware of what you're doing and I'm not sure even just putting a few warnings would be enough to detract people from making a potential mistake.

The option is already hidden behind the Advanced toggle in the UI and there's plenty of other options where users could shoot themselves in the foot. I'd at least like the option to drop frame queue size to 1 because I know what I'm making is not a very intensive game (and if framerate does become a concern the player can always disable V-Sync).

In my eyes, this hardly feels like a mistake. I've dealt long enough with games in all kinds of engines having issues with input lag. It's one of the reasons I sprung for a 240hz monitor. But not everyone has that kind of money, and there's also plenty of folks using non-gaming laptops, so I'd really like to improve the situation for those on 60hz displays.

I think informing the user of the potential downsides of this is a good idea, though. Currently the documentation for the setting doesn't mention anything about the tradeoffs you get by changing that value. I'm unsure if a warning would be appropriate here though - usually that stuff is reserved for incorrect configuration, and for many projects it's reasonable to set it to 1.

I've also been thinking of updating the current documentation on jitter, stutter, and input lag to mention frame queue size and swapchain image count.

@KeyboardDanni KeyboardDanni force-pushed the min-frame-queue-size branch 2 times, most recently from da24f54 to c74095f Compare July 29, 2024 17:42
@KeyboardDanni KeyboardDanni requested a review from a team as a code owner July 29, 2024 17:42
@DarioSamo
Copy link
Contributor

DarioSamo commented Jul 29, 2024

I am already telling the nVidia drivers to layer OpenGL/Vulkan apps on DXGI. Presentmon reports "Hardware Composed: Independent Flip" for the game's process. Latency is the same in both windowed and fullscreen.

I've used the same setting and I've hardly seen better results than when it's actually D3D12. It behaves as more of a compatibility setting to make sure stuff actually works (e.g. screenshots) than actually offering lower latency in my experience. I very much recommend trying D3D12 to draw a better conclusion here than the driver's option. It is immediately apparent just from using the editor the latency is significantly lower.

I'm unsure if a warning would be appropriate here though - usually that stuff is reserved for incorrect configuration, and for many projects it's reasonable to set it to 1.

A warning seems appropriate considering it'll significantly impact performance and users might forget that they modified it in the long run. The more demanding the project is, the more disastrous the performance drop will be. But if you ask me, I still think this is one of those things where the minimum of 2 is reasonable because of the reasons I laid out. I will not go against lowering the minimum if it's what others want, but I think the possibility I stated is very real and will result in extra support and a worse perceived image of the engine's performance on games that ship with that setting modified to 1 at the project level.

@KeyboardDanni
Copy link
Contributor Author

KeyboardDanni commented Jul 29, 2024

I am already telling the nVidia drivers to layer OpenGL/Vulkan apps on DXGI. Presentmon reports "Hardware Composed: Independent Flip" for the game's process. Latency is the same in both windowed and fullscreen.

I've used the same setting and I've hardly seen better results than when it's actually D3D12. It behaves as more of a compatibility setting to make sure stuff actually works (e.g. screenshots) than actually offering lower latency in my experience. I very much recommend trying D3D12 to draw a better conclusion here than the driver's option. It is immediately apparent just from using the editor the latency is significantly lower.

I think this might be worth investigating. It's entirely possible that D3D12 and Vulkan implementations might differ in how they're handled by the driver. From my limited understanding of synchronization, a higher frame queue size means the driver may choose to use the additional time for optimizations but it doesn't have to. Driver settings might also affect this depending on the rendering API. For example, AMD advertises Radeon Anti-Lag as supporting DirectX 9, 11, and 12, but does not seem to mention OpenGL or Vulkan (though it looks like Vulkan support landed very recently in the Linux driver Vulkan specification with driver support hopefully soon to follow).

This reminds me of the whole glFinish situation from many years ago. Some drivers needed it for lower latency, while others had problems with it enabled.

I'm unsure if a warning would be appropriate here though - usually that stuff is reserved for incorrect configuration, and for many projects it's reasonable to set it to 1.

A warning seems appropriate considering it'll significantly impact performance and users might forget that they modified it in the long run. The more demanding the project is, the more disastrous the performance drop will be. But if you ask me, I still think this is one of those things where the minimum of 2 is reasonable because of the reasons I laid out. I will not go against lowering the minimum if it's what others want, but I think the possibility I stated is very real and will result in extra support and a worse perceived image of the engine's performance on games that ship with that setting modified to 1 at the project level.

My main concern is that it might cause too much noise if Godot gets linting tools in the future (though I suppose they could be set to permit specific warnings if the user knows what they're doing).

@darksylinc
Copy link
Contributor

darksylinc commented Jul 29, 2024

I concur with Dario that disabling double buffering is a shotgun to the foot.

However the solution may be different.

Ideally Godot should sleep to maximize latency.

The concept is simple:

  1. Let's assume monitor refresh rate is 16ms
  2. Rendering takes 3ms
  3. That means Godot should sleep for 13ms, then poll input, then take 3ms to render.
  4. As a result latency is 3ms instead of 16.

The concept is very simple, however the reason I haven't attempted to implement it is because the devil is in the details:

  1. We have no guarantee rendering takes 3ms. All we can do is predict based on past measurements
  2. Measuring rendering time without V-Sync is actually hard
  3. If we miss predict and miss V-Sync, we will now be off, always missing V-Sync because our sleeping always crosses the vblank. This results on the opposite: latency is maximized instead of minimized
  4. To recover from a miss prediction, we need to query the API or DXGI to know if we've missed V-Sync and recalibrate our sleeping schedule
  5. To minimize miss predictions, it's good to pesimize rendering time by 100% i.e. if rendering takes 3ms, let's assume it can take 6ms and sleep for 10ms

When you enable "anti lag" in NV & AMD control panels, this is pretty much what happpens under the hood: the driver sleeps for us and pretends the GPU is taking more time, so that the app starts polling input as late as possible.

TheForge integrated Swappy (and there is a PR in the works) that adds such functionality but only for Android, given that Swappy makes use of VK_GOOGLE_display_timing and Android's Choreographer.

@KeyboardDanni
Copy link
Contributor Author

KeyboardDanni commented Jul 29, 2024

I concur with Dario that disabling double buffering is a shotgun to the foot.

This isn't disabling double buffering. swapchain_image_count is still 2, it's just not waiting 2 frames on the CPU side before sending commands to the GPU.

However the solution may be different.

Ideally Godot should sleep to maximize latency.

The concept if simple:

  1. Let's assume monitor refresh rate is 16ms
  2. Rendering takes 3ms
  3. That means Godot should sleep for 13ms, then poll input, then take 3ms to render.
  4. As a result latency is 3ms instead of 16.

This sort of technique can very easily introduce jitter/stutter if the framerate fluctuates. It also doesn't address the existing latency caused by the swapchain or lazy CPU/GPU synchronization, which is safer to remove than delaying rendering.

When you enable "anti lag" in NV & AMD control panels, this is pretty much what happpens under the hood: the driver sleeps for us and pretends the GPU is taking more time, so that the app starts polling input as late as possible.

My understanding of "anti lag" is that it changes the driver behavior to send the commands to the GPU as soon as it can. There might be an "Ultra" setting that does more such as sleeping to render later, but those come at the risk of causing stutter.

@darksylinc
Copy link
Contributor

This isn't disabling double buffering. swapchain_image_count is still 2, it's just not waiting 2 frames on the CPU side before sending commands to the GPU.

No. You could set swapchain_image_count to 10 and yet it will be single buffered.

frame_queue_size = 2 means there is one frame the GPU is working on, and another frame the CPU is working on. Aka double buffer.

When you set frame_queue_size = 1, only the CPU can access the frame, or only the GPU.
Which means full serialization, aka single buffer.

@KeyboardDanni
Copy link
Contributor Author

KeyboardDanni commented Jul 29, 2024

No. You could set swapchain_image_count to 10 and yet it will be single buffered.

frame_queue_size = 2 means there is one frame the GPU is working on, and another frame the CPU is working on. Aka double buffer.

From what I recall, double-buffering and triple-buffering refer to the number of images in the swapchain. Each one of these images is a buffer. With double-buffering, you have one buffer displayed on-screen while the GPU is updating the other. With triple-buffering, you have three buffers. This setting is controlled by swapchain_image_count. I am not aware of a single Vulkan implementation that even allows the creation of a swapchain with only one buffer, so trying to do that wouldn't work anyway.

Single-buffering is more along the lines of old DOS programs that would update the screen directly as it was being displayed, meaning you would actually see partial updates as the application is drawing elements such as boxes and text. I am not trying to create single-buffering, just remove unnecessary latency when using double-buffering.

When you set frame_queue_size = 1, only the CPU can access the frame, or only the GPU. Which means full serialization, aka single buffer.

My understanding (correct me if I'm wrong) is that frame_queue_size deals with sending commands from the CPU to the GPU, i.e. how many frames to wait to send the commands. This is different from swapchain_image_count which is the number of images that are available to draw on. Delaying the submission of commands to the GPU allows parallelization, yes, but it's still adding a frame of lag for that parallelization you might not even need. If the CPU and GPU only take 1ms each (simple 2D graphics on a modern system), you're delaying everything by a frame for no good reason.

@clayjohn
Copy link
Member

clayjohn commented Jul 29, 2024

My understanding (correct me if I'm wrong) is that frame_queue_size deals with sending commands from the CPU to the GPU, i.e. how many frames to wait to send the commands. This is different from swapchain_image_count which is the number of images that are available to draw on. Delaying the submission of commands to the GPU allows parallelization, yes, but it's still adding a frame of lag for that parallelization you might not even need. If the CPU and GPU only take 1ms each (simple 2D graphics on a modern system), you're delaying everything by a frame for no good reason.

This is wrong. darksylinc explained it correctly above. Setting frame_queue_size to 1 is essentially single buffering. When there is only one frame in the queue, all the GPU commands are submitted immediately, then the CPU will wait for the GPU to finish before continuing on to Frame n + 1. Which means the CPU waits for Vsync too. Doing this ensures that you can never overlap frames, so you leave a lot of performance on the table.

Hopefully this illustration will help you understand what is going on
frame_queue_size = 1
CPU | read input1 | | CPU work1 |-------------------------------------------------------------------------------| read input2| | CPU work2 |
GPU | ------------------------------------| do GPU work1 | | wait for vsync1 ......................................|---------------------------------------

frame_queue_size = 2
CPU | read input1 | | CPU work1 || read input2| | CPU work2 |----------------------------------| read input3| | CPU work3|
GPU | ------------------------------------| do GPU work1 | | wait for vsync ........................................|do GPU work2|...

frame_queue_size = 3
CPU | input1 | | CPU work1 || input2| | CPU work2 || input3| | CPU work3|--------------- |input4| | CPU work4|
GPU | ------------------------------------| do GPU work1 | | wait for vsync ........................................|do GPU work2|...

@KeyboardDanni
Copy link
Contributor Author

KeyboardDanni commented Jul 29, 2024

This is wrong. darksylinc explained it correctly above. Setting frame_queue_size to 1 is essentially single buffering. When there is only one frame in the queue, all the GPU commands are submitted immediately, then the CPU will wait for the GPU to finish before continuing on to Frame n + 1.

I think we're just referring to different things when we talk about single-buffering vs double-buffering. In this case you're referring to command buffering. Might be worth establishing some common definitions when referring to this stuff, but at this point I think we're just bikeshedding.

Which means the CPU waits for Vsync too. Doing this ensures that you can never overlap frames, so you leave a lot of performance on the table

I am already getting 240 FPS. I'm not sure what I'd gain by relaxing the GPU command sync. And it's not like I'm making anything real intensive. Just because it might cause problems for other projects, shouldn't mean I should be forbidden from using this in mine. There's a reason we have Project Settings after all.

(Also, having the CPU wait for V-Sync may be better for energy efficiency since it can spend more time sleeping)

Edit: I measured some times with my project. All times have V-Sync disabled, driver DXGI enabled, driver low-latency disabled, and swapchain_image_count=2.

With frame_queue_size=1, 240hz, G-Sync on: 1818-2070 FPS.
With frame_queue_size=2, 240hz, G-Sync on: 1962-2113 FPS
With frame_queue_size=1, 60hz, G-Sync off: 1812-2066FPS.
With frame_queue_size=2, 60hz, G-Sync off: 2014-2073FPS.

Having it at 2 seems to produce a slightly higher and more consistent framerate, but I think the 50% better input lag is worth more than 4% better framerate. The disparity in framerate would probably be larger with a more complex project, though.

@alvinhochun
Copy link
Contributor

alvinhochun commented Jul 29, 2024

For D3D12 you should use GPUView to see what the GPU is actually doing. Here is a screenshot of a capture I made using a simple project with default settings using the mobile renderer:

GPUView

PresentMon reported 4 frames (66ms) of latency (that may be excluding some of the CPU work). Although I cannot explain what exactly is going on here, it seems that part of the latency is from Godot sending too much work to the GPU, as I believe there is work for two whole frames being blocked in the GPU queue. The flip queue shows two present frames in the queue, which is the expected amount for swapchain_image_count = 3.

With frame_queue_size = 1, this is what I got:

GPUView

PresentMon still reported 4 frames (63ms) of latency though I can't say whether I feel any actual difference (would need to film the screen and count the frames). There is clearly less work queued on the GPU now, but something is still being blocked (it might be blocking on the Present call, which is expected when not using a waitable swap chain.)

I used this project spinning-cube_multiwindow.zip to test. You can try comparing the latency with #94503 presenting OpenGL on DXGI.

(For comparison, the following is a capture with OpenGL on DXGI using a waitable swap chain.)

GPUView

@KeyboardDanni
Copy link
Contributor Author

PresentMon reported 4 frames (66ms) of latency (that may be excluding some of the CPU work). Although I cannot explain what exactly is going on here, it seems that part of the latency is from Godot sending too much work to the GPU, as I believe there is work for two whole frames being blocked in the GPU queue. The flip queue shows two present frames in the queue, which is the expected amount for swapchain_image_count = 3.

With frame_queue_size = 1, this is what I got:

PresentMon still reported 4 frames (63ms) of latency though I can't say whether I feel any actual difference (would need to film the screen and count the frames). There is clearly less work queued on the GPU now, but something is still being blocked (it might be blocking on the Present call, which is expected when not using a waitable swap chain.)

As far as I'm aware, PresentMon only cares about how the final image is presented to the screen. I don't think it knows anything about the command buffers used to produce the image. So PresentMon giving a similar result for both settings is somewhat expected.

I don't have D3D12 support set up in Godot right now, so I'm looking into alternative tools (I'm also mostly interested in the Vulkan results currently). May see if nVidia FrameView or Renderdoc can get the information I need.

@clayjohn
Copy link
Member

clayjohn commented Jul 29, 2024

I think we're just referring to different things when we talk about single-buffering vs double-buffering. In this case you're referring to command buffering. Might be worth establishing some common definitions when referring to this stuff, but at this point I think we're just bikeshedding.

Fair enough.

Single buffering vs double buffering refers to how many "frames in flight" are allowed by the execution model.
Single buffering means "do all the work to draw a frame, present the frame, then start on the next frame".
Double buffering means "do all the work to draw a frame, then start on drawing the next frame. If you finish that before presenting frame one, then wait"
Triple buffering means "do all the work to draw a frame, then start on drawing the next frame. If you finish frame n + 1 before presenting frame 1, then start on frame n + 2. If you finish frame n + 2 before presenting frame 1, then wait."

They correspond to frame_queue_size 1, 2, and 3 respectively.

Importantly, there is no command buffering going on. The commands are submitted and executed in an appropriate order. The difference is at what point you synchronize the CPU and the GPU. In other words, the difference is how many frames you buffer.

I'm not sure what you mean by command buffering. In Vulkan speak, command buffering is the process of recording commands before submitting them to a command queue.

swapchain_count comes into play too as a limiting factor. If you only have one swapchain, then you are limited to single buffering. 2 limits you to double buffering and so on.

Modern graphics APIs don't really have a built in concept of buffering frames, its something that exists totally on the application side now. In other words, we are the ones who control when to synchronize and when to buffer up frames.

Also, to contrast this with OpenGL. In the OpenGL backend, we don't allow single buffering either. It always uses the equivalent of frame_queue_size = 2. The fact that you are seeing better latency there is a very strong indicator that we can address the source of the latency elsewhere. Like Dario mentions above, likely using DXGI for presentation (or using DX12 directly) will be a much better solution

I am already getting 240 FPS. I'm not sure what I'd gain by relaxing the GPU command sync. And it's not like I'm making anything real intensive. Just because it might cause problems for other projects, shouldn't mean I should be forbidden from using this in mine. There's a reason we have Project Settings after all.

This is exactly why darksylinc calls this setting a "shotgun to the foot". The problem we are explicitly working to avoid is a developer with an expensive PC turning the setting on because shrug "it works on my PC", then shipping it to users and not understanding why their game runs horribly on all other devices.

This way of thinking only makes sense when the only hardware you ship on is your own hardware.

Edit: I measured some times with my project. All times have V-Sync disabled, driver DXGI enabled, driver low-latency disabled, and swapchain_image_count=2.

All of the above discussion assumes that Vsync is enabled. If you disable vsync then there is no point in double buffering as you will get tearing anyway

@DarioSamo
Copy link
Contributor

I don't have D3D12 support set up in Godot right now, so I'm looking into alternative tools (I'm also mostly interested in the Vulkan results currently). May see if nVidia FrameView or Renderdoc can get the information I need.

You're just gonna get added input latency out of Vulkan on Windows unless you present to DXGI directly (which is something I attempted to hack together but something about the synchronization primitives fails sadly). It is very apparent that the D3D12 driver doesn't suffer from this, and it's something I've experienced in other projects that aren't Godot as well. The driver option doesn't solve it, it just makes it not work as badly as regular Vulkan surfaces do on Windows when it comes to other capabilities.

@KeyboardDanni
Copy link
Contributor Author

KeyboardDanni commented Jul 29, 2024

Importantly, there is no command buffering going on. The commands are submitted and executed in an appropriate order. The difference is at what point you synchronize the CPU and the GPU. In other words, the difference is how many frames you buffer.

Yes, this is what I meant by command buffering. Was looking for a term that describes "CPU communicating with the GPU" but now I realize there is some confusing overlap here.

Modern graphics APIs don't really have a built in concept of buffering frames, its something that exists totally on the application side now. In other words, we are the ones who control when to synchronize and when to buffer up frames.

Also, to contrast this with OpenGL. In the OpenGL backend, we don't allow single buffering either. It always uses the equivalent of frame_queue_size = 2. The fact that you are seeing better latency there is a very strong indicator that we can address the source of the latency elsewhere. Like Dario mentions above, likely using DXGI for presentation (or using DX12 directly) will be a much better solution

My understanding is that frame queueing in OpenGL is highly dependent on the underlying driver implementation and when it decides to do CPU/GPU synchronization, not to mention the user's driver settings, which is one of the reasons that APIs like Vulkan exist now. So this comparison doesn't make a lot of sense.

I am already getting 240 FPS. I'm not sure what I'd gain by relaxing the GPU command sync. And it's not like I'm making anything real intensive. Just because it might cause problems for other projects, shouldn't mean I should be forbidden from using this in mine. There's a reason we have Project Settings after all.

This is exactly why darksylinc calls this setting a "shotgun to the foot". The problem we are explicitly working to avoid is a developer with an expensive PC turning the setting on because shrug "it works on my PC", then shipping it to users and not understanding why their game runs horribly on all other devices.

This way of thinking only makes sense when the only hardware you ship on is your own hardware.

I plan to have others test my game, but it's a 2D game rendering at 640x480 resolution and there's only about a hundred sprites (around 32x32 in size). I don't expect issues, because if players have an OpenGL 3-capable GPU, chances are the system will be fast enough to run the game well. If there's issues I'll provide an easy setting for players to use frame_queue_size=2, but I don't think it's likely.

If extra support is a concern, document the potential issues with this setting and refer to that. Then the onus is on the game developer, not Godot.

Edit: I measured some times with my project. All times have V-Sync disabled, driver DXGI enabled, driver low-latency disabled, and swapchain_image_count=2.

All of the above discussion assumes that Vsync is enabled. If you disable vsync then there is no point in double buffering as you will get tearing anyway

Those times were meant to show that the efficiency gains from frame_queue_size=2 are minimal for my project. I can test again using V-Sync and frametimes if you'd like.

You're just gonna get added input latency out of Vulkan on Windows unless you present to DXGI directly

I'm not sure why the focus is on DXGI here. I set frame_queue_size=1 and input lag was much better. The assumption appears to be that setting frame_queue_size=1 doesn't work or is dangerous. But in my case it works great and doesn't seem to be all that dangerous. The worst thing I've seen so far is just a slightly lower framerate. I can also test this on AMD, as well as Linux with exclusive fullscreen mode and disabled compositing so we have more datapoints. I'd like to encourage others to provide datapoints as well.

It is very apparent that the D3D12 driver doesn't suffer from this, and it's something I've experienced in other projects that aren't Godot as well.

Are you certain that this isn't due to driver-level anti-lag that supports D3D12 and not Vulkan?

@clayjohn
Copy link
Member

I'm not sure why the focus is on DXGI here. I set frame_queue_size=1 and input lag was much better. The assumption appears to be that setting frame_queue_size=1 doesn't work or is dangerous. But in my case it works great and doesn't seem to be all that dangerous. The worst thing I've seen so far is just a slightly lower framerate. I can also test this on AMD, as well as Linux with exclusive fullscreen mode and disabled compositing so we have more datapoints. I'd like to encourage others to provide datapoints as well.

To be extremely clear, you are getting reduced latency as a side effect of doing something that can severely impact the playability of your game. The focus is on DXGI because the main source of latency is presenting a Vulkan swapchain which will always be slow on Windows. You can reduce most of the latency just by switching to DX12 or DXGI for swapchain presentation.

To put the reactions you have gotten here in perspective. Using single buffering to decrease latency in a game engine is like removing the engine from a car to make it more lightweight. If being lightweight is your only goal, then sure go ahead, maybe you only ever need to drive downhill, so you don't care about having an engine. The thing is, from the perspective of the engine it makes no sense a car without an engine sucks for 99.99% of people. We make software for the 99.99% of people, not for the person who wants an engine-less car.

Ultimately, we have expressed that we don't want to give our users a foot gun. You have expressed that you don't care about shooting yourself in the foot and are very happy to do it because of your unique constraints. The ideal solution here is for you to ship your game with a custom build that allows single-buffering. That way we don't have to worry about the support burden and you can happily ship your game with reduced latency. You know best what trade-offs you are willing to make, so you can provide this option for yourself without us having to sacrifice the usability of the engine.

@KeyboardDanni
Copy link
Contributor Author

There are several assumptions being made here that are still not being addressed.

To be extremely clear, you are getting reduced latency as a side effect of doing something that can severely impact the playability of your game.

Can you provide some numbers or examples on how this could impact playability any more than the existing tweakables Godot provides in Advanced mode? This is my main point of contention.

The focus is on DXGI because the main source of latency is presenting a Vulkan swapchain which will always be slow on Windows. You can reduce most of the latency just by switching to DX12 or DXGI for swapchain presentation.

With DXGI, it is using the hardware display planes. I can enable DXGI layering for Vulkan in the driver settings and if I disable V-Sync, I see tearing even in windowed mode. I don't think you can get any more direct-to-screen than that. Is there something else I'm missing here?

We also don't know exactly why D3D12 has less latency than Vulkan here. It could be due to DXGI implementation. It could be due to driver-level anti-lag. This warrants further investigation.

To put the reactions you have gotten here in perspective. Using single buffering to decrease latency in a game engine is like removing the engine from a car to make it more lightweight. If being lightweight is your only goal, then sure go ahead, maybe you only ever need to drive downhill, so you don't care about having an engine. The thing is, from the perspective of the engine it makes no sense a car without an engine sucks for 99.99% of people. We make software for the 99.99% of people, not for the person who wants an engine-less car.

Presently, developers can do things like set the 3D shadow resolution to 16k. This doesn't work well for 99.99% of users, but Godot still allows it. In fact, shadow resolution above 2k can cause significant performance issues with integrated AMD graphics (the default is 4k). But that's why it's a tweakable. At some point you just need to trust that the game developer knows what they're doing. If they don't, it's the developer's fault. Trying to take responsibility for every bad thing any developer could do is not a healthy relationship to have. Yes, try to prevent developers from making mistakes. But if the warning is there, and it's not default, and the developer chooses to do it anyway, just let them do it.

Ultimately, we have expressed that we don't want to give our users a foot gun. You have expressed that you don't care about shooting yourself in the foot and are very happy to do it because of your unique constraints. The ideal solution here is for you to ship your game with a custom build that allows single-buffering. That way we don't have to worry about the support burden and you can happily ship your game with reduced latency. You know best what trade-offs you are willing to make, so you can provide this option for yourself without us having to sacrifice the usability of the engine.

Because I'm not the only one making games. I'm sure many others want to make fast-paced games where low latency is a priority. If they didn't, they wouldn't be filing bug reports about input lag in the Forward+ renderer.

@clayjohn
Copy link
Member

There are several assumptions being made here that are still not being addressed.

Can you provide some numbers or examples on how this could impact playability any more than the existing tweakables Godot provides in Advanced mode? This is my main point of contention.

The first response to your PR pointed you to a simulator that illustrates the problem and allows you to see the impact of this setting in all scenarios. You can literally see it for yourself. https://darksylinc.github.io/vsync_simulator/.

It feels now like I've just been talking past you and none of what I have said has actually stuck. So I will just leave you with that.

@Calinou
Copy link
Member

Calinou commented Jul 30, 2024

I strongly recommend getting a high speed camera and doing end-to-end measurements if you want to be able to see what really makes a difference, as tools like PresentMon are known to give incorrect latency readouts. You can point a phone at your screen with your mouse keyboard/visible and use its camera's (super) slow motion mode, which is typically 240 FPS or 480 FPS. This gives you 4.2 ms or 2.1 ms precision which should suffice as long as you perform enough samples.

https://github.com/GPUOpen-Tools/frame_latency_meter was recently released, but looking at its issues page, there are reports of the latency readouts not being very accurate either. I'd say nothing beats a hardware solution here.

I think that sounds reasonable as a setting for power users in a shipped game, and in fact, some games do expose this kind of behavior as a toggle for people with strong CPUs. The only problem is it requires a game restart at the moment, although there's a chance you could make it tweakable at runtime with the right function as long as it flushes all existing work safely.

We can expose it as a CLI argument like we did for --audio-output-latency. This way, it can be adjusted by power users on any game made with Godot without needing work on developer's end. This doesn't need implementing runtime adjustments to be functional either.

@RandomShaper
Copy link
Member

RandomShaper commented Jul 30, 2024

On the jargon aspect, anecdotally, I also have that legacy of the very old times where double-buffering was about what we call the swapchain today, so it's indeed a good thing the concepts are cleared so everyone is talking about the same.

On the latency aspect, it's worth noting that the frame timing simulator, which is irrespective of DXGI/Vulkan, shows latency is roughly halved with single-buffering under conditions akin to the kind of project being talked about here.

However, being practical, it's already been established that single-buffering may be a good fit for some, so the discussion is now if that should be exposed. Maybe we can find an approach that satisfies everyone.

Collecting some of the ideas already stated, would this work?:

  • Project settings don't allow values below 2 to avoid the self-shoot situation (I think there are many other ways to mess up anyway, but we have to draw the line somewhere).
  • Changing queue size it to 1 is allowed programmatically so games can either tweak it automatically or give control to users.
  • The renderer is enhanced to allow a change of the frame queue size without restarting.
  • Complementarily, Godot does what it can to detect issues or just print a warning at runtime reminding about the fact.

@Calinou
Copy link
Member

Calinou commented Jul 30, 2024

Collecting some of the ideas already stated, would this work?:

Sounds good to me 🙂

@alvinhochun
Copy link
Contributor

Okay, I tried getting some better traces. With Intel's Graphics Trace Analyzer it shows me where exactly the wait is happening.

frame_queue_size=2:

trace

frame_queue_size=1:

trace

Zoomed in:

trace

trace

Honestly I think these traces are pointless anyway, because all the wait looks to be caused by the renderer trying to render too many frames, but also my test scene is too simplistic to really show anything.

@zooxy
Copy link

zooxy commented Jul 30, 2024

I'm surprised this change is so controversial. Unity supports this since at least 2015: QualitySettings.maxQueuedFrames

@darksylinc
Copy link
Contributor

I'm surprised this change is so controversial. Unity supports this since at least 2015: QualitySettings.maxQueuedFrames

A value of frame_queue_size = 1 is equivalent to Unity's maxQueuedFrames = 0, which is not supported by Unity.

@darksylinc
Copy link
Contributor

darksylinc commented Jul 30, 2024

Okay, I tried getting some better traces. With Intel's Graphics Trace Analyzer it shows me where exactly the wait is happening.

Your captures show exactly the problem I pointed out: Godot polls input immediately after VBLANK, starts rendering, then sleeps for 14ms.

Ideally it should be doing the opposite: sleep for 14ms, poll input, start rendering.

@KeyboardDanni
Copy link
Contributor Author

Ideally it should be doing the opposite: sleep for 14ms, poll input, start rendering.

There is a big problem with this idea: We do not know how long the next frame will take to render.

Let's say that last frame we take 4ms to render. V-blank happens every 16ms. So we wait for 16 - 4 - 1 = 11ms (using 1ms as a cushion). Next we poll input and render, which takes 4ms. We spent 15ms, so we make it in time for V-blank. We saved 11ms in latency. Great!

But what if the player pans the camera around and we suddenly have to render more? We continue with the assumption that it will take 4ms to render. We wait 11ms again. We poll input and render, but instead it takes 6ms. Now we've spent 17ms and missed the V-blank window, resulting in stutter.

So now we realize that 1ms of cushion isn't enough time in this scenario. Now we introduce a tweakable in the Project Settings to allow changing this value. Suddenly we have to tweak this value per-project, per-machine, in order to find a good value that won't cause stutter while still saving enough input lag to be worth using. There will be no "one size fits all" solution. I can foresee this being a far bigger support burden than just frame_queue_size=1. It's especially problematic given Godot's history of stutter issues that have resulted in multiple tweakables attempting to work around the problem (one of these, V-Sync via Compositor, had to be removed because it caused serious issues with random frame-halving at times).

We also need to consider that the sleep function can be problematic, especially on Windows, where we're at the mercy of the current timer resolution. Performing a sleep may take as long as 15.6ms, and in my experience I haven't been able to find a way to get the sleep resolution lower than about 2. This means there is a very real risk of over-sleeping, resulting in stutter (and we aren't even taking pre-emption by the task scheduler into account). There's an article that discusses the issue here, including changes that were made in Windows 10 that affect timer stability and the ability to set it to a lower value. Mitigations for timing attacks in browsers and the OS may also be a problem.

I'm not quite sure what the graphics driver is doing, but it seems to do a better job of waiting for V-blank than the sleep functions typically available to userland programs. It could be that in kernel space the driver has access to more low-level scheduling and interrupts that allow it to do this. But generally I trust this wait more than I would a sleep trying to predict when rendering should start, and hoping that the OS will wake the program in time.

Having said all that, the situation is probably much less dire with frame_queue_size=2 since there's an extra frame for cushion. But the savings also won't be as great as with frame_queue_size=1 since we only save close to a whole frame when the game is doing almost no work.

And regardless of frame queue size, we'd still be moving around the timing of the input reads, which some players might not like. It'd be an issue for fighting games, shooters, and rhythm games (especially rhythm games, where a static value for latency compensation is established during game onboarding).

I'm not even sure if the OS or windowing system can guarantee that it will deliver the input events to the window when we expect to read them, so the delayed input read might not even fix latency at all depending on the system.

It's an interesting idea, but I feel like there's too much that could go wrong here.

Anyway, I decided to do some more research on the topic of swapchains, since frame_queue_size=1 can cause problems if both the CPU and GPU are doing too much work. In particular, the Vulkan presentation mode MAILBOX_KHR grabbed my attention. It's like standard FIFO V-Sync, except it always tries to use the latest finished frame when presenting. It is available today in Godot. I tried it on my desktop system with nVidia with frame_queue_size=2 and swapchain_image_count=3 and found that the input lag was very good even at 60hz. The best part is that we can still buffer frames in advance, it's just that the driver will select the best one to display.

Unfortunately, support is not ubiquitous. nVidia and Intel support this presentation mode, but AMD does not. Android seems to have support for it across all drivers, but it's been reported to cause battery drain due to rendering unused frames. In the macOS/iOS world, mailbox apparently doesn't exist at all.

Things also get weird on hybrid graphics laptops. I tried this setting on AMD integrated + nVidia discrete. When running on the nVidia graphics, it acted the same as V-Sync off (probably because the AMD chip was driving the display). When running on the AMD graphics, mailbox wasn't used at all and input lag was dreadful. So a different technique is needed to bridge the gap on these systems.

It seems that right now there are several ways to attack input lag, but none that are "one size fits all" unless the game is simple enough that framerate drops won't be an issue:

  • MAILBOX_KHR presentation mode on nVidia and Intel graphics on Windows and Linux (dedicated only, not hybrid)
  • VK_AMD_anti_lag extension on AMD graphics when support lands in drivers
  • For all other configurations, frame_queue_size=1 if graphics aren't intensive

@KeyboardDanni
Copy link
Contributor Author

KeyboardDanni commented Jul 30, 2024

I strongly recommend getting a high speed camera and doing end-to-end measurements if you want to be able to see what really makes a difference, as tools like PresentMon are known to give incorrect latency readouts.

A silly thought that might not require specialized hardware: we could make a test program that moves the hardware cursor (since that's on its own hardware plane) while also displaying a circle where we told the cursor to move, and simply grab a fast-enough camera and take a picture.

We can expose it as a CLI argument like we did for --audio-output-latency. This way, it can be adjusted by power users on any game made with Godot without needing work on developer's end. This doesn't need implementing runtime adjustments to be functional either.

This could be a useful escape hatch in the event that a developer hardcodes a frame queue size of 1 but players need to set it to 2 and the developer has stopped updating their game.

Edit: If we allow games to set the frame queue size at runtime, the escape hatch would need to block this and enforce the value specified on the commandline for this to be effective.

@alvinhochun
Copy link
Contributor

Give #94960 a try.

@KeyboardDanni
Copy link
Contributor Author

I just made a latency tester app in Godot. It should help us measure input lag in frames instead of having to rely on software tools or "eyeballing it": https://github.com/KeyboardDanni/godot-latency-tester

Summary of findings:

  • Without DXGI, in windowed mode Vulkan latency is always 2 frames, though sometimes it's jittery so it's actually between 1-3 frames. With Layered DXGI, latency is higher but framepacing is consistent.
  • With Layered DXGI, there is an extra frame of latency when using traditional V-Sync. In all other situations, Layered DXGI is superior to Native Presentation either in latency or framepacing.
  • When configured correctly, mailbox can achieve just one frame of latency, while still providing V-Sync. "Configured correctly" means using DXGI and setting swapchain_image_count = 3.
  • If mailbox is not available, frame_queue_size = 1 is the only way to achieve 3 frames with DXGI and V-Sync (without patches).

To compare with OpenGL, latency is 2 frames with V-Sync enabled on Layered DXGI, and 3 frames with Native Presentation. Which is weird, because Vulkan seems to have the opposite effect. Maybe nVidia made a mistake when implementing Layered DXGI in Vulkan?

This is without any patches besides allowing frame_queue_size = 1 (so no waitable swapchain, etc).

@KeyboardDanni
Copy link
Contributor Author

Wrote up a new proposal that tries to address some of the concerns raised previously: godotengine/godot-proposals#11200

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants