Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WSL2 gets laggy over time - suspend to disk (hibernate) restores normal operation #11885

Open
1 of 2 tasks
cvjjm opened this issue Aug 6, 2024 · 23 comments
Open
1 of 2 tasks

Comments

@cvjjm
Copy link

cvjjm commented Aug 6, 2024

Windows Version

Microsoft Windows [Version 10.0.19045.4651]

WSL Version

2.2.4.0

Are you using WSL 1 or WSL 2?

  • WSL 2
  • WSL 1

Kernel Version

5.15.153.1-microsoft-standard-WSL2

Distro Version

Ubuntu 24.04

Other Software

Docker Desktop 4.33.1 (161083)

Repro Steps

Start computed, start using wsl, on a time scale of 30min to 1h WSL2 gets increasingly unresponsive and laggy. This causes noticeable input delay when typing in any console connected to WSL2 (irrespective of whether it is the Windows Console or the System Powershell directly into WSL2, or whether it is, e.g., the console built into Docker Desktop) and equally affects, e.g., graphical applications running inside a container on Docker Desktop, displayed through VcXsrv.

More objectively the problem manifests itself in erratic ping times between the host and WSL. A typical ping after an hour of usage looks like this:

64 bytes from 192.168.65.254 (192.168.65.254): icmp_seq=2 ttl=63 time=0.914 ms
64 bytes from 192.168.65.254 (192.168.65.254): icmp_seq=3 ttl=63 time=1.14 ms
64 bytes from 192.168.65.254 (192.168.65.254): icmp_seq=4 ttl=63 time=1.58 ms
64 bytes from 192.168.65.254 (192.168.65.254): icmp_seq=5 ttl=63 time=1.08 ms
64 bytes from 192.168.65.254 (192.168.65.254): icmp_seq=6 ttl=63 time=1.25 ms
64 bytes from 192.168.65.254 (192.168.65.254): icmp_seq=7 ttl=63 time=4.31 ms
64 bytes from 192.168.65.254 (192.168.65.254): icmp_seq=8 ttl=63 time=0.817 ms
64 bytes from 192.168.65.254 (192.168.65.254): icmp_seq=9 ttl=63 time=0.800 ms
64 bytes from 192.168.65.254 (192.168.65.254): icmp_seq=10 ttl=63 time=1.01 ms
64 bytes from 192.168.65.254 (192.168.65.254): icmp_seq=11 ttl=63 time=20.7 ms
64 bytes from 192.168.65.254 (192.168.65.254): icmp_seq=12 ttl=63 time=0.572 ms
64 bytes from 192.168.65.254 (192.168.65.254): icmp_seq=13 ttl=63 time=1.02 ms
64 bytes from 192.168.65.254 (192.168.65.254): icmp_seq=14 ttl=63 time=1.15 ms
64 bytes from 192.168.65.254 (192.168.65.254): icmp_seq=15 ttl=63 time=0.794 ms
64 bytes from 192.168.65.254 (192.168.65.254): icmp_seq=16 ttl=63 time=1.49 ms

Interestingly the problem can be temporarily resolved by suspending the system to disk (hibernation), but not by a suspend to ram. After a hibernate/wake cycle the ping looks like this, and then again slowly degrades along with the input lag:

64 bytes from 192.168.65.254 (192.168.65.254): icmp_seq=1 ttl=63 time=1.52 ms
64 bytes from 192.168.65.254 (192.168.65.254): icmp_seq=2 ttl=63 time=0.633 ms
64 bytes from 192.168.65.254 (192.168.65.254): icmp_seq=3 ttl=63 time=0.836 ms
64 bytes from 192.168.65.254 (192.168.65.254): icmp_seq=4 ttl=63 time=0.789 ms
64 bytes from 192.168.65.254 (192.168.65.254): icmp_seq=5 ttl=63 time=0.695 ms
64 bytes from 192.168.65.254 (192.168.65.254): icmp_seq=6 ttl=63 time=0.715 ms
64 bytes from 192.168.65.254 (192.168.65.254): icmp_seq=7 ttl=63 time=0.813 ms
64 bytes from 192.168.65.254 (192.168.65.254): icmp_seq=8 ttl=63 time=0.731 ms
64 bytes from 192.168.65.254 (192.168.65.254): icmp_seq=9 ttl=63 time=0.894 ms

Restarting Docker / Docker Desktop or the vEthernet adapters (Restart-NetAdapter -Name "vEthernet (WSL)") does not have the same positive effect. Completely shutting down and restarting WSL2 does temporarily solve the problem, just like a hibernation cycle.

The speed and magnitude of the performance problem do not seem to be related to how intensely I use WSL2 (just idling vs. running multi core work loads inside a container) and also occurs irrespective of whether the laptop is connected to a wired or wires network and/or VPN.

No suspiciously high resource usage of the Vmemm process (or any other process) is visible in Task Manager.

I have tried disabling NetBIOS and the "Large Send Offload" feature as suggested in other issues/blog posts, but nothing has made any difference.

I have experimented with the following settings in wsl.conf, but to no avail:

[boot]
systemd=true
[interop]
enabled=false
appendWindowsPath=false
[wsl2]
gpuSupport=false
[automount]
enabled=false

The lagg also manifests itself in some docker commands taking very long time, e.g. Measure-Command {docker ps} taking between 1 and 5 seconds vs. less than 200ms when everything is working, but the problem does not exclusively effect Docker (Desktop) but all of WSL2.

This is on a T14s with AMD Ryzen Pro 7 Processor.

Please let me know which other information/logs could be useful to diagnose the problem.

Expected Behavior

The performance and responsiveness of WSL should not degrade over time.

Actual Behavior

The performance and responsiveness of WSL do degrade over time.

Diagnostic Logs

No response

Copy link

github-actions bot commented Aug 6, 2024

Logs are required for review from WSL team

If this a feature request, please reply with '/feature'. If this is a question, reply with '/question'.
Otherwise please attach logs by following the instructions below, your issue will not be reviewed unless they are added. These logs will help us understand what is going on in your machine.

How to collect WSL logs

Download and execute collect-wsl-logs.ps1 in an administrative powershell prompt:

Invoke-WebRequest -UseBasicParsing "https://raw.githubusercontent.com/microsoft/WSL/master/diagnostics/collect-wsl-logs.ps1" -OutFile collect-wsl-logs.ps1
Set-ExecutionPolicy Bypass -Scope Process -Force
.\collect-wsl-logs.ps1

The script will output the path of the log file once done.

If this is a networking issue, please use collect-networking-logs.ps1, following the instructions here

Once completed please upload the output files to this Github issue.

Click here for more info on logging
If you choose to email these logs instead of attaching to the bug, please send them to [email protected] with the number of the github issue in the subject, and in the message a link to your comment in the github issue and reply with '/emailed-logs'.

View similar issues

Please view the issues below to see if they solve your problem, and if the issue describes your problem please consider closing this one and thumbs upping the other issue to help us prioritize it!

Open similar issues:

Closed similar issues:

Note: You can give me feedback by thumbs upping or thumbs downing this comment.

@cvjjm
Copy link
Author

cvjjm commented Aug 6, 2024

Copy link

github-actions bot commented Aug 6, 2024

The log file doesn't contain any WSL traces. Please make sure that you reproduced the issue while the log collection was running.

Diagnostic information
Appx package is not installed
Found no WSL traces in the logs

@cvjjm
Copy link
Author

cvjjm commented Aug 6, 2024

The issue cannot be "reproduced" and then captured in a log. WSL2 just slows down over time...

@OneBlue
Copy link
Collaborator

OneBlue commented Aug 12, 2024

@cvjjm: When WSL gets laggy, can you run:

wsl --debug-shell

And inside that shell, run top and share a screenshot ?

@cvjjm
Copy link
Author

cvjjm commented Aug 14, 2024

When I try to run wsl --debug-shell in a normal powershell console I get the error:

PS C:\Users\cvjjm> wsl --debug-shell
Running the debug shell requires running wsl.exe as Administrator.

When I try to run wsl --debug-shell in a powershell console started as administrator I get the error:

PS C:\WINDOWS\system32> wsl --debug-shell
The system cannot find the file specified.
Error code: Wsl/DebugShell/ERROR_FILE_NOT_FOUND

top in a normal wsl shell looks like this when the system is laggy:

top - 10:49:20 up 10:09,  0 user,  load average: 0.26, 0.24, 0.17
Tasks:  14 total,   1 running,  13 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  15348.3 total,  13218.9 free,   1422.9 used,   1035.3 buff/cache
MiB Swap:   4096.0 total,   4096.0 free,      0.0 used.  13925.4 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
    1 root      20   0    2476   1716   1600 S   0.0   0.0   0:00.01 init(Ubuntu-24.
    4 root      20   0    2476      4      0 S   0.0   0.0   0:00.00 init
   10 root      20   0    2484    916    792 S   0.0   0.0   0:00.00 SessionLeader
   11 root      20   0    2496    120      0 S   0.0   0.0   0:00.00 Relay(12)
   12 cvjjm     20   0    2800   1076    980 S   0.0   0.0   0:00.00 sh
   81 root      20   0    2500    124      0 S   0.0   0.0   0:00.00 Relay(82)
   82 root      20   0 1247728  27544  15260 S   0.0   0.2   0:01.18 docker-desktop-
  212 root      20   0    2480    116      0 S   0.0   0.0   0:00.00 SessionLeader
  213 root      20   0    2496    120      0 S   0.0   0.0   0:00.02 Relay(214)
  214 cvjjm     20   0    6072   5308   3592 S   0.0   0.0   0:00.03 bash
  266 root      20   0    2480    116      0 S   0.0   0.0   0:00.00 SessionLeader
  267 root      20   0    2496    120      0 S   0.0   0.0   0:00.00 Relay(268)
  268 cvjjm     20   0    6072   5300   3584 S   0.0   0.0   0:00.01 bash
  316 cvjjm     20   0    9304   5224   3072 R   0.0   0.0   0:00.00 top

Because of the relation to hibernation I tried to disable the virtual memory / page file in WIn10, but this did not change anything.

@OneBlue
Copy link
Collaborator

OneBlue commented Aug 14, 2024

@cvjjm: The debug shell only works when WSL is already running, was wsl running when you got that error ?

@cvjjm
Copy link
Author

cvjjm commented Aug 15, 2024

Yes, it was running in laggy state and I was actively using it.
image

@OneBlue
Copy link
Collaborator

OneBlue commented Aug 15, 2024

@cvjjm: Hmm that's weird. Can you share the content of HKEY_LOCAL_MACHINE\SOFTWARE\Policies\WSL ?

@cvjjm
Copy link
Author

cvjjm commented Aug 16, 2024

I have no WSL folder under HKEY_LOCAL_MACHINE\SOFTWARE\Policies:
image
This is a Win10 Laptop.

I see the same console output when WSL is operating normally without lag after a fresh boot or a hibernate/resume cycle.

@OneBlue
Copy link
Collaborator

OneBlue commented Aug 16, 2024

Ok thank you @cvjjm. Let's try something else: Can you share the output of top in system --system -u root, and the output of dmesg once the machine is in that "laggy" state ?

@cvjjm
Copy link
Author

cvjjm commented Aug 19, 2024

I suppose you mean wsl --system -u root? This produces the following output:
image
And dmsg yields:

[...]
[55314.066594] tee (251677): drop_caches: 3
[55316.816847] systemd-journald[60]: Time jumped backwards, rotating.
[55345.596413] systemd-journald[60]: Time jumped backwards, rotating.
[55345.943525] WSL (1) WARNING: /usr/share/zoneinfo/Europe/Berlin not found. Is the tzdata package installed?
[55345.948086] WSL (1) WARNING: /usr/share/zoneinfo/Europe/Berlin not found. Is the tzdata package installed?
[55358.027401] WSL (1) WARNING: /usr/share/zoneinfo/Europe/Berlin not found. Is the tzdata package installed?
[55358.027479] WSL (1) WARNING: /usr/share/zoneinfo/Europe/Berlin not found. Is the tzdata package installed?
[55358.090564] hv_utils: TimeSync IC version 4.0
[55370.457322] tee (252062): drop_caches: 3
[55381.516458] tee (252120): drop_caches: 3
[55420.502954] mini_init (200): drop_caches: 1
[55443.143455] tee (252406): drop_caches: 3
[55460.876140] systemd-journald[60]: Time jumped backwards, rotating.
[55501.045830] tee (252672): drop_caches: 3
[55561.009411] tee (252926): drop_caches: 3
[55620.952977] tee (253184): drop_caches: 3
[55637.413249] systemd-journald[60]: Time jumped backwards, rotating.
[55666.795488] systemd-journald[60]: Time jumped backwards, rotating.
[55682.845999] tee (253458): drop_caches: 3
[55725.870154] systemd-journald[60]: Time jumped backwards, rotating.
[55742.944554] tee (253721): drop_caches: 3
[55755.271848] systemd-journald[60]: Time jumped backwards, rotating.
[55801.820590] tee (253978): drop_caches: 3
[55862.223843] tee (254235): drop_caches: 3
[55903.190321] systemd-journald[60]: Time jumped backwards, rotating.
[55922.462530] tee (254494): drop_caches: 3

This is with the wsl.conf as given above. The drop caches lines are from a * * * * * sync; echo 3 | sudo tee /proc/sys/vm/drop_caches entry in crontab.

@OneBlue
Copy link
Collaborator

OneBlue commented Aug 19, 2024

Thank you @cvjjm. I wonder if that drop cache script has an impact. Can you try to remove it and see if that changes the behavior ?

Also FYI WSL will essentially do that automatically now since 2.1.3 ( See AutoMemoryReclaim in https://learn.microsoft.com/en-us/windows/wsl/wsl-config)

@cvjjm
Copy link
Author

cvjjm commented Aug 22, 2024

I have removed all contrab entries and instead put AutoMemoryReclaim="gradual" in wsl.conf and the behavior stays the same. I will try next without crontab and without auto memory reclaim.

One additional piece of information that I only notice now is that it seems that after a fresh restart WSL runs smoothly indefinitely (or at least for a full day). Only after the first sleep/hibernate/suspend to disk/ram cycle does the laggyness start to increase.

@cvjjm
Copy link
Author

cvjjm commented Aug 26, 2024

Behavior stays the same also with AutoMemoryReclaim disabled and no cron job running.

@cvjjm
Copy link
Author

cvjjm commented Aug 27, 2024

I stumbled accross this issue here carlfriedrich/wsl-kernel-build#1 which is clearly different from mine but also hibernate related, so have decided to try with the custom Kernel 5.15.153.1-microsoft-standard-WSL2 provided there and will report back.

@cvjjm
Copy link
Author

cvjjm commented Aug 28, 2024

Unfortunately I cannot use Docker Desktop with the custom WSL Kernel because of the problem described here docker/for-win#14282. It is thus hard for me to test whether the issue here goes away with patched custom kernel because I would need to stop working for a day or so...

@Baryczka
Copy link

Supplementary question to @cvjjm. So in your case when WSL become slower and slower it never hang as a result? First symptoms I observed in my case are for example ctrl+r search using fzf is much slower, general responsiveness of the system is slower as well. To the point where I can't run any command on WSL instance. Only thing that help is wsl --shutdown or killing all wsl related processes. Next time when you realize instance is slower try to run command watch -n1 "cat /proc/interrupts" to observe how fast HVS interrupts increase, it should be 1-3 per second, not more until VM is heavily stressed.
I would place my bet on the related issue of Hyper-v commit.
Ps. I'm using W10, T14 g1, Ryzen 7 Pro 4750U

@cvjjm
Copy link
Author

cvjjm commented Aug 29, 2024

Yes, it becomes slower and slower and after a few hours unbearably slow, but it never hangs up completely.

We seem to have very similar hardware! I have a T14 with Ryzen 7 Pro 7840U.

I don't think I fully understand the thing with the interrupts, but when WSL in the responsive state and essentially idling I get the following output:

$ for i in {1..10}; do (sleep 1 && date && cat /proc/interrupts | grep HVS); done
Thu Aug 29 17:16:17 CEST 2024
HVS:     266009     148092     305913     149931     274330     109782     268455     132794     297263     131491     289115     134723     289611     125352     290011     166953   Hyper-V stimer0 interrupts
Thu Aug 29 17:16:18 CEST 2024
HVS:     266022     148159     305928     149946     274350     109782     268470     132804     297300     131495     289155     134723     289702     125354     290018     166964   Hyper-V stimer0 interrupts
Thu Aug 29 17:16:19 CEST 2024
HVS:     266025     148170     305933     149953     274363     109784     268477     132808     297304     131499     289182     134723     289830     125363     290030     166964   Hyper-V stimer0 interrupts
Thu Aug 29 17:16:20 CEST 2024
HVS:     266028     148200     305947     149970     274391     109784     268486     132827     297341     131511     289282     134723     289961     125366     290040     166974   Hyper-V stimer0 interrupts
Thu Aug 29 17:16:21 CEST 2024
HVS:     266031     148213     305953     149977     274402     109791     268499     132827     297349     131511     289306     134734     290088     125368     290052     166987   Hyper-V stimer0 interrupts
Thu Aug 29 17:16:22 CEST 2024
HVS:     266039     148219     305965     149991     274468     109800     268510     132841     297377     131513     289357     134749     290347     125371     290060     167000   Hyper-V stimer0 interrupts
Thu Aug 29 17:16:23 CEST 2024
HVS:     266076     148246     306052     150007     274545     109830     268545     132869     297435     131530     289481     134778     290526     125401     290276     167013   Hyper-V stimer0 interrupts
Thu Aug 29 17:16:24 CEST 2024
HVS:     266354     148292     306192     150337     274725     109861     268606     132890     297576     131568     289590     134849     290767     125496     290467     167170   Hyper-V stimer0 interrupts
Thu Aug 29 17:16:25 CEST 2024
HVS:     266395     148311     306208     150518     274753     109868     268624     132903     297594     131573     289617     134938     290849     125506     290488     167170   Hyper-V stimer0 interrupts
Thu Aug 29 17:16:26 CEST 2024
HVS:     266420     148324     306212     150522     274755     109874     268630     132903     297610     131575     289654     134942     291095     125511     290519     167179   Hyper-V stimer0 interrupts

I.e., 411 HVS interrupts in 10s on CPU0. I will report the number for the laggy state later...

@cvjjm
Copy link
Author

cvjjm commented Aug 30, 2024

This is the output when WSL is the laggy state:

$ for i in {1..10}; do (sleep 1 && date && cat /proc/interrupts | grep HVS); done
Fri Aug 30 09:43:59 CEST 2024
HVS:     520095     271465     565831     279236     523380     219622     513645     246503     541096     223164     519293     251069     532632     224885     542878     290457   Hyper-V stimer0 interrupts
Fri Aug 30 09:44:01 CEST 2024
HVS:     520098     271465     565833     279238     523383     219624     513647     246505     541098     223166     519295     251069     532635     224885     542880     290459   Hyper-V stimer0 interrupts
Fri Aug 30 09:44:02 CEST 2024
HVS:     520101     271467     565836     279241     523386     219628     513651     246507     541102     223170     519299     251069     532638     224887     542884     290463   Hyper-V stimer0 interrupts
Fri Aug 30 09:44:03 CEST 2024
HVS:     520105     271467     565839     279245     523389     219630     513654     246511     541105     223171     519302     251069     532641     224888     542887     290466   Hyper-V stimer0 interrupts
Fri Aug 30 09:44:04 CEST 2024
HVS:     520107     271469     565842     279247     523392     219633     513657     246513     541108     223173     519305     251071     532644     224890     542890     290469   Hyper-V stimer0 interrupts
Fri Aug 30 09:44:05 CEST 2024
HVS:     520110     271472     565845     279250     523395     219636     513660     246516     541111     223175     519307     251074     532647     224893     542892     290471   Hyper-V stimer0 interrupts
Fri Aug 30 09:44:06 CEST 2024
HVS:     520113     271475     565847     279253     523398     219636     513662     246518     541113     223177     519310     251076     532649     224896     542895     290474   Hyper-V stimer0 interrupts
Fri Aug 30 09:44:07 CEST 2024
HVS:     520116     271477     565850     279256     523401     219636     513665     246521     541117     223178     519313     251079     532652     224899     542898     290477   Hyper-V stimer0 interrupts
Fri Aug 30 09:44:08 CEST 2024
HVS:     520119     271480     565853     279259     523404     219638     513668     246524     541120     223181     519316     251080     532656     224902     542901     290480   Hyper-V stimer0 interrupts
Fri Aug 30 09:44:10 CEST 2024
HVS:     520122     271480     565856     279262     523407     219640     513671     246525     541122     223184     519319     251080     532658     224905     542904     290483   Hyper-V stimer0 interrupts

About 28 HVS interrups in 10s on CPU0.

@Baryczka, is this in line with your expectations? Fewer interrupts in the laggy state and more during the responsive state?

@WanchaiSae
Copy link

I think should use older version 20 or 22 maybe fix this problem.

@ericblade
Copy link

I would add that I also experience similar symptoms on my Mac Pro 7,1 running Win 11, and with very heavy use of docker. Originally the system was configured with a 24 core CPU and 48gig ram, and I had 10 cores set for WSL and 32gig ram. The docker containers, or the operating system builds that I do in WSL, pretty much continuously have the CPU and RAM at maximum, and are constantly accessing the disk and network. Other than a significant lack of network speed with Docker (which seems capped at 100Mbps, despite having a 2.5G local ethernet and a 1G link to the outside world), WSL is pretty much constantly consuming all resources available to it.

Typically after a day or two of this, the VM would grind to a halt. A wsl --shutdown would restore normal operation. I have not tried a hibernate, but now it's on my list of possible tools.

Upgrading the system memory to 112GB and giving WSL access to 20 cores and 80GB has significantly improved the problem. I still notice it slowing down some after a couple of days, but in the week that I've had that upgrade, it has not come to a halt or slowed down to unacceptable speeds, though when I do notice it slowing down, i restart wsl.

I'll drop in in a few more days and say if hibernate got it back, and maybe do some ping testing like above.

@cvjjm
Copy link
Author

cvjjm commented Nov 7, 2024

This issue still persists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants