Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WSL2 + JLab Network = phantom network timeouts #4

Closed
slominskir opened this issue Nov 20, 2023 · 14 comments
Closed

WSL2 + JLab Network = phantom network timeouts #4

slominskir opened this issue Nov 20, 2023 · 14 comments

Comments

@slominskir
Copy link
Member

Got a new Windows 11 PC with latest everything and the Wildfly bash setup scripts no longer work properly in WSL2. Specifically network requests periodically timeout.

Related:

This was working before on Windows 10 Enterprise (JLab) and Windows 11 Home (Personal), but likely with older Ubuntu distros and possibly older versions of WSL2 or at least possibly different install methods (app store version differs apparently?).

Fully patched Windows 11 Installed on 11/14/2023 with Windows 11 Enterprise Version 22H2 build 22621.2506.

Fully patched WSL2:

PS C:\Users\ryans> wsl.exe --status
Default Distribution: Ubuntu-22.04
Default Version: 2
PS C:\Users\ryans> wsl.exe --version
WSL version: 2.0.9.0
Kernel version: 5.15.133.1-1
WSLg version: 1.0.59
MSRDC version: 1.2.4677
Direct3D version: 1.611.1-81528511
DXCore version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows version: 10.0.22621.2506

I ended up just letting the server-setup.sh script run all afternoon and it eventually did complete. The output just contains a bunch of seemingly random timeouts followed by retries (took an hour or two when it should have run in less than a minute):

ryans@SFTRYANS:/mnt/c/users/ryans/servers/setup$ ./server-setup.sh server.env config_provided
Loading environment server.env
------------------------
config_provided
------------------------
Using env file: server.env
Loading environment server.env
------------------------
add_modules
------------------------
local|org.apache.poi|https://repo1.maven.org/maven2/org/apache/poi/poi/5.2.3/poi-5.2.3.jar,https://repo1.maven.org/maven2/org/apache/poi/poi-ooxml/5.2.3/poi-ooxml-5.2.3.jar,https://repo1.maven.org/maven2/org/apache/poi/poi-ooxml-lite/5.2.3/poi-ooxml-lite-5.2.3.jar,https://repo1.maven.org/maven2/org/apache/xmlbeans/xmlbeans/5.1.1/xmlbeans-5.1.1.jar,https://repo1.maven.org/maven2/org/apache/commons/commons-math3/3.6.1/commons-math3-3.6.1.jar,https://repo1.maven.org/maven2/org/apache/commons/commons-compress/1.22/commons-compress-1.22.jar,https://repo1.maven.org/maven2/com/zaxxer/SparseBitSet/1.2/SparseBitSet-1.2.jar,https://repo1.maven.org/maven2/org/apache/commons/commons-collections4/4.4/commons-collections4-4.4.jar|javaee.api,org.jboss.as.web,org.apache.commons.io,org.apache.commons.codec,org.apache.logging.log4j.api
SCOPE: local
DEP_NAME: org.apache.poi
RESOURCES_CSV: https://repo1.maven.org/maven2/org/apache/poi/poi/5.2.3/poi-5.2.3.jar,https://repo1.maven.org/maven2/org/apache/poi/poi-ooxml/5.2.3/poi-ooxml-5.2.3.jar,https://repo1.maven.org/maven2/org/apache/poi/poi-ooxml-lite/5.2.3/poi-ooxml-lite-5.2.3.jar,https://repo1.maven.org/maven2/org/apache/xmlbeans/xmlbeans/5.1.1/xmlbeans-5.1.1.jar,https://repo1.maven.org/maven2/org/apache/commons/commons-math3/3.6.1/commons-math3-3.6.1.jar,https://repo1.maven.org/maven2/org/apache/commons/commons-compress/1.22/commons-compress-1.22.jar,https://repo1.maven.org/maven2/com/zaxxer/SparseBitSet/1.2/SparseBitSet-1.2.jar,https://repo1.maven.org/maven2/org/apache/commons/commons-collections4/4.4/commons-collections4-4.4.jar
DEPENDENCIES_CSV: javaee.api,org.jboss.as.web,org.apache.commons.io,org.apache.commons.codec,org.apache.logging.log4j.api
add_module
> [https://repo1.maven.org/maven2/org/apache/poi/poi/5.2.3/poi-5.2.3.jar]
2023-11-20 13:22:42 URL:https://repo1.maven.org/maven2/org/apache/poi/poi/5.2.3/poi-5.2.3.jar [2964641/2964641] -> "poi-5.2.3.jar.3" [1]
done with wget
> [https://repo1.maven.org/maven2/org/apache/poi/poi-ooxml/5.2.3/poi-ooxml-5.2.3.jar]
2023-11-20 13:22:45 URL:https://repo1.maven.org/maven2/org/apache/poi/poi-ooxml/5.2.3/poi-ooxml-5.2.3.jar [2010497/2010497] -> "poi-ooxml-5.2.3.jar.1" [1]
done with wget
> [https://repo1.maven.org/maven2/org/apache/poi/poi-ooxml-lite/5.2.3/poi-ooxml-lite-5.2.3.jar]
2023-11-20 13:22:56 URL:https://repo1.maven.org/maven2/org/apache/poi/poi-ooxml-lite/5.2.3/poi-ooxml-lite-5.2.3.jar [5898622/5898622] -> "poi-ooxml-lite-5.2.3.jar.1" [1]
done with wget
> [https://repo1.maven.org/maven2/org/apache/xmlbeans/xmlbeans/5.1.1/xmlbeans-5.1.1.jar]
failed: Connection timed out.
failed: Connection timed out.
failed: Connection timed out.
failed: Connection timed out.
failed: Connection timed out.
failed: Connection timed out.
failed: Connection timed out.
failed: Connection timed out.
failed: Connection timed out.
2023-11-20 13:42:58 URL:https://repo1.maven.org/maven2/org/apache/xmlbeans/xmlbeans/5.1.1/xmlbeans-5.1.1.jar [2196526/2196526] -> "xmlbeans-5.1.1.jar" [1]
done with wget
> [https://repo1.maven.org/maven2/org/apache/commons/commons-math3/3.6.1/commons-math3-3.6.1.jar]
2023-11-20 13:43:03 URL:https://repo1.maven.org/maven2/org/apache/commons/commons-math3/3.6.1/commons-math3-3.6.1.jar [2213560/2213560] -> "commons-math3-3.6.1.jar" [1]
done with wget
> [https://repo1.maven.org/maven2/org/apache/commons/commons-compress/1.22/commons-compress-1.22.jar]
2023-11-20 13:43:04 URL:https://repo1.maven.org/maven2/org/apache/commons/commons-compress/1.22/commons-compress-1.22.jar [1039712/1039712] -> "commons-compress-1.22.jar" [1]
done with wget
> [https://repo1.maven.org/maven2/com/zaxxer/SparseBitSet/1.2/SparseBitSet-1.2.jar]
failed: Connection timed out.
failed: Connection timed out.
failed: Connection timed out.
failed: Connection timed out.
failed: Connection timed out.
failed: Connection timed out.
failed: Connection timed out.
failed: Connection timed out.
failed: Connection timed out.
2023-11-20 14:03:01 URL:https://repo1.maven.org/maven2/com/zaxxer/SparseBitSet/1.2/SparseBitSet-1.2.jar [24510/24510] -> "SparseBitSet-1.2.jar" [1]
done with wget
> [https://repo1.maven.org/maven2/org/apache/commons/commons-collections4/4.4/commons-collections4-4.4.jar]
2023-11-20 14:03:02 URL:https://repo1.maven.org/maven2/org/apache/commons/commons-collections4/4.4/commons-collections4-4.4.jar [751914/751914] -> "commons-collections4-4.4.jar" [1]
done with wget
pendencies=javaee.api,org.jboss.as.web,org.apache.commons.io,org.apache.commons.codec,org.apache.logging.log4j.api1.2.jar,/tmp/commons-collections4-4.4.jar --de
[standalone@localhost:9990 /] global|org.tuckey.urlrewritefilter|https://repo1.maven.org/maven2/org/tuckey/urlrewritefilter/4.0.4/urlrewritefilter-4.0.4.jar|javaee.api,org.jboss.as.web
SCOPE: global
DEP_NAME: org.tuckey.urlrewritefilter
RESOURCES_CSV: https://repo1.maven.org/maven2/org/tuckey/urlrewritefilter/4.0.4/urlrewritefilter-4.0.4.jar
DEPENDENCIES_CSV: javaee.api,org.jboss.as.web
add_module
> [https://repo1.maven.org/maven2/org/tuckey/urlrewritefilter/4.0.4/urlrewritefilter-4.0.4.jar]
failed: Connection timed out.
failed: Connection timed out.
failed: Connection timed out.
failed: Connection timed out.
failed: Connection timed out.
failed: Connection timed out.
failed: Connection timed out.
failed: Connection timed out.
failed: Connection timed out.
2023-11-20 14:23:04 URL:https://repo1.maven.org/maven2/org/tuckey/urlrewritefilter/4.0.4/urlrewritefilter-4.0.4.jar [177474/177474] -> "urlrewritefilter-4.0.4.jar" [1]
done with wget
vaee.api,org.jboss.as.web0 /] module add --name=org.tuckey.urlrewritefilter --resource-delimiter=, --resources=/tmp/urlrewritefilter-4.0.4.jar --dependencies=ja
[standalone@localhost:9990 /] {"outcome" => "success"}
global|org.jlab.jlog|https://repo1.maven.org/maven2/org/jlab/jlog/5.0.0/jlog-5.0.0.jar|javaee.api,org.jboss.as.web
SCOPE: global
DEP_NAME: org.jlab.jlog
RESOURCES_CSV: https://repo1.maven.org/maven2/org/jlab/jlog/5.0.0/jlog-5.0.0.jar
DEPENDENCIES_CSV: javaee.api,org.jboss.as.web
add_module
> [https://repo1.maven.org/maven2/org/jlab/jlog/5.0.0/jlog-5.0.0.jar]
2023-11-20 14:23:11 URL:https://repo1.maven.org/maven2/org/jlab/jlog/5.0.0/jlog-5.0.0.jar [51354/51354] -> "jlog-5.0.0.jar" [1]
done with wget
[standalone@localhost:9990 /] module add --name=org.jlab.jlog --resource-delimiter=, --resources=/tmp/jlog-5.0.0.jar --dependencies=javaee.api,org.jboss.as.web
[standalone@localhost:9990 /] {"outcome" => "success"}
global|org.keycloak.admin-client|https://repo1.maven.org/maven2/org/keycloak/keycloak-admin-client/20.0.5/keycloak-admin-client-20.0.5.jar,https://repo1.maven.org/maven2/org/keycloak/keycloak-core/20.0.5/keycloak-core-20.0.5.jar,https://repo1.maven.org/maven2/org/keycloak/keycloak-common/20.0.5/keycloak-common-20.0.5.jar|org.jboss.ws.api,javax.ws.rs.api,org.jboss.logging,org.jboss.resteasy.resteasy-client,org.jboss.resteasy.resteasy-jackson2-provider,org.jboss.resteasy.resteasy-jaxb-provider,org.jboss.resteasy.resteasy-multipart-provider
SCOPE: global
DEP_NAME: org.keycloak.admin-client
RESOURCES_CSV: https://repo1.maven.org/maven2/org/keycloak/keycloak-admin-client/20.0.5/keycloak-admin-client-20.0.5.jar,https://repo1.maven.org/maven2/org/keycloak/keycloak-core/20.0.5/keycloak-core-20.0.5.jar,https://repo1.maven.org/maven2/org/keycloak/keycloak-common/20.0.5/keycloak-common-20.0.5.jar
DEPENDENCIES_CSV: org.jboss.ws.api,javax.ws.rs.api,org.jboss.logging,org.jboss.resteasy.resteasy-client,org.jboss.resteasy.resteasy-jackson2-provider,org.jboss.resteasy.resteasy-jaxb-provider,org.jboss.resteasy.resteasy-multipart-provider
add_module
> [https://repo1.maven.org/maven2/org/keycloak/keycloak-admin-client/20.0.5/keycloak-admin-client-20.0.5.jar]
2023-11-20 14:23:18 URL:https://repo1.maven.org/maven2/org/keycloak/keycloak-admin-client/20.0.5/keycloak-admin-client-20.0.5.jar [64674/64674] -> "keycloak-admin-client-20.0.5.jar" [1]
done with wget
> [https://repo1.maven.org/maven2/org/keycloak/keycloak-core/20.0.5/keycloak-core-20.0.5.jar]
failed: Connection timed out.
failed: Connection timed out.
failed: Connection timed out.
failed: Connection timed out.
failed: Connection timed out.
failed: Connection timed out.
failed: Connection timed out.
failed: Connection timed out.
failed: Connection timed out.
2023-11-20 14:43:13 URL:https://repo1.maven.org/maven2/org/keycloak/keycloak-core/20.0.5/keycloak-core-20.0.5.jar [330299/330299] -> "keycloak-core-20.0.5.jar" [1]
done with wget
> [https://repo1.maven.org/maven2/org/keycloak/keycloak-common/20.0.5/keycloak-common-20.0.5.jar]
2023-11-20 14:43:13 URL:https://repo1.maven.org/maven2/org/keycloak/keycloak-common/20.0.5/keycloak-common-20.0.5.jar [162927/162927] -> "keycloak-common-20.0.5.jar" [1]
done with wget
resteasy.resteasy-jackson2-provider,org.jboss.resteasy.resteasy-jaxb-provider,org.jboss.resteasy.resteasy-multipart-provider.resteasy.resteasy-client,org.jboss.
[standalone@localhost:9990 /] {"outcome" => "success"}
@slominskir
Copy link
Member Author

Worth pointing out that if I use PowerShell to run curl or wget to grab the files from maven central the files are downloaded nearly instantly. We do have an upstream intercepting TLS proxy server at JLab that may be interacting in odd ways with the WSL networking. In order for wget to work inside WSL I had to add our internal PKI cert plus enable legacy renegotiation:

sudo wget -O /usr/local/share/ca-certificates/customcert.crt http://pki.jlab.org/JLabCA.crt
sudo update-ca-certificates
sudo cat "Options = UnsafeLegacyRenegotiation" >> /etc/ssl/openssl.cnf

@slominskir
Copy link
Member Author

slominskir commented Nov 20, 2023

Tried a few of the suggested fixes from related issues mentioned above. Syncing the hardware clock didn't do anything. Disabling IP6 at the Windows adapter and inside WSL by making IP4 higher precedence didn't do anything. Rebooting the machine didn't fix it. Next thing I tried was overriding the DNS server. Many suggest using Google's 8.8.8.8, but that doesn't work, presumably because our intercepting proxy blocks it, Commenting out the WSL2 configured DNS server and setting it to one of our internal DNS servers appears to fix the issue:

ryans@SFTRYANS:/etc$ cat resolv.conf
# This file was automatically generated by WSL. To stop automatic generation of this file, add the following entry to /etc/wsl.conf:
# [network]
# generateResolvConf = false
#nameserver 192.168.208.1
nameserver 129.57.90.255

Note: Must add generateResolvConf=false to /etc/wsl.conf in order for resolv.conf changes to survive a reboot. Make sure resolv.conf is no longer a symlink too (unlink and create new)

@slominskir
Copy link
Member Author

slominskir commented Nov 27, 2023

Note: I still have no idea WHY explicitly defining the DNS server worked. Some notes to follow up with:

@slominskir
Copy link
Member Author

Also worth pointing out there are a few other firewalls/anti-virus apps on the PC besides the HyperV one mentioned above that I have to work around as well (and are probably stepping all over each-other too):

  • CarbonBlack (I see the process running in TaskManager, but have no control of it)
  • CrowdStrike Falcon (I see the process running in TaskManager, but have no control of it)
  • Windows Defender (Has Public, Private, and Domain Profile settings and I have no control over Domain settings)
  • Symantec End Point Protection (while glancing at the config that is read-only to me I noticed "Smart DNS" is checked. Lots of other options too, all read-only to me)

The first two are new to the new machine so that could be a clue.

@slominskir
Copy link
Member Author

slominskir commented Nov 28, 2023

There is a broader issue affecting Docker Desktop containers as well and the WSL2 Ubutunu distribution fix mentioned above only fixes it for the WSL2 Ubuntu distribution, not for Docker Desktop containers. Specifically docker compose up on a compose file of containers that need to communicate with each other fail to connect to each other (compose example). So I'll re-open. A few more notes:

  • Confirmed that latest version of Docker Desktop on up-to-date Windows 11 Home edition with latest version of WSL2 located offsite (not on the JLab network) does not suffer this issue. Almost certainly it is a JLab environment issue.
  • It appears a reboot fixes the issue for one up/down compose cycle before the problem returns. So maybe a DNS cache issue. To re-create you may need to run up/down cycle once or twice to see the issue.

@slominskir slominskir reopened this Nov 28, 2023
@slominskir
Copy link
Member Author

slominskir commented Nov 29, 2023

After a Docker Desktop crash followed by Factory Reset now compose appears to be working fine. Not very satisfying. So I tried to uninstall everything, reboot, and then re-install everything again, and reboot. This means uninstalling Docker Desktop, Ubuntu, and WSL2. Turns out there are two instances of WSL2 and on re-install 2 are put back. Weird. Screenshot:

ScreenshotA

I re-installed using the directions here: https://learn.microsoft.com/en-us/windows/wsl/install, which means simply running

wsl --install

It's confusing that the Microsoft Store could be used for this as well with possibly different outcome and also strangely the store lists two different Ubuntu apps, and even the one with implicit version actually uses the same version as the one with explicit version (22.0.4.2). Weird. Screenshot:

ScreenshotE

Just to be safe I also unchecked the WSL "Windows Feature" during uninstall and confirmed it's re-checked (enabled) after re-installing with wsl --install command. The connection between app and feature is unclear too. Weird. Screenshot:

ScreenshotC

I can confirm that the previous behavior inside WSL2 Ubuntu returns, in that wget is unreliable again. Screenshot:

ScreenshotB ScreenshotF

Docker Desktop re-installed without a hitch and now works without a problem so far. I haven't found how to get it back into the odd state it was in before. I guess I just keep using it until it breaks again. Might break once I attempt to fix phantom network error in WSL.

@slominskir
Copy link
Member Author

Ran wireshark packet capture with wget when it works vs when it results in timeout. Looks like when it works DNS selects IP6 address:
good

When it doesn't work, IP4 addresses are returned and a .com root DNS server is selected oddly (I'm not sure how to interpret this):
bad

@slominskir
Copy link
Member Author

I think my initial interpretation of the packet capture data is misleading.

After doing some reading it appears that in both working and timeout scenarios a list of answers are returned for both IP4 and for IP6 and it appears the lists are identical, but the difference is that order the lists are returned differs. In the working case IP4 answers are returned first (A records) whereas in the timeout case IP6 answers are returned first (AAAA records). This ordering could be a coincidence and perhaps is not important. What really matters is that the chosen IP to use differs in the working vs timeout cases. It isn't clear how the choice is made.

  • How does it know to use IP4 vs IP6?
  • Each list has multiple entries, how does it know which one in the list to use?

@slominskir
Copy link
Member Author

slominskir commented Nov 30, 2023

It's also interesting that if I repeat the wireshark test on my personal Windows 11 Home PC I notice the DNS answers don't include root DNS servers in results:

Screenshot

The answers are only actual GitHub domain results as you'd expect. The fact that the onsite PC results include seemingly spurious root DNS answers appears to be the issue. The onsite test also shows inserting/mixing A records (the root DNS ones) in an AAAA response, which appears odd.

@slominskir
Copy link
Member Author

slominskir commented Nov 30, 2023

I guess I'm seeing this issue: microsoft/WSL#5806

DNS lookup response is erroneously mixing AUTHORITY response in ANSWERS section:

ryans@SFTRYANS:/mnt/c/Users/ryans$ dig github.com

; <<>> DiG 9.18.18-0ubuntu0.22.04.1-Ubuntu <<>> github.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 11848
;; flags: qr rd ad; QUERY: 1, ANSWER: 15, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;github.com.                    IN      A

;; ANSWER SECTION:
github.com.             0       IN      A       140.82.114.3
b.gtld-servers.net.     0       IN      A       192.33.14.30
e.gtld-servers.net.     0       IN      A       192.12.94.30
i.gtld-servers.net.     0       IN      A       192.43.172.30
k.gtld-servers.net.     0       IN      A       192.52.178.30
f.gtld-servers.net.     0       IN      A       192.35.51.30
h.gtld-servers.net.     0       IN      A       192.54.112.30
c.gtld-servers.net.     0       IN      A       192.26.92.30
a.gtld-servers.net.     0       IN      A       192.5.6.30
j.gtld-servers.net.     0       IN      A       192.48.79.30
g.gtld-servers.net.     0       IN      A       192.42.93.30
l.gtld-servers.net.     0       IN      A       192.41.162.30
d.gtld-servers.net.     0       IN      A       192.31.80.30
m.gtld-servers.net.     0       IN      A       192.55.83.30
b.gtld-servers.net.     0       IN      AAAA    2001:503:231d::2:30

@slominskir
Copy link
Member Author

This of course only creates more questions:

  • Why wasn't I seeing this bug before onsite with Windows 10? Home computer is safe since ISPs apparently strip off optional AUTHORITY section. Can rolling back to old distro fix this? Different upstream DNS?
  • What's the best workaround? Can I configure Windows 11 to use a JLab (corporate) DNS that doesn't respond with AUTHORITY section (non-recursive)? Microsoft apparently has known about this bug for years and isn't doing anything to fix it directly (I guess they lost source code or there's a licensing dispute). Within the last few weeks a DNS Tunnel feature was released so we can tunnel around the WSL DNS server, which apparently is an ICS DNS server from 1998. Manually modifying resolv.conf works too for a specific distro.
  • Is this problem related to the issues with Docker Desktop? This could be totally unrelated.

If/when Windows 12 comes out someone else can migrate over first!

@slominskir
Copy link
Member Author

Moving forward with /etc/resolv.conf and /etc/wsl.conf config change as done before. Sounds like in the near future the experimental dnsTunneling feature will be the correct fix. Re-closing. I'll create a new issue if I'm able to pinpoint odd behavior with Docker Desktop - it seems that may be something unrelated and appears to be gone at the moment.

@karolswdev
Copy link

@slominskir I read through your updates and appreciate the level of detail that you included here. Have you had a chance to come back to this issue and explore the dnsTunneling solution you were referring to?

@slominskir
Copy link
Member Author

@karolswdev - Nope, I'm still relying on explicit configuration / override to one of our corporate domain DNS servers. It does look like dnsTunneling is about to be the default mode of operation though as there is a pre-release stating as much, so presumably soon new users will never see the dark corner of WSL discussed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants