Skip to content

udp checksum 校验错误导致宿主机访问 SVC 概率性失败

Oilbeater edited this page Mar 31, 2021 · 4 revisions

版本

Kube-OVN 1.6.0+

现象

SVC 的 Endpoint 为容器网络 Pod,从宿主机访问 SVC ClusterIP+Port 概率性出现请求卡主,需要半分钟左右才会返回

排查方法

观察 Endpoint 内 Pod 所在宿主机的系统日志,通过 dmesg 可看到类似如下日志,则可判断由于 udp checksum 问题导致请求失败

[ 8702.057455] UDP: bad checksum. From 192.168.16.44:13066 to 192.168.16.45:6081 ulen 98
[ 8702.097551] UDP: bad checksum. From 192.168.16.44:4234 to 192.168.16.45:6081 ulen 98
[ 8702.128824] UDP: bad checksum. From 192.168.16.44:11537 to 192.168.16.45:6081 ulen 98
[ 8702.385434] UDP: bad checksum. From 192.168.16.44:32102 to 192.168.16.45:6081 ulen 98
[ 8703.099713] UDP: bad checksum. From 192.168.16.44:4234 to 192.168.16.45:6081 ulen 98
[ 8703.388079] UDP: bad checksum. From 192.168.16.44:32102 to 192.168.16.45:6081 ulen 98

对于麒麟 V10 操作系统 dmesg 中无法显示相关信息,通过 netstat -us 观察 InCsumErrors 计数器是否一直增加

# netstat -us
IcmpMsg:
    InType0: 22
    InType3: 24
    InType8: 117852
    OutType0: 117852
    OutType3: 29
    OutType8: 22
Udp:
    3040636 packets received
    0 packets to unknown port received.
    4 packet receive errors
    602 packets sent
    0 receive buffer errors
    0 send buffer errors
    InCsumErrors: 4
UdpLite:
IpExt:
    InBcastPkts: 10244
    InOctets: 4446320361
    OutOctets: 1496815600
    InBcastOctets: 3095950
    InNoECTPkts: 7683903

解决方法

关闭 Geneve 的 udp checksum 校验,修改 kube-system/kube-ovn-cni daemonset 的启动参数,将 --encap-checksum=false

    spec:
      containers:
      - args:
        - --enable-mirror=false
        - --encap-checksum=false
        - --service-cluster-ip-range=10.96.0.0/12

关闭每个节点 kube-ovn相关网卡的 tx offloading

ethtool -K ovn0 tx off
ethtool -K genev_sys_6081 tx off
Clone this wiki locally