[BUG] Lots of NoAvailableAddress errors #4775

cmdy · 2024-11-28T09:29:15Z

Kube-OVN Version

v1.12.29

Kubernetes Version

v1.28.11

Operation-system/Kernel Version

"CentOS Linux 7 (Core)" 5.10.0-228.2410.el7.bzl.x86_64

Description

When creating pods in batches, a large number of NoAvailableAddress errors occur

E1128 10:02:11.291383       7 pod.go:425] error syncing 'kf-partition/predict-data-2-hive-2024-11-20-8d364b9370675611-exec-1382': NoAvailableAddress, requeuing
E1128 10:02:11.298457       7 pod.go:1659] NoAvailableAddress
E1128 10:02:11.298474       7 pod.go:608] NoAvailableAddress
E1128 10:02:11.298487       7 pod.go:425] error syncing 'kf-partition/predict-data-2-hive-2024-11-20-8d364b9370675611-exec-1382': NoAvailableAddress, requeuing
E1128 10:02:11.308268       7 pod.go:1659] NoAvailableAddress
E1128 10:02:11.308282       7 pod.go:608] NoAvailableAddress
E1128 10:02:11.308294       7 pod.go:425] error syncing 'kf-partition/predict-data-2-hive-2024-11-20-8d364b9370675611-exec-1376': NoAvailableAddress, requeuing
E1128 10:02:11.318782       7 pod.go:1659] NoAvailableAddress
E1128 10:02:11.318798       7 pod.go:608] NoAvailableAddress
E1128 10:02:11.318812       7 pod.go:425] error syncing 'kf-partition/predict-data-2-hive-2024-11-20-8d364b9370675611-exec-1383': NoAvailableAddress, requeuing
E1128 10:02:11.324401       7 pod.go:1659] NoAvailableAddress
E1128 10:02:11.324417       7 pod.go:608] NoAvailableAddress
E1128 10:02:11.324431       7 pod.go:425] error syncing 'kf-partition/predict-data-2-hive-2024-11-20-8d364b9370675611-exec-1383': NoAvailableAddress, requeuing
E1128 10:02:11.327052       7 pod.go:1659] NoAvailableAddress
E1128 10:02:11.327068       7 pod.go:608] NoAvailableAddress
E1128 10:02:11.327080       7 pod.go:425] error syncing 'kf-partition/predict-data-2-hive-2024-11-20-8d364b9370675611-exec-1377': NoAvailableAddress, requeuing
E1128 10:02:11.342842       7 pod.go:1659] NoAvailableAddress
E1128 10:02:11.342856       7 pod.go:608] NoAvailableAddress
E1128 10:02:11.342869       7 pod.go:425] error syncing 'kf-partition/predict-data-2-hive-2024-11-20-8d364b9370675611-exec-1370': NoAvailableAddress, requeuing
E1128 10:02:11.355345       7 pod.go:1659] NoAvailableAddress
E1128 10:02:11.355363       7 pod.go:608] NoAvailableAddress
E1128 10:02:11.355377       7 pod.go:425] error syncing 'kf-partition/predict-data-2-hive-2024-11-20-8d364b9370675611-exec-1384': NoAvailableAddress, requeuing
E1128 10:02:11.365660       7 pod.go:1659] NoAvailableAddress
E1128 10:02:11.365674       7 pod.go:608] NoAvailableAddress
E1128 10:02:11.365688       7 pod.go:425] error syncing 'kf-partition/predict-data-2-hive-2024-11-20-8d364b9370675611-exec-1384': NoAvailableAddress, requeuing
E1128 10:02:11.376498       7 pod.go:1659] NoAvailableAddress
E1128 10:02:11.376512       7 pod.go:608] NoAvailableAddress
E1128 10:02:11.376523       7 pod.go:425] error syncing 'kf-partition/predict-data-2-hive-2024-11-20-8d364b9370675611-exec-1380': NoAvailableAddress, requeuing
E1128 10:02:11.377634       7 pod.go:1659] NoAvailableAddress
E1128 10:02:11.377648       7 pod.go:608] NoAvailableAddress
E1128 10:02:11.377659       7 pod.go:425] error syncing 'kf-partition/predict-data-2-hive-2024-11-20-8d364b9370675611-exec-1371': NoAvailableAddress, requeuing
E1128 10:02:11.392717       7 pod.go:1659] NoAvailableAddress
E1128 10:02:11.392735       7 pod.go:608] NoAvailableAddress
E1128 10:02:11.392758       7 pod.go:425] error syncing 'kf-partition/predict-data-2-hive-2024-11-20-8d364b9370675611-exec-1372': NoAvailableAddress, requeuing
E1128 10:02:11.410417       7 pod.go:1659] NoAvailableAddress
E1128 10:02:11.410430       7 pod.go:608] NoAvailableAddress
E1128 10:02:11.410445       7 pod.go:425] error syncing 'kf-partition/predict-data-2-hive-2024-11-20-8d364b9370675611-exec-1373': NoAvailableAddress, requeuing
E1128 10:02:11.475704       7 pod.go:1659] NoAvailableAddress

Namespace kf-partition has multiple subenet

apiVersion: v1
kind: Namespace
metadata:
  annotations:
    ovn.kubernetes.io/logical_switch: arsenal-subnet-10,arsenal-subnet-3,arsenal-subnet-6,arsenal-subnet-7,arsenal-subnet-9,arsenal-subnet-1,arsenal-subnet-2,arsenal-subnet-4,arsenal-subnet-5,arsenal-subnet-8
  labels:
    kubernetes.io/metadata.name: kf-partition
  name: kf-partition
  resourceVersion: "1422896579"
  uid: 0a4aa8fb-4ae3-4590-aec7-41de64f654c3
spec:
  finalizers:
  - kubernetes
status:
  phase: Active

Subnet usage

Steps To Reproduce

Namespace first subnet IP and a few remaining IPs
Creating pods in batches

Current Behavior

NoAvailableAddress error causes pod to retry

Expected Behavior

In multiple subenet when the subnet IP is not actually exhausted, do not cause pod retries due to NoAvailableAddress errors

The text was updated successfully, but these errors were encountered:

dosubot · 2024-11-28T09:31:16Z

I found a similar closed issue that might be relevant to your situation:

[BUG] need to handle the failure during allocate multiple IPs for a single pod, or it will exhaust the whole IP Pool: This issue was closed with a pull request that addressed the problem by checking both the stateful set name and UID when handling pod deletion. The issue was resolved in version 1.10.6 ^[1].

In your case, the NoAvailableAddress error occurs when creating pods in batches, even though the namespace has multiple subnets. This error is triggered when Kube-OVN fails to allocate an IP address for a pod, indicating that there are no available IP addresses to allocate ^[2].

To address this, you might want to ensure that the IP allocation logic is correctly configured to handle multiple subnets within a namespace. The logic for selecting a subnet involves checking namespace bindings, label selectors, and default configurations ^[3]. Make sure that your configuration aligns with these criteria to prevent unnecessary NoAvailableAddress errors.

To continue talking to Dosu, mention @dosu.

bobz965 · 2024-11-28T10:09:51Z

please attach the pod yaml, does the pod use static IP?

cmdy · 2024-11-28T12:10:42Z

please attach the pod yaml, does the pod use static IP?

not using a static IP，just a normal pod.

I think it is because the code always selects the first subnet of namespce first. When the number of pods in a batch is greater than the number of available IPs in the first subnet, the pod will retry because there is no IP.

cmdy · 2024-11-28T12:13:23Z

I think the method getPodDefaultSubnet should choose the subnet with the most available IPs
@bobz965

bobz965 · 2024-11-29T01:25:09Z

the ovn-default cidr use mask /16, which is not enough ？

cmdy · 2024-11-29T01:38:13Z

the ovn-default cidr use mask /16, which is not enough ？

we use vxlan mode, the subnet cird only use mask /21

cmdy · 2024-11-29T02:24:56Z

I think using multiple subnets and using small bit masks is common in normal business, especially for large-scale clusters.

bobz965 · 2024-11-29T07:56:45Z

I think using multiple subnets and using small bit masks is common in normal business, especially for large-scale clusters.

you can use pod annotation to use the subnet which has available ip

cmdy · 2024-11-29T08:28:38Z

I think using multiple subnets and using small bit masks is common in normal business, especially for large-scale clusters.

this does not quite fit our business scenario and it didn't meet our expectations. if the business queries how many available IPs are left in the allocated subnet when scheduling the pod, it is not very elegant to use.

cmdy · 2024-11-29T08:37:51Z

our business scenario is to schedule pods for batch processing tasks. each batch of pods may have hundreds or thousands of pods.

bobz965 · 2024-11-29T09:10:38Z

especially for large-scale clusters : in VPC case, cidr /8 /16 is very common.
if you use VLAN, it is better to use smaller than /24.

cmdy · 2024-11-29T10:48:38Z

especially for large-scale clusters : in VPC case, cidr /8 /16 is very common. if you use VLAN, it is better to use smaller than /24.

we use /21 because tunnel_key in vxlan mode only supports this many bits at most.

If /8 /16 subnet is used, will configuring ACL for isolation between services result in a very large subnet? will it be difficult to configure?

bobz965 · 2024-11-30T02:37:55Z

how about using geneve?

cmdy · 2024-12-02T01:44:35Z

how about using geneve?

our company's IDC needs to use vxlan

zcq98 · 2024-12-12T10:18:17Z

release v1.12.28 seems to have fixed this issue，are you sure your version is v1.12.29?

cmdy added the bug Something isn't working label Nov 28, 2024

dosubot bot added the ipam label Nov 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Lots of NoAvailableAddress errors #4775

[BUG] Lots of NoAvailableAddress errors #4775

cmdy commented Nov 28, 2024

dosubot bot commented Nov 28, 2024

bobz965 commented Nov 28, 2024

cmdy commented Nov 28, 2024

cmdy commented Nov 28, 2024

bobz965 commented Nov 29, 2024

cmdy commented Nov 29, 2024

cmdy commented Nov 29, 2024

bobz965 commented Nov 29, 2024

cmdy commented Nov 29, 2024

cmdy commented Nov 29, 2024

bobz965 commented Nov 29, 2024

cmdy commented Nov 29, 2024

bobz965 commented Nov 30, 2024

cmdy commented Dec 2, 2024

zcq98 commented Dec 12, 2024

[BUG] Lots of NoAvailableAddress errors #4775

[BUG] Lots of NoAvailableAddress errors #4775

Comments

cmdy commented Nov 28, 2024

Kube-OVN Version

Kubernetes Version

Operation-system/Kernel Version

Description

Steps To Reproduce

Current Behavior

Expected Behavior

dosubot bot commented Nov 28, 2024

bobz965 commented Nov 28, 2024

cmdy commented Nov 28, 2024

cmdy commented Nov 28, 2024

bobz965 commented Nov 29, 2024

cmdy commented Nov 29, 2024

cmdy commented Nov 29, 2024

bobz965 commented Nov 29, 2024

cmdy commented Nov 29, 2024

cmdy commented Nov 29, 2024

bobz965 commented Nov 29, 2024

cmdy commented Nov 29, 2024

bobz965 commented Nov 30, 2024

cmdy commented Dec 2, 2024

zcq98 commented Dec 12, 2024