Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Lots of NoAvailableAddress errors #4775

Open
cmdy opened this issue Nov 28, 2024 · 15 comments
Open

[BUG] Lots of NoAvailableAddress errors #4775

cmdy opened this issue Nov 28, 2024 · 15 comments
Labels
bug Something isn't working ipam

Comments

@cmdy
Copy link
Contributor

cmdy commented Nov 28, 2024

Kube-OVN Version

v1.12.29

Kubernetes Version

v1.28.11

Operation-system/Kernel Version

"CentOS Linux 7 (Core)" 5.10.0-228.2410.el7.bzl.x86_64

Description

When creating pods in batches, a large number of NoAvailableAddress errors occur

E1128 10:02:11.291383       7 pod.go:425] error syncing 'kf-partition/predict-data-2-hive-2024-11-20-8d364b9370675611-exec-1382': NoAvailableAddress, requeuing
E1128 10:02:11.298457       7 pod.go:1659] NoAvailableAddress
E1128 10:02:11.298474       7 pod.go:608] NoAvailableAddress
E1128 10:02:11.298487       7 pod.go:425] error syncing 'kf-partition/predict-data-2-hive-2024-11-20-8d364b9370675611-exec-1382': NoAvailableAddress, requeuing
E1128 10:02:11.308268       7 pod.go:1659] NoAvailableAddress
E1128 10:02:11.308282       7 pod.go:608] NoAvailableAddress
E1128 10:02:11.308294       7 pod.go:425] error syncing 'kf-partition/predict-data-2-hive-2024-11-20-8d364b9370675611-exec-1376': NoAvailableAddress, requeuing
E1128 10:02:11.318782       7 pod.go:1659] NoAvailableAddress
E1128 10:02:11.318798       7 pod.go:608] NoAvailableAddress
E1128 10:02:11.318812       7 pod.go:425] error syncing 'kf-partition/predict-data-2-hive-2024-11-20-8d364b9370675611-exec-1383': NoAvailableAddress, requeuing
E1128 10:02:11.324401       7 pod.go:1659] NoAvailableAddress
E1128 10:02:11.324417       7 pod.go:608] NoAvailableAddress
E1128 10:02:11.324431       7 pod.go:425] error syncing 'kf-partition/predict-data-2-hive-2024-11-20-8d364b9370675611-exec-1383': NoAvailableAddress, requeuing
E1128 10:02:11.327052       7 pod.go:1659] NoAvailableAddress
E1128 10:02:11.327068       7 pod.go:608] NoAvailableAddress
E1128 10:02:11.327080       7 pod.go:425] error syncing 'kf-partition/predict-data-2-hive-2024-11-20-8d364b9370675611-exec-1377': NoAvailableAddress, requeuing
E1128 10:02:11.342842       7 pod.go:1659] NoAvailableAddress
E1128 10:02:11.342856       7 pod.go:608] NoAvailableAddress
E1128 10:02:11.342869       7 pod.go:425] error syncing 'kf-partition/predict-data-2-hive-2024-11-20-8d364b9370675611-exec-1370': NoAvailableAddress, requeuing
E1128 10:02:11.355345       7 pod.go:1659] NoAvailableAddress
E1128 10:02:11.355363       7 pod.go:608] NoAvailableAddress
E1128 10:02:11.355377       7 pod.go:425] error syncing 'kf-partition/predict-data-2-hive-2024-11-20-8d364b9370675611-exec-1384': NoAvailableAddress, requeuing
E1128 10:02:11.365660       7 pod.go:1659] NoAvailableAddress
E1128 10:02:11.365674       7 pod.go:608] NoAvailableAddress
E1128 10:02:11.365688       7 pod.go:425] error syncing 'kf-partition/predict-data-2-hive-2024-11-20-8d364b9370675611-exec-1384': NoAvailableAddress, requeuing
E1128 10:02:11.376498       7 pod.go:1659] NoAvailableAddress
E1128 10:02:11.376512       7 pod.go:608] NoAvailableAddress
E1128 10:02:11.376523       7 pod.go:425] error syncing 'kf-partition/predict-data-2-hive-2024-11-20-8d364b9370675611-exec-1380': NoAvailableAddress, requeuing
E1128 10:02:11.377634       7 pod.go:1659] NoAvailableAddress
E1128 10:02:11.377648       7 pod.go:608] NoAvailableAddress
E1128 10:02:11.377659       7 pod.go:425] error syncing 'kf-partition/predict-data-2-hive-2024-11-20-8d364b9370675611-exec-1371': NoAvailableAddress, requeuing
E1128 10:02:11.392717       7 pod.go:1659] NoAvailableAddress
E1128 10:02:11.392735       7 pod.go:608] NoAvailableAddress
E1128 10:02:11.392758       7 pod.go:425] error syncing 'kf-partition/predict-data-2-hive-2024-11-20-8d364b9370675611-exec-1372': NoAvailableAddress, requeuing
E1128 10:02:11.410417       7 pod.go:1659] NoAvailableAddress
E1128 10:02:11.410430       7 pod.go:608] NoAvailableAddress
E1128 10:02:11.410445       7 pod.go:425] error syncing 'kf-partition/predict-data-2-hive-2024-11-20-8d364b9370675611-exec-1373': NoAvailableAddress, requeuing
E1128 10:02:11.475704       7 pod.go:1659] NoAvailableAddress

Namespace kf-partition has multiple subenet

apiVersion: v1
kind: Namespace
metadata:
  annotations:
    ovn.kubernetes.io/logical_switch: arsenal-subnet-10,arsenal-subnet-3,arsenal-subnet-6,arsenal-subnet-7,arsenal-subnet-9,arsenal-subnet-1,arsenal-subnet-2,arsenal-subnet-4,arsenal-subnet-5,arsenal-subnet-8
  labels:
    kubernetes.io/metadata.name: kf-partition
  name: kf-partition
  resourceVersion: "1422896579"
  uid: 0a4aa8fb-4ae3-4590-aec7-41de64f654c3
spec:
  finalizers:
  - kubernetes
status:
  phase: Active

Subnet usage

image

Steps To Reproduce

  1. Namespace first subnet IP and a few remaining IPs
  2. Creating pods in batches

Current Behavior

NoAvailableAddress error causes pod to retry

Expected Behavior

In multiple subenet when the subnet IP is not actually exhausted, do not cause pod retries due to NoAvailableAddress errors

@cmdy cmdy added the bug Something isn't working label Nov 28, 2024
Copy link

dosubot bot commented Nov 28, 2024

I found a similar closed issue that might be relevant to your situation:

  1. [BUG] need to handle the failure during allocate multiple IPs for a single pod, or it will exhaust the whole IP Pool: This issue was closed with a pull request that addressed the problem by checking both the stateful set name and UID when handling pod deletion. The issue was resolved in version 1.10.6 [1].

In your case, the NoAvailableAddress error occurs when creating pods in batches, even though the namespace has multiple subnets. This error is triggered when Kube-OVN fails to allocate an IP address for a pod, indicating that there are no available IP addresses to allocate [2].

To address this, you might want to ensure that the IP allocation logic is correctly configured to handle multiple subnets within a namespace. The logic for selecting a subnet involves checking namespace bindings, label selectors, and default configurations [3]. Make sure that your configuration aligns with these criteria to prevent unnecessary NoAvailableAddress errors.

To continue talking to Dosu, mention @dosu.

@dosubot dosubot bot added the ipam label Nov 28, 2024
@bobz965
Copy link
Collaborator

bobz965 commented Nov 28, 2024

please attach the pod yaml, does the pod use static IP?

@cmdy
Copy link
Contributor Author

cmdy commented Nov 28, 2024

please attach the pod yaml, does the pod use static IP?

not using a static IP,just a normal pod.

I think it is because the code always selects the first subnet of namespce first. When the number of pods in a batch is greater than the number of available IPs in the first subnet, the pod will retry because there is no IP.

@cmdy
Copy link
Contributor Author

cmdy commented Nov 28, 2024

image
I think the method getPodDefaultSubnet should choose the subnet with the most available IPs
@bobz965

@bobz965
Copy link
Collaborator

bobz965 commented Nov 29, 2024

the ovn-default cidr use mask /16, which is not enough ?

@cmdy
Copy link
Contributor Author

cmdy commented Nov 29, 2024

the ovn-default cidr use mask /16, which is not enough ?

we use vxlan mode, the subnet cird only use mask /21

@cmdy
Copy link
Contributor Author

cmdy commented Nov 29, 2024

I think using multiple subnets and using small bit masks is common in normal business, especially for large-scale clusters.

@bobz965
Copy link
Collaborator

bobz965 commented Nov 29, 2024

I think using multiple subnets and using small bit masks is common in normal business, especially for large-scale clusters.

you can use pod annotation to use the subnet which has available ip

@cmdy
Copy link
Contributor Author

cmdy commented Nov 29, 2024

I think using multiple subnets and using small bit masks is common in normal business, especially for large-scale clusters.

this does not quite fit our business scenario and it didn't meet our expectations. if the business queries how many available IPs are left in the allocated subnet when scheduling the pod, it is not very elegant to use.

@cmdy
Copy link
Contributor Author

cmdy commented Nov 29, 2024

our business scenario is to schedule pods for batch processing tasks. each batch of pods may have hundreds or thousands of pods.

@bobz965
Copy link
Collaborator

bobz965 commented Nov 29, 2024

especially for large-scale clusters : in VPC case, cidr /8 /16 is very common.
if you use VLAN, it is better to use smaller than /24.

@cmdy
Copy link
Contributor Author

cmdy commented Nov 29, 2024

especially for large-scale clusters : in VPC case, cidr /8 /16 is very common. if you use VLAN, it is better to use smaller than /24.

we use /21 because tunnel_key in vxlan mode only supports this many bits at most.

If /8 /16 subnet is used, will configuring ACL for isolation between services result in a very large subnet? will it be difficult to configure?

@bobz965
Copy link
Collaborator

bobz965 commented Nov 30, 2024

how about using geneve?

@cmdy
Copy link
Contributor Author

cmdy commented Dec 2, 2024

how about using geneve?

our company's IDC needs to use vxlan

@zcq98
Copy link
Member

zcq98 commented Dec 12, 2024

image
release v1.12.28 seems to have fixed this issue,are you sure your version is v1.12.29?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ipam
Projects
None yet
Development

No branches or pull requests

3 participants