You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If VM Template of a storage node was altered to use custom Image(or clone of the original image) as a second mounted drive - the VM never finishes its configuration and not joining the k8s cluster.
The resolv.conf is different than on a healthy node:
RKE2 agent tries to query itself
The RKE2-agent configuration is also different than on a healthy node (the server set to: server: https://:9345)
Affected OneKE versions are both 1.29 and 1.27
Important note
Please note that the behaviour is going to differ whether second network (private) is isolated or not. But result is going to be the same - storage node is misconfigured and not joined the cluster!
If Private Network isolated:
The resolv.conf is missing on the damaged node, while it exists on a healthy node.
/var/log/one-appliance/configure.log is going to contain errors to communicate with OneGate Failed to open TCP connection to 172.16.100.1:5030 (Network is unreachable - connect(2) for "172.16.100.1" port 5030)
rke2-agent systemd service is going to be dead
If Private Network is routable (easiest way - hook both networks to the same Vnet):
resolv.conf is pointing to the network defined by the Vnet
no errors in the /var/log/one-appliance/configure.log - I, [2024-10-25T13:52:33.674020 #1449] INFO -- : Join storage: oneke-ip-172-16-100-4
RKE2-Agent errors: Oct 25 13:54:42 oneke-ip-172-16-100-4 rke2[1552]: time="2024-10-25T13:54:42Z" level=error msg="failed to get CA certs: Get \"https://127.0.0.1:6444/cacerts\": read tcp 127.0.0.1:51308->127.0.0.1:64>
Steps to reproduce:
install miniONE.
Import OneKE 1.29 service.
Clone the default image that is used as a second disk for storage role nodes. (typically: Service OneKE 1.29-storage-2-*-1 and is 10G in size)
Change the Service OneKE 1.29-storage-2 VM Template to use the cloned Image as a second disk instead of a default one.
Instantiate service using the preferred method and make sure that k8s environment is running as expected (set enable traefik, longhorn, dns, route, NAT)
Scale the storage role by changing its cardinality to 1
Result:
The storage VM can't finish its configuration thus not added to the cluster.
Service stuck in Scaling state.
Workaround:
You can resize the disk after the VM is up and set the desired value.
The text was updated successfully, but these errors were encountered:
I've conducted some additional research based on my guess that the problem is a bit wider and might affect more than just OneKE service. And it seems that it's somehow related to the FSunstone. Here are some evidences:
I've simplified the change - just set the hypervisor to KVM (because it's the minimal change you can do and also is the mandatory one if you edit the template). Such a small change results to storage role not being scaled properly and service stuck.
Tried to perform the same process of setting the hypervisor type using the RSunstone and the service got scaled without any issues.
Next I did a few cross-checks:
Took the service that contains the template edited using RSunstone, and instantiated that in FSunstone , and then scale the storage role. The result is - storage node added without issues.
Took the service that contains the templated edited in FSunstone using the RSunstone, and then scale the storage role. The result - storage node misconfigured and thus not added to the k8s cluster.
If we compare two templates of the same origin - the huge part of CONTEXT is missing from the FSunstone:
Bug Info
Description:
If VM Template of a storage node was altered to use custom Image(or clone of the original image) as a second mounted drive - the VM never finishes its configuration and not joining the k8s cluster.
Some of the symptoms:
server: https://:9345
)Affected OneKE versions are both 1.29 and 1.27
Important note
Please note that the behaviour is going to differ whether second network (private) is isolated or not. But result is going to be the same - storage node is misconfigured and not joined the cluster!
If Private Network isolated:
/var/log/one-appliance/configure.log
is going to contain errors to communicate with OneGateFailed to open TCP connection to 172.16.100.1:5030 (Network is unreachable - connect(2) for "172.16.100.1" port 5030)
If Private Network is routable (easiest way - hook both networks to the same Vnet):
/var/log/one-appliance/configure.log
-I, [2024-10-25T13:52:33.674020 #1449] INFO -- : Join storage: oneke-ip-172-16-100-4
Oct 25 13:54:42 oneke-ip-172-16-100-4 rke2[1552]: time="2024-10-25T13:54:42Z" level=error msg="failed to get CA certs: Get \"https://127.0.0.1:6444/cacerts\": read tcp 127.0.0.1:51308->127.0.0.1:64>
Steps to reproduce:
Service OneKE 1.29-storage-2-*-1
and is 10G in size)Result:
The storage VM can't finish its configuration thus not added to the cluster.
Service stuck in Scaling state.
Workaround:
You can resize the disk after the VM is up and set the desired value.
The text was updated successfully, but these errors were encountered: