Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OneKE - Storage nodes misconfigured if using custom or cloned Image #6809

Open
alpeon opened this issue Oct 25, 2024 · 3 comments
Open

OneKE - Storage nodes misconfigured if using custom or cloned Image #6809

alpeon opened this issue Oct 25, 2024 · 3 comments

Comments

@alpeon
Copy link

alpeon commented Oct 25, 2024

Bug Info

Description:

If VM Template of a storage node was altered to use custom Image(or clone of the original image) as a second mounted drive - the VM never finishes its configuration and not joining the k8s cluster.

 2/3 Configuration step is in progress...

 * * * * * * * *
 * PLEASE WAIT *
 * * * * * * * *`

Some of the symptoms:

  • The resolv.conf is different than on a healthy node:
  • RKE2 agent tries to query itself
  • The RKE2-agent configuration is also different than on a healthy node (the server set to: server: https://:9345)

Affected OneKE versions are both 1.29 and 1.27

Important note

Please note that the behaviour is going to differ whether second network (private) is isolated or not. But result is going to be the same - storage node is misconfigured and not joined the cluster!

If Private Network isolated:

  • The resolv.conf is missing on the damaged node, while it exists on a healthy node.
  • /var/log/one-appliance/configure.log is going to contain errors to communicate with OneGate Failed to open TCP connection to 172.16.100.1:5030 (Network is unreachable - connect(2) for "172.16.100.1" port 5030)
  • rke2-agent systemd service is going to be dead

If Private Network is routable (easiest way - hook both networks to the same Vnet):

  • resolv.conf is pointing to the network defined by the Vnet
  • no errors in the /var/log/one-appliance/configure.log - I, [2024-10-25T13:52:33.674020 #1449] INFO -- : Join storage: oneke-ip-172-16-100-4
  • RKE2-Agent errors: Oct 25 13:54:42 oneke-ip-172-16-100-4 rke2[1552]: time="2024-10-25T13:54:42Z" level=error msg="failed to get CA certs: Get \"https://127.0.0.1:6444/cacerts\": read tcp 127.0.0.1:51308->127.0.0.1:64>

Steps to reproduce:

  1. install miniONE.
  2. Import OneKE 1.29 service.
  3. Clone the default image that is used as a second disk for storage role nodes. (typically: Service OneKE 1.29-storage-2-*-1 and is 10G in size)
  4. Change the Service OneKE 1.29-storage-2 VM Template to use the cloned Image as a second disk instead of a default one.
  5. Instantiate service using the preferred method and make sure that k8s environment is running as expected (set enable traefik, longhorn, dns, route, NAT)
  6. Scale the storage role by changing its cardinality to 1

Result:

The storage VM can't finish its configuration thus not added to the cluster.
Service stuck in Scaling state.

Workaround:

You can resize the disk after the VM is up and set the desired value.

@rsmontero
Copy link
Member

This seems to be related to a wrong handling of some vars in OneFlow, @alpeon can you attach the associated OneFlow document

@alpeon
Copy link
Author

alpeon commented Nov 1, 2024

Hi @rsmontero , please find the .json export attached.

fsun-3.json

I've conducted some additional research based on my guess that the problem is a bit wider and might affect more than just OneKE service. And it seems that it's somehow related to the FSunstone. Here are some evidences:

I've simplified the change - just set the hypervisor to KVM (because it's the minimal change you can do and also is the mandatory one if you edit the template). Such a small change results to storage role not being scaled properly and service stuck.

Tried to perform the same process of setting the hypervisor type using the RSunstone and the service got scaled without any issues.

Next I did a few cross-checks:

  • Took the service that contains the template edited using RSunstone, and instantiated that in FSunstone , and then scale the storage role. The result is - storage node added without issues.
  • Took the service that contains the templated edited in FSunstone using the RSunstone, and then scale the storage role. The result - storage node misconfigured and thus not added to the k8s cluster.

If we compare two templates of the same origin - the huge part of CONTEXT is missing from the FSunstone:

Template after the edit in FSunstone:
fsun-storage-tmpl.json

Template after edit in RSunstone:
ogsun-storage-tmpl.json

Original template:
defaults-storage-tmpl.json

If VM Template is getting cloned - no issues, the issue appears only after editing and saving the template in FSunstone.

@rsmontero
Copy link
Member

Thanks @alpeon I'm moving this to the OpenNebula repo to look into the FSuntone issue there.

Thank you for the detailed description and log files ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants