Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question/issue around Talos bootstrap with Cluster API & vSphere infrastructure (CAPV) #179

Open
julien-sugg opened this issue Oct 20, 2023 · 4 comments

Comments

@julien-sugg
Copy link

Greetings,

We've been playing with Talos Linux and Cluster API to automate the management of our clusters, and are currently facing some questions/issues around the bootstrap process using the vSphere infrastructure provider.

Versions / Environment

  • Kubernetes: 1.27.5
  • Talos: 1.5.2 (OVA)
  • Cluster API Infrastructure: vSphere 1.8.1
  • Cluster API Bootstrap: Talos 0.6.2
  • Cluster API CP: Talos 0.5.3
  • VMWare ESXi 7.0.3

Description

According to the Talos - VMware documentation, we have to install a custom talos-vmtools with some dedicated Talos config.

This totally makes senses, however, my concern if the following:

In order to bootstrap the cluster via Cluster API, and especially the CACPPT controller, I need my CAPV controller to retrieve the IP address of the VM via the vCenter API. However, such IP is only available upon successful installation and configuration of the VMTools. Unfortunately, to install the VMTools, I need to necessarily have the Talos bootstrap done due to the fact that it is deployed as a DaemonSet. This makes us hit the chicken/egg problem.

Our current workaround is to manually bootstrap the cluster via the IP addresses provided by the DHCP. However, this is quite a pain as we wish to automate everything via GitOps since we will manage quite a lot of permanent clusters, but also some ephemeral ones.

Do you have any insights or recommendations to achieve such goal using the VMware ecosystem ?

Reproduce Steps

The following steps can be performed to easily reproduce the issue:

  1. Create a transient cluster that will be used to spawn the first permanent management cluster via Cluster API.

The cluster can either be created directly on vSphere or kind/k3d/...

  1. Initialize Cluster API components on the transient cluster with clusterctl with CAPV, CABPT and CACPPT
clusterctl init \
--infrastructure vsphere:v1.8.1 \
--bootstrap talos:v0.6.2 \
--control-plane talos:v0.5.3 \
--target-namespace cluster-api-system
  1. Create the permanent management cluster with the following minimal manifests:
Click to expand manifests
---
apiVersion: v1
kind: Secret
metadata:
  name: observability-cluster-poc
  namespace: cluster-api-system
stringData:
  password: REDACTED
  username: REDACTED
---
apiVersion: bootstrap.cluster.x-k8s.io/v1alpha3
kind: TalosConfigTemplate
metadata:
  name: observability-cluster-poc-md-0
  namespace: cluster-api-system
spec:
  template:
    spec:
      configPatches:
      - op: add
        path: /machine/network
        value:
          interfaces:
          - dhcp: true
            dhcpOptions:
              routeMetric: 1
            interface: eth0
          - dhcp: true
            dhcpOptions:
              routeMetric: 10
            interface: eth1
      - op: add
        path: /machine/install
        value:
          extraKernelArgs:
          - net.ifnames=0
      - op: add
        path: /cluster/network/cni
        value:
          name: none
      - op: add
        path: /cluster/proxy
        value:
          disabled: true
      - op: add
        path: /machine/features/kubePrism
        value:
          enabled: true
          port: 7445
      - op: replace
        path: /cluster/controlPlane
        value:
          endpoint: https://172.30.11.10:6443
      - op: add
        path: /machine/certSANs
        value:
        - 172.30.11.10
      - op: add
        path: /machine/time
        value:
          disabled: false
          servers:
          - 172.30.110.1
      - op: replace
        path: /cluster/extraManifests
        value:
        - https://raw.githubusercontent.com/mologie/talos-vmtoolsd/master/deploy/unstable.yaml
      - op: add
        path: /machine/kubelet/extraArgs
        value:
          cloud-provider: external
      generateType: worker
---
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  labels:
    cluster.x-k8s.io/cluster-name: observability-cluster-poc
  name: observability-cluster-poc
  namespace: cluster-api-system
spec:
  controlPlaneRef:
    apiVersion: controlplane.cluster.x-k8s.io/v1alpha3
    kind: TalosControlPlane
    name: observability-cluster-poc
  infrastructureRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
    kind: VSphereCluster
    name: observability-cluster-poc
---
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineDeployment
metadata:
  labels:
    cluster.x-k8s.io/cluster-name: observability-cluster-poc
  name: observability-cluster-poc-md-0
  namespace: cluster-api-system
spec:
  clusterName: observability-cluster-poc
  replicas: 3
  selector:
    matchLabels: {}
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
    type: RollingUpdate
  template:
    metadata:
      labels:
        cluster.x-k8s.io/cluster-name: observability-cluster-poc
    spec:
      bootstrap:
        configRef:
          apiVersion: bootstrap.cluster.x-k8s.io/v1alpha3
          kind: TalosConfigTemplate
          name: observability-cluster-poc-md-0
      clusterName: observability-cluster-poc
      infrastructureRef:
        apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
        kind: VSphereMachineTemplate
        name: observability-cluster-poc-worker
      version: v1.27.5
---
apiVersion: controlplane.cluster.x-k8s.io/v1alpha3
kind: TalosControlPlane
metadata:
  name: observability-cluster-poc
  namespace: cluster-api-system
spec:
  controlPlaneConfig:
    controlplane:
      configPatches:
      - op: add
        path: /machine/network
        value:
          interfaces:
          - dhcp: true
            dhcpOptions:
              routeMetric: 1
            interface: eth0
            vip:
              ip: 172.30.11.10
          - dhcp: true
            dhcpOptions:
              routeMetric: 10
            interface: eth1
      - op: add
        path: /machine/install
        value:
          extraKernelArgs:
          - net.ifnames=0
      - op: add
        path: /cluster/network/cni
        value:
          name: none
      - op: add
        path: /cluster/proxy
        value:
          disabled: true
      - op: add
        path: /machine/features/kubePrism
        value:
          enabled: true
          port: 7445
      - op: replace
        path: /cluster/controlPlane
        value:
          endpoint: https://172.30.11.10:6443
      - op: add
        path: /machine/certSANs
        value:
        - 172.30.11.10
      - op: add
        path: /cluster/coreDNS
        value:
          disabled: true
      - op: add
        path: /machine/time
        value:
          disabled: false
          servers:
          - 172.30.110.1
      - op: replace
        path: /cluster/extraManifests
        value:
        - https://raw.githubusercontent.com/mologie/talos-vmtoolsd/master/deploy/unstable.yaml
      - op: add
        path: /machine/kubelet/extraArgs
        value:
          cloud-provider: external
      generateType: controlplane
  infrastructureTemplate:
    apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
    kind: VSphereMachineTemplate
    name: observability-cluster-poc
  replicas: 3
  rolloutStrategy:
    rollingUpdate:
      maxSurge: 1
    type: RollingUpdate
  version: v1.27.6
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: VSphereCluster
metadata:
  name: observability-cluster-poc
  namespace: cluster-api-system
spec:
  controlPlaneEndpoint:
    host: 172.30.11.10
    port: 6443
  identityRef:
    kind: Secret
    name: observability-cluster-poc
  server: REDACTED
  thumbprint: REDACTED
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: VSphereMachineTemplate
metadata:
  name: observability-cluster-poc
  namespace: cluster-api-system
spec:
  template:
    spec:
      cloneMode: linkedClone
      customVMXKeys:
        disk.EnableUUID: "true"
      datacenter: REDACTED
      datastore: REDACTED
      diskGiB: 25
      folder: cluster-api-vms
      memoryMiB: 8192
      network:
        devices:
        - dhcp4: true
          dhcp4Overrides:
            routeMetric: 1
          networkName: PLATFORM-PRODUCTION-OBSERVABILITY
        - dhcp4: true
          dhcp4Overrides:
            routeMetric: 10
          networkName: PRODUCTION
      numCPUs: 2
      os: Linux
      powerOffMode: hard
      resourcePool: Cluster-API-POC
      server: REDACTED
      storagePolicyName: ""
      tagIDs:
      - urn:vmomi:InventoryServiceTag:0fe8eb41-7a8f-47b3-a9fe-0d288ec787dd:GLOBAL
      - urn:vmomi:InventoryServiceTag:4495a9ce-727a-4814-b067-682b52130cad:GLOBAL
      template: talos-linux-1.5.2
      thumbprint: REDACTED
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: VSphereMachineTemplate
metadata:
  name: observability-cluster-poc-worker
  namespace: cluster-api-system
spec:
  template:
    spec:
      cloneMode: linkedClone
      customVMXKeys:
        disk.EnableUUID: "true"
      datacenter: REDACTED
      datastore: REDACTED
      diskGiB: 25
      folder: cluster-api-vms
      memoryMiB: 8192
      network:
        devices:
        - dhcp4: true
          dhcp4Overrides:
            routeMetric: 1
          networkName: PLATFORM-PRODUCTION-OBSERVABILITY
        - dhcp4: true
          dhcp4Overrides:
            routeMetric: 10
          networkName: PRODUCTION
      numCPUs: 2
      os: Linux
      powerOffMode: hard
      resourcePool: Cluster-API-POC
      server: REDACTED
      storagePolicyName: ""
      tagIDs:
      - urn:vmomi:InventoryServiceTag:0fe8eb41-7a8f-47b3-a9fe-0d288ec787dd:GLOBAL
      - urn:vmomi:InventoryServiceTag:4495a9ce-727a-4814-b067-682b52130cad:GLOBAL
      template: talos-linux-1.5.2
      thumbprint: REDACTED
  1. Once the VMs are created, confirm that the bootstrap cannot occur since VMTools cannot be installed and the bootstrap cannot be done either as it cannot reach the VMs due to the lack of IP Addresses at vCenter level.

Useful outputs/content

Talos console:

image

vSphere machine (no IP due to VMtools not being installable at this point in time):

image

CACPPT logs:

2023-10-20T06:56:47Z    INFO    reconcile TalosControlPlane     {"controller": "taloscontrolplane", "controllerGroup": "controlplane.cluster.x-k8s.io", "controllerKind": "TalosControlPlane", "TalosControlPlane": {"name":"observability-cluster-poc","namespace":"cluster-api-system"}, "namespace": "cluster-api-system", "name": "observability-cluster-poc", "reconcileID": "be96027a-b052-4819-bd53-8215a326733f", "cluster": "observability-cluster-poc"}
2023-10-20T06:56:47Z    INFO    controllers.TalosControlPlane   bootstrap failed, retrying in 20 seconds        {"namespace": "cluster-api-system", "talosControlPlane": "observability-cluster-poc", "error": "no addresses were found for node \"observability-cluster-poc-bzpgr\""}
2023-10-20T06:56:47Z    INFO    controllers.TalosControlPlane   attempting to set control plane status
2023-10-20T06:56:57Z    INFO    controllers.TalosControlPlane   failed to get kubeconfig for the cluster        {"error": "failed to create cluster accessor: error creating client for remote cluster \"cluster-api-system/observability-cluster-poc\": error getting rest mapping: failed to get API group resources: unable to retrieve the complete list of server APIs: v1: Get \"https://172.30.11.10:6443/api/v1?timeout=10s\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)", "errorVerbose": "failed to get API group resources: unable to retrieve the complete list of server APIs: v1: Get \"https://172.30.11.10:6443/api/v1?timeout=10s\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)\nerror creating client for remote cluster \"cluster-api-system/observability-cluster-poc\": error getting rest mapping\nsigs.k8s.io/cluster-api/controllers/remote.(*ClusterCacheTracker).createClient\n\t/.cache/mod/sigs.k8s.io/[email protected]/controllers/remote/cluster_cache_tracker.go:396\nsigs.k8s.io/cluster-api/controllers/remote.(*ClusterCacheTracker).newClusterAccessor\n\t/.cache/mod/sigs.k8s.io/[email protected]/controllers/remote/cluster_cache_tracker.go:299\nsigs.k8s.io/cluster-api/controllers/remote.(*ClusterCacheTracker).getClusterAccessor\n\t/.cache/mod/sigs.k8s.io/[email protected]/controllers/remote/cluster_cache_tracker.go:273\nsigs.k8s.io/cluster-api/controllers/remote.(*ClusterCacheTracker).GetClient\n\t/.cache/mod/sigs.k8s.io/[email protected]/controllers/remote/cluster_cache_tracker.go:180\ngithub.com/siderolabs/cluster-api-control-plane-provider-talos/controllers.(*TalosControlPlaneReconciler).updateStatus\n\t/src/controllers/taloscontrolplane_controller.go:562\ngithub.com/siderolabs/cluster-api-control-plane-provider-talos/controllers.(*TalosControlPlaneReconciler).Reconcile.func1\n\t/src/controllers/taloscontrolplane_controller.go:155\ngithub.com/siderolabs/cluster-api-control-plane-provider-talos/controllers.(*TalosControlPlaneReconciler).Reconcile\n\t/src/controllers/taloscontrolplane_controller.go:184\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/.cache/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:118\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/.cache/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:314\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/.cache/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/.cache/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:226\nruntime.goexit\n\t/toolchain/go/src/runtime/asm_amd64.s:1598\nfailed to create cluster accessor\nsigs.k8s.io/cluster-api/controllers/remote.(*ClusterCacheTracker).getClusterAccessor\n\t/.cache/mod/sigs.k8s.io/[email protected]/controllers/remote/cluster_cache_tracker.go:275\nsigs.k8s.io/cluster-api/controllers/remote.(*ClusterCacheTracker).GetClient\n\t/.cache/mod/sigs.k8s.io/[email protected]/controllers/remote/cluster_cache_tracker.go:180\ngithub.com/siderolabs/cluster-api-control-plane-provider-talos/controllers.(*TalosControlPlaneReconciler).updateStatus\n\t/src/controllers/taloscontrolplane_controller.go:562\ngithub.com/siderolabs/cluster-api-control-plane-provider-talos/controllers.(*TalosControlPlaneReconciler).Reconcile.func1\n\t/src/controllers/taloscontrolplane_controller.go:155\ngithub.com/siderolabs/cluster-api-control-plane-provider-talos/controllers.(*TalosControlPlaneReconciler).Reconcile\n\t/src/controllers/taloscontrolplane_controller.go:184\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/.cache/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:118\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/.cache/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:314\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/.cache/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/.cache/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:226\nruntime.goexit\n\t/toolchain/go/src/runtime/asm_amd64.s:1598"}
2023-10-20T06:56:57Z    INFO    controllers.TalosControlPlane   successfully updated control plane status       {"namespace": "cluster-api-system", "talosControlPlane": "observability-cluster-poc", "cluster": "observability-cluster-poc"}
2023-10-20T06:56:57Z    INFO    reconcile TalosControlPlane     {"controller": "taloscontrolplane", "controllerGroup": "controlplane.cluster.x-k8s.io", "controllerKind": "TalosControlPlane", "TalosControlPlane": {"name":"observability-cluster-poc","namespace":"cluster-api-system"}, "namespace": "cluster-api-system", "name": "observability-cluster-poc", "reconcileID": "2bb6e4b1-8a51-4c48-b463-eb6b0a915de8", "cluster": "observability-cluster-poc"}
2023-10-20T06:56:57Z    INFO    controllers.TalosControlPlane   bootstrap failed, retrying in 20 seconds        {"namespace": "cluster-api-system", "talosControlPlane": "observability-cluster-poc", "error": "no addresses were found for node \"observability-cluster-poc-bzpgr\""}
2023-10-20T06:56:57Z    INFO    controllers.TalosControlPlane   attempting to set control plane status

Thanks in advance for your help and insights.

@smira
Copy link
Member

smira commented Oct 20, 2023

It was discussed in community Slack, but it didn't quite go that far.

VMWare users need to reimplement vmtoolsd to be a Talos system extension (and an extension service), this way it will run always with the machine.

Another option is to make Talos itself report IPs, if we can do that without pulling all VMWare libraries in.

@sempex
Copy link

sempex commented May 3, 2024

Hi everyone, I face the same problem right now. Are there any updates or instructions to follow to work around this?

@amaol-vestas
Copy link

Also interested to see the fix for this issue, thanks?

@amaol-vestas
Copy link

I found a way to deploy, just create a TalosOS with vmtoolds installed by default using Talos image fabric and the use that one as baseline template for the deployment, please check here [https://factory.talos.dev/].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants