Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPIKE: EKS discovery #2609

Closed
5 tasks done
Tracked by #316
T-Kukawka opened this issue Jun 21, 2023 · 9 comments
Closed
5 tasks done
Tracked by #316

SPIKE: EKS discovery #2609

T-Kukawka opened this issue Jun 21, 2023 · 9 comments
Assignees
Labels
provider/cluster-api-aws Cluster API based running on AWS team/phoenix Team Phoenix

Comments

@T-Kukawka
Copy link
Contributor

T-Kukawka commented Jun 21, 2023

First step to achieve EKS deployments integrated with GS is the discovery phase. In the short SPIKE we want to discover how easily it is to deploy the EKS clusters, what is missing from our setup and what we have to build on top to integrate fully with our offering.

Tasks

Preview Give feedback
@calvix
Copy link

calvix commented Jun 28, 2023

issues that I encountered when trying to create a proper EKS cluster with CAPI:

  • chart -operator cannot be scheduled on the EKS cluster because there is no master/control-plane node and the chart-operator pod has node selector that pins it to a master node
  • chart operator is trying to run in bootstrap bot which means hostNetwork and connecting to api via localhost - that obviously doesnt work as EKS has no master nodes and api is not reachable
  • due to that issues no apps can be installed,
  • if we use helmRelease in order to install CNI we could be able to run chart operator in normal mode like in vintage, but that means cilium needs to be able to connect to API and that won't work as currently dns-operator-aws only reconciles AWSCluster so there is no DNS pointing to the API EKS ELB

@T-Kukawka
Copy link
Contributor Author

as discussed with Product we will move on with the dns-operator-aws so we unblock rest of deployments

@calvix
Copy link

calvix commented Jun 28, 2023

so the next steps:

@calvix
Copy link

calvix commented Jun 30, 2023

With further progress I found out that the solution with dns-operator-aws is not really compatible with our custom DNS api.clusterID.basedomain DNS wont work as the EKS does not allow adding extra CN to the API SSL certs meaning we have to use the EKS endpoint URL everywhere including cilium

So I got inspired by the solution they have in cluster-cloud-director which uses a hook that patches the HelmRelease resource afterwards - https://github.com/giantswarm/cluster-cloud-director/blob/main/helm/cluster-cloud-director/templates/update-values-hook-job.yaml

Instead, I did a job that wait until the Cluster.Spec.ControlPlaneEndpoint value is populated and then create a CM with the EKS endpoint value which is consumed by the cilium HelmRelease and is passing the right value to the helm chart.

This solved Cilium app and the cluster got networking up

The next step was to make chart-operator work - in cluster-apps-operator I add a single if that decides if the cluster is EKS and in that case it will set values for chart operator to run in normal mode - giantswarm/cluster-apps-operator#363

So the result is the EKS cluster created with cluster-eks repo which has CNI working and chart-operator can run and install additional apps. There is still an issue with adopting the corends app which does not work for some reason. Some other non-important default apps might be failing as well.

I updated the TODO list, the PRs need reviews

@calvix
Copy link

calvix commented Jul 7, 2023

app etcd-kubernetes-resources-count-exporter in default-apps-aws do not make sense for EKS as the ETCD is not exposed so there it should not be there, we might need to introduce a mechanism to disable certain apps from the bundle

@calvix
Copy link

calvix commented Jul 7, 2023

current state:

 #: kk get no
NAME                                        STATUS   ROLES    AGE   VERSION
ip-10-0-147-56.eu-west-2.compute.internal   Ready    worker   19m   v1.24.13-eks-0a21954
ip-10-0-4-202.eu-west-2.compute.internal    Ready    worker   19m   v1.24.13-eks-0a21954
ip-10-0-43-122.eu-west-2.compute.internal   Ready    worker   19m   v1.24.13-eks-0a21954
[local] _GS/cluster-eks/helm/cluster-eks | git: add-corends-adopter | k8s: vac0eks-capi-admin@vac0eks
 #: kk get po
NAME                                                            READY   STATUS              RESTARTS   AGE
aws-pod-identity-webhook-app-57df4547fc-2s2ns                   0/1     ContainerCreating   0          7m15s
aws-pod-identity-webhook-app-57df4547fc-nqnk6                   0/1     ContainerCreating   0          7m15s
aws-pod-identity-webhook-app-57df4547fc-wcggs                   0/1     ContainerCreating   0          7m15s
aws-pod-identity-webhook-restarter-28145685-fwhlh               0/1     Completed           0          3m10s
capi-node-labeler-c6pwj                                         1/1     Running             0          9m38s
capi-node-labeler-lr7h9                                         1/1     Running             0          9m38s
capi-node-labeler-nw65w                                         1/1     Running             0          9m38s
cert-exporter-daemonset-2p9st                                   1/1     Running             0          9m38s
cert-exporter-daemonset-pblkw                                   1/1     Running             0          9m38s
cert-exporter-daemonset-tjlt5                                   1/1     Running             0          9m38s
cert-exporter-deployment-5d9dc8cd46-fqv2b                       1/1     Running             0          9m38s
cert-manager-cainjector-9cc6b5c4d-f66fl                         1/1     Running             0          6m11s
cert-manager-controller-7c688c6db-44zdt                         1/1     Running             0          6m11s
cert-manager-webhook-6c5ddc7678-j6lmz                           1/1     Running             0          6m11s
cert-manager-webhook-6c5ddc7678-mw4c5                           1/1     Running             0          6m11s
cilium-cfhtx                                                    1/1     Running             0          11m
cilium-hl56d                                                    1/1     Running             0          11m
cilium-n25st                                                    1/1     Running             0          11m
cilium-operator-f6ffd6cc4-42j4d                                 1/1     Running             0          11m
cilium-operator-f6ffd6cc4-wck9j                                 1/1     Running             0          11m
coredns-adopter-g2ddb                                           0/1     Completed           0          20m
coredns-controlplane-56565947c4-z48pf                           1/1     Running             0          8m45s
coredns-workers-6858cbbdbb-27jq2                                1/1     Running             0          8m44s
coredns-workers-6858cbbdbb-nf5mt                                1/1     Running             0          8m45s
etcd-kubernetes-resources-count-exporter-674c49c5b9-z8wmk       0/1     Pending             0          10m
external-dns-5d6545fd79-mp792                                   1/1     Running             0          41s
hubble-relay-88c84b559-rpshj                                    1/1     Running             0          11m
hubble-ui-5c7475d499-2wl89                                      2/2     Running             0          11m
metrics-server-75f9465bdc-db7jd                                 1/1     Running             0          10m
metrics-server-75f9465bdc-pm44g                                 1/1     Running             0          10m
net-exporter-9jpfn                                              1/1     Running             0          7m16s
net-exporter-f4c2j                                              1/1     Running             0          7m16s
net-exporter-wv4rc                                              1/1     Running             0          7m16s
node-exporter-node-exporter-6kqc5                               1/1     Running             0          7m17s
node-exporter-node-exporter-drwjc                               1/1     Running             0          7m17s
node-exporter-node-exporter-z2lcg                               1/1     Running             0          7m17s
prometheus-operator-app-kube-state-metrics-d7f4ff68d-x6lc8      1/1     Running             0          7m5s
prometheus-operator-app-operator-5474bb6778-gwxjq               1/1     Running             0          42s
vertical-pod-autoscaler-admission-controller-754fd4b4b9-jjqg6   1/1     Running             0          6m49s
vertical-pod-autoscaler-admission-controller-754fd4b4b9-pltxs   1/1     Running             0          6m49s
vertical-pod-autoscaler-recommender-5cf6659f65-sjn5s            1/1     Running             0          6m49s
vertical-pod-autoscaler-updater-5b445bbb66-rbjhc                1/1     Running             0          6m49s

cluster is created and available to use, all default apps (except etcd exporter) are running

@calvix
Copy link

calvix commented Jul 7, 2023

The current state is that the cluster can be created and is usable for basic usage, but there is no ingress.

Ingress - for that, we need proper DNS

TODO for DNS:

  • reconcile AWSManagedControlPlane in dns-oprator-aws and create a hosted zone + ingress DNS record
  • reconcile AWSManagedCluster in irsa-operator
  • reconcile AWSManagedControlPlane in capa-iam-operator in order to create a IAM role for external-dns app which is necessary for the creation of the ingress DNS record

other stories that would cover other teams would be:

  • EKS kubeconfig and automation of opsctl login - the proper kubeconfig needs AWS credentials in order to work ( there is very short-lived kubeconfig in the CAPI MC that can be used for short debug but its not feasible for normal work)
  • check if monitoring works
  • CI tests for cluster-eks

More questions:

  • EKS based MC?
  • should we still use provider: capa for this or should there be a new provider: eks ?

@calvix
Copy link

calvix commented Aug 4, 2023

once the basic functionality is achieved - with the all above PRs, we should do next:

  • opsctl login and kubectl gs login working for EKS cluster

With that, we can open EKS for other teams to start testing their components like monitoring, CI, managed apps and others

@calvix
Copy link

calvix commented Aug 22, 2023

done
documentation for EKS cluster can be found here https://intranet.giantswarm.io/docs/dev-and-releng/capi-eks-internal-doc/

@calvix calvix closed this as completed Aug 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
provider/cluster-api-aws Cluster API based running on AWS team/phoenix Team Phoenix
Projects
None yet
Development

No branches or pull requests

3 participants