chore(docs): update main README

olga-mir · Jul 13, 2022 · 01bf6a4 · 01bf6a4
1 parent 501c7c4
commit 01bf6a4
Showing 1 changed file with 61 additions and 11 deletions.
diff --git a/README.md b/README.md
@@ -1,35 +1,85 @@
 # Multi Cluster Management
 
-This repository explores multi cluster technologies and is primarily based on [Cluster API](https://github.com/kubernetes-sigs/cluster-api) for declarative cluster management and [FLuxCD](https://fluxcd.io/) for GitOps workflows.
+This repository contains manifests and scripts to bootstrap clusters with [Cluster API](https://github.com/kubernetes-sigs/cluster-api). Currently only AWS clusters are supported, but more types will be added later (EKS and GCP).
+
+# Tech Stack
+
+The cluster definitions are managed declaratively in a GitOps way using [FLuxCD](https://fluxcd.io/). Flux repositories structure follows ["Repo per team"](https://fluxcd.io/docs/guides/repository-structure/#repo-per-team) approach.
+
+Deploy process follows ["Boostrap & Pivot"](https://cluster-api.sigs.k8s.io/clusterctl/commands/move.html) approach with initial temporary management cluster running on `kind`.
+Flux manifests are installed on each workload cluster using CAPI feature `ClusterResourceSet` (althouth this feature maybe deprecated in future). Flux manifests are pre-generated and packaged as CRS ConfigMaps, flux is running in read-only mode (deploy key does not have write permissions).
+
+CNI of choice is `cilium` and is installed by a script after the workload cluster is bootstrapped by CAPI. This is because it requires API server IP and in the current setup this is only known in runtime.
 
 Other projects used:
+
 * [Kong OSS k8s ingress controller](https://docs.konghq.com/kubernetes-ingress-controller/)
 * [Kubernetes Cluster Federation](https://github.com/kubernetes-sigs/kubefed/)
 
-This is not a complete production-ready pattern rather iterative approach to get from distinct quick-start guides to a state where multiple technologies are integrated together to achieve powerful multi-cluster workflows and patterns.
+# Installation
 
-Deviations from the quick starts are improvements around least privilege principle, preference to "as Code" approach opposed to cli commands, cost-optimization, etc.
+## One Time Setup
 
-Flux repositories structure follows ["Repo per team"](https://fluxcd.io/docs/guides/repository-structure/#repo-per-team) approach.
-Cluster API deployment follows ["Boostrap & Pivot"](https://cluster-api.sigs.k8s.io/clusterctl/commands/move.html) approach with initial temporary management cluster running on `kind`.
+Create CAPI IAM user. This will ensure the least privilege principle and give the ability to audit CAPI requests separately.
+Refer to [aws/README.md](aws/README.md) for more details what required for initial AWS setup.
 
-My previous experiment with permanent management cluster bootstraped by kOps with workload cluster applied by FluxCD: https://github.com/olga-mir/k8s/releases/tag/v0.0.1
+Setup workload clusters config as described in [config/README.md](config/README.md). Workload clusters can be set and removed on the go, they don't need to exist before running the deploy script.
 
-# Installation
+More details on deploy process can be found here: [docs/bootstrap-and-pivot.md](docs/bootstrap-and-pivot.md)
+
+## Deploy
+
+deploy permanent management cluster on AWS (using temp `kind` cluster and then pivot)
+```
+./scripts/deploy.sh
+```
+flux on management cluster will apply CAPI manifests that are currently present in the repo.
 
-Detailed process for installing a permanent management cluster and a workload cluster can be found in [docs/bootstrap-and-pivot.md](docs/bootstrap-and-pivot.md)
-Script: [deploy.sh](./scripts/deploy.sh) - installs the full stack kind -> AWS mgmt cluster -> dev cluster, including relevant components: flux, kubefed, kong, capi.
+When script is complete run script to finalize workload clusters (install cilium which currently is not vi CRS - due to dynamic KAS address) and flux secret (WIP to eliminate this step).
+This script without arguments will discover all workload clusters and perform all necessary adjustments:
+```
+./scripts/workload-cluster.sh
+```
+
+## Adding a new cluster
+
+Hands free with just one command!
+
+To add a new cluster create config env for it by copying existing file (`./config/cluster-<num>.env`) and modifying values. This is intended to be manual as script can't or shouldn't guess this values (or too difficult in bash like calc next CIDR)
+
+```
+./scripts/workload-cluster.sh -n cluster-02
+```
+
+This will generate all necessary files and add the cluster to mgmt kustomization list too. Then it will be pushed to the repo (example commit from the script: https://github.com/olga-mir/k8s-multi-cluster/pull/10/commits/92ee7e094881969736ed666a0e732f073ebc53c6), where flux will apply it and capi will provision it. The `./scripts/workload-cluster.sh` is still waiting for the cluster to come up and finalize the installation.
+
+on mgmt cluster:
+```
+% k get cluster -A
+NAMESPACE      NAME           PHASE          AGE   VERSION
+cluster-01     cluster-01     Provisioned    12m
+cluster-02     cluster-02     Provisioning   60s
+cluster-mgmt   cluster-mgmt   Provisioned    13m
+```
 
 # Cleanup
 
-Always delete `cluster` objects from management cluster(s) first (`k delete cluster`), otherwise cloud resources are not deleted.
-If the cluster resource was not deleted, or if deletion got tangled up in errors clean up resources manually to avoid charges:
+Delete clusters in clean CAPI way:
+```
+% ./scripts/cleanup.sh
+```
+The script will move all cluster definitions, including mgmt cluster (which at this point is hosted on the mgmt cluster itself) to the `kind` cluster and delete them in parallel.
+
+When CAPI way is not working for some reasons (bugs), then you need to delete AWS resources that make up the clusters to avoid charges.
+
 * delete NAT gateway.
 * release Elastic IP(s).
 * delete VPC.
 * terminate EC2 instances.
 (Resrouces usually are named `<cluster-name>-<resource-type>` pattern, e.g `mgmt-nat`, `mgmt-vpc`)
 
+Alternatively, use script `./scripts/brutal-aws-cleanup.sh` - this script deletes everything it can find (in NATs, EIPs, EC2 instances, ELBs, but not VPCs) without checking if they are related to the clusters in this project. So it is not recommended to use if there are other resources in the account.
+
 # Resources
 
 * https://www.weave.works/blog/manage-thousands-of-clusters-with-gitops-and-the-cluster-api