Scale Out PoC

This document tracks the milestones, tasks and ideas related to scale out Arktos.

Tasks

API gateway:

Setup a proxy to statically route requests: 1) for "nodes" and "nodeLeases", redirect to resource partition. 2) for everything else, redirect to tenant partition
Workaround: redirect some requests (for common system-level URLs) based on client IP
Switch to HAProxy if the watch issue of Nginx blocks or fails the scalability test

Build&Test Scripts:

Have simple scripts to launch a tenant partition, a resource partition and an API gateway.
- For tenant partition, disable node controller and workload controller manager, add deployment and replicaSet controllers into controller manager (https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/)
- for resource partition, disable scheduler, workload controller manager and other controllers except node lifecycle controller
Do we need nodeIPAM controller?
Update Kubemark and perf test to work with the scale-out design.
The service token issue in NodeLifecycle controller?
A new branch for scale-out PoC
Prometheus settings for multi API Servers
Fix issues emerged during test runs
- time out errors during pod deletion
- pod creation errors starting from 5K (maybe due to proxy?)

Investigation: