-
Notifications
You must be signed in to change notification settings - Fork 69
Scale Out PoC
Xiaoning Ding edited this page Nov 11, 2020
·
41 revisions
This document tracks the milestones, tasks and ideas related to scale out Arktos.
API gateway:
- Setup a proxy to statically route requests: 1) for "nodes" and "nodeLeases", redirect to resource partition. 2) for everything else, redirect to tenant partition
- Workaround: redirect some requests (for common system-level URLs) based on client IP
Build&Test Scripts:
- Have simple scripts to launch a tenant partition, a resource partition and an API gateway.
- For tenant partition, disable node controller and workload controller manager, add deployment and replicaSet controllers into controller manager (https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/)
- for resource partition, disable scheduler, workload controller manager and other controllers except node lifecycle controller
- Do we need nodeIPAM controller?
- Update Kubemark and perf test to work with the scale-out design.
- The service token issue in NodeLifecycle controller?
- A new branch for scale-out PoC
- Prometheus settings for multi API Servers
Investigation:
- Why is there a huge latency delay -- (Added HTTP 1.1)
- Intermittent timeout for watch requests on proxy
- Node sometimes get not ready
- Manual e2e tests for key flows
- scheduler: node informer using a new client
- kubelet: node and lease update using a new client
- Proxy configuration changes to support multiple tenant partitions
- Kubelet changes
- Use multiple kube clients to watch all tenant partitions
- Use the right kube client to update pod status
- Node controller changes
- Use multiple kube clients to watch pods/daemonSets
- Support multi-tenants in perf tests (can be hard-coded tenant names)
- TP API Server needs to get node objects for commands like "kubectl attach/log/.."
- Update node controller to use a separate client for objects in tenant partition
- Improve Kubemark to simulate more than 20K nodes.
- Change Proxy (API Gateway) to use dynamic map lookup instead of static configuration.
- Reduce the number of watch connections on a tenant partition.
- One scheduler instance can't cache unlimited nodes. need optimizations like node preferences.
- Scheduler sampling to improve scheduling throughput.
- Reduce scheduling conflicts (still node preference?)
- Some controllers in tenant partitions also cache all nodes. Needs change as well.
- Current node controller caches all pods and statefulsets.
- Try to avoid aggregated watch on API gateway -- does "node.podList" help?
- Do we still need to coordinate etcd revisions among partitions? Probably not. But need more detailed analysis and some changes.