-
Notifications
You must be signed in to change notification settings - Fork 69
Scale Out PoC
Xiaoning Ding edited this page Dec 10, 2020
·
41 revisions
This document tracks the milestones, tasks and ideas related to scale out Arktos.
API gateway:
- Setup a proxy to statically route requests: 1) for "nodes" and "nodeLeases", redirect to resource partition. 2) for everything else, redirect to tenant partition
- Workaround: redirect some requests (for common system-level URLs) based on client IP
- Switch to HAProxy if the watch issue of Nginx blocks or fails the scalability test
Build&Test Scripts:
- Have simple scripts to launch a tenant partition, a resource partition and an API gateway.
- For tenant partition, disable node controller and workload controller manager, add deployment and replicaSet controllers into controller manager (https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/)
- for resource partition, disable scheduler, workload controller manager and other controllers except node lifecycle controller
- Do we need nodeIPAM controller?
- Update Kubemark and perf test to work with the scale-out design.
- The service token issue in NodeLifecycle controller?
- A new branch for scale-out PoC
- Prometheus settings for multi API Servers
Investigation:
- Why is there a huge latency delay -- (Added HTTP 1.1)
- Intermittent timeout for watch requests on proxy
- Manual e2e tests for key flows
- scheduler: node informer using a new client
- kubelet: node and lease update using a new client
- Proxy configuration changes to support multiple tenant partitions
- HAProxy configuration
- Cherry-pick kubemark & setup changes from m1 branch
- Kubelet changes
- Use multiple kube clients to watch all tenant partitions
- Pods keep crashing and restarting
- Use the right kube client to update pod status
- Use the right kube client to pod deletion
- Use the right kube client for PVs
- Fixing the multi-tenancy bugs in statefulSet controller
- How to handle resources in system space which can come from any tenant partition
- Node controller changes
- Use multiple kube clients to watch pods/daemonSets
- Kubemark component changes to support multiple clients
- Test setup & script changes
- Dynamically configure proxy with TP/RP IPs
- Hollow nodes talk to multiple TPs
- Create two tenants on TP1 and TP2 in the test deployment script
- Change test config to distribute workloads to the multiple tenants
- Change metrics collection part to support multi tenants
- Do aggregated query result in tests on a tenant level
- Investigation
- "endpoint not found" error in resource partition
- Kubectl: use the right client to send events to different tenant partitions and resource partitions
- TP API Server needs to get node objects for commands like "kubectl attach/log/.."
- Update node controller to use a separate client for objects in tenant partition
- Improve Kubemark to simulate more than 20K nodes.
- Change Proxy (API Gateway) to use dynamic map lookup instead of static configuration.
- Reduce the number of watch connections on a tenant partition.
- One scheduler instance can't cache unlimited nodes. need optimizations like node preferences.
- Scheduler sampling to improve scheduling throughput.
- Reduce scheduling conflicts (still node preference?)
- Some controllers in tenant partitions also cache all nodes. Needs change as well.
- Current node controller caches all pods and statefulsets.
- Try to avoid aggregated watch on API gateway -- does "node.podList" help?
- Do we still need to coordinate etcd revisions among partitions? Probably not. But need more detailed analysis and some changes.