-
Notifications
You must be signed in to change notification settings - Fork 69
Arktos Scalability 430 2021 Tracks
Ying Huang edited this page Jun 9, 2021
·
59 revisions
- Multiple resource partitions for pod scheduling (2+ for 430) - primary goal
- A tenant can have pods physically located in 2 different RPs - upon scheduling
- The scheduler in a tenant partition should be able to listen to multiple api servers belong to each RP
- Performance test for 50K hollow nodes. (2 TP, 2~3 RP)
- Performance test runs with SSL enabled
- QPS optimization
- Daemon set handling in RP - non primary goal
- Remove from TP
- Support daemon set in RP
- Load test
- Dynamic add/delete TP/RP design - TBD
- For design purpose only, not for implementation - consider as trying to avoid hardcode in a lot of places
- Quality bar only
- Dynamically discover new tenant partitions based on CRD objects in its resource manager
- System partition pod handling - TBD
- API gateway
- 10Kx2 cluster: 1 resource partition, support 20K hosts; 2 tenant partitions, each support 300K pods
- Density test passed
- 10Kx1 cluster: 1 resource partition, support 10K hosts; 1 tenant partition, each support 300K pods
- Density test passed
- Load test completed (with known failures)
- Single cluster
- 8K cluster passed density test, load test completed (with known failures)
- 10K cluster density test completed with etcd too many request error, load test completed (with know failures)
- Code change for 2TPx1RP mostly completed and merged into master (v0.7.0)
- Enable SSL in performance test - WIP (Yunwen)
- Use insecure mode in local cluster setting for POC (Agreed on 3/1/2021)
- Kubelet
- Use a dedicated kube-client to talk to the resource manager.
- Use multiple kube-clients to connect to multiple tenant partitions.
- Track the mapping between tenant ID and kube-clients.
- Use the right kube-client to do CRUD for all objects (To verify)
- Controllers
- Node controllers (in resource partition)
- Use a dedicated kube-client to talk to the resource manager.
- Use multiple kube-clients to talk to multiple tenant partitions.
- Other controllers (in tenant partition)
- If the controller list/watches node objects, it needs to use multiple kube-clients to access multiple resource managers.
- DaemonSet controller (Service/PV/AttachDetach/Garbage)
- [] Move TTL/DaemonSet controllers to RP
- Disable in TP, enable in RP
- Identify resources belong to RP only
- Further perf and scalability improvements (TBD, currently non goal)
- Partition or not cache all node objects in a single process.
- Node controllers (in resource partition)
- Scheduler
- Use a dedicated kube-client to talk to its tenant partition.
- Use multiple kube-clients to connect to multiple resource managers, list/watching nodes from all resource managers.
- [?] Use the right kube-client to update nodes objects.
- Further perf and scalability improvements (TBD)
- Improve scheduling algorithm to reduce the possibility of scheduling conflicts.
- Improve scheduler sampling algorithm to reduce scheduling time.
- API server - TBD
- Current haven't identified areas that need to be changed
- Proxy
- Working on a design that will evaluate proxy vs. code change in each components (TBD)
- Performance test tools
- Cluster loader
- How to talk to node in perf test (Hongwei)
- Kubemark
- Support 2 TP scale out cluster set up, insecure (0.7)
- Support 2 TP scale out cluster set up, secure mode
- Support 2 TP, 2 RP scale out cluster set up, secure mode
- Kube-up
- Support for scale out (current only kubemark support scale out)
- Cluster loader
- Performance test
- Single RP capacity test (>= 25K, preparing for 25Kx2 goal)
- QPS optimization (x2, x3, x4, etc. in density test)
- Regular density test for 10K single cluster, 10Kx2. Each will be done after 500 node test
- 2TP (10K), 1RP (20K), 20K density test, secure mode
- Dev tools
- One box setup script for 2 TP, 1 RP (Peng, Ying)
- One box setup script for 2 TP, 2 RP (Ying)
- 1.18 Changes
- Complete golang 1.13.9 migration (Sonya)
- Metrics platform migration (YingH)
- Migrated from metrics server to Prometheus
- Get correct API responsiveness data
- CR back ported 1.18 scheduler and related changes to master
- Multiple RP code change:
- KCM/Scheduler/Kubelet - Hongwei
- Kubeup/kubemark - Hongwei
- Script (arkto scale out) - Yunwen
- Perf test - Ying H.
- Apply VM/CommonInfo/Action changes to back porting branch - Ying H.
- Partial Runtime support - Yunwen
- Multi tenancy - Yunwen
- Integration test - Ying/Yunwen/Hongwei (2 reviewers each commit)
- K8s code - Hongwei - hold
- Multiple RP code change:
- Perf test
- 50K density redo (Currently 6s. Waiting for rerun in master after code merge) (QPS 20)
- 1TP/1RP 10K load test
- Scale up 8K density test (optional)
- Issue tracker
- Failed to change ttl annotation for hollow-node - Yunwen Issue 1054
- Multiple resource partition design - decided to continue multiple client connection changes in all components for multiple RP for now. Will re-design if encountered issue in current approach. (2/17)
- Setup local cluster for multiple TPs, RPs (Done - 2/24)
- Component code changes
- TP components connect to RP directly (Done - 3/15)
- RP components connect to TP directly (Done - 3/12)
- Disable/Enable controllers in TP/RP
- Scheduler backporting
- Support multiple RPs in kube-up (Done)
- Enable SSL in performance test - master
- Code change (3/12) RP 1001
- 1TP/1RP 500 nodes perf test (3/12)
- 2TP/1RP 500 nodes (3/30)
- 1TP/1RP 15K nodes (4/6)
- Perf test code changes (Done)
- Perf test changes needs for multiple RPs (3/18)
- Disable DaemonSet test in load (3/25) PR 1050
- Performance test (WIP)
- Test single RP limit
- 1TP/1RP achieved 40K hollow nodes (3/3). RP CPU ~44%
- 15K density test in SSL mode - passed on 4/6 (QPS 20)
- Get more resource in GCP (80K US central 3/8)
- 10K density test insecure mode - benchmark (3/18)
- Multiple TPs/RPs density test
- 2TP/2RP 2x500 passed (3/27)
- 2TP/2RP 2x5K density test (3/30)
- Scale up 500 density test (3/30)
- 2TP/2RP 2x10K density test (3/31)
- 2TP/2RP 2x10K density test with double RS QPS (4/1 - density test passed but with high saturation pod start up latency)
- Scheduler back porting perf test
- 1TP1RP 10K high QPS test 4/23
- Scheduler throughput can almost reach 200
- 1TP1RP 10K RS QPS 200 4/23
- Similar pod start up latency. p50 4.9s, p90 12s, p99 16s. scheduler throughput p50 204, max 588
- Scheduler permission issue and logging - fixed 4/27
- 4TP/2RP total 30K nodes, RS QPS 100 4/27
- No failure except pod start up latency too high (community benchmark 5s at 10K level)
- Pod start up latency maximum out of 4 TPs: p50 3.4s, p90 6.9s, p99 10.8s
- Scheduling throughput p50 100, max 224 ~ 294
- 5TP/5RP total 50K nodes, RS QPS 100 4/28
- No failure except pod start up latency too high
- Pod start up latency maximum out of 5 TPs: p50 2.1s, p90 5.6s, p99 10~13s
- Scheduling throughput p50 101 ~ 103, max 177 ~ 193
- 1TP1RP 10K high QPS test 4/23
- Test single RP limit
- QPS tuning
- Increase GC controller QPS (3/18)
- 20->40 PR 1034 10K density test 14 hours reduced to 9.6 hours
- Increase replicaset controller QPS
- 20->40 PR 1034
- High saturation pod start up latency, scheduler throughput 31 (4/1, 2TPx2RP 2X10K)
- Check scheduler QPS distribution - (4/5 - all used pod binding)
- Check Vinay's high qps log to find out how many schedulers were running and whether they were changing leader frequently. (4/5 - no changing leader)
- Check global scheduler team optimization changes - 4/6
- Community 1.18 high QPS 10K throughput confirm - 4/13
- Increase GC controller QPS (3/18)
- Back port arktos code to 1.18 scheduler - 4/22
- 1x10K scheduler back porting with QPS 200 - 4/23
- Failed with pod start up latency tp50 4.9, tp90 12s, tp99 16s, scheduling throughput tp50 204, tp90 311, tp99 558, max 588
- 1x10K scheduler back porting with QPS 200 - 4/23
- Complete golang 1.13.9 migration (Done - 3/12)
- Kube-openapi upgrade Issue 923
- Add and verify import-alias (2/10) PR 965
- Add hack/arktos_cherrypick.sh (2/19) PR 990
- Promote admission webhook API to v1. Arktos only support v1beta1 now (2/20) PR 981
- Promote admissionreview to v1. Arktos only support v1beta1 now (2/25) PR 998
- Promote CRD to v1 - (3/3) PR 1004
- Bump kube-openapi to 20200410 version and SMD to V3 (3/12) PR 1010
- Kube-openapi upgrade Issue 923
- Regression fix
- Failed to collect profiling of ETCD (3/11) Issue 1008 PR 1009
- Static pods being recycled on TP cluster Issue 1006 (Yunwen/Verifying)
- ETCD object counts issue in 3/10 run (3/16) PR 1027 Issue 1023
- haproxy ssl check causes api server "TSL handshake error" (3/31) PR 1060 Issue 1048
- RP server failed to collect pprof files - (4/5) PR 1058 Issue 1057
- Change scheduler PVC binder code to support multiple RPs - (3/31) PR 1063 Issue 1059
- Issues Fixed
- Kubelet failed to upload events due to authorization error - Yunwen Issue 1046 PR 1040
- KCM (deployment controller) on TP cluster failed to sync up deployment with its token - Yunwen master Issue 1039 PR 1040
- KCM on TP cluster didn't get nodes in RP cluster(s) - Yunwen master PR 1040 Issue 1038 PR 1040
- 1TP/1RP limit test
- 15K density test - done 4/6, QPS 20, pod start up latency p50 1.8s, p99 4.4s
- 20K density test - TODO (4/22 failed on start up latency)
- Metrics platform migration
- Regression fix
- 500 nodes load run finished with error: DaemonSets timeout Issue 1007
- System partition pod - how to handle when HA proxy is removed (TBD)
- Density test should be OK
- Check node authorizer in secure mode
- Kubeup/Kubemark improvement
- Start proxy at the end (Yunwen)
- TP/RP start concurrently (Hongwei)
- Benchmark on cluster throughput, pod start up latency for 1TP/1RP, and 50K cluster
- Issues
- GC controller queries its own master nodes' lease info and cause 404 error in haproxy Issue 1047 - appears to be in master only. Fixed in POC. Park issue till POC changes being port back to master.
- [Scale out POC] pod scheduler reported bound successfully but not appear in local Issue 1049 - related to system tenant design. Post 430
- [Scale out POC] secret not found in kubelet Issue 1052 - related to system tenant design. Post 430
- Tenant zeta request was not redirected to TP2 master correctly Issue 1056 - current proxy limitation
- Static pods being recycled on TP cluster (fixed in POC) PR 1044 Issue 1006
- Controllers on TP should union the nodes from RP cluster and local cluster - fixed in POC PR 1044 Issue 1042