Skip to content

Arktos Scalability 430 2021 Tracks

Ying Huang edited this page Jun 9, 2021 · 59 revisions

Goals

  1. Multiple resource partitions for pod scheduling (2+ for 430) - primary goal
    • A tenant can have pods physically located in 2 different RPs - upon scheduling
    • The scheduler in a tenant partition should be able to listen to multiple api servers belong to each RP
    • Performance test for 50K hollow nodes. (2 TP, 2~3 RP)
    • Performance test runs with SSL enabled
    • QPS optimization
  2. Daemon set handling in RP - non primary goal
    • Remove from TP
    • Support daemon set in RP
    • Load test
  3. Dynamic add/delete TP/RP design - TBD
    • For design purpose only, not for implementation - consider as trying to avoid hardcode in a lot of places
    • Quality bar only
    • Dynamically discover new tenant partitions based on CRD objects in its resource manager
  4. System partition pod handling - TBD

Non Goals

  1. API gateway

Current status (release 0.7 - 2021.2.6)

Performance test status

  1. 10Kx2 cluster: 1 resource partition, support 20K hosts; 2 tenant partitions, each support 300K pods
    • Density test passed
  2. 10Kx1 cluster: 1 resource partition, support 10K hosts; 1 tenant partition, each support 300K pods
    • Density test passed
    • Load test completed (with known failures)
  3. Single cluster
    • 8K cluster passed density test, load test completed (with known failures)
    • 10K cluster density test completed with etcd too many request error, load test completed (with know failures)

Design & Development status

  1. Code change for 2TPx1RP mostly completed and merged into master (v0.7.0)
  2. Enable SSL in performance test - WIP (Yunwen)
  3. Use insecure mode in local cluster setting for POC (Agreed on 3/1/2021)
  4. Kubelet
    • Use a dedicated kube-client to talk to the resource manager.
    • Use multiple kube-clients to connect to multiple tenant partitions.
    • Track the mapping between tenant ID and kube-clients.
    • Use the right kube-client to do CRUD for all objects (To verify)
  5. Controllers
    • Node controllers (in resource partition)
      • Use a dedicated kube-client to talk to the resource manager.
      • Use multiple kube-clients to talk to multiple tenant partitions.
    • Other controllers (in tenant partition)
      • If the controller list/watches node objects, it needs to use multiple kube-clients to access multiple resource managers.
    • DaemonSet controller (Service/PV/AttachDetach/Garbage)
      • [] Move TTL/DaemonSet controllers to RP
      • Disable in TP, enable in RP
    • Identify resources belong to RP only
    • Further perf and scalability improvements (TBD, currently non goal)
      • Partition or not cache all node objects in a single process.
  6. Scheduler
    • Use a dedicated kube-client to talk to its tenant partition.
    • Use multiple kube-clients to connect to multiple resource managers, list/watching nodes from all resource managers.
    • [?] Use the right kube-client to update nodes objects.
    • Further perf and scalability improvements (TBD)
      • Improve scheduling algorithm to reduce the possibility of scheduling conflicts.
      • Improve scheduler sampling algorithm to reduce scheduling time.
  7. API server - TBD
    • Current haven't identified areas that need to be changed
  8. Proxy
    • Working on a design that will evaluate proxy vs. code change in each components (TBD)
  9. Performance test tools
    • Cluster loader
      • How to talk to node in perf test (Hongwei)
    • Kubemark
      • Support 2 TP scale out cluster set up, insecure (0.7)
      • Support 2 TP scale out cluster set up, secure mode
      • Support 2 TP, 2 RP scale out cluster set up, secure mode
    • Kube-up
      • Support for scale out (current only kubemark support scale out)
  10. Performance test
    • Single RP capacity test (>= 25K, preparing for 25Kx2 goal)
    • QPS optimization (x2, x3, x4, etc. in density test)
    • Regular density test for 10K single cluster, 10Kx2. Each will be done after 500 node test
      • 2TP (10K), 1RP (20K), 20K density test, secure mode
  11. Dev tools
    • One box setup script for 2 TP, 1 RP (Peng, Ying)
    • One box setup script for 2 TP, 2 RP (Ying)
  12. 1.18 Changes

Current Work in Progress (6/1):

  1. CR back ported 1.18 scheduler and related changes to master
    1. Multiple RP code change:
      1. KCM/Scheduler/Kubelet - Hongwei
      2. Kubeup/kubemark - Hongwei
      3. Script (arkto scale out) - Yunwen
      4. Perf test - Ying H.
    2. Apply VM/CommonInfo/Action changes to back porting branch - Ying H.
    3. Partial Runtime support - Yunwen
    4. Multi tenancy - Yunwen
    5. Integration test - Ying/Yunwen/Hongwei (2 reviewers each commit)
    6. K8s code - Hongwei - hold
  2. Perf test
    1. 50K density redo (Currently 6s. Waiting for rerun in master after code merge) (QPS 20)
    2. 1TP/1RP 10K load test
    3. Scale up 8K density test (optional)
  3. Issue tracker
    1. Failed to change ttl annotation for hollow-node - Yunwen Issue 1054

Completed Tasks

  1. Multiple resource partition design - decided to continue multiple client connection changes in all components for multiple RP for now. Will re-design if encountered issue in current approach. (2/17)
  2. Setup local cluster for multiple TPs, RPs (Done - 2/24)
    1. Script/manual for 2TP&1RP cluster set up with 3 hosts - insecure mode (2/19) PR 994
    2. Local dev environment: SSL enabled for scheduler in TP connects to RP directly (2/24) RP 1003
  3. Component code changes
    1. TP components connect to RP directly (Done - 3/15)
      1. Scheduler connect to RP directly via separated clients (2/23) PR 991
      2. KCM connected to RP directly via separated client (3/10) PR 1015
      3. Garbage collector support multiple RPs (3/15) RP 1025
    2. RP components connect to TP directly (Done - 3/12)
      1. Nodelifecycle controller connects to TP directly via separated clients (3/9) PR 1011
      2. Kubelet connects to TP directly via kubeconfig (3/12) PR 1021
    3. Disable/Enable controllers in TP/RP
      1. Move TTL/DaemonSet controller from TP KCM to RP KCM (3/10) PR 1015
      2. Enable service account/token controller in RP KCM local (3/15) PR 1028
    4. Scheduler backporting
  4. Support multiple RPs in kube-up (Done)
    1. Script changes to bring up and cleanup multiple RPs (2/23)
    2. Merge kube-up/kubemark code from master to POC (3/15) PR 1024
    3. Move DaemonSet/TTL controller etc. to RP KCM (3/16) PR 1031
    4. Multiple RPs works in kube-up/kubemark (3/22)
  5. Enable SSL in performance test - master
    1. Code change (3/12) RP 1001
    2. 1TP/1RP 500 nodes perf test (3/12)
    3. 2TP/1RP 500 nodes (3/30)
    4. 1TP/1RP 15K nodes (4/6)
  6. Perf test code changes (Done)
    1. Perf test changes needs for multiple RPs (3/18)
    2. Disable DaemonSet test in load (3/25) PR 1050
  7. Performance test (WIP)
    1. Test single RP limit
      1. 1TP/1RP achieved 40K hollow nodes (3/3). RP CPU ~44%
      2. 15K density test in SSL mode - passed on 4/6 (QPS 20)
    2. Get more resource in GCP (80K US central 3/8)
    3. 10K density test insecure mode - benchmark (3/18)
    4. Multiple TPs/RPs density test
      1. 2TP/2RP 2x500 passed (3/27)
      2. 2TP/2RP 2x5K density test (3/30)
      3. Scale up 500 density test (3/30)
      4. 2TP/2RP 2x10K density test (3/31)
      5. 2TP/2RP 2x10K density test with double RS QPS (4/1 - density test passed but with high saturation pod start up latency)
    5. Scheduler back porting perf test
      • 1TP1RP 10K high QPS test 4/23
        • Scheduler throughput can almost reach 200
      • 1TP1RP 10K RS QPS 200 4/23
        • Similar pod start up latency. p50 4.9s, p90 12s, p99 16s. scheduler throughput p50 204, max 588
        • Scheduler permission issue and logging - fixed 4/27
      • 4TP/2RP total 30K nodes, RS QPS 100 4/27
        • No failure except pod start up latency too high (community benchmark 5s at 10K level)
        • Pod start up latency maximum out of 4 TPs: p50 3.4s, p90 6.9s, p99 10.8s
        • Scheduling throughput p50 100, max 224 ~ 294
      • 5TP/5RP total 50K nodes, RS QPS 100 4/28
        • No failure except pod start up latency too high
        • Pod start up latency maximum out of 5 TPs: p50 2.1s, p90 5.6s, p99 10~13s
        • Scheduling throughput p50 101 ~ 103, max 177 ~ 193
  8. QPS tuning
    1. Increase GC controller QPS (3/18)
      1. 20->40 PR 1034 10K density test 14 hours reduced to 9.6 hours
    2. Increase replicaset controller QPS
      1. 20->40 PR 1034
      2. High saturation pod start up latency, scheduler throughput 31 (4/1, 2TPx2RP 2X10K)
        1. Check scheduler QPS distribution - (4/5 - all used pod binding)
        2. Check Vinay's high qps log to find out how many schedulers were running and whether they were changing leader frequently. (4/5 - no changing leader)
    3. Check global scheduler team optimization changes - 4/6
    4. Community 1.18 high QPS 10K throughput confirm - 4/13
  9. Back port arktos code to 1.18 scheduler - 4/22
    1. 1x10K scheduler back porting with QPS 200 - 4/23
      1. Failed with pod start up latency tp50 4.9, tp90 12s, tp99 16s, scheduling throughput tp50 204, tp90 311, tp99 558, max 588
  10. Complete golang 1.13.9 migration (Done - 3/12)
    1. Kube-openapi upgrade Issue 923
      1. Add and verify import-alias (2/10) PR 965
      2. Add hack/arktos_cherrypick.sh (2/19) PR 990
      3. Promote admission webhook API to v1. Arktos only support v1beta1 now (2/20) PR 981
      4. Promote admissionreview to v1. Arktos only support v1beta1 now (2/25) PR 998
      5. Promote CRD to v1 - (3/3) PR 1004
      6. Bump kube-openapi to 20200410 version and SMD to V3 (3/12) PR 1010
  11. Regression fix
    1. Failed to collect profiling of ETCD (3/11) Issue 1008 PR 1009
    2. Static pods being recycled on TP cluster Issue 1006 (Yunwen/Verifying)
    3. ETCD object counts issue in 3/10 run (3/16) PR 1027 Issue 1023
    4. haproxy ssl check causes api server "TSL handshake error" (3/31) PR 1060 Issue 1048
    5. RP server failed to collect pprof files - (4/5) PR 1058 Issue 1057
    6. Change scheduler PVC binder code to support multiple RPs - (3/31) PR 1063 Issue 1059
  12. Issues Fixed
    1. Kubelet failed to upload events due to authorization error - Yunwen Issue 1046 PR 1040
    2. KCM (deployment controller) on TP cluster failed to sync up deployment with its token - Yunwen master Issue 1039 PR 1040
    3. KCM on TP cluster didn't get nodes in RP cluster(s) - Yunwen master PR 1040 Issue 1038 PR 1040

Tasks on hold

  1. 1TP/1RP limit test
    1. 15K density test - done 4/6, QPS 20, pod start up latency p50 1.8s, p99 4.4s
    2. 20K density test - TODO (4/22 failed on start up latency)
  2. Metrics platform migration
  3. Regression fix
    1. 500 nodes load run finished with error: DaemonSets timeout Issue 1007
  4. System partition pod - how to handle when HA proxy is removed (TBD)
    1. Density test should be OK
  5. Check node authorizer in secure mode
  6. Kubeup/Kubemark improvement
    1. Start proxy at the end (Yunwen)
    2. TP/RP start concurrently (Hongwei)
  7. Benchmark on cluster throughput, pod start up latency for 1TP/1RP, and 50K cluster
  8. Issues
    1. GC controller queries its own master nodes' lease info and cause 404 error in haproxy Issue 1047 - appears to be in master only. Fixed in POC. Park issue till POC changes being port back to master.
    2. [Scale out POC] pod scheduler reported bound successfully but not appear in local Issue 1049 - related to system tenant design. Post 430
    3. [Scale out POC] secret not found in kubelet Issue 1052 - related to system tenant design. Post 430
    4. Tenant zeta request was not redirected to TP2 master correctly Issue 1056 - current proxy limitation

Issue solved in POC - pending in master

  1. Static pods being recycled on TP cluster (fixed in POC) PR 1044 Issue 1006
  2. Controllers on TP should union the nodes from RP cluster and local cluster - fixed in POC PR 1044 Issue 1042