Skip to content

Computing Backlog

Mengni Zhang edited this page Jun 17, 2021 · 23 revisions

This page list features/issues that will be addressed in Centaurus Arktos.

Scalability

Features

Scaleout Framework

  • Daemonset handling in RP - 930 system tenant only, multi-tenancy daemonset out of scope - Hongwei
  • System partition pod handling - design 930 - YingH.
    • Current system tenant objects are saved in ETCD all with "system" as a part of the directory. Customize system object etcd directory is possible
    • Proposal: each TP has its own system objects, not sharing with other TPs. This will not have single point of failure issue or data population issue
      • Need to check whether it is possible to have multiple system pods from different TP deployed on same VM and causing confliction: case by case.
        • Currently we only have system dns pods and virtlet pods. DNS pods can be deployed to same VM as long as there is no storage confliction. Need to check virtlet pods.
  • Scheduler benchmark - by usage requirement
  • Burst scheduling support
    • 930 QPS 40 per 10K cluster only
      • 1TP/1RP QPS 40
      • 50K cluster QPS 200: 5TP/5RP, 3TP/4RP possible < 200
    • Post 930 distribute VM allocation, VM start up latency, etc.
  • Dynamic add/delete TP/RP - post 2021
  • Create Tenant requests need to go through proxy and be properly redirected to correct TP API server #1056
    • Click2Cloud

Cluster size

  • Scale out
    • Upper limit for 1TP/1RP cluster capacity 930 - YingH.
      • Verified 1TP/1RP QPS 20, 15K cluster passed density test (4/6/2021 pod start up latency p50 1.8s, p99 4.4s)
      • Consider enable affinity (10%?) in density test for production environment prediction - not for regular perf test - post 930
    • Minimum TP/RP combination for 50K nodes 930 - YingH.
    • Overall cluster size >= 100K - post 930
  • Scale up - post 930
    • Cluster size >= 10K

Throughput 930

  • W/O pod start up latency loss (<5s), improve cluster throughput - YingH/Yunwen/Hongwei
    • Release v0.8 support 100 QPS (20 per TP) with pod start up latency < 6s

Perf test tool

  • Refactor cluster up script, make cluster start up parallel, reduce cluster up time #1036
    • Start proxy at the end (Yunwen - Done PR 1105)
    • TP/RP start concurrently (Hongwei - TODO)
  • Support VM scheduling in density test - post 930
    • Simulate delay only?
    • What to test
    • Cluster size?
  • Metrics platform migration - post 930
    • Complete migration from metrics server to Prometheus PR #980
    • Get correct API responsiveness data

Issues

Check

  • perftest/cloudloader2 to support multiple TP/RP clusters in getting node resources #1101 - Yunwen
  • Failed to change ttl annotation for hollow-node # 1054 - Resolved
  • Binding error in large scale perf runs #1089 - YingH.
  • GC controller queries its own master nodes' lease info and cause 404 error in haproxy #1047 - Resolved
  • GC node lease object in Arktos #1106
  • fix usage of KubeTPClients[0] in kubelet for scale-out #974
  • load test failed time-out to waiting for pods using volumes to be running #978

Parking

  • Excessive pod start up latency in 50K density test # 1102
  • secret not found in kubelet [system tenant only #1052 - low (local)
  • pod scheduler reported bound successfully but not appear in local [system tenant only] #1049 - system pod handling
  • 500 nodes load run finished with error: DaemonSets timeout #1007

Scheduler backporting leftover (TBD)

  • Arktos needs migrate CSINodeInfo to GA #1099

Runtime - post 930

Features

Issues

Multi Tenancy

Features

  • Support multi-tenancy network

Issues

  • Pod Affinity check should use tenant-namespace-podName to compare pods. #1100 (TBD)

Network

Features

  • Mizar integration
  • Alcor integration (TBD)

Dev Tools/Script - post 930

Issues

  • Refactor arktos-up/scale-out script to share common code [#1098] (https://github.com/CentaurusInfra/arktos/issues/1098)
  • hack/arktos-worker-up.sh does not get network plugin installed #1067
  • cleanup unused /tmp/xxx-ip.txt entries in kubemark scripts #1018
  • revisit controls on HAPROXY and Backend API servers secure/insecure mode #1016
  • Make arktos-scaleout-up script run on a single dev vm (one-box) similar to arktos-up #999