-
Notifications
You must be signed in to change notification settings - Fork 69
Computing Backlog
Mengni Zhang edited this page Jun 17, 2021
·
23 revisions
This page list features/issues that will be addressed in Centaurus Arktos.
- Daemonset handling in RP - 930 system tenant only, multi-tenancy daemonset out of scope - Hongwei
- System partition pod handling - design 930 - YingH.
- Current system tenant objects are saved in ETCD all with "system" as a part of the directory. Customize system object etcd directory is possible
- Proposal: each TP has its own system objects, not sharing with other TPs. This will not have single point of failure issue or data population issue
- Need to check whether it is possible to have multiple system pods from different TP deployed on same VM and causing confliction: case by case.
- Currently we only have system dns pods and virtlet pods. DNS pods can be deployed to same VM as long as there is no storage confliction. Need to check virtlet pods.
- Need to check whether it is possible to have multiple system pods from different TP deployed on same VM and causing confliction: case by case.
- Scheduler benchmark - by usage requirement
- Burst scheduling support
- 930 QPS 40 per 10K cluster only
- 1TP/1RP QPS 40
- 50K cluster QPS 200: 5TP/5RP, 3TP/4RP possible < 200
- Post 930 distribute VM allocation, VM start up latency, etc.
- 930 QPS 40 per 10K cluster only
- Dynamic add/delete TP/RP - post 2021
- Create Tenant requests need to go through proxy and be properly redirected to correct TP API server #1056
- Click2Cloud
- Scale out
- Upper limit for 1TP/1RP cluster capacity 930 - YingH.
- Verified 1TP/1RP QPS 20, 15K cluster passed density test (4/6/2021 pod start up latency p50 1.8s, p99 4.4s)
- Consider enable affinity (10%?) in density test for production environment prediction - not for regular perf test - post 930
- Minimum TP/RP combination for 50K nodes 930 - YingH.
- Overall cluster size >= 100K - post 930
- Upper limit for 1TP/1RP cluster capacity 930 - YingH.
- Scale up - post 930
- Cluster size >= 10K
- W/O pod start up latency loss (<5s), improve cluster throughput - YingH/Yunwen/Hongwei
- Release v0.8 support 100 QPS (20 per TP) with pod start up latency < 6s
- Refactor cluster up script, make cluster start up parallel, reduce cluster up time #1036
- Start proxy at the end (Yunwen - Done PR 1105)
- TP/RP start concurrently (Hongwei - TODO)
- Support VM scheduling in density test - post 930
- Simulate delay only?
- What to test
- Cluster size?
- Metrics platform migration - post 930
- Complete migration from metrics server to Prometheus PR #980
- Get correct API responsiveness data
- perftest/cloudloader2 to support multiple TP/RP clusters in getting node resources #1101 - Yunwen
- Failed to change ttl annotation for hollow-node # 1054 - Resolved
- Binding error in large scale perf runs #1089 - YingH.
- GC controller queries its own master nodes' lease info and cause 404 error in haproxy #1047 - Resolved
- GC node lease object in Arktos #1106
- fix usage of KubeTPClients[0] in kubelet for scale-out #974
- load test failed time-out to waiting for pods using volumes to be running #978
- Excessive pod start up latency in 50K density test # 1102
- secret not found in kubelet [system tenant only #1052 - low (local)
- pod scheduler reported bound successfully but not appear in local [system tenant only] #1049 - system pod handling
- 500 nodes load run finished with error: DaemonSets timeout #1007
- Arktos needs migrate CSINodeInfo to GA #1099
- Baremetal provision
- Container encapsulate VM - refactor
- Secure Container virtualization
- gvisor
- Microsoft 2019 DX
- Kata
- Check VM and Resource Utilization Roadmap#Post 730-2020
- Support multi-tenancy network
- Pod Affinity check should use tenant-namespace-podName to compare pods. #1100 (TBD)
- Mizar integration
- Alcor integration (TBD)
- Refactor arktos-up/scale-out script to share common code [#1098] (https://github.com/CentaurusInfra/arktos/issues/1098)
- hack/arktos-worker-up.sh does not get network plugin installed #1067
- cleanup unused /tmp/xxx-ip.txt entries in kubemark scripts #1018
- revisit controls on HAPROXY and Backend API servers secure/insecure mode #1016
- Make arktos-scaleout-up script run on a single dev vm (one-box) similar to arktos-up #999