Computing Backlog

Jump to bottom

Mengni Zhang edited this page Jun 17, 2021 · 23 revisions

This page list features/issues that will be addressed in Centaurus Arktos.

Scalability

Features

Scaleout Framework

Daemonset handling in RP - 930 system tenant only, multi-tenancy daemonset out of scope - Hongwei
System partition pod handling - design 930 - YingH.
- Current system tenant objects are saved in ETCD all with "system" as a part of the directory. Customize system object etcd directory is possible
- Proposal: each TP has its own system objects, not sharing with other TPs. This will not have single point of failure issue or data population issue
  - Need to check whether it is possible to have multiple system pods from different TP deployed on same VM and causing confliction: case by case.
    - Currently we only have system dns pods and virtlet pods. DNS pods can be deployed to same VM as long as there is no storage confliction. Need to check virtlet pods.
Scheduler benchmark - by usage requirement
Burst scheduling support
- 930 QPS 40 per 10K cluster only
  - 1TP/1RP QPS 40
  - 50K cluster QPS 200: 5TP/5RP, 3TP/4RP possible < 200
- Post 930 distribute VM allocation, VM start up latency, etc.
Dynamic add/delete TP/RP - post 2021
Create Tenant requests need to go through proxy and be properly redirected to correct TP API server #1056
- Click2Cloud

Cluster size

Scale out
- Upper limit for 1TP/1RP cluster capacity 930 - YingH.
  - Verified 1TP/1RP QPS 20, 15K cluster passed density test (4/6/2021 pod start up latency p50 1.8s, p99 4.4s)
  - Consider enable affinity (10%?) in density test for production environment prediction - not for regular perf test - post 930
- Minimum TP/RP combination for 50K nodes 930 - YingH.
- Overall cluster size >= 100K - post 930
Scale up - post 930
- Cluster size >= 10K

Throughput 930

W/O pod start up latency loss (<5s), improve cluster throughput - YingH/Yunwen/Hongwei
- Release v0.8 support 100 QPS (20 per TP) with pod start up latency < 6s

Perf test tool

Refactor cluster up script, make cluster start up parallel, reduce cluster up time #1036
- Start proxy at the end (Yunwen - Done PR 1105)
- TP/RP start concurrently (Hongwei - TODO)
Support VM scheduling in density test - post 930
- Simulate delay only?
- What to test
- Cluster size?
Metrics platform migration - post 930
- Complete migration from metrics server to Prometheus PR #980
- Get correct API responsiveness data

Issues

Check

perftest/cloudloader2 to support multiple TP/RP clusters in getting node resources #1101 - Yunwen
Failed to change ttl annotation for hollow-node # 1054 - Resolved
Binding error in large scale perf runs #1089 - YingH.
GC controller queries its own master nodes' lease info and cause 404 error in haproxy #1047 - Resolved
GC node lease object in Arktos #1106
fix usage of KubeTPClients[0] in kubelet for scale-out #974
load test failed time-out to waiting for pods using volumes to be running #978

Parking

Excessive pod start up latency in 50K density test # 1102
secret not found in kubelet [system tenant only #1052 - low (local)
pod scheduler reported bound successfully but not appear in local [system tenant only] #1049 - system pod handling
500 nodes load run finished with error: DaemonSets timeout #1007

Scheduler backporting leftover (TBD)

Arktos needs migrate CSINodeInfo to GA #1099

Runtime - post 930

Features

Baremetal provision
Container encapsulate VM - refactor
Secure Container virtualization
- gvisor
- Microsoft 2019 DX
- Kata
Check VM and Resource Utilization Roadmap#Post 730-2020

Issues

Multi Tenancy

Features

Support multi-tenancy network

Issues

Pod Affinity check should use tenant-namespace-podName to compare pods. #1100 (TBD)

Network

Features

Mizar integration
Alcor integration (TBD)

Dev Tools/Script - post 930

Issues

Refactor arktos-up/scale-out script to share common code [#1098] (https://github.com/CentaurusInfra/arktos/issues/1098)
hack/arktos-worker-up.sh does not get network plugin installed #1067
cleanup unused /tmp/xxx-ip.txt entries in kubemark scripts #1018
revisit controls on HAPROXY and Backend API servers secure/insecure mode #1016
Make arktos-scaleout-up script run on a single dev vm (one-box) similar to arktos-up #999