Skip to content

Scale Out PoC

Xiaoning Ding edited this page Nov 11, 2020 · 41 revisions

This document tracks the milestones, tasks and ideas related to scale out Arktos.

Tasks

Milestone 1 (One tenant partition, One resource partition, API Gateway)

API gateway:

  • Setup a proxy to statically route requests: 1) for "nodes" and "nodeLeases", redirect to resource partition. 2) for everything else, redirect to tenant partition
  • Workaround: redirect some requests (for common system-level URLs) based on client IP

Build&Test Scripts:

  • Have simple scripts to launch a tenant partition, a resource partition and an API gateway.
  • Do we need nodeIPAM controller?
  • Update Kubemark and perf test to work with the scale-out design.
  • The service token issue in NodeLifecycle controller?
  • A new branch for scale-out PoC
  • Prometheus settings for multi API Servers

Investigation:

  • Why is there a huge latency delay -- (Added HTTP 1.1)
  • Intermittent timeout for watch requests on proxy
  • Node sometimes get not ready (hold on until it repros and proves significant)
  • Manual e2e tests for key flows
  • scheduler: node informer using a new client
  • kubelet: node and lease update using a new client

Milestone 2 (More partitions)

  • Proxy configuration changes to support multiple tenant partitions
  • Kubelet changes
    • Use multiple kube clients to watch all tenant partitions
    • Use the right kube client to update pod status
  • Node controller changes
    • Use multiple kube clients to watch pods/daemonSets
  • Support multi-tenants in perf tests (can be hard-coded tenant names)

Future Milestones

  • TP API Server needs to get node objects for commands like "kubectl attach/log/.."
  • Update node controller to use a separate client for objects in tenant partition
  • Improve Kubemark to simulate more than 20K nodes.
  • Change Proxy (API Gateway) to use dynamic map lookup instead of static configuration.
  • Reduce the number of watch connections on a tenant partition.
  • One scheduler instance can't cache unlimited nodes. need optimizations like node preferences.
  • Scheduler sampling to improve scheduling throughput.
  • Reduce scheduling conflicts (still node preference?)
  • Some controllers in tenant partitions also cache all nodes. Needs change as well.
  • Current node controller caches all pods and statefulsets.
  • Try to avoid aggregated watch on API gateway -- does "node.podList" help?
  • Do we still need to coordinate etcd revisions among partitions? Probably not. But need more detailed analysis and some changes.