Skip to content

Scale Out PoC

Xiaoning Ding edited this page Dec 10, 2020 · 41 revisions

This document tracks the milestones, tasks and ideas related to scale out Arktos.

Tasks

Milestone 1 (One tenant partition, One resource partition, API Gateway)

API gateway:

  • Setup a proxy to statically route requests: 1) for "nodes" and "nodeLeases", redirect to resource partition. 2) for everything else, redirect to tenant partition
  • Workaround: redirect some requests (for common system-level URLs) based on client IP
  • Switch to HAProxy if the watch issue of Nginx blocks or fails the scalability test

Build&Test Scripts:

  • Have simple scripts to launch a tenant partition, a resource partition and an API gateway.
  • Do we need nodeIPAM controller?
  • Update Kubemark and perf test to work with the scale-out design.
  • The service token issue in NodeLifecycle controller?
  • A new branch for scale-out PoC
  • Prometheus settings for multi API Servers

Investigation:

  • Why is there a huge latency delay -- (Added HTTP 1.1)
  • Intermittent timeout for watch requests on proxy
  • Manual e2e tests for key flows
  • scheduler: node informer using a new client
  • kubelet: node and lease update using a new client

Milestone 2 (Multiple tenant partitions, one resource partition)

  • Proxy configuration changes to support multiple tenant partitions
  • HAProxy configuration
  • Cherry-pick kubemark & setup changes from m1 branch
  • Kubelet changes
    • Use multiple kube clients to watch all tenant partitions
    • Pods keep crashing and restarting
    • Use the right kube client to update pod status
    • Use the right kube client to pod deletion
    • Use the right kube client for PVs
    • Fixing the multi-tenancy bugs in statefulSet controller
    • How to handle resources in system space which can come from any tenant partition
  • Node controller changes
    • Use multiple kube clients to watch pods/daemonSets
  • Kubemark component changes to support multiple clients
  • Test setup & script changes
    • Dynamically configure proxy with TP/RP IPs
    • Hollow nodes talk to multiple TPs
    • Create two tenants on TP1 and TP2 in the test deployment script
    • Change test config to distribute workloads to the multiple tenants
    • Change metrics collection part to support multi tenants
    • Do aggregated query result in tests on a tenant level
  • Investigation
    • "endpoint not found" error in resource partition

Future Milestones

  • Kubectl: use the right client to send events to different tenant partitions and resource partitions
  • TP API Server needs to get node objects for commands like "kubectl attach/log/.."
  • Update node controller to use a separate client for objects in tenant partition
  • Improve Kubemark to simulate more than 20K nodes.
  • Change Proxy (API Gateway) to use dynamic map lookup instead of static configuration.
  • Reduce the number of watch connections on a tenant partition.
  • One scheduler instance can't cache unlimited nodes. need optimizations like node preferences.
  • Scheduler sampling to improve scheduling throughput.
  • Reduce scheduling conflicts (still node preference?)
  • Some controllers in tenant partitions also cache all nodes. Needs change as well.
  • Current node controller caches all pods and statefulsets.
  • Try to avoid aggregated watch on API gateway -- does "node.podList" help?
  • Do we still need to coordinate etcd revisions among partitions? Probably not. But need more detailed analysis and some changes.