Skip to content

Scale Out PoC

Xiaoning Ding edited this page Dec 23, 2020 · 41 revisions

This document tracks the milestones, tasks and ideas related to scale out Arktos.

Tasks

Milestone 1 (One tenant partition, One resource partition, API Gateway)

API gateway:

  • Setup a proxy to statically route requests: 1) for "nodes" and "nodeLeases", redirect to resource partition. 2) for everything else, redirect to tenant partition
  • Workaround: redirect some requests (for common system-level URLs) based on client IP
  • Switch to HAProxy if the watch issue of Nginx blocks or fails the scalability test

Build&Test Scripts:

  • Have simple scripts to launch a tenant partition, a resource partition and an API gateway.
  • Do we need nodeIPAM controller?
  • Update Kubemark and perf test to work with the scale-out design.
  • The service token issue in NodeLifecycle controller?
  • A new branch for scale-out PoC
  • Prometheus settings for multi API Servers
  • Fix issues emerged during test runs
    • time out errors during pod deletion
    • pod creation errors starting from 5K (maybe due to proxy?)

Investigation:

  • Why is there a huge latency delay -- (Added HTTP 1.1)
  • Intermittent timeout for watch requests on proxy
  • Manual e2e tests for key flows
  • scheduler: node informer using a new client
  • kubelet: node and lease update using a new client

Milestone 2 (Multiple tenant partitions, one resource partition)

  • Proxy configuration changes to support multiple tenant partitions

  • HAProxy configuration

  • Cherry-pick kubemark & setup changes from m1 branch

  • Kubelet changes

    • Use multiple kube clients to watch all tenant partitions
    • Pods keep crashing and restarting
    • Use the right kube client to update pod status
    • Use the right kube client to pod deletion
    • Use the right kube client for PVs
    • Fixing the multi-tenancy bugs in statefulSet controller
    • Bug fix: always uses the RP client to do all node status & lease update
    • Bug fix: concurrent map access error. Verify it's the same issue on master branch, cherry-pick all recent improvements if it is.
    • How to handle resources in system space which can come from any tenant partition
  • Node controller changes

    • Use multiple kube clients to watch pods/daemonSets
  • Kubemark component changes to support multiple clients

  • Test setup & script changes

    • Dynamically configure proxy with TP/RP IPs
    • Hollow nodes talk to multiple TPs
    • Create two tenants on TP1 and TP2 in the test deployment script
    • Change test to create workloads under a non-system tenant for density test
    • Pods stuck in "terminating" status during test when test tries to delete pods
    • Change test to create workloads under a non-system tenant for load test
    • Two parallel perf-tests for different tenants, supporting system space requests
    • Verify if tests require system resource access
    • Update wiki for scale-out test setup
  • Investigation

    • "endpoint not found" error in resource partition

Future Milestones

  • Kubectl: design support on static pod
  • Kubectl: use the right client to send events to different tenant partitions and resource partitions
  • TP API Server needs to get node objects for commands like "kubectl attach/log/.."
  • Update node controller to use a separate client for objects in tenant partition
  • Improve Kubemark to simulate more than 20K nodes.
  • Change Proxy (API Gateway) to use dynamic map lookup instead of static configuration.
  • Reduce the number of watch connections on a tenant partition.
  • One scheduler instance can't cache unlimited nodes. need optimizations like node preferences.
  • Scheduler sampling to improve scheduling throughput.
  • Reduce scheduling conflicts (still node preference?)
  • Some controllers in tenant partitions also cache all nodes. Needs change as well.
  • Current node controller caches all pods and statefulsets.
  • Try to avoid aggregated watch on API gateway -- does "node.podList" help?
  • Do we still need to coordinate etcd revisions among partitions? Probably not. But need more detailed analysis and some changes.

Links

M2 Architecture Diagram

How to setup scale-out test