Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ws-manager] Re-create workspace pods on rejection #20243

Merged
merged 5 commits into from
Nov 15, 2024
Merged

Conversation

geropl
Copy link
Member

@geropl geropl commented Sep 26, 2024

Description

This PR enables ws-manager to re-create a workspace pod when it detects that the pod is stuck. This i
s done by stopping the failed workspace, wait a configurable timeout (15s default), and create a pod with the same name/ID again.

The root cause is an unfixed race condition in kubelet: node labels are out of sync, and suddenly the Pod - although already scheduled to a node - does no longer fit onto the node, so the kubelet marks it as "rejected": A situation, in which it is "stuck".
It shows up in both AWS and GCP, and has a ~1% chance of occurring (from our rough investigation).

This change makes it so that ws-manager detects this condition (which is luckily very straight forward and clearly distinguishable), and once detected, marks it with the PodRejected condition. From there on, workspace is treated differently. This ensures that this change remains (nearly) zero-impact for the other 99% of workspaces (with the exception of some changes to context handling in ws-daemon's pod control loop, see below).

To make sure this change does not break anything it has been thoroughly stress-tested using a loadgen + a breaker program which "rejects" every workspace that is created exactly once (see the "DEV" commit).

TODO

  • add happy-path test
  • manual tests
    • regular workspace starts
    • prebuild workspace starts
    • regular workspace + forced "pod rejection"
  • add "WorkspaceStatus" test
  • do load test, as we saw the CPU variant on GCP as well (docs and docs)
  • discuss with folks to make sure we don't miss any edges cases
  • remove debug logging (UPPER CASE)
  • ❗ drop DEV commit before merging

Discussion: "Cool, but it's a big change. Is this worth it?"

Yes, I think so. And that's because:

  • a) it's fairly isolated (the change does impose very little risk to the majority/99% of workspaces) which continue to function 💯% as they do today.
  • b) it fixes a long-standing bug (it's a kubernetes bug, granted, but it directly affects Gitpod usability)

Changes in Detail

The general theme is that we maintain all the existing logic of starting and stopping workspaces. The only conceptual changes are:

  • Stopping a workspace with PodRejected works a bit differently, as a) we don't care about a backup, b) it might fail at any point during Pending/Creating and c) we need to cleanup everything we know about that workspace in ws-daemon, as there might be a new pod with the exact same ID in the near future, soon.
    • the brunt of the changes are actually related to enabling ws-daemon to "wipe" a workspace, in a synchronized way ( this is only exercised in this failure condition!)
    • this includes using a dedicated condition to communicated this being done (next to BackupFailure/BackupComplete): StateWiped
  • once a workspace with PodRejected has been stopped, it gets eventually re-created

ws-manager

Two new config fields:

  • PodRecreationMaxRetries: (default: 3): how often we are trying to re-create a given Workspace
  • PodRecreationBackoff (default: 15s): how long two wait before creating a new pod after the old one has been deleted

The workspace CRD gets two new fields:

  • .status.podRecreated: the number of times a workspace has gone through the re-creation logic (usually 0)
  • .status.podDeletionTime: The time the workspace pod was seen removed the first time by ws-manager (set in status.go)
    • this is used to make sure
      There is a 3rd field podStarts which has been there since forever, which is finally used now. 🙂

Two new conditions got introduced to, which are :

  • PodRejected (set in status.go)
    • true: when the pod is failing (pending -> stopped)
    • false: when the pod is re-created
  • StateWiped (set in ws-daemon, and checked in status.go)
    • used to synchronize the "workspace state wiped" state from ws-daemon back to ws-manager. Only once this is set a "rejected workspace" is considered "stopped".

ws-daemon

Workspace control loop

On handleWorkspaceStop, we call doWipeWorkspace instead of doWorkspaceContentBackup. It calls workspace_operations.WipeWorkspace, and sets the StateWiped condition accordingly.

WipeWorkspace does:

  • calls the regular "dispose" hooks
  • calls an additional "dispose" hook, which tries to call the IWS, and if successful, runs WipingTeardown
    • this unmount the shiftfs mounts from inside the workspace
    • ℹ️ this might be a change we actually can use for all workspaces. But that's a separate discussion.
  • calls dispatch.DisposeWorkspace (see below)
  • disposes all stored metadata (like on regular shutdown)

Pod control loop

This control loop is fueled by a PodInformer, and before this change it's shutdown mechanics were not sync'ed with the Workspace control loop, and timing worked on "best-effort" basis.

dispatch.DisposeWorkspace got added (which again is only called in the PodRejected case, cmp. above), which:

  • cancels the context of all running go routines
  • waits for all of them to report being cancelled (WaitGroup)
  • calls containerRuntime.DisposeContainer, which clears some internally held ID mappings between Workspace, Pod and Container
  • ℹ️ blocks future pod updates (excluding "deletions") from being processed (pod is being identified by .Name + .CreationTimestamp!)

Diagram

I tried to capture the new state changes in a diagram (source):
Untitled-2024-10-08-1358

Related Issue(s)

Fixes ENT-530

How to test

Check tests are sensible and 🟢

Manual: regular workspace

Manual: prebuild workspace

Manual: using loadgen and the "rejection"-program

Loom of my testing is here: https://www.loom.com/share/e2eae31965dc48829b415c81f3005f1b?sid=6b4acf8d-06d0-4d21-b3e2-ef3182f8e974
Summary screen:
image

  1. open a workspace on this branch,
  2. connect to the ephemeral-gpl4 cluster (announce here so you don't block others)
  3. prepare to run a loadgen test as usual
  4. in parallel, start "rejection" tool: cd scheduler && go run main.go -kubeconfig ~/.kube/config
  5. run loadgen
  6. notice how all workspaces come up nicely 😌

TODO(gpl): add loom so others don't have to do this

Documentation

Preview status

Gitpod was successfully deployed to your preview environment.

Build Options

Build
  • /werft with-werft
    Run the build with werft instead of GHA
  • leeway-no-cache
  • /werft no-test
    Run Leeway with --dont-test
Publish
  • /werft publish-to-npm
  • /werft publish-to-jb-marketplace
Installer
  • analytics=segment
  • with-dedicated-emulation
  • workspace-feature-flags
    Add desired feature flags to the end of the line above, space separated
Preview Environment / Integration Tests
  • /werft with-local-preview
    If enabled this will build install/preview
  • /werft with-preview
  • /werft with-large-vm
  • /werft with-gce-vm
    If enabled this will create the environment on GCE infra
  • /werft preemptible
    Saves cost. Untick this only if you're really sure you need a non-preemtible machine.
  • with-integration-tests=all
    Valid options are all, workspace, webapp, ide, jetbrains, vscode, ssh. If enabled, with-preview and with-large-vm will be enabled.
  • with-monitoring

/hold

@geropl geropl changed the title [WIP][ws-manager] Re-create workspace pods on rejection (incl. test) [WIP][ws-manager] Re-create workspace pods on rejection Oct 1, 2024
@roboquat roboquat added size/XL and removed size/L labels Oct 1, 2024
@geropl geropl force-pushed the gpl/530-retry branch 2 times, most recently from 2a35e06 to ac1c86b Compare October 8, 2024 14:43
Copy link
Contributor

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the meta: stale This issue/PR is stale and will be closed soon label Oct 18, 2024
@github-actions github-actions bot closed this Oct 23, 2024
@geropl
Copy link
Member Author

geropl commented Oct 24, 2024

Still work in progress

@geropl geropl reopened this Oct 24, 2024
@github-actions github-actions bot removed the meta: stale This issue/PR is stale and will be closed soon label Oct 24, 2024
@geropl geropl force-pushed the gpl/530-retry branch 2 times, most recently from 9aeba20 to d3b0ed0 Compare October 25, 2024 12:58
Copy link
Contributor

github-actions bot commented Nov 4, 2024

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the meta: stale This issue/PR is stale and will be closed soon label Nov 4, 2024
@roboquat roboquat added size/XXL and removed size/XL labels Nov 8, 2024
@github-actions github-actions bot removed the meta: stale This issue/PR is stale and will be closed soon label Nov 9, 2024
@geropl geropl force-pushed the gpl/530-retry branch 2 times, most recently from 669e328 to 2f27e7b Compare November 11, 2024 08:33
Copy link

socket-security bot commented Nov 11, 2024

New dependencies detected. Learn more about Socket for GitHub ↗︎

Package New capabilities Transitives Size Publisher
golang/k8s.io/[email protected] unsafe 0 23.9 MB
golang/k8s.io/[email protected] environment, filesystem, network, shell, unsafe 0 4.17 MB
golang/k8s.io/[email protected] environment, filesystem, network, shell, unsafe 0 13.9 MB

View full report↗︎

@geropl geropl changed the title [WIP][ws-manager] Re-create workspace pods on rejection [ws-manager] Re-create workspace pods on rejection Nov 11, 2024
@geropl geropl marked this pull request as ready for review November 11, 2024 15:00
@geropl geropl requested review from a team as code owners November 11, 2024 15:00
Copy link
Member

@WVerlaek WVerlaek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall nice work!
Left a few comments/questions, didn't look too deeply into the ws-daemon changes

components/ws-daemon/pkg/cgroup/cgroup.go Outdated Show resolved Hide resolved
components/ws-daemon/pkg/dispatch/dispatch.go Outdated Show resolved Hide resolved
scheduler/main.go Outdated Show resolved Hide resolved
components/ws-daemon/pkg/netlimit/netlimit.go Outdated Show resolved Hide resolved
components/ws-daemon/pkg/dispatch/dispatch.go Outdated Show resolved Hide resolved
}
}()

timeout := time.NewTicker(10 * time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we restrict the delete to 10 seconds?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment.

tl;dr: If we don't have the timeout, there is a chance that that this operation blocks forever (at least before I fixed the mark unmounting). By adding the timeout guarantee it never blocks the workspace from being deleted (because that depends on the "wiping" to finish.

We trade it for the slight chance of not wiping cleanly, which might become a problem iff the new workspace pod is scheduled to the same node again. Which I deem ok, because then we are talking about an error case in the 0.1-0.01% range, which is already way better than the original 1%.
All of this, again, only if the mark unmounting did not fix this problem in the first place. 👍

@geropl
Copy link
Member Author

geropl commented Nov 14, 2024

@iQQBot I added the loom for the load test (incl. rejector running, so "breaking" every workspace pod once), and addressed all the other comments. Would be great if you could review, and do a manual smoke test on the preview (once it's there again) that the "normal case" still works as expected.

Two things I still need to test (tomorrow):

  • have dev/rejector running against the preview
  • manually (+ via prebuild) start workspaces
  • check that the UX does not break 👍

Once that's done, we can merge. 🙂

@geropl geropl requested a review from a team as a code owner November 15, 2024 10:10
@geropl
Copy link
Member Author

geropl commented Nov 15, 2024

Done testing! 🥳 🎉

Added a small logging improvement, otherwise just need ✔️ to 🚢 !

Copy link
Contributor

@iQQBot iQQBot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM, try regular start and stop, it works like before

A question: If content-init has already started and the pod re-schedule at same node, and content is big enough, what will happen? Is there a possibility of a data overlap issue?

@geropl
Copy link
Member Author

geropl commented Nov 15, 2024

A question: If content-init has already started and the pod re-schedule at same node, and content is big enough, what will happen? Is there a possibility of a data overlap issue?

The approach in this PR is the "wiping mode" as you found out, where first all running handlers etc. are stopped, and finally all content directories are removed. The oprations are synchronized on multiple levels (one e.g. for content-init), so this should* not happen.

*: saying "should", because the ws-daemon code was surprinsingly complex, as we have three layers of feeback loops (k8s workspace, k8s pod, containerd) we have to synchronize. There still might be a whole somewhere, but after the extensive testing, I'm absolutely sure we are taking about 0.01% here, instead of the 1% we have atm.

@geropl
Copy link
Member Author

geropl commented Nov 15, 2024

/unhold

@roboquat roboquat merged commit 99cc66b into main Nov 15, 2024
18 checks passed
@roboquat roboquat deleted the gpl/530-retry branch November 15, 2024 12:28
@iQQBot
Copy link
Contributor

iQQBot commented Nov 15, 2024

The approach in this PR is the "wiping mode" as you found out, where first all running handlers etc. are stopped, and finally all content directories are removed. The oprations are synchronized on multiple levels (one e.g. for content-init), so this should* not happen.

content-init is a special case that executes using another process and has no related context to control (i.e. the context is the context of ws-deamon).

@geropl
Copy link
Member Author

geropl commented Nov 15, 2024

content-init is a special case that executes using another process and has no related context to control (i.e. the context is the context of ws-deamon).

But content init happens in sync. And all workspace-state related changes are handled by workspace_controller.go, using the controller framework by k8s, which I understood handles one event at time (e.g. handles synchronization).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants