Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch dry run but like for real this time #943

Closed

Conversation

timothysmith0609
Copy link
Contributor

@timothysmith0609 timothysmith0609 commented Jan 19, 2024

Part of https://github.com/Shopify/infrastructure-tooling/issues/686

Related: #797

What are you trying to accomplish with this PR?
#781 implemented batched dry-run kubectl apply as part of the resource validation phase. For what I'm sure was a compelling reason at the time, if batched dry-run fails, we fallback to doing a resource-by-resource check. Unfortunately for larger resource sets, this means we end up making N kubectl calls as we individually check each resource: the very issue #781 was meant to solve.

It turns out batch_dry_run pathologically fails because we were attempting to apply non-applyable resources (e.g. Pods with generateName, etc.). The solution is to partition resources into applyable/non-applyable and batch dry-run only the former. A quick top hat proves this works.

With that out of the way, the next thing we should do is just trust the batch results and raise a FatalDeploymentError if it doesn't succeed.

@timothysmith0609 timothysmith0609 force-pushed the tsmith/batch-dry-run-but-like-for-real-this-time branch from 03703cc to e5fe76b Compare January 22, 2024 16:08
Comment on lines 288 to 293
validate_globals(resources)
applyables, individuals = resources.partition { |r| r.deploy_method == :apply }
batch_dry_run_success = validate_dry_run(applyables)
resources_to_validate = batch_dry_run_success ? individuals : resources

Krane::Concurrency.split_across_threads(resources_to_validate) do |r|
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally I wanted to to just trust the output of the batched dry run test and raise a FatalDeploymentError. However, that made it very difficult to give a consistent UX with regards to giving useful feedback: the main issue being it became necessary to suppress all feedback if a sensitive document/secret was contained in the apply set.

To that end, I've kept things pretty similar, just removing resources from the per-resource checks if batched dry-run succeeds. If it fails, since that's a less common case, we default to per-resource checking in order to provide more granular feedback and highlight what the actual issue it. This is the difference between "Something somewhere went wrong" and "THIS particular file and THIS particular line are wrong".

Comment on lines +23 to +27
test_suite:
- "unit_test"
- "cli_test"
- "serial_integration_test"
- "integration_test"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

integration tests are flaky and take a long time to run. Adding this extra dimension so we don't have to sit through unit/cli/serial tests each time integration fails

Comment on lines 294 to +296
Krane::Concurrency.split_across_threads(resources) do |r|
# No need to pass in kubectl (and do per-resource dry run apply) if batch dry run succeeded
if batch_dry_run_success
r.validate_definition(kubectl: nil, selector: @selector, dry_run: false)
else
r.validate_definition(kubectl: kubectl, selector: @selector, dry_run: true)
end
# No need to pass in kubectl as we batch dry run server-side apply above
r.validate_definition(kubectl: nil, selector: @selector, dry_run: false)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still validate ALL resources, because there are additional krane-specific checks. However, we never pass in kubectl so we never end up running a dry-run apply (neither client-side nor server-side)

Comment on lines +320 to +324
message = "This command is namespaced and cannot be used to deploy global resources. "\
"Use GlobalDeployTask instead."
@logger.summary.add_paragraph(ColorizedString.new(message).yellow)
@logger.summary.add_paragraph(ColorizedString.new("Global resources:\n#{global_names}").yellow)
raise FatalDeploymentError, "This command is namespaced and cannot be used to deploy global resources. "\
"Use GlobalDeployTask instead."
raise FatalDeploymentError
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moves around log lines to make the FatalDeploymentError handler simpler while preserving log output

tags = statsd_tags + (dry_run ? ['dry_run:true'] : ['dry_run:false'])
Krane::StatsD.client.distribution('apply_all.duration', Krane::StatsD.duration(start), tags: tags)
if st.success?
log_pruning(out) if prune
elsif dry_run
record_dry_run_apply_failure(err, resources: resources)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logging is just different enough to require separate methods for the dry-run and actual-run failures

@@ -75,10 +75,9 @@ def test_apply_failure_with_sensitive_resources_hides_template_content
refute_logs_match(%r{Kubectl err:.*something/invalid})

assert_logs_match_all([
"Command failed: apply -f",
"Template validation failed",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously these errors would occur during the deploy step (since we threw away batch dry-run apply and fell back to client-side validations). Now that we trust batch dry-run, the failure occurs earlier

Comment on lines +59 to +79
# def test_service_account_predeployed_before_unmanaged_pod
# # Add a valid service account in unmanaged pod
# service_account_name = "build-robot"
# result = deploy_fixtures("hello-cloud",
# subset: ["configmap-data.yml", "unmanaged-pod-1.yml.erb", "service-account.yml"],
# render_erb: true) do |fixtures|
# pod = fixtures["unmanaged-pod-1.yml.erb"]["Pod"].first
# pod["spec"]["serviceAccountName"] = service_account_name
# pod["spec"]["automountServiceAccountToken"] = false
# end
# # Expect the service account is deployed before the unmanaged pod
# assert_deploy_success(result)
# hello_cloud = FixtureSetAssertions::HelloCloud.new(@namespace)
# hello_cloud.assert_configmap_data_present
# hello_cloud.assert_all_service_accounts_up
# hello_cloud.assert_unmanaged_pod_statuses("Succeeded", 1)
# assert_logs_match_all([
# %r{Successfully deployed in \d.\ds: ServiceAccount/build-robot},
# %r{Successfully deployed in \d+.\ds: Pod/unmanaged-pod-.*},
# ], in_order: true)
# end
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the last error I went to fix and ends up being the most important. If we rely on the output of batched dry-run, we cannot deploy resources that are codependent on one another, even if such an apply is valid and would work in practice.

For example, in this test the Pod resource references a ServiceAccount that will be deployed alongside. This fails dry-run because the ServiceAccount does not actually exist on the server. Krane` gets around this by predeploying certain resources (like ServiceAccounts...).

I'm not sure what the path forward here is. In reality, it points to a fundamental, irreconcilable, issue: krane's built-in pacing makes it impossible to server-side dry-run validate sets of resources that include co-dependencies. That really only leaves:

  • resource-by-resource server-side dry run
  • batch client-side dry run (this handles the case above but is less robust, overall)

I'm tempted to just remove the batch dry-run and keep the resource-by-resource implementation. For core we can rely on bypassing this entirely with the work in #944 and Production Platform tooling.

@timothysmith0609
Copy link
Contributor Author

Replaced by #946

@timothysmith0609 timothysmith0609 deleted the tsmith/batch-dry-run-but-like-for-real-this-time branch January 30, 2024 15:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant