Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace EKS test-infra with example #1192

Merged
merged 4 commits into from
Mar 12, 2021
Merged

Conversation

dak1n1
Copy link
Contributor

@dak1n1 dak1n1 commented Mar 12, 2021

This copies the EKS cluster from our examples into our test-infra, so we can use it for tests.

Description

Acceptance tests

  • Have you added an acceptance test for the functionality being added?
  • Have you run the acceptance tests on this branch?

Output from acceptance testing:

$ make testacc TESTARGS='-run=TestAccXXX'

...

Release Note

Release note for CHANGELOG:

...

References

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

This copies the EKS cluster from our examples into our test-infra, so we can use it for tests.
@ghost ghost added the size/XXL label Mar 12, 2021
Copy link
Member

@alexsomesan alexsomesan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Stef!

I had a look and dropped some comments here in there. I hope I haven't missed anything.

I'll give this a try myself as soon as I find some spare cycles.

kubernetes/test-infra/eks/README.md Outdated Show resolved Hide resolved
kubernetes/test-infra/eks/README.md Outdated Show resolved Hide resolved
write_kubeconfig = false

workers_group_defaults = {
root_volume_type = "gp2"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GP2 volumes are generally more expensive.
Is there a technical need for nodes to run on GP2 volumes? We're likely not putting much workload on the nodes during our tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gp2 was the default in the EKS module, but they recently updated it to gp3. I put it back to the old default because gp3 isn't available in all regions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the commit where it was added. terraform-aws-modules/terraform-aws-eks@76537d1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, if that's the case then disregard my comment. I must have based it on old information and likely GPx volumes are now the only choice. Sorry for the confusion :)


## Replacing the EKS cluster and re-creating the Kubernetes / Helm resources

When the cluster is initially created, the Kubernetes and Helm providers will not be initialized until authentication details are created for the cluster. However, for future operations that may involve replacing the underlying cluster (for example, changing the network where the EKS cluster resides), the EKS cluster will have to be targeted without the Kubernetes/Helm providers, as shown below. This is done by removing the `module.kubernetes-config` from Terraform State prior to replacing cluster credentials, to avoid passing outdated credentials into the providers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer that we leave out the first phrase. I feel it's presenting some facts inaccurately.

My understanding is that all providers are configured at the same time, BEFORE any of the resources have been created. It is because the configured attribute values are only then consumed by the Kubernetes / Helm provider CRUD logic at a later stage in the apply, that we only see intermittent issues that we generally label as "progressive apply". This sequencing is not predictable and is influenced by the resource dependencies in the graph in very unintuitive ways. I think the first phrase here gives users the false impression that this aspect is somehow predictable. I worry that it is not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe there's a way I can represent this more accurately. Let's discuss the nuances of it, and maybe I can learn from your perspective. There is probably some truth to both of our understandings of this scenario, since my understanding is based on first-hand experience and observation. The scenario I'm describing in this document is thoroughly tested. It is repeatable and predictable. I'm less certain about the "why" behind it though.

I think this is because, when you initialize the Kubernetes provider using a data source, like google_container_cluster, the reading of the data source can be deferred. In deferring the data source read, you get the predictable deferred provider initialization. This is what allows our initial builds of GKE/EKS/AKS clusters to succeed in our test-infra and examples. It's the predictable single-apply scenario.

However, you can run into problems later, such as during cluster replacement, when the Kubernetes provider initialization is no longer delayed by the initial read of the data source. The Kubernetes provider is initialized before the data source has read the new GKE credentials, since the credentials were replaced during the apply phase. In this scenario, the data source was never re-read to give the Kubernetes provider valid credentials, and so the apply fails. In those cases, as far as I'm aware, the only way forward is to remove the Kubernetes provider resources from state, as described in this document. (Related: I opened a bug upstream to see if they can help us with this scenario).

My goal with this README to be able to convey the information accurately to users, to help them to succeed in establishing the dependency between their GKE cluster and their Kubernetes resources.

How about this instead... I edited it to say the providers will not attempt to read from the Kubernetes API instead of saying they won't be initialized. Perhaps that information is more accurate and more relevant to the users.

Suggested change
When the cluster is initially created, the Kubernetes and Helm providers will not be initialized until authentication details are created for the cluster. However, for future operations that may involve replacing the underlying cluster (for example, changing the network where the EKS cluster resides), the EKS cluster will have to be targeted without the Kubernetes/Helm providers, as shown below. This is done by removing the `module.kubernetes-config` from Terraform State prior to replacing cluster credentials, to avoid passing outdated credentials into the providers.
When the cluster is initially created, the Kubernetes and Helm providers will not attempt to read from the Kubernetes API until authentication details are created for the cluster. However, for future operations that may involve replacing the underlying cluster (for example, changing the network where the EKS cluster resides), the EKS cluster will have to be targeted without the Kubernetes/Helm resources in state, as shown below. This is done by removing the `module.kubernetes-config` from Terraform State prior to replacing cluster credentials. Alternatively, keeping the EKS infrastructure and Kubernetes infrastructure in different state files will avoid this scenario.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stef, thanks for breaking things down in such detail.

I appreciate the overall solution that you are proposing here and I think removing the state of the Kubernetes resources deliberately via the state rm command before replacing the cluster is a clever and very valid solution. I think we should promote that and encourage users to adopt it. It follows good practice patterns, by making actions explicit rather than implicit and limiting blast radius of each action. I like it.

What I'm not comfortable with is the messaging in the first phrase only. It can be interpreted by users as stating for a fact that Terraform offers any kind of sequencing guarantees that "the Kubernetes and Helm providers will not attempt to read from the Kubernetes API until authentication details are created for the cluster".

This causality, I'm afraid, is not sustained by the way Terraform is currently designed. After multiple conversations with Terraform engineers, the conclusion has always been that this is not by-design behaviour. The fact that it works (as clearly you have experienced first-hand) is a side-effect and side-effects are dangerous to promote as reliable fact.

In fact, Terraform documentation clearly advises against relying on this assumption here: https://www.terraform.io/docs/language/providers/configuration.html#provider-configuration-1. About midway in that section, it reads:
"You can use expressions in the values of these configuration arguments, but can only reference values that are known before the configuration is applied. This means you can safely reference input variables, but not attributes exported by resources (with an exception for resource arguments that are specified directly in the configuration)."

I think it's in our users' best interest (but also ours) for us to send a clear and cohesive message about how to properly use Terraform.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll give this more thought. Maybe for now, we can just remove this README from the test-infra dir. It was just a copy/paste of our example README anyway. And we can continue the discussion and tuning this file over in the PR where I'm making changes to this README in a more user-facing way. This discussion is also something we can think about for our upcoming meeting regarding the Kubernetes provider authentication issues. (Phil scheduled it for us yesterday -- we'll meet in about 2 weeks).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the paragraph in question. It's ok to just wait until the meeting to discuss it further, since we might not do much work on the other PR until we have that talk.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's a great way forward! I'm happy to do a deep dive in this topic online and exchange ideas.

@dak1n1 dak1n1 merged commit 078e9db into hashicorp:master Mar 12, 2021
@ghost
Copy link

ghost commented Apr 12, 2021

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. If you feel I made an error 🤖 🙉 , please reach out to my human friends 👉 [email protected]. Thanks!

@ghost ghost locked as resolved and limited conversation to collaborators Apr 12, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants