-
Notifications
You must be signed in to change notification settings - Fork 982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace EKS test-infra with example #1192
Conversation
This copies the EKS cluster from our examples into our test-infra, so we can use it for tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Stef!
I had a look and dropped some comments here in there. I hope I haven't missed anything.
I'll give this a try myself as soon as I find some spare cycles.
write_kubeconfig = false | ||
|
||
workers_group_defaults = { | ||
root_volume_type = "gp2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GP2 volumes are generally more expensive.
Is there a technical need for nodes to run on GP2 volumes? We're likely not putting much workload on the nodes during our tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gp2 was the default in the EKS module, but they recently updated it to gp3. I put it back to the old default because gp3 isn't available in all regions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the commit where it was added. terraform-aws-modules/terraform-aws-eks@76537d1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, if that's the case then disregard my comment. I must have based it on old information and likely GPx volumes are now the only choice. Sorry for the confusion :)
kubernetes/test-infra/eks/README.md
Outdated
|
||
## Replacing the EKS cluster and re-creating the Kubernetes / Helm resources | ||
|
||
When the cluster is initially created, the Kubernetes and Helm providers will not be initialized until authentication details are created for the cluster. However, for future operations that may involve replacing the underlying cluster (for example, changing the network where the EKS cluster resides), the EKS cluster will have to be targeted without the Kubernetes/Helm providers, as shown below. This is done by removing the `module.kubernetes-config` from Terraform State prior to replacing cluster credentials, to avoid passing outdated credentials into the providers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer that we leave out the first phrase. I feel it's presenting some facts inaccurately.
My understanding is that all providers are configured at the same time, BEFORE any of the resources have been created. It is because the configured attribute values are only then consumed by the Kubernetes / Helm provider CRUD logic at a later stage in the apply, that we only see intermittent issues that we generally label as "progressive apply". This sequencing is not predictable and is influenced by the resource dependencies in the graph in very unintuitive ways. I think the first phrase here gives users the false impression that this aspect is somehow predictable. I worry that it is not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe there's a way I can represent this more accurately. Let's discuss the nuances of it, and maybe I can learn from your perspective. There is probably some truth to both of our understandings of this scenario, since my understanding is based on first-hand experience and observation. The scenario I'm describing in this document is thoroughly tested. It is repeatable and predictable. I'm less certain about the "why" behind it though.
I think this is because, when you initialize the Kubernetes provider using a data source, like google_container_cluster
, the reading of the data source can be deferred. In deferring the data source read, you get the predictable deferred provider initialization. This is what allows our initial builds of GKE/EKS/AKS clusters to succeed in our test-infra and examples. It's the predictable single-apply scenario.
However, you can run into problems later, such as during cluster replacement, when the Kubernetes provider initialization is no longer delayed by the initial read of the data source. The Kubernetes provider is initialized before the data source has read the new GKE credentials, since the credentials were replaced during the apply phase. In this scenario, the data source was never re-read to give the Kubernetes provider valid credentials, and so the apply fails. In those cases, as far as I'm aware, the only way forward is to remove the Kubernetes provider resources from state, as described in this document. (Related: I opened a bug upstream to see if they can help us with this scenario).
My goal with this README to be able to convey the information accurately to users, to help them to succeed in establishing the dependency between their GKE cluster and their Kubernetes resources.
How about this instead... I edited it to say the providers will not attempt to read from the Kubernetes API
instead of saying they won't be initialized. Perhaps that information is more accurate and more relevant to the users.
When the cluster is initially created, the Kubernetes and Helm providers will not be initialized until authentication details are created for the cluster. However, for future operations that may involve replacing the underlying cluster (for example, changing the network where the EKS cluster resides), the EKS cluster will have to be targeted without the Kubernetes/Helm providers, as shown below. This is done by removing the `module.kubernetes-config` from Terraform State prior to replacing cluster credentials, to avoid passing outdated credentials into the providers. | |
When the cluster is initially created, the Kubernetes and Helm providers will not attempt to read from the Kubernetes API until authentication details are created for the cluster. However, for future operations that may involve replacing the underlying cluster (for example, changing the network where the EKS cluster resides), the EKS cluster will have to be targeted without the Kubernetes/Helm resources in state, as shown below. This is done by removing the `module.kubernetes-config` from Terraform State prior to replacing cluster credentials. Alternatively, keeping the EKS infrastructure and Kubernetes infrastructure in different state files will avoid this scenario. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stef, thanks for breaking things down in such detail.
I appreciate the overall solution that you are proposing here and I think removing the state of the Kubernetes resources deliberately via the state rm
command before replacing the cluster is a clever and very valid solution. I think we should promote that and encourage users to adopt it. It follows good practice patterns, by making actions explicit rather than implicit and limiting blast radius of each action. I like it.
What I'm not comfortable with is the messaging in the first phrase only. It can be interpreted by users as stating for a fact that Terraform offers any kind of sequencing guarantees that "the Kubernetes and Helm providers will not attempt to read from the Kubernetes API until authentication details are created for the cluster".
This causality, I'm afraid, is not sustained by the way Terraform is currently designed. After multiple conversations with Terraform engineers, the conclusion has always been that this is not by-design behaviour. The fact that it works (as clearly you have experienced first-hand) is a side-effect and side-effects are dangerous to promote as reliable fact.
In fact, Terraform documentation clearly advises against relying on this assumption here: https://www.terraform.io/docs/language/providers/configuration.html#provider-configuration-1. About midway in that section, it reads:
"You can use expressions in the values of these configuration arguments, but can only reference values that are known before the configuration is applied. This means you can safely reference input variables, but not attributes exported by resources (with an exception for resource arguments that are specified directly in the configuration)."
I think it's in our users' best interest (but also ours) for us to send a clear and cohesive message about how to properly use Terraform.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll give this more thought. Maybe for now, we can just remove this README from the test-infra dir. It was just a copy/paste of our example README anyway. And we can continue the discussion and tuning this file over in the PR where I'm making changes to this README in a more user-facing way. This discussion is also something we can think about for our upcoming meeting regarding the Kubernetes provider authentication issues. (Phil scheduled it for us yesterday -- we'll meet in about 2 weeks).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the paragraph in question. It's ok to just wait until the meeting to discuss it further, since we might not do much work on the other PR until we have that talk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's a great way forward! I'm happy to do a deep dive in this topic online and exchange ideas.
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. If you feel I made an error 🤖 🙉 , please reach out to my human friends 👉 [email protected]. Thanks! |
This copies the EKS cluster from our examples into our test-infra, so we can use it for tests.
Description
Acceptance tests
Output from acceptance testing:
Release Note
Release note for CHANGELOG:
References
Community Note