Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Simplifying ALB and IAM Role Configuration in modules #392

Open
henrykie opened this issue Nov 18, 2024 · 7 comments
Open

RFC: Simplifying ALB and IAM Role Configuration in modules #392

henrykie opened this issue Nov 18, 2024 · 7 comments
Labels
perforce RFC terraform Pull requests that update Terraform code

Comments

@henrykie
Copy link
Contributor

henrykie commented Nov 18, 2024

Is this related to an existing feature request or issue?

No

What part of the Cloud Game Development Toolkit does this RFC relate to?

Modules

Summary

We currently expose customization of the application load balancer that fronts the Helix Authentication Service, the Helix Swarm service, and the Unreal Horde service through module variables. This results in a significant number of variables, which requires extensive conditional creation of resources inside the module. I propose that we adopt a more opinionated and less customizable approach to ALB creation inside of the module, and support target group and service detail module outputs so extensive networking customization can be accomplished outside of the module.

The same goes for IAM roles. We support passing in an externally created role. It may make more sense to control role creation for the service inside the module, and attach permissions / policies externally by exposing the role ARN as an output.

Use case

Proposal

Currently, I control the ALB and role creation for the Helix Authentication Service through variables:

module "perforce_helix_authentication_service" {
  source                                   = "cloud-game-development-toolkit/modules/perforce/helix-authentication-service"
  ...
  # subnet specification
  helix_authentication_service_alb_subnets = aws_subnet.public_subnets[*].id
  helix_authentication_service_subnets     = aws_subnet.private_subnets[*].id
  certificate_arn                          = aws_acm_certificate.helix.arn
  ...
  # access log specification - these are wordy and confusing
  enable_helix_authentication_service_alb_access_logs = true
  helix_authentication_service_alb_access_logs_prefix = "has-access-logs"

  # Deletion protection
  enable_helix_authentication_service_alb_deletion_protection = false

  # security group attachment
  existing_security_groups = ["sg-0123456789"]
  
  # Role attachment - These also feel cumbersome
  custom_helix_authentication_service_role = "IAM ROLE ARN"
  create_helix_authentication_service_default_policy = true

  depends_on = [aws_ecs_cluster.perforce_cluster, aws_acm_certificate_validation.helix]
}

It would be simpler if these variables were not exposed. We should enable creation (turn off/on), but be opinionated on the following topics:

  1. For load balancing, we should default for access logs, deletion protection. We can enable dynamic creation and setting of these with one additional flag: debug.
  2. For IAM policies and roles we should use least privilege permissions and expose the IAM role through outputs to allow end users to extend permissions as needed.

Out of scope

The module should be responsible for creating NECESSARY resources for operating the service:

  1. ECS cluster we can turn off or on. If it is turned off, you must provide an external cluster name. We need this cluster for service definition.
  2. The service, container definitions, and supporting resources (elasticache, documentdb, efs, IAM roles, etc.) are the core of the module.
  3. A target group for the service is generally a requirement. A target group does not need any information beyond service details. It should be created by default.
  4. An ALB or NLB is a networking decision for the end user. Again, we can decide to provide simple ALB creation and provisioning through the module, but this should be a binary switch.

Potential challenges

There is nuance around how we manage security groups, both within the module and externally managed. This is something that needs further design, and probably would have holistic changes to the way we build security groups for modules.

Dependencies and Integrations

I am currently working on a sample for running Helix Core behind an NLB. This same NLB can also manage traffic for Helix Swarm and Helix Authentication Service. Currently, both Helix Swarm and Helix Authentication service create ALBs, but this is inefficient. In order to share an ALB we should be able to disable ALB creation inside of each of these modules, but pull out target groups from each for creating listeners / rules on the shared ALB and the shared NLB.

Alternative solutions

Support the injection of an externally created ALB. Then manage listener / target group attachment inside the module.
@henrykie henrykie added triage to be triaged by project maintainers RFC perforce terraform Pull requests that update Terraform code labels Nov 18, 2024
@henrykie
Copy link
Contributor Author

@jorisdon @gabebatista - thoughts?

@jorisdon jorisdon changed the title RFC: TITLE RFC: Simplifying ALB and IAM Role Configuration in Cloud Game Development Toolkit Modules Nov 20, 2024
@jorisdon jorisdon changed the title RFC: Simplifying ALB and IAM Role Configuration in Cloud Game Development Toolkit Modules RFC: Simplifying ALB and IAM Role Configuration in modules Nov 20, 2024
@jorisdon
Copy link
Contributor

I see value in simplifying the codebase, and similarly in being opinionated about specific things, but those should not lead to a loss in flexibility. Some users have no infrastructure set up yet and would just want to deploy a module that provides them with the module's functionality. Others already have an opinionated infrastructure set up and they want to add a module's functionality to it. When we're simplifying or being opinionated about things, we've got to ask ourselves: "are there any users who would object to our decisions here? if so, do they have an escape-hatch other than a fork or a pull request?" Obviously there's a trade-off there still, and I don't expect that we can support every use case easily.

On IAM: generally I'd say that having modules provision IAM roles with appropriate policies makes a lot of sense. Those roles can then be exposed as an output for customization purposes and I expect that that'd suffice for most use cases. I suppose there may be a use case where users would want to manage IAM roles outside of Terraform completely (presumably using a different tool) which would necessitate having support for passing in a role, but I wouldn't really know what the motivation behind that would be.

@jorisdon
Copy link
Contributor

jorisdon commented Nov 20, 2024

Here's an idea that might allow for a good level of customization, whilst also keeping the code readable and free from excessive configuration parameters / boolean logic: module layers with dependency injection. I am not entirely sure if this is supported well in Terraform nor if it's a common practice, but here's the idea;

Rather than providing a single module as we do now, we provide a set of modules that are layered on top of each other in a chain, with the lowest-level module providing the most core functionality only, and the highest-level module providing a more opinionated stack of services that are required, but some users may want to customize and manage elsewhere (either within their own Terraform module, or maybe even outside of Terraform entirely).

Here's an example of a bunch of layers:

  1. Helix Authentication Service Core workload; gets all of the dependencies (IAM role, ECS cluster, ALB) and supporting services passed in through variables, then provisions new resources on top of these to implement the workload. For example, this module would take care of provisioning an appropriate IAM policy, and then attaches it to the provided IAM role. It would create the ECS service, target group, etc for the provided cluster and load balancer. Conditional resource creation is not done within this module at all (unless absolutely necessary and there's no way around it). Goal of this module is to provision the resources that are always needed, in 100% of the use cases, and which need no configuration parameters.
  2. The second layer module provisions those resources that we'd recommend in 99% of use cases. In this case that'd be the IAM role; perhaps also security groups. Second layer is really what we'd recommend for users that want to manage their own networking and clusters, but it provides a guard against extreme edge cases (those could then use the first layer instead).
  3. The third layer module provisions supporting resources, like ElastiCache or EFS. Expectation is that this would suffice for 95% or so of use cases and lead to a nicely isolated best-practices environment with our recommended services, but perhaps some users may not want to use this module (and use the layer 2 one instead) if they want to use different services or share resources with other workloads.
  4. The fourth layer starts getting far more opinionated; it (by default?) provisions ECS, ALB, and similar resources.

And here's some of the rules / tenets governing layers:

  • We recommend that users adopt the highest level layer.
  • The lower level a layer is, the more strict we are against boolean logic / configuration parameters.
  • The higher level a layer is, the more opinionated it can be.
  • All layers expose output variables upwards.
  • All layers depend on / use the layer beneath them.
  • Changes in lower-level layers must never break higher-level layers.
  • Configuration parameters from lower-level layers should be supported in higher-level layers unless it makes no sense to do so (because the higher-level layer takes an opinion or provides the resource the lower-level layer depends on).
  • Conditional creation of resources is limited as much as possible.
  • We prioritize the design and support of the highest-level layer and second-level layer within a use case, and should aim for the fewest amount of layers. Expectation is that users of the other layers are power-users that know how to read the source code and understand module structure. Intermediary layers may not be documented.

I acknowledge that this wouldn't necessarily lead to less code nor duplication of code (we'd have to repeat the passing down of variables and up of outputs in each layer), but it would reduce conditional logic (where I think the complexity lies) significantly and give us an ability to be more opinionated (and change those opinions over time based on feedback) without sacrificing flexibility or causing breaking changes.

@gabebatista
Copy link
Contributor

While I like Joris' idea in concept, I worry it would add a lot of extra work (and potentially complexity) to the project. I think that having opinionated defaults with the ability to pass in your own resources would be ideal. This approach would keep things simple for non-tech users, while providing a simple way to manage changes either through creating resources in the console prior to deployment or defining resources in a higher level terraform template (like our samples). It might also be worth exploring adding additional samples to the modules that cover common use cases. This would provide more options for non-tech users while also giving additional examples for users who wish to customize the code themselves.

@gabebatista gabebatista removed the triage to be triaged by project maintainers label Nov 27, 2024
@henrykie
Copy link
Contributor Author

henrykie commented Dec 3, 2024

This is great feedback folks - thanks.

On the question of nested / layered modules

While I like this theory in practice, I think we can accomplish much of the same flexibility through variables with a flatter module structure and better variable definition. Hashicorp is pretty explicit on their guidance for nesting modules, and from my research it seems to complicate the update process for resources. Plus, we aren't expected to distinguish between a "layer 1" versus a "layer 2" versus a "layer 3" resource. You are correct that much of the complexity lies in the conditional logic - this is why I am proposing more extensive use of locals for defining logical paths.

Locals for conditional creation

Currently, when resources need to be conditionally created we use count - if we were to use nested modules we could add some sort of create variable, but I find that unnecessary. Instead of writing complex conditionals for each resource as part of the count argument, we should define high level boolean expressions via locals that can be used throughout the module. These locals can combine multiple variables together to specify logical paths for resource creation.

Simplification of overrides

I am in favor of EXTENSIVE customizations available for modules through variables. But we need to do a better job defining where the "break glass" points actually are. IAM is the best example of this. The variables custom_helix_core_role and create_default_helix_core_role are redundant. If a custom role is passed in, don't create a default role. Assume permissions are set appropriately on the role passed in. And, I feel very confident stating, a module should not modify resources passed into it as an override. That means no attaching policies to the role. Rather than passing things into modules we should expose module attributes via outputs and support extensive modification of those resources at the higher level - i.e., samples. If you want to pass an existing filesystem to the Jenkins deployment I am not convinced it is the responsibility of the Jenkins module to grant permissions to that filesystem on the IAM role created in the module. Rather, the Jenkins IAM role should be exposed as an output, and policy attachment can be specified at the root level (sample). We provide guidance on how to do this with - surprise - an example.

@kylesomers
Copy link
Member

kylesomers commented Dec 4, 2024

Conceptually I'm aligned with this idea. @henrykie Implement this into Helix Core module as PR so there's a representative example of some of these ideas in practice and associate with this issue.

@gabebatista
Copy link
Contributor

gabebatista commented Jan 7, 2025

@henrykie @jorisdon I would love your opinions on the Definition of Done (DoD) for this issue. It goes without saying that each module being modified would have its own DoD. I wrote this with the (Helix Authentication Service)[https://github.com/aws-games/cloud-game-development-toolkit/tree/d60f4984c84299628783f97dc54cfd16dd3f3283/modules/perforce/helix-authentication-service] module in mind, so slight changes would be made for each of the individual modules being modified.

Definition of Done

General Work

  • Define local variables that encompass the complex conditional logic used in count arguments for conditional creation of resources
  • Add a single "debug" flag to control access logs and deletion protection

ALB Configuration Simplification

  • Implement opinionated defaults for ALB creation
  • Replace conditional creation statements with local variables (see General Work section)
  • Expose target group and ALB details via Output Variables

IAM Role Management

  • Remove conditional creation of IAM roles
  • Implement least privilege permissions by default
  • Expose IAM role ARN as module output

Documentation & Examples

  • Update module documentation to reflect new configuration approach
  • Add example for sharing ALB across services
  • Add example for external IAM policy attachment
  • Add example for custom networking configuration
  • Update existing samples to reflect new patters

Testing & Verification

  • Verify that samples using the modified module still work
  • Verify that IAM Role ARN is being exposed via Output variable
  • Verify that Target Group and ALB details are exposed via Output variable
  • Verify that 'tf destroy' completes without issue when debug flag is enabled

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
perforce RFC terraform Pull requests that update Terraform code
Projects
None yet
Development

No branches or pull requests

4 participants