Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race-condition when IAM Role has been created previously #82

Merged
merged 2 commits into from
Jul 18, 2021

Conversation

diranged
Copy link
Contributor

@diranged diranged commented Jul 2, 2021

What does this PR do?

We discovered a race condition that occurs when an AWS IAM Role already exists and you create your Iamrole resource. The problem is that when the IAM Controller calls the CreateRole function, it tries to gracefully handle an IAM Role that already exists (see https://github.com/keikoproj/iam-manager/blob/v0.0.7/pkg/awsapi/iam.go#L103-L105) Unfortunately in that situation, it does not set the RoleARN or RoleID values (see https://github.com/keikoproj/iam-manager/blob/v0.0.7/pkg/awsapi/iam.go#L110-L113).

What did I change?

After digging through the code and discussing with colleagues (@sabw8217), we agreed that it feels "right" for the code to be idempotent and handle the idea that an IAM Role might already exist.. there are a number of race conditions in these eventually-consistent APIs that AWS provides. It makes more sense to break the CreateRole function out into a few different pieces so that each one is logically testable and clean.

I have changed the interface from the controller from pkg.awapi.iam.CreateRole to pkg.aws.iam.EnsureRole. That function now makes a call to a new function GetOrCreateRole, and that function in turn makes calls to GetRole and CreateRole. Because we now make the call to GetRole before the CreateCall, we have a response object that we can use to populate the RoleARN and RoleID.

Testing

I have tested this and been able to replicate both the normal-case (no race condition) and the race-condition case.

No Race Condition Logs

2021-07-02T00:36:15.359Z        DEBUG   awsapi.iam.EnsureRole   Verifying that IAM Role exists  {"request_id": "2b70bed9-f271-4fdd-998e-8601dbf0dacc", "roleName": "k8s-us1-myns-testrole-app"}
2021-07-02T00:36:15.359Z        DEBUG   awsapi.iam.GetRole      Initiating api call     {"request_id": "2b70bed9-f271-4fdd-998e-8601dbf0dacc", "roleName": "k8s-us1-myns-testrole-app"}
2021-07-02T00:36:15.458Z        INFO    awsapi.iam.GetOrCreateRole      Role does not already exist     {"request_id": "2b70bed9-f271-4fdd-998e-8601dbf0dacc", "roleName": "k8s-us1-myns-testrole-app"}
2021-07-02T00:36:15.458Z        DEBUG   awsapi.iam.CreateRole   Initiating api call     {"request_id": "2b70bed9-f271-4fdd-998e-8601dbf0dacc", "roleName": "k8s-us1-myns-testrole-app"}
2021-07-02T00:36:15.607Z        DEBUG   awsapi.iam.CreateRole   Successfully created the role   {"request_id": "2b70bed9-f271-4fdd-998e-8601dbf0dacc", "roleName": "k8s-us1-myns-testrole-app"}
2021-07-02T00:36:15.607Z        DEBUG   awsapi.iam.EnsureRole   Verifying Tags  {"request_id": "2b70bed9-f271-4fdd-998e-8601dbf0dacc", "roleName": "k8s-us1-myns-testrole-app"}
2021-07-02T00:36:15.607Z        DEBUG   awsapi.iam.VerifyTags   Initiating api call     {"request_id": "2b70bed9-f271-4fdd-998e-8601dbf0dacc", "roleName": "k8s-us1-myns-testrole-app"}
2021-07-02T00:36:15.710Z        DEBUG   awsapi.iam.EnsureRole   Attaching Tag   {"request_id": "2b70bed9-f271-4fdd-998e-8601dbf0dacc", "roleName": "k8s-us1-myns-testrole-app"}
2021-07-02T00:36:15.711Z        DEBUG   awsapi.iam.TagRole      Initiating api call     {"request_id": "2b70bed9-f271-4fdd-998e-8601dbf0dacc", "roleName": "k8s-us1-myns-testrole-app"}
2021-07-02T00:36:15.836Z        DEBUG   awsapi.iam.TagRole      Successfully completed TagRole call     {"request_id": "2b70bed9-f271-4fdd-998e-8601dbf0dacc", "roleName": "k8s-us1-myns-testrole-app"}
2021-07-02T00:36:15.836Z        DEBUG   awsapi.iam.EnsureRole   Attaching Permission Boundary   {"request_id": "2b70bed9-f271-4fdd-998e-8601dbf0dacc", "roleName": "k8s-us1-myns-testrole-app"}
2021-07-02T00:36:15.836Z        DEBUG   awsapi.iam.AddPermissionBoundary        Initiating api call     {"request_id": "2b70bed9-f271-4fdd-998e-8601dbf0dacc", "roleName": "k8s-us1-myns-testrole-app"}
2021-07-02T00:36:15.939Z        DEBUG   awsapi.iam.AddPermissionBoundary        Successfuly added permission boundary   {"request_id": "2b70bed9-f271-4fdd-998e-8601dbf0dacc", "roleName": "k8s-us1-myns-testrole-app"}
2021-07-02T00:36:15.939Z        DEBUG   awsapi.iam.EnsureRole   Attaching Managed policies      {"request_id": "2b70bed9-f271-4fdd-998e-8601dbf0dacc", "roleName": "k8s-us1-myns-testrole-app"}
2021-07-02T00:36:15.939Z        DEBUG   awsapi.iam.AttachManagedRolePolicy      Initiating api call     {"request_id": "2b70bed9-f271-4fdd-998e-8601dbf0dacc", "roleName": "k8s-us1-myns-testrole-app", "policyName": "arn:aws:iam::123456:policy/EKS-us1-stage2-IamManagerController-T5N293MP8WY-default-policy"}
2021-07-02T00:36:16.087Z        DEBUG   awsapi.iam.AttachManagedRolePolicy      Successfully attached managed policies  {"request_id": "2b70bed9-f271-4fdd-998e-8601dbf0dacc", "roleName": "k8s-us1-myns-testrole-app", "policyName": "arn:aws:iam::123456:policy/EKS-us1-stage2-IamManagerController-T5N293MP8WY-default-policy"}
2021-07-02T00:36:16.087Z        DEBUG   awsapi.iam.EnsureRole   Attaching Inline role policies  {"request_id": "2b70bed9-f271-4fdd-998e-8601dbf0dacc", "roleName": "k8s-us1-myns-testrole-app"}
2021-07-02T00:36:16.087Z        DEBUG   awsapi.iam.UpdateRole   Initiating api call     {"request_id": "2b70bed9-f271-4fdd-998e-8601dbf0dacc", "roleName": "k8s-us1-myns-testrole-app"}
2021-07-02T00:36:16.200Z        DEBUG   awsapi.iam.UpdateRole   Initiating api call     {"request_id": "2b70bed9-f271-4fdd-998e-8601dbf0dacc", "roleName": "k8s-us1-myns-testrole-app", "api": "UpdateAssumeRolePolicy"}
2021-07-02T00:36:16.334Z        DEBUG   awsapi.iam.UpdateRole   AssumeRole Policy is successfully updated       {"request_id": "2b70bed9-f271-4fdd-998e-8601dbf0dacc", "roleName": "k8s-us1-myns-testrole-app"}
2021-07-02T00:36:16.335Z        DEBUG   awsapi.iam.AttachInlineRolePolicy       Initiating api call     {"request_id": "2b70bed9-f271-4fdd-998e-8601dbf0dacc", "roleName": "k8s-us1-myns-testrole-app"}
2021-07-02T00:36:16.478Z        DEBUG   awsapi.iam.AttachInlineRolePolicy       Successfully completed attaching InlineRolePolicy       {"request_id": "2b70bed9-f271-4fdd-998e-8601dbf0dacc", "roleName": "k8s-us1-myns-testrole-app"}

Race Condition Logs

2021-07-02T00:38:38.493Z        DEBUG   awsapi.iam.EnsureRole   Verifying that IAM Role exists  {"request_id": "5ee20b83-9c2e-4071-8bd1-51c2b14fd938", "roleName": "k8s-us1-myns-testrole-app"}
2021-07-02T00:38:38.493Z        DEBUG   awsapi.iam.GetRole      Initiating api call     {"request_id": "5ee20b83-9c2e-4071-8bd1-51c2b14fd938", "roleName": "k8s-us1-myns-testrole-app"}
2021-07-02T00:38:38.608Z        DEBUG   awsapi.iam.GetRole      Successfully able to get the role       {"request_id": "5ee20b83-9c2e-4071-8bd1-51c2b14fd938", "roleName": "k8s-us1-myns-testrole-app"}
2021-07-02T00:38:38.608Z        DEBUG   awsapi.iam.EnsureRole   Verifying Tags  {"request_id": "5ee20b83-9c2e-4071-8bd1-51c2b14fd938", "roleName": "k8s-us1-myns-testrole-app"}
2021-07-02T00:38:38.608Z        DEBUG   awsapi.iam.VerifyTags   Initiating api call     {"request_id": "5ee20b83-9c2e-4071-8bd1-51c2b14fd938", "roleName": "k8s-us1-myns-testrole-app"}
2021-07-02T00:38:38.714Z        DEBUG   awsapi.iam.EnsureRole   Attaching Tag   {"request_id": "5ee20b83-9c2e-4071-8bd1-51c2b14fd938", "roleName": "k8s-us1-myns-testrole-app"}
2021-07-02T00:38:38.714Z        DEBUG   awsapi.iam.TagRole      Initiating api call     {"request_id": "5ee20b83-9c2e-4071-8bd1-51c2b14fd938", "roleName": "k8s-us1-myns-testrole-app"}
2021-07-02T00:38:38.865Z        DEBUG   awsapi.iam.TagRole      Successfully completed TagRole call     {"request_id": "5ee20b83-9c2e-4071-8bd1-51c2b14fd938", "roleName": "k8s-us1-myns-testrole-app"}
2021-07-02T00:38:38.865Z        DEBUG   awsapi.iam.EnsureRole   Attaching Permission Boundary   {"request_id": "5ee20b83-9c2e-4071-8bd1-51c2b14fd938", "roleName": "k8s-us1-myns-testrole-app"}
2021-07-02T00:38:38.865Z        DEBUG   awsapi.iam.AddPermissionBoundary        Initiating api call     {"request_id": "5ee20b83-9c2e-4071-8bd1-51c2b14fd938", "roleName": "k8s-us1-myns-testrole-app"}
2021-07-02T00:38:38.970Z        DEBUG   awsapi.iam.AddPermissionBoundary        Successfuly added permission boundary   {"request_id": "5ee20b83-9c2e-4071-8bd1-51c2b14fd938", "roleName": "k8s-us1-myns-testrole-app"}
2021-07-02T00:38:38.970Z        DEBUG   awsapi.iam.EnsureRole   Attaching Managed policies      {"request_id": "5ee20b83-9c2e-4071-8bd1-51c2b14fd938", "roleName": "k8s-us1-myns-testrole-app"}
2021-07-02T00:38:38.970Z        DEBUG   awsapi.iam.AttachManagedRolePolicy      Initiating api call     {"request_id": "5ee20b83-9c2e-4071-8bd1-51c2b14fd938", "roleName": "k8s-us1-myns-testrole-app", "policyName": "arn:aws:iam::123456:policy/EKS-us1-stage2-IamManagerController-T5N293MP8WY-default-policy"}
2021-07-02T00:38:39.089Z        DEBUG   awsapi.iam.AttachManagedRolePolicy      Successfully attached managed policies  {"request_id": "5ee20b83-9c2e-4071-8bd1-51c2b14fd938", "roleName": "k8s-us1-myns-testrole-app", "policyName": "arn:aws:iam::123456:policy/EKS-us1-stage2-IamManagerController-T5N293MP8WY-default-policy"}
2021-07-02T00:38:39.089Z        DEBUG   awsapi.iam.EnsureRole   Attaching Inline role policies  {"request_id": "5ee20b83-9c2e-4071-8bd1-51c2b14fd938", "roleName": "k8s-us1-myns-testrole-app"}
2021-07-02T00:38:39.089Z        DEBUG   awsapi.iam.UpdateRole   Initiating api call     {"request_id": "5ee20b83-9c2e-4071-8bd1-51c2b14fd938", "roleName": "k8s-us1-myns-testrole-app"}
2021-07-02T00:38:39.198Z        DEBUG   awsapi.iam.UpdateRole   Initiating api call     {"request_id": "5ee20b83-9c2e-4071-8bd1-51c2b14fd938", "roleName": "k8s-us1-myns-testrole-app", "api": "UpdateAssumeRolePolicy"}
2021-07-02T00:38:39.324Z        DEBUG   awsapi.iam.UpdateRole   AssumeRole Policy is successfully updated       {"request_id": "5ee20b83-9c2e-4071-8bd1-51c2b14fd938", "roleName": "k8s-us1-myns-testrole-app"}
2021-07-02T00:38:39.324Z        DEBUG   awsapi.iam.AttachInlineRolePolicy       Initiating api call     {"request_id": "5ee20b83-9c2e-4071-8bd1-51c2b14fd938", "roleName": "k8s-us1-myns-testrole-app"}
2021-07-02T00:38:39.454Z        DEBUG   awsapi.iam.AttachInlineRolePolicy       Successfully completed attaching InlineRolePolicy       {"request_id": "5ee20b83-9c2e-4071-8bd1-51c2b14fd938", "roleName": "k8s-us1-myns-testrole-app"}
2021-07-02T00:38:39.454Z        INFO    internal.utils.utils.ParseIRSAAnnotation        Annotation found        {"request_id": "5ee20b83-9c2e-4071-8bd1-51c2b14fd938", "name": "goflask"}

@diranged
Copy link
Contributor Author

diranged commented Jul 6, 2021

@mnkg561,
I have this PR, and a followup (see the preview at Nextdoor#4) that I'd like to get your input on... we're running these changes now in production and the behavior is much smoother. Let me know if you have any questions? I want to submit Nextdoor#4 also, but I can't until this one is merged..

@mnkg561
Copy link
Contributor

mnkg561 commented Jul 7, 2021

@diranged Thanks for the PR. I'm trying to understand the race condition here? What was your request flow looked like when you had the race condition? We do not make the reconciliation successful until IAM API sends us the response but if the use case is the request came back again for reconciliation and for some reason IAM responded that Role doesn't exist? we are running iam-manager for more than 2 years with 300+ clusters and have not observed that behavior.

@diranged
Copy link
Contributor Author

diranged commented Jul 7, 2021

@mnkg561,
We use ArgoCD to handle our various application deployments. ArgoCD has a feature where it will "self heal" - it will replace a resource or mutate a resource back to an expected state if a human interacts with it. We noticed that during the default status case handler (which is intended to be the "create" phase lets call it), there were conditions where the RoleARN was not available and our ServiceAccounts were being set to have the rolearn annotation set to an empty string "".

I'll comment in-line in the PR to give you some pointers to where we see the problems... but overall, we were able to replicate this by repeatedly deleting an Iamrole resource and letting ArgoCD recreate it. After ~5-10 tries, you get the failure. Once we were able to replicate the failure that way, we were then later able to replicate it by simply pre-creating the IAM Role in AWS, and then trying to create an Iamrole resource in K8S.

if err != nil {
log.Error(err, "error in creating a role")
state := iammanagerv1alpha1.Error

if strings.Contains(err.Error(), awsapi.RoleAlreadyExistsError) {
// This check verifies whether or not the IAM Role somehow already exists, but is allocated to another namespace based on the tag applied to it.
if strings.Contains(err.Error(), awsapi.RoleExistsAlreadyForOtherNamespace) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed this exception string because it really wasn't checking if the role exists, but was checking if it exists and was tagged for another namespace.

//OK. Successful!!
// Is this IRSA role? If yes, Create/update Service Account with required annotation
flag, saName := utils.ParseIRSAAnnotation(ctx, iamRole)
if flag {
//CreateOrUpdateServiceAccount
roleARN := resp.RoleARN
if roleARN == "" {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case - roleARN has the chance to be "" if the Role already exists in AWS. Check out my comment at https://github.com/keikoproj/iam-manager/pull/82/files#diff-cca9ed36302b8c25c0285718b2a7c46e3d65997b8731430231b4675d68fe38b4L110-L112.

//CreateOrUpdateServiceAccount
roleARN := resp.RoleARN
if roleARN == "" {
roleARN = iamRole.Status.RoleARN
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This status field only exists if the role has been fully created previously... but during an initial "create" phase, this is also an empty string..


resp := &IAMRoleResponse{}

if !roleAlreadyExists {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if an exception for ErrCodeEntityAlreadyExistsException comes back, then this is set True. If it is set to True, the code never sets the resp.RoleARN or resp.RoleID fields in the IAMRoleResponse object, which is what triggers the ServiceAccount annotation to get set to an empty string. In the event that the role already exists, we want the code to be intelligent and get the ARN/ID for us.

@mnkg561
Copy link
Contributor

mnkg561 commented Jul 7, 2021

Thanks for the explanation. Let me test that myself in couple of days (This week is off for us as refresh days so i'm traveling and doesn't have a setup to test this :) ).

I just want to make sure that we are not taking over any other IAM roles created outside of iam-manager and also multiple iam-managers in same account is not stepping on each other. I will update you soon.

Also, can you create the issues as well if possible with those scenarios in iam-manager repo itself so if anyone faces that issues in future will know what was our thought process before implementation and what scenarios we considered etc.,

@diranged
Copy link
Contributor Author

@mnkg561 #83 < done..

@mnkg561
Copy link
Contributor

mnkg561 commented Jul 14, 2021

Thanks @diranged Sorry for the delay. Just got back. will test this today and will update with my comments.

@mnkg561
Copy link
Contributor

mnkg561 commented Jul 15, 2021

I was able to observe this issue and that is because of the delay in setting IamRole resource status. I created an issue #84 to fix that as well.

@mnkg561
Copy link
Contributor

mnkg561 commented Jul 15, 2021

Looks like CI is failing and needs to fix CI Job #81 before we can merge this request

@mnkg561
Copy link
Contributor

mnkg561 commented Jul 15, 2021

Change looks good to me. Let's wait until CI job is fixed and check the unit test cases before we proceed. Thanks @diranged. I will get that CI job fixed ASAP.

@mnkg561
Copy link
Contributor

mnkg561 commented Jul 15, 2021

@diranged can you rebase with master please?

@diranged
Copy link
Contributor Author

@mnkg561 I just rebased it… but I am out of town for 2 weeks on vacation, so I don’t have a laptop with me (😲 ). If you need to do anything else to the code, feel free to sort of take it over and merge the change and then fix whatever you need. I am actually more interested in the second PR anyways which adds continual reconciliation of the ServiceAccount.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants