-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race-condition when IAM Role has been created previously #82
Conversation
Signed-off-by: Matt Wise <[email protected]>
@mnkg561, |
@diranged Thanks for the PR. I'm trying to understand the race condition here? What was your request flow looked like when you had the race condition? We do not make the reconciliation successful until IAM API sends us the response but if the use case is the request came back again for reconciliation and for some reason IAM responded that Role doesn't exist? we are running iam-manager for more than 2 years with 300+ clusters and have not observed that behavior. |
@mnkg561, I'll comment in-line in the PR to give you some pointers to where we see the problems... but overall, we were able to replicate this by repeatedly deleting an |
if err != nil { | ||
log.Error(err, "error in creating a role") | ||
state := iammanagerv1alpha1.Error | ||
|
||
if strings.Contains(err.Error(), awsapi.RoleAlreadyExistsError) { | ||
// This check verifies whether or not the IAM Role somehow already exists, but is allocated to another namespace based on the tag applied to it. | ||
if strings.Contains(err.Error(), awsapi.RoleExistsAlreadyForOtherNamespace) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed this exception string because it really wasn't checking if the role exists, but was checking if it exists and was tagged for another namespace.
//OK. Successful!! | ||
// Is this IRSA role? If yes, Create/update Service Account with required annotation | ||
flag, saName := utils.ParseIRSAAnnotation(ctx, iamRole) | ||
if flag { | ||
//CreateOrUpdateServiceAccount | ||
roleARN := resp.RoleARN | ||
if roleARN == "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case - roleARN
has the chance to be ""
if the Role already exists in AWS. Check out my comment at https://github.com/keikoproj/iam-manager/pull/82/files#diff-cca9ed36302b8c25c0285718b2a7c46e3d65997b8731430231b4675d68fe38b4L110-L112.
//CreateOrUpdateServiceAccount | ||
roleARN := resp.RoleARN | ||
if roleARN == "" { | ||
roleARN = iamRole.Status.RoleARN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This status field only exists if the role has been fully created previously... but during an initial "create" phase, this is also an empty string..
|
||
resp := &IAMRoleResponse{} | ||
|
||
if !roleAlreadyExists { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if an exception for ErrCodeEntityAlreadyExistsException
comes back, then this is set True
. If it is set to True
, the code never sets the resp.RoleARN
or resp.RoleID
fields in the IAMRoleResponse
object, which is what triggers the ServiceAccount annotation to get set to an empty string. In the event that the role already exists, we want the code to be intelligent and get the ARN/ID for us.
Thanks for the explanation. Let me test that myself in couple of days (This week is off for us as refresh days so i'm traveling and doesn't have a setup to test this :) ). I just want to make sure that we are not taking over any other IAM roles created outside of iam-manager and also multiple iam-managers in same account is not stepping on each other. I will update you soon. Also, can you create the issues as well if possible with those scenarios in iam-manager repo itself so if anyone faces that issues in future will know what was our thought process before implementation and what scenarios we considered etc., |
Thanks @diranged Sorry for the delay. Just got back. will test this today and will update with my comments. |
I was able to observe this issue and that is because of the delay in setting IamRole resource status. I created an issue #84 to fix that as well. |
Looks like CI is failing and needs to fix CI Job #81 before we can merge this request |
Change looks good to me. Let's wait until CI job is fixed and check the unit test cases before we proceed. Thanks @diranged. I will get that CI job fixed ASAP. |
@diranged can you rebase with master please? |
@mnkg561 I just rebased it… but I am out of town for 2 weeks on vacation, so I don’t have a laptop with me (😲 ). If you need to do anything else to the code, feel free to sort of take it over and merge the change and then fix whatever you need. I am actually more interested in the second PR anyways which adds continual reconciliation of the ServiceAccount. |
What does this PR do?
We discovered a race condition that occurs when an AWS IAM Role already exists and you create your
Iamrole
resource. The problem is that when the IAM Controller calls theCreateRole
function, it tries to gracefully handle an IAM Role that already exists (see https://github.com/keikoproj/iam-manager/blob/v0.0.7/pkg/awsapi/iam.go#L103-L105) Unfortunately in that situation, it does not set theRoleARN
orRoleID
values (see https://github.com/keikoproj/iam-manager/blob/v0.0.7/pkg/awsapi/iam.go#L110-L113).What did I change?
After digging through the code and discussing with colleagues (@sabw8217), we agreed that it feels "right" for the code to be idempotent and handle the idea that an IAM Role might already exist.. there are a number of race conditions in these eventually-consistent APIs that AWS provides. It makes more sense to break the
CreateRole
function out into a few different pieces so that each one is logically testable and clean.I have changed the interface from the controller from
pkg.awapi.iam.CreateRole
topkg.aws.iam.EnsureRole
. That function now makes a call to a new functionGetOrCreateRole
, and that function in turn makes calls toGetRole
andCreateRole
. Because we now make the call toGetRole
before theCreateCall
, we have a response object that we can use to populate theRoleARN
andRoleID
.Testing
I have tested this and been able to replicate both the normal-case (no race condition) and the race-condition case.
No Race Condition Logs
Race Condition Logs