Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rate limit handling #137

Open
irizzant opened this issue Jan 11, 2025 · 10 comments
Open

Rate limit handling #137

irizzant opened this issue Jan 11, 2025 · 10 comments

Comments

@irizzant
Copy link

irizzant commented Jan 11, 2025

Hello
I'm struggling to find if and how the provider supports Github rate limits handling.

We have many Github resources handled by this provider and what we're experiencing is that we hit Github rate limits.

What I think is the reason why is that we have some managed resources which cannot be created and sit there consuming reconciliation loops.

I tried to set write_delay_ms and read_delay_ms settings but it seems it doesn't fix the issue.

I also read about --max-reconcile-rate and --poll settings but I didn't know if the provider supported them until I dag into the code:

pollInterval = app.Flag("poll", "Poll interval controls how often an individual resource should be checked for drift.").Default("10m").Duration()
pollStateMetricInterval = app.Flag("poll-state-metric", "State metric recording interval").Default("5s").Duration()
leaderElection = app.Flag("leader-election", "Use leader election for the controller manager.").Short('l').Default("false").OverrideDefaultFromEnvar("LEADER_ELECTION").Bool()
maxReconcileRate = app.Flag("max-reconcile-rate", "The global maximum rate per second at which resources may checked for drift from the desired state.").Default("100").Int()

Anyway I'm not sure it that would help. Does the default values mean that there should be 100 concurrent API calls to Github every 10 minutes?
Any suggestion?

@lacroi-m-insta
Copy link
Contributor

lacroi-m-insta commented Jan 13, 2025

Hey @irizzant,
Have you tried experimenting with read_delay_ms ?

The write_delay_ms seam to be using authenticated calls that dont hit the same rate limits as the read ones.

See #81

I am still experiencing some issues with these values also. I will share any finding I have

@irizzant
Copy link
Author

Hi @lacroi-m-insta,
I tried to increase read_delay_ms to 500 and this didn't fix the issue.
My suspect is that I'm hitting the rate limit because of the number of drift reconciliations rather than because of the number of Create or Update operation.

For this reason I was more into the poll flag because it states in the doc:

Poll interval controls how often an individual resource should be checked for drift

But maybe I'm in the wrong direction, that's why I was asking for help.

I'm going to further increase read_delay_ms, maybe if you have a suggested value it could be helpful

@lacroi-m-insta
Copy link
Contributor

lacroi-m-insta commented Jan 13, 2025

read_delay_ms to 500

Note that this is in millisecond so 500 is a GET request every 0.9~ seconds. I would suggest a minimum of 60000

These request allow to check for diff between the remote (github) and your crossplane resource

@lacroi-m-insta
Copy link
Contributor

Also I have only tested this for webhooks, I'm not sure if it takes into consideration other resources.

@irizzant
Copy link
Author

irizzant commented Jan 13, 2025

Ok so read_delay_ms = 60000 means that the provider is going to wait 1 minute before reconciling the drift of each resource, did I understand correctly?

So the poll flag will control how often to check for drift in general, while read_delay_ms controls how quick it reconcile the single resources, is that correct?

I'm going to set read_delay_ms to 60000 as you suggested and see if anything changes

@irizzant
Copy link
Author

I also created this slack thread to ask for help

@irizzant
Copy link
Author

Does the provider Pod need to be restarted after applying the read_delay_ms change?

@irizzant
Copy link
Author

Ok so as far as I can see after applying the new read_delay_ms value the provider pod logs seem to be much slower than before.

It also looks like the workqueue is increasing after the change, which is expected I suppose:
immagine

@irizzant
Copy link
Author

I have tried to create a Repository resource after updating the read_delay_ms to 60000 but it takes a lot more time than before to create resources like that. After 15 minutes waiting the Repository was still not created.

I'm curious to understand how long it takes for your resources to get synced with this setting.

I don't know if this is just about tuning the read_delay_ms or if it's better to focus on the poll parameter. Any suggestion?

@irizzant
Copy link
Author

irizzant commented Jan 16, 2025

@lacroi-m-insta it looks like the problem was an infinite reconciliation loop of TeamMembership resources caused by a missing LateInitialize managementPolicy and crossplane/crossplane#5918.

Without LateInitialize Crossplane sees the resource to be reconciled and queue it again and again in a infinite loop, see this Slack thread.

After changing the TeamMembership and add back the LateInitialize I went from this:
immagine

to this:
immagine

I now have a regular operations/sec rate and no more infinite workqueue with thousands of TeamMembership resources.

Still I'm wondering, how much does all this scale? I think that when there are more and more resources to sync sooner or later you'll face rate limit issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants