Rate limit handling #137

irizzant · 2025-01-11T11:13:54Z

Hello
I'm struggling to find if and how the provider supports Github rate limits handling.

We have many Github resources handled by this provider and what we're experiencing is that we hit Github rate limits.

What I think is the reason why is that we have some managed resources which cannot be created and sit there consuming reconciliation loops.

I tried to set write_delay_ms and read_delay_ms settings but it seems it doesn't fix the issue.

I also read about --max-reconcile-rate and --poll settings but I didn't know if the provider supported them until I dag into the code:

provider-upjet-github/cmd/provider/main.go

Lines 47 to 50 in bf4fac6

    
           pollInterval            = app.Flag("poll", "Poll interval controls how often an individual resource should be checked for drift.").Default("10m").Duration() 
        
           pollStateMetricInterval = app.Flag("poll-state-metric", "State metric recording interval").Default("5s").Duration() 
        
           leaderElection          = app.Flag("leader-election", "Use leader election for the controller manager.").Short('l').Default("false").OverrideDefaultFromEnvar("LEADER_ELECTION").Bool() 
        
           maxReconcileRate        = app.Flag("max-reconcile-rate", "The global maximum rate per second at which resources may checked for drift from the desired state.").Default("100").Int()

Anyway I'm not sure it that would help. Does the default values mean that there should be 100 concurrent API calls to Github every 10 minutes?
Any suggestion?

lacroi-m-insta · 2025-01-13T08:12:48Z

Hey @irizzant,
Have you tried experimenting with read_delay_ms ?

The write_delay_ms seam to be using authenticated calls that dont hit the same rate limits as the read ones.

See #81

I am still experiencing some issues with these values also. I will share any finding I have

irizzant · 2025-01-13T08:33:56Z

Hi @lacroi-m-insta,
I tried to increase read_delay_ms to 500 and this didn't fix the issue.
My suspect is that I'm hitting the rate limit because of the number of drift reconciliations rather than because of the number of Create or Update operation.

For this reason I was more into the poll flag because it states in the doc:

Poll interval controls how often an individual resource should be checked for drift

But maybe I'm in the wrong direction, that's why I was asking for help.

I'm going to further increase read_delay_ms, maybe if you have a suggested value it could be helpful

lacroi-m-insta · 2025-01-13T08:35:18Z

read_delay_ms to 500

Note that this is in millisecond so 500 is a GET request every 0.9~ seconds. I would suggest a minimum of 60000

These request allow to check for diff between the remote (github) and your crossplane resource

lacroi-m-insta · 2025-01-13T08:37:11Z

Also I have only tested this for webhooks, I'm not sure if it takes into consideration other resources.

irizzant · 2025-01-13T08:40:11Z

Ok so read_delay_ms = 60000 means that the provider is going to wait 1 minute before reconciling the drift of each resource, did I understand correctly?

So the poll flag will control how often to check for drift in general, while read_delay_ms controls how quick it reconcile the single resources, is that correct?

I'm going to set read_delay_ms to 60000 as you suggested and see if anything changes

irizzant · 2025-01-13T08:43:18Z

I also created this slack thread to ask for help

irizzant · 2025-01-13T08:58:11Z

Does the provider Pod need to be restarted after applying the read_delay_ms change?

irizzant · 2025-01-13T09:43:03Z

Ok so as far as I can see after applying the new read_delay_ms value the provider pod logs seem to be much slower than before.

It also looks like the workqueue is increasing after the change, which is expected I suppose:

irizzant · 2025-01-13T12:02:37Z

I have tried to create a Repository resource after updating the read_delay_ms to 60000 but it takes a lot more time than before to create resources like that. After 15 minutes waiting the Repository was still not created.

I'm curious to understand how long it takes for your resources to get synced with this setting.

I don't know if this is just about tuning the read_delay_ms or if it's better to focus on the poll parameter. Any suggestion?

irizzant · 2025-01-16T08:15:31Z

@lacroi-m-insta it looks like the problem was an infinite reconciliation loop of TeamMembership resources caused by a missing LateInitialize managementPolicy and crossplane/crossplane#5918.

Without LateInitialize Crossplane sees the resource to be reconciled and queue it again and again in a infinite loop, see this Slack thread.

After changing the TeamMembership and add back the LateInitialize I went from this:

to this:

I now have a regular operations/sec rate and no more infinite workqueue with thousands of TeamMembership resources.

Still I'm wondering, how much does all this scale? I think that when there are more and more resources to sync sooner or later you'll face rate limit issues

This was referenced Jan 14, 2025

No ability to query existing users #130

Closed

Resources are recreated when provider Pod is restarted #141

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rate limit handling #137

Rate limit handling #137

irizzant commented Jan 11, 2025 •

edited

Loading

lacroi-m-insta commented Jan 13, 2025 •

edited

Loading

irizzant commented Jan 13, 2025

lacroi-m-insta commented Jan 13, 2025 •

edited

Loading

lacroi-m-insta commented Jan 13, 2025

irizzant commented Jan 13, 2025 •

edited

Loading

irizzant commented Jan 13, 2025

irizzant commented Jan 13, 2025

irizzant commented Jan 13, 2025

irizzant commented Jan 13, 2025

irizzant commented Jan 16, 2025 •

edited

Loading

Rate limit handling #137

Rate limit handling #137

Comments

irizzant commented Jan 11, 2025 • edited Loading

lacroi-m-insta commented Jan 13, 2025 • edited Loading

irizzant commented Jan 13, 2025

lacroi-m-insta commented Jan 13, 2025 • edited Loading

lacroi-m-insta commented Jan 13, 2025

irizzant commented Jan 13, 2025 • edited Loading

irizzant commented Jan 13, 2025

irizzant commented Jan 13, 2025

irizzant commented Jan 13, 2025

irizzant commented Jan 13, 2025

irizzant commented Jan 16, 2025 • edited Loading

irizzant commented Jan 11, 2025 •

edited

Loading

lacroi-m-insta commented Jan 13, 2025 •

edited

Loading

lacroi-m-insta commented Jan 13, 2025 •

edited

Loading

irizzant commented Jan 13, 2025 •

edited

Loading

irizzant commented Jan 16, 2025 •

edited

Loading