[Fix] Target Allocator Manager quits if the initial sync fails #241

okankoAMZ · 2024-10-15T17:24:32Z

Description:
If the CloudWatch Agent Collector pod starts before the Target Allocator pod, it will fail to ping the Target Allocator. This failure causes the Agent's target allocator thread to end, resulting in prometheus metrics being lost. Currently, the only solution is to restart the pods if this happens.

To overcome this , I removed the return failure on Start of the TA Manager thread; thus, TA Manager can keep on trying.

This ensures no loss in metrics. This also has no extra cost since if the scrape_config(savedHash) is the same as the previous one, sync function will immediately return. In the case if the ping keeps failing the savedHash will be the same the as the hash--which is 0, requiring no extra computing power.

Testing:
Manually tested, here you can see when the first one is failing it keeps on trying again.

removed return on first sync fail

a96b8e6

sky333999 approved these changes Oct 15, 2024

View reviewed changes

okankoAMZ merged commit 4e2991d into target-allocator Oct 15, 2024
138 of 146 checks passed

okankoAMZ mentioned this pull request Oct 25, 2024

Target allocator Support for the Agent #245

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix] Target Allocator Manager quits if the initial sync fails #241

[Fix] Target Allocator Manager quits if the initial sync fails #241

okankoAMZ commented Oct 15, 2024

[Fix] Target Allocator Manager quits if the initial sync fails #241

[Fix] Target Allocator Manager quits if the initial sync fails #241

Conversation

okankoAMZ commented Oct 15, 2024