[Fix] Target Allocator Manager quits if the initial sync fails #241
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description:
If the CloudWatch Agent Collector pod starts before the Target Allocator pod, it will fail to ping the Target Allocator. This failure causes the Agent's target allocator thread to end, resulting in prometheus metrics being lost. Currently, the only solution is to restart the pods if this happens.
To overcome this , I removed the
return failure
on Start of the TA Manager thread; thus, TA Manager can keep on trying.This ensures no loss in metrics. This also has no extra cost since if the
scrape_config
(savedHash) is the same as the previous one, sync function will immediately return. In the case if the ping keeps failing thesavedHash
will be the same the as the hash--which is 0, requiring no extra computing power.Testing:
Manually tested, here you can see when the first one is failing it keeps on trying again.