-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add callback failure warning email #2190
Conversation
- Alphabeticalized the list of Notify's templates in config.py, because readability is nice.
app/celery/service_callback_tasks.py
Outdated
@@ -86,5 +86,5 @@ def _send_data_to_service_callback_api(self, data, service_callback_url, token, | |||
self.retry(queue=QueueNames.CALLBACKS_RETRY) | |||
except self.MaxRetriesExceededError: | |||
current_app.logger.warning( | |||
"Retry: {function_name} has retried the max num of times for callback url {service_callback_url} and notification_id: {notification_id}" | |||
f"Retry: {function_name} has retried the max num of times for callback url {service_callback_url} and notification_id: {notification_id}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out we have not been properly capturing callback failure logs for a while now.
- Added CALLBACK_FAILURE_THRESHOLD_PERCENTAGE env var
- Updated the callback email migration script with the latest changes from content - Added a method to send the callback failure email to service owners - Stubbed a method to query cloudwatch so we can determine if callbacks for the service have failed at least 5 times in a 30 minute time period before we send the email to the service owner
ALREADY_REGISTERED_EMAIL_TEMPLATE_ID = "0880fbb1-a0c6-46f0-9a8e-36c986381ceb" | ||
APIKEY_REVOKE_TEMPLATE_ID = "a0a4e7b8-8a6a-4eaa-9f4e-9c3a5b2dbcf3" | ||
BRANDING_REQUEST_TEMPLATE_ID = "7d423d9e-e94e-4118-879d-d52f383206ae" | ||
CALLBACK_FAILURE_TEMPLATE_ID = "d8d580f4-86b3-4ba4-9d7c-263a630af354" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amongst the rearranging, this env var is the one I added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Easier to read, thank you 🔤
- fix circular dependency - formatting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. If our threshold is simply whenever MaxRetriesExceededError
is thrown, I think we can use a redis key to ensure we only send one email per day. See comment below.
# Retry if the response status code is server-side or 429 (too many requests). | ||
if not isinstance(e, HTTPError) or e.response.status_code >= 500 or e.response.status_code == 429: | ||
try: | ||
self.retry(queue=QueueNames.CALLBACKS_RETRY) | ||
except self.MaxRetriesExceededError: | ||
current_app.logger.warning( | ||
"Retry: {function_name} has retried the max num of times for callback url {service_callback_url} and notification_id: {notification_id}" | ||
f"Retry: {function_name} has retried the max num of times for callback url {service_callback_url} and notification_id: {notification_id} for service: {current_app.service.id}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the limits feature, we use an expiring redis key to ensure only one email is sent within a particular time period. For example, if they reach the threshold, we send the email then set this key in redis to true. Then if they hit the threshold again within the time period the redis key will still be true and we will check it before sending another email.
ALREADY_REGISTERED_EMAIL_TEMPLATE_ID = "0880fbb1-a0c6-46f0-9a8e-36c986381ceb" | ||
APIKEY_REVOKE_TEMPLATE_ID = "a0a4e7b8-8a6a-4eaa-9f4e-9c3a5b2dbcf3" | ||
BRANDING_REQUEST_TEMPLATE_ID = "7d423d9e-e94e-4118-879d-d52f383206ae" | ||
CALLBACK_FAILURE_TEMPLATE_ID = "d8d580f4-86b3-4ba4-9d7c-263a630af354" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Easier to read, thank you 🔤
HEARTBEAT_TEMPLATE_EMAIL_LOW = "73079cb9-c169-44ea-8cf4-8d397711cc9d" | ||
HEARTBEAT_TEMPLATE_EMAIL_MEDIUM = "c75c4539-3014-4c4c-96b5-94d326758a74" | ||
HEARTBEAT_TEMPLATE_EMAIL_HIGH = "276da251-3103-49f3-9054-cbf6b5d74411" | ||
HEARTBEAT_TEMPLATE_SMS_HIGH = "4969a9e9-ddfd-476e-8b93-6231e6f1be4a" |
Check warning
Code scanning / CodeQL
Variable defined multiple times Warning
redefined
We are not doing this,this is already in the PRD GC notify service |
Summary | Résumé
This PR adds the groundwork for sending a warning email to service owners when their callback service is not working correctly.
The actual email sending based on threshold is not yet implemented. It is non-trivial to ensure that we don't spam users with emails every time a retried callback reaches the max retries of 5. For the time being we will simply monitor how frequently this occurs while the remaining work is completed.
Related Issues | Cartes liées
Test instructions | Instructions pour tester la modification
TODO: Fill in test instructions for the reviewer.
Release Instructions | Instructions pour le déploiement
None.
Reviewer checklist | Liste de vérification du réviseur